https://github.com/apache/spark

sort by:
Revision Author Date Message Commit Date
bbec382 Preparing Spark release v2.1.3-rc1 18 June 2018, 20:53:49 UTC
09d6253 Preparing development version 2.1.4-SNAPSHOT 18 June 2018, 20:37:10 UTC
9a3a34c Preparing Spark release v2.1.3-rc1 18 June 2018, 20:37:08 UTC
858e89b [SPARK-23732][DOCS] Fix source links in generated scaladoc. Apply the suggestion on the bug to fix source links. Tested with the 2.3.1 release docs. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #21521 from vanzin/SPARK-23732. (cherry picked from commit dc22465f3e1ef5ad59306b1f591d6fd16d674eb7) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> 13 June 2018, 23:12:50 UTC
d405e66 [WEBUI] Avoid possibility of script in query param keys As discussed separately, this avoids the possibility of XSS on certain request param keys. CC vanzin Author: Sean Owen <srowen@gmail.com> Closes #21464 from srowen/XSS2. (cherry picked from commit 698b9a0981f0ec322e15d6ac89cc38c8f49ed33d) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> 31 May 2018, 16:51:18 UTC
fa8ef46 fix compilation caused by SPARK-24257 24 May 2018, 04:45:43 UTC
1d82876 [SPARK-24257][SQL] LongToUnsafeRowMap calculate the new size may be wrong LongToUnsafeRowMap has a mistake when growing its page array: it blindly grows to `oldSize * 2`, while the new record may be larger than `oldSize * 2`. Then we may have a malformed UnsafeRow when querying this map, whose actual data is smaller than its declared size, and the data is corrupted. Author: sychen <sychen@ctrip.com> Closes #21311 from cxzl25/fix_LongToUnsafeRowMap_page_size. (cherry picked from commit 888340151f737bb68d0e419b1e949f11469881f9) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 24 May 2018, 03:20:23 UTC
b51ce93 [R][BACKPORT-2.2] backport lint fix ## What changes were proposed in this pull request? backport part of the commit that addresses lintr issue Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #21325 from felixcheung/rlintfix22. (cherry picked from commit 8c223b651c3330e6b7f5bcf24b9911b0c32dc7b9) Signed-off-by: Felix Cheung <felixcheung@apache.org> 16 May 2018, 02:31:34 UTC
1d569f6 [SPARKR] Match pyspark features in SparkR communication protocol. (cherry picked from commit 628c7b517969c4a7ccb26ea67ab3dd61266073ca) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> (cherry picked from commit 16cd9ac5264831e061c033b26fe1173ebc88e5d1) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> (cherry picked from commit f96d13dd67b2d39e5fff80a6bc4a1b1fc36745c6) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> 11 May 2018, 18:13:44 UTC
61a8c4c [PYSPARK] Update py4j to version 0.10.7. (cherry picked from commit cc613b552e753d03cb62661591de59e1c8d82c74) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> (cherry picked from commit 323dc3ad02e63a7c99b5bd6da618d6020657ecba) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> (cherry picked from commit 850b7d868bbe23e58e1164fc551fac2facbebead) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> 11 May 2018, 18:13:32 UTC
8177b21 [SPARK-21278][PYSPARK] Upgrade to Py4J 0.10.6 This PR aims to bump Py4J in order to fix the following float/double bug. Py4J 0.10.5 fixes this (https://github.com/bartdag/py4j/issues/272) and the latest Py4J is 0.10.6. **BEFORE** ``` >>> df = spark.range(1) >>> df.select(df['id'] + 17.133574204226083).show() +--------------------+ |(id + 17.1335742042)| +--------------------+ | 17.1335742042| +--------------------+ ``` **AFTER** ``` >>> df = spark.range(1) >>> df.select(df['id'] + 17.133574204226083).show() +-------------------------+ |(id + 17.133574204226083)| +-------------------------+ | 17.133574204226083| +-------------------------+ ``` Manual. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #18546 from dongjoon-hyun/SPARK-21278. (cherry picked from commit c8d0aba198c0f593c2b6b656c23b3d0fb7ea98a2) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> 08 May 2018, 19:15:34 UTC
c4ecc04 [SPARK-23697][CORE] LegacyAccumulatorWrapper should define isZero correctly ## What changes were proposed in this pull request? It's possible that Accumulators of Spark 1.x may no longer work with Spark 2.x. This is because `LegacyAccumulatorWrapper.isZero` may return wrong answer if `AccumulableParam` doesn't define equals/hashCode. This PR fixes this by using reference equality check in `LegacyAccumulatorWrapper.isZero`. ## How was this patch tested? a new test Author: Wenchen Fan <wenchen@databricks.com> Closes #21229 from cloud-fan/accumulator. (cherry picked from commit 4d5de4d303a773b1c18c350072344bd7efca9fc4) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 04 May 2018, 11:21:46 UTC
2d751fb [SPARK-23438][DSTREAMS] Fix DStreams data loss with WAL when driver crashes There is a race condition introduced in SPARK-11141 which could cause data loss. The problem is that ReceivedBlockTracker.insertAllocatedBatch function assumes that all blocks from streamIdToUnallocatedBlockQueues allocated to the batch and clears the queue. In this PR only the allocated blocks will be removed from the queue which will prevent data loss. Additional unit test + manually. Author: Gabor Somogyi <gabor.g.somogyi@gmail.com> Closes #20620 from gaborgsomogyi/SPARK-23438. (cherry picked from commit b308182f233b8840dfe0e6b5736d2f2746f40757) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> 26 February 2018, 16:57:21 UTC
24fe6eb [SPARK-23053][CORE][BRANCH-2.1] taskBinarySerialization and task partitions calculate in DagScheduler.submitMissingTasks should keep the same RDD checkpoint status ## What changes were proposed in this pull request? This PR backports [#20244](https://github.com/apache/spark/pull/20244) When we run concurrent jobs using the same rdd which is marked to do checkpoint. If one job has finished running the job, and start the process of RDD.doCheckpoint, while another job is submitted, then submitStage and submitMissingTasks will be called. In [submitMissingTasks](https://github.com/apache/spark/blob/branch-2.1/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L932), will serialize taskBinaryBytes and calculate task partitions which are both affected by the status of checkpoint, if the former is calculated before doCheckpoint finished, while the latter is calculated after doCheckpoint finished, when run task, rdd.compute will be called, for some rdds with particular partition type such as [UnionRDD](https://github.com/apache/spark/blob/branch-2.1/core/src/main/scala/org/apache/spark/rdd/UnionRDD.scala) who will do partition type cast, will get a ClassCastException because the part params is actually a CheckpointRDDPartition. This error occurs because rdd.doCheckpoint occurs in the same thread that called sc.runJob, while the task serialization occurs in the DAGSchedulers event loop. ## How was this patch tested? the exist tests. Author: huangtengfei <huangtengfei@huangtengfeideMacBook-Pro.local> Closes #20635 from ivoson/branch-2.1-23053. 21 February 2018, 03:01:45 UTC
1df8020 [SPARK-22862] Docs on lazy elimination of columns missing from an encoder This behavior has confused some users, so lets clarify it. Author: Michael Armbrust <michael@databricks.com> Closes #20048 from marmbrus/datasetAsDocs. (cherry picked from commit 8df1da396f64bb7fe76d73cd01498fdf3b8ed964) Signed-off-by: gatorsmile <gatorsmile@gmail.com> 22 December 2017, 05:38:56 UTC
0f68621 [SPARK-22688][SQL] Upgrade Janino version to 3.0.8 This PR upgrade Janino version to 3.0.8. [Janino 3.0.8](https://janino-compiler.github.io/janino/changelog.html) includes an important fix to reduce the number of constant pool entries by using 'sipush' java bytecode. * SIPUSH bytecode is not used for short integer constant [#33](https://github.com/janino-compiler/janino/issues/33). Please see detail in [this discussion thread](https://github.com/apache/spark/pull/19518#issuecomment-346674976). Existing tests Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Closes #19890 from kiszk/SPARK-22688. (cherry picked from commit 8ae004b4602266d1f210e4c1564246d590412c06) Signed-off-by: Sean Owen <sowen@cloudera.com> 08 December 2017, 03:31:40 UTC
768b34a [SPARK-22373][BUILD][FOLLOWUP][BRANCH-2.1] Updates other dependency lists too for Janino ## What changes were proposed in this pull request? This PR updates other missed lists in branch-2.1 for dependency update of Janino. ## How was this patch tested? Jenkins tests Author: hyukjinkwon <gurwls223@gmail.com> Closes #19866 from HyukjinKwon/branch-2.1-SPARK-22373. 02 December 2017, 15:36:52 UTC
4e3680f [SPARK-22373] Bump Janino dependency version to fix thread safety issue… … with Janino when compiling generated code. ## What changes were proposed in this pull request? Bump up Janino dependency version to fix thread safety issue during compiling generated code ## How was this patch tested? Check https://issues.apache.org/jira/browse/SPARK-22373 for details. Converted part of the code in CodeGenerator into a standalone application, so the issue can be consistently reproduced locally. Verified that changing Janino dependency version resolved this issue. Author: Min Shen <mshen@linkedin.com> Closes #19839 from Victsm/SPARK-22373. (cherry picked from commit 7da1f5708cc96c18ddb3acd09542621275e71d83) Signed-off-by: Sean Owen <sowen@cloudera.com> 01 December 2017, 01:25:03 UTC
a02a8bd [SPARK-22548][SQL] Incorrect nested AND expression pushed down to JDBC data source ## What changes were proposed in this pull request? Let’s say I have a nested AND expression shown below and p2 can not be pushed down, (p1 AND p2) OR p3 In current Spark code, during data source filter translation, (p1 AND p2) is returned as p1 only and p2 is simply lost. This issue occurs with JDBC data source and is similar to [SPARK-12218](https://github.com/apache/spark/pull/10362) for Parquet. When we have AND nested below another expression, we should either push both legs or nothing. Note that: - The current Spark code will always split conjunctive predicate before it determines if a predicate can be pushed down or not - If I have (p1 AND p2) AND p3, it will be split into p1, p2, p3. There won't be nested AND expression. - The current Spark code logic for OR is OK. It either pushes both legs or nothing. The same translation method is also called by Data Source V2. ## How was this patch tested? Added new unit test cases to JDBCSuite gatorsmile Author: Jia Li <jiali@us.ibm.com> Closes #19776 from jliwork/spark-22548. (cherry picked from commit 881c5c807304a305ef96e805d51afbde097f7f4f) Signed-off-by: gatorsmile <gatorsmile@gmail.com> 22 November 2017, 01:30:35 UTC
7bdad58 [SPARK-22377][BUILD] Use /usr/sbin/lsof if lsof does not exists in release-build.sh ## What changes were proposed in this pull request? This PR proposes to use `/usr/sbin/lsof` if `lsof` is missing in the path to fix nightly snapshot jenkins jobs. Please refer https://github.com/apache/spark/pull/19359#issuecomment-340139557: > Looks like some of the snapshot builds are having lsof issues: > > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-branch-2.1-maven-snapshots/182/console > >https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-branch-2.2-maven-snapshots/134/console > >spark-build/dev/create-release/release-build.sh: line 344: lsof: command not found >usage: kill [ -s signal | -p ] [ -a ] pid ... >kill -l [ signal ] Up to my knowledge, the full path of `lsof` is required for non-root user in few OSs. ## How was this patch tested? Manually tested as below: ```bash #!/usr/bin/env bash LSOF=lsof if ! hash $LSOF 2>/dev/null; then echo "a" LSOF=/usr/sbin/lsof fi $LSOF -P | grep "a" ``` Author: hyukjinkwon <gurwls223@gmail.com> Closes #19695 from HyukjinKwon/SPARK-22377. (cherry picked from commit c8b7f97b8a58bf4a9f6e3a07dd6e5b0f646d8d99) Signed-off-by: hyukjinkwon <gurwls223@gmail.com> 13 November 2017, 23:28:43 UTC
ca19271 [SPARK-22327][SPARKR][TEST][BACKPORT-2.1] check for version warning ## What changes were proposed in this pull request? Will need to port to this to branch-~~1.6~~, -2.0, -2.1, -2.2 ## How was this patch tested? manually Jenkins, AppVeyor Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #19620 from felixcheung/rcranversioncheck21. 08 November 2017, 04:57:39 UTC
acef224 [SPARK-22429][STREAMING] Streaming checkpointing code does not retry after failure ## What changes were proposed in this pull request? SPARK-14930/SPARK-13693 put in a change to set the fs object to null after a failure, however the retry loop does not include initialization. Moved fs initialization inside the retry while loop to aid recoverability. ## How was this patch tested? Passes all existing unit tests. Author: Tristan Stevens <tristan@cloudera.com> Closes #19645 from tmgstevens/SPARK-22429. (cherry picked from commit fe258a7963361c1f31bc3dc3a2a2ee4a5834bb58) Signed-off-by: Sean Owen <sowen@cloudera.com> 05 November 2017, 09:11:05 UTC
8b57211 [MINOR][DOC] automatic type inference supports also Date and Timestamp ## What changes were proposed in this pull request? Easy fix in the documentation, which is reporting that only numeric types and string are supported in type inference for partition columns, while Date and Timestamp are supported too since 2.1.0, thanks to SPARK-17388. ## How was this patch tested? n/a Author: Marco Gaido <mgaido@hortonworks.com> Closes #19628 from mgaido91/SPARK-22398. (cherry picked from commit b04eefae49b96e2ef5a8d75334db29ef4e19ce58) Signed-off-by: hyukjinkwon <gurwls223@gmail.com> 02 November 2017, 00:30:35 UTC
3d6d889 [SPARK-22344][SPARKR] Set java.io.tmpdir for SparkR tests This PR sets the java.io.tmpdir for CRAN checks and also disables the hsperfdata for the JVM when running CRAN checks. Together this prevents files from being left behind in `/tmp` Tested manually on a clean EC2 machine Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu> Closes #19589 from shivaram/sparkr-tmpdir-clean. (cherry picked from commit 1fe27612d7bcb8b6478a36bc16ddd4802e4ee2fc) Signed-off-by: Shivaram Venkataraman <shivaram@cs.berkeley.edu> 30 October 2017, 01:56:51 UTC
aa023fd [SPARK-17902][R] Revive stringsAsFactors option for collect() in SparkR ## What changes were proposed in this pull request? This PR proposes to revive `stringsAsFactors` option in collect API, which was mistakenly removed in https://github.com/apache/spark/commit/71a138cd0e0a14e8426f97877e3b52a562bbd02c. Simply, it casts `charactor` to `factor` if it meets the condition, `stringsAsFactors && is.character(vec)` in primitive type conversion. ## How was this patch tested? Unit test in `R/pkg/tests/fulltests/test_sparkSQL.R`. Author: hyukjinkwon <gurwls223@gmail.com> Closes #19551 from HyukjinKwon/SPARK-17902. (cherry picked from commit a83d8d5adcb4e0061e43105767242ba9770dda96) Signed-off-by: hyukjinkwon <gurwls223@gmail.com> 26 October 2017, 11:55:14 UTC
3e77b74 [SPARK-21991][LAUNCHER][FOLLOWUP] Fix java lint ## What changes were proposed in this pull request? Fix java lint ## How was this patch tested? Run `./dev/lint-java` Author: Andrew Ash <andrew@andrewash.com> Closes #19574 from ash211/aash/fix-java-lint. (cherry picked from commit 5433be44caecaeef45ed1fdae10b223c698a9d14) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> 25 October 2017, 21:41:29 UTC
dc09714 [SPARK-21991][LAUNCHER] Fix race condition in LauncherServer#acceptConnections ## What changes were proposed in this pull request? This patch changes the order in which _acceptConnections_ starts the client thread and schedules the client timeout action ensuring that the latter has been scheduled before the former get a chance to cancel it. ## How was this patch tested? Due to the non-deterministic nature of the patch I wasn't able to add a new test for this issue. Author: Andrea zito <andrea.zito@u-hopper.com> Closes #19217 from nivox/SPARK-21991. (cherry picked from commit 6ea8a56ca26a7e02e6574f5f763bb91059119a80) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> 25 October 2017, 17:10:49 UTC
5a4072b [SPARK-21551][PYTHON] Increase timeout for PythonRDD.serveIterator Backport of https://github.com/apache/spark/pull/18752 (https://issues.apache.org/jira/browse/SPARK-21551) (cherry picked from commit 9d3c6640f56e3e4fd195d3ad8cead09df67a72c7) Author: peay <peay@protonmail.com> Closes #19513 from FRosner/branch-2.1. 20 October 2017, 07:29:10 UTC
eb00037 [SPARK-22273][SQL] Fix key/value schema field names in HashMapGenerators. ## What changes were proposed in this pull request? When fixing schema field names using escape characters with `addReferenceMinorObj()` at [SPARK-18952](https://issues.apache.org/jira/browse/SPARK-18952) (#16361), double-quotes around the names were remained and the names become something like `"((java.lang.String) references[1])"`. ```java /* 055 */ private int maxSteps = 2; /* 056 */ private int numRows = 0; /* 057 */ private org.apache.spark.sql.types.StructType keySchema = new org.apache.spark.sql.types.StructType().add("((java.lang.String) references[1])", org.apache.spark.sql.types.DataTypes.StringType); /* 058 */ private org.apache.spark.sql.types.StructType valueSchema = new org.apache.spark.sql.types.StructType().add("((java.lang.String) references[2])", org.apache.spark.sql.types.DataTypes.LongType); /* 059 */ private Object emptyVBase; ``` We should remove the double-quotes to refer the values in `references` properly: ```java /* 055 */ private int maxSteps = 2; /* 056 */ private int numRows = 0; /* 057 */ private org.apache.spark.sql.types.StructType keySchema = new org.apache.spark.sql.types.StructType().add(((java.lang.String) references[1]), org.apache.spark.sql.types.DataTypes.StringType); /* 058 */ private org.apache.spark.sql.types.StructType valueSchema = new org.apache.spark.sql.types.StructType().add(((java.lang.String) references[2]), org.apache.spark.sql.types.DataTypes.LongType); /* 059 */ private Object emptyVBase; ``` ## How was this patch tested? Existing tests. Author: Takuya UESHIN <ueshin@databricks.com> Closes #19491 from ueshin/issues/SPARK-22273. (cherry picked from commit e0503a7223410289d01bc4b20da3a451730577da) Signed-off-by: gatorsmile <gatorsmile@gmail.com> 14 October 2017, 06:25:21 UTC
920372a [SPARK-22206][SQL][SPARKR] gapply in R can't work on empty grouping columns ## What changes were proposed in this pull request? Looks like `FlatMapGroupsInRExec.requiredChildDistribution` didn't consider empty grouping attributes. It should be a problem when running `EnsureRequirements` and `gapply` in R can't work on empty grouping columns. ## How was this patch tested? Added test. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #19436 from viirya/fix-flatmapinr-distribution. (cherry picked from commit ae61f187aa0471242c046fdeac6ed55b9b98a3f6) Signed-off-by: hyukjinkwon <gurwls223@gmail.com> 05 October 2017, 14:37:22 UTC
5bb4e93 [SPARK-20466][CORE] HadoopRDD#addLocalConfiguration throws NPE ## What changes were proposed in this pull request? Fix for SPARK-20466, full description of the issue in the JIRA. To summarize, `HadoopRDD` uses a metadata cache to cache `JobConf` objects. The cache uses soft-references, which means the JVM can delete entries from the cache whenever there is GC pressure. `HadoopRDD#getJobConf` had a bug where it would check if the cache contained the `JobConf`, if it did it would get the `JobConf` from the cache and return it. This doesn't work when soft-references are used as the JVM can delete the entry between the existence check and the get call. ## How was this patch tested? Haven't thought of a good way to test this yet given the issue only occurs sometimes, and happens during high GC pressure. Was thinking of using mocks to verify `#getJobConf` is doing the right thing. I deleted the method `HadoopRDD#containsCachedMetadata` so that we don't hit this issue again. Author: Sahil Takiar <stakiar@cloudera.com> Closes #19413 from sahilTakiar/master. (cherry picked from commit e36ec38d89472df0dfe12222b6af54cd6eea8e98) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> 03 October 2017, 23:53:58 UTC
49e8ccc Preparing development version 2.1.3-SNAPSHOT 02 October 2017, 18:57:24 UTC
2abaea9 Preparing Spark release v2.1.2-rc4 02 October 2017, 18:57:15 UTC
ff4179b [SPARK-22167][R][BUILD] sparkr packaging issue allow zinc ## What changes were proposed in this pull request? When zinc is running the pwd might be in the root of the project. A quick solution to this is to not go a level up incase we are in the root rather than root/core/. If we are in the root everything works fine, if we are in core add a script which goes and runs the level up ## How was this patch tested? set -x in the SparkR install scripts. Author: Holden Karau <holden@us.ibm.com> Closes #19402 from holdenk/SPARK-22167-sparkr-packaging-issue-allow-zinc. (cherry picked from commit 8fab7995d36c7bc4524393b20a4e524dbf6bbf62) Signed-off-by: Holden Karau <holden@us.ibm.com> 02 October 2017, 18:47:36 UTC
81e4008 [SPARK-18971][CORE] Upgrade Netty to 4.0.43.Final ## What changes were proposed in this pull request? Upgrade Netty to `4.0.43.Final` to add the fix for https://github.com/netty/netty/issues/6153 ## How was this patch tested? Jenkins Author: Shixiong Zhu <shixiong@databricks.com> Closes #16568 from zsxwing/SPARK-18971. (cherry picked from commit a8567e34dc77a32ddeb280f8f9f603f301722059) Signed-off-by: Sean Owen <sowen@cloudera.com> # Conflicts: # dev/deps/spark-deps-hadoop-2.2 # dev/deps/spark-deps-hadoop-2.3 # dev/deps/spark-deps-hadoop-2.4 # dev/deps/spark-deps-hadoop-2.6 # dev/deps/spark-deps-hadoop-2.7 02 October 2017, 18:33:21 UTC
78661f9 Preparing development version 2.1.3-SNAPSHOT 29 September 2017, 16:04:35 UTC
efdbef4 Preparing Spark release v2.1.2-rc3 29 September 2017, 16:04:26 UTC
60f78c2 [SPARK-22129][SPARK-22138] Release script improvements ## What changes were proposed in this pull request? Use the GPG_KEY param, fix lsof to non-hardcoded path, remove version swap since it wasn't really needed. Use EXPORT on JAVA_HOME for downstream scripts as well. ## How was this patch tested? Rolled 2.1.2 RC2 Author: Holden Karau <holden@us.ibm.com> Closes #19359 from holdenk/SPARK-22129-fix-signing. (cherry picked from commit ecbe416ab5001b32737966c5a2407597a1dafc32) Signed-off-by: Holden Karau <holden@us.ibm.com> 29 September 2017, 15:04:38 UTC
361aa0e [SPARK-22083][CORE] Release locks in MemoryStore.evictBlocksToFreeSpace ## What changes were proposed in this pull request? MemoryStore.evictBlocksToFreeSpace acquires write locks for all the blocks it intends to evict up front. If there is a failure to evict blocks (eg., some failure dropping a block to disk), then we have to release the lock. Otherwise the lock is never released and an executor trying to get the lock will wait forever. ## How was this patch tested? Added unit test. Author: Imran Rashid <irashid@cloudera.com> Closes #19311 from squito/SPARK-22083. (cherry picked from commit 2c5b9b1173c23f6ca8890817a9a35dc7557b0776) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> 25 September 2017, 19:02:51 UTC
0b3e7cc [SPARK-18136] Fix SPARK_JARS_DIR for Python pip install on Windows ## What changes were proposed in this pull request? Fix for setup of `SPARK_JARS_DIR` on Windows as it looks for `%SPARK_HOME%\RELEASE` file instead of `%SPARK_HOME%\jars` as it should. RELEASE file is not included in the `pip` build of PySpark. ## How was this patch tested? Local install of PySpark on Anaconda 4.4.0 (Python 3.6.1). Author: Jakub Nowacki <j.s.nowacki@gmail.com> Closes #19310 from jsnowacki/master. (cherry picked from commit c11f24a94007bbaad0835645843e776507094071) Signed-off-by: hyukjinkwon <gurwls223@gmail.com> 23 September 2017, 12:05:04 UTC
03db721 Preparing development version 2.1.3-SNAPSHOT 22 September 2017, 15:07:45 UTC
fabbb7f Preparing Spark release v2.1.2-rc2 22 September 2017, 15:07:37 UTC
d930bbb [SPARK-22072][SPARK-22071][BUILD] Improve release build scripts ## What changes were proposed in this pull request? Check JDK version (with javac) and use SPARK_VERSION for publish-release ## How was this patch tested? Manually tried local build with wrong JDK / JAVA_HOME & built a local release (LFTP disabled) Author: Holden Karau <holden@us.ibm.com> Closes #19312 from holdenk/improve-release-scripts-r2. (cherry picked from commit 8f130ad40178e35fecb3f2ba4a61ad23e6a90e3d) Signed-off-by: Holden Karau <holden@us.ibm.com> 22 September 2017, 07:15:27 UTC
1a4b6ee [SPARK-21928][CORE] Set classloader on SerializerManager's private kryo ## What changes were proposed in this pull request? We have to make sure that SerializerManager's private instance of kryo also uses the right classloader, regardless of the current thread classloader. In particular, this fixes serde during remote cache fetches, as those occur in netty threads. ## How was this patch tested? Manual tests & existing suite via jenkins. I haven't been able to reproduce this is in a unit test, because when a remote RDD partition can be fetched, there is a warning message and then the partition is just recomputed locally. I manually verified the warning message is no longer present. (cherry picked from commit b75bd1777496ce0354458bf85603a8087a6a0ff8) Author: Imran Rashid <irashid@cloudera.com> Closes #19313 from squito/SPARK-21928_2.1. 21 September 2017, 22:47:23 UTC
56865a1 [SPARK-19318][SPARK-22041][SPARK-16625][BACKPORT-2.1][SQL] Docker test case failure: `: General data types to be mapped to Oracle` ## What changes were proposed in this pull request? This PR is backport of https://github.com/apache/spark/pull/16891 to Spark 2.1. ## How was this patch tested? unit tests Author: Yuming Wang <wgyumg@gmail.com> Closes #19259 from wangyum/SPARK-22041-BACKPORT-2.1. 19 September 2017, 23:55:13 UTC
30ce056 [SPARK-22052] Incorrect Metric assigned in MetricsReporter.scala Current implementation for processingRate-total uses wrong metric: mistakenly uses inputRowsPerSecond instead of processedRowsPerSecond ## What changes were proposed in this pull request? Adjust processingRate-total from using inputRowsPerSecond to processedRowsPerSecond ## How was this patch tested? Built spark from source with proposed change and tested output with correct parameter. Before change the csv metrics file for inputRate-total and processingRate-total displayed the same values due to the error. After changing MetricsReporter.scala the processingRate-total csv file displayed the correct metric. <img width="963" alt="processed rows per second" src="https://user-images.githubusercontent.com/32072374/30554340-82eea12c-9ca4-11e7-8370-8168526ff9a2.png"> Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Taaffy <32072374+Taaffy@users.noreply.github.com> Closes #19268 from Taaffy/patch-1. (cherry picked from commit 1bc17a6b8add02772a8a0a1048ac6a01d045baf4) Signed-off-by: Sean Owen <sowen@cloudera.com> 19 September 2017, 09:20:25 UTC
b35136a [SPARK-22043][PYTHON] Improves error message for show_profiles and dump_profiles ## What changes were proposed in this pull request? This PR proposes to improve error message from: ``` >>> sc.show_profiles() Traceback (most recent call last): File "<stdin>", line 1, in <module> File ".../spark/python/pyspark/context.py", line 1000, in show_profiles self.profiler_collector.show_profiles() AttributeError: 'NoneType' object has no attribute 'show_profiles' >>> sc.dump_profiles("/tmp/abc") Traceback (most recent call last): File "<stdin>", line 1, in <module> File ".../spark/python/pyspark/context.py", line 1005, in dump_profiles self.profiler_collector.dump_profiles(path) AttributeError: 'NoneType' object has no attribute 'dump_profiles' ``` to ``` >>> sc.show_profiles() Traceback (most recent call last): File "<stdin>", line 1, in <module> File ".../spark/python/pyspark/context.py", line 1003, in show_profiles raise RuntimeError("'spark.python.profile' configuration must be set " RuntimeError: 'spark.python.profile' configuration must be set to 'true' to enable Python profile. >>> sc.dump_profiles("/tmp/abc") Traceback (most recent call last): File "<stdin>", line 1, in <module> File ".../spark/python/pyspark/context.py", line 1012, in dump_profiles raise RuntimeError("'spark.python.profile' configuration must be set " RuntimeError: 'spark.python.profile' configuration must be set to 'true' to enable Python profile. ``` ## How was this patch tested? Unit tests added in `python/pyspark/tests.py` and manual tests. Author: hyukjinkwon <gurwls223@gmail.com> Closes #19260 from HyukjinKwon/profile-errors. (cherry picked from commit 7c7266208a3be984ac1ce53747dc0c3640f4ecac) Signed-off-by: hyukjinkwon <gurwls223@gmail.com> 18 September 2017, 04:20:48 UTC
99de4b8 [SPARK-21953] Show both memory and disk bytes spilled if either is present As written now, there must be both memory and disk bytes spilled to show either of them. If there is only one of those types of spill recorded, it will be hidden. Author: Andrew Ash <andrew@andrewash.com> Closes #19164 from ash211/patch-3. (cherry picked from commit 6308c65f08b507408033da1f1658144ea8c1491f) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 18 September 2017, 02:43:21 UTC
3ae7ab8 [SPARK-21985][PYSPARK] PairDeserializer is broken for double-zipped RDDs ## What changes were proposed in this pull request? (edited) Fixes a bug introduced in #16121 In PairDeserializer convert each batch of keys and values to lists (if they do not have `__len__` already) so that we can check that they are the same size. Normally they already are lists so this should not have a performance impact, but this is needed when repeated `zip`'s are done. ## How was this patch tested? Additional unit test Author: Andrew Ray <ray.andrew@gmail.com> Closes #19226 from aray/SPARK-21985. (cherry picked from commit 6adf67dd14b0ece342bb91adf800df0a7101e038) Signed-off-by: hyukjinkwon <gurwls223@gmail.com> 17 September 2017, 17:47:06 UTC
e49c997 Preparing development version 2.1.3-SNAPSHOT 14 September 2017, 02:34:45 UTC
6f47032 Preparing Spark release v2.1.2-rc1 14 September 2017, 02:34:41 UTC
e7696eb [SPARK-21976][DOC] Fix wrong documentation for Mean Absolute Error. ## What changes were proposed in this pull request? Fixed wrong documentation for Mean Absolute Error. Even though the code is correct for the MAE: ```scala Since("1.2.0") def meanAbsoluteError: Double = { summary.normL1(1) / summary.count } ``` In the documentation the division by N is missing. ## How was this patch tested? All of spark tests were run. Please review http://spark.apache.org/contributing.html before opening a pull request. Author: FavioVazquez <favio.vazquezp@gmail.com> Author: faviovazquez <favio.vazquezp@gmail.com> Author: Favio André Vázquez <favio.vazquezp@gmail.com> Closes #19190 from FavioVazquez/mae-fix. (cherry picked from commit e2ac2f1c71a0f8b03743d0d916dc0ef28482a393) Signed-off-by: Sean Owen <sowen@cloudera.com> 12 September 2017, 09:33:54 UTC
ae4e8ae [SPARKR][BACKPORT-2.1] backporting package and test changes ## What changes were proposed in this pull request? cherrypick or manually porting changes to 2.1 ## How was this patch tested? Jenkins Author: Felix Cheung <felixcheung_m@hotmail.com> Author: hyukjinkwon <gurwls223@gmail.com> Author: Wayne Zhang <actuaryzhang@uber.com> Closes #19165 from felixcheung/rbackportpkg21. 10 September 2017, 17:24:46 UTC
6a8a726 [SPARK-21950][SQL][PYTHON][TEST] pyspark.sql.tests.SQLTests2 should stop SparkContext. ## What changes were proposed in this pull request? `pyspark.sql.tests.SQLTests2` doesn't stop newly created spark context in the test and it might affect the following tests. This pr makes `pyspark.sql.tests.SQLTests2` stop `SparkContext`. ## How was this patch tested? Existing tests. Author: Takuya UESHIN <ueshin@databricks.com> Closes #19158 from ueshin/issues/SPARK-21950. (cherry picked from commit 57bc1e9eb452284cbed090dbd5008eb2062f1b36) Signed-off-by: Takuya UESHIN <ueshin@databricks.com> 08 September 2017, 05:26:48 UTC
041eccb [SPARK-21834] Incorrect executor request in case of dynamic allocation ## What changes were proposed in this pull request? killExecutor api currently does not allow killing an executor without updating the total number of executors needed. In case of dynamic allocation is turned on and the allocator tries to kill an executor, the scheduler reduces the total number of executors needed ( see https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala#L635) which is incorrect because the allocator already takes care of setting the required number of executors itself. ## How was this patch tested? Ran a job on the cluster and made sure the executor request is correct Author: Sital Kedia <skedia@fb.com> Closes #19081 from sitalkedia/skedia/oss_fix_executor_allocation. (cherry picked from commit 6949a9c5c6120fdde1b63876ede661adbd1eb15e) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> 30 August 2017, 21:19:33 UTC
5769753 [SPARK-21826][SQL][2.1][2.0] outer broadcast hash join should not throw NPE backport https://github.com/apache/spark/pull/19036 to branch 2.1 and 2.0 Author: Wenchen Fan <wenchen@databricks.com> Closes #19040 from cloud-fan/bug. 24 August 2017, 17:36:37 UTC
3d3be4d [SPARK-21721][SQL][BACKPORT-2.1][FOLLOWUP] Clear FileSystem deleteOnExit cache when paths are successfully removed ## What changes were proposed in this pull request? Fix a typo in test. ## How was this patch tested? Jenkins tests. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #19006 from viirya/SPARK-21721-backport-2.1-followup. 20 August 2017, 13:57:27 UTC
2394ae2 [MINOR] Correct validateAndTransformSchema in GaussianMixture and AFTSurvivalRegression ## What changes were proposed in this pull request? The line SchemaUtils.appendColumn(schema, $(predictionCol), IntegerType) did not modify the variable schema, hence only the last line had any effect. A temporary variable is used to correctly append the two columns predictionCol and probabilityCol. ## How was this patch tested? Manually. Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Cédric Pelvet <cedric.pelvet@gmail.com> Closes #18980 from sharp-pixel/master. (cherry picked from commit 73e04ecc4f29a0fe51687ed1337c61840c976f89) Signed-off-by: Sean Owen <sowen@cloudera.com> 20 August 2017, 10:06:14 UTC
6f366fb [SPARK-21721][SQL][BACKPORT-2.1] Clear FileSystem deleteOnExit cache when paths are successfully removed ## What changes were proposed in this pull request? Backport SPARK-21721 to branch 2.1: We put staging path to delete into the deleteOnExit cache of FileSystem in case of the path can't be successfully removed. But when we successfully remove the path, we don't remove it from the cache. We should do it to avoid continuing grow the cache size. ## How was this patch tested? Added test. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #18947 from viirya/SPARK-21721-backport-2.1. 15 August 2017, 15:48:00 UTC
9b749b6 [SPARK-21306][ML] For branch 2.1, OneVsRest should support setWeightCol The PR is related to #18554, and is modified for branch 2.1. ## What changes were proposed in this pull request? add `setWeightCol` method for OneVsRest. `weightCol` is ignored if classifier doesn't inherit HasWeightCol trait. ## How was this patch tested? + [x] add an unit test. Author: Yan Facai (颜发才) <facai.yan@gmail.com> Closes #18763 from facaiy/BUG/branch-2.1_OneVsRest_support_setWeightCol. 08 August 2017, 03:05:36 UTC
444cca1 [SPARK-18535][SPARK-19720][CORE][BACKPORT-2.1] Redact sensitive information ## What changes were proposed in this pull request? Backporting SPARK-18535 and SPARK-19720 to spark 2.1 It's a backport PR that redacts senstive information by configuration to Spark UI and Spark Submit console logs. Using reference from Mark Grover markapache.org PRs ## How was this patch tested? Same tests from PR applied Author: Mark Grover <mark@apache.org> Closes #18802 from dmvieira/feature-redact. 07 August 2017, 21:23:05 UTC
5634fad [SPARK-21588][SQL] SQLContext.getConf(key, null) should return null ## What changes were proposed in this pull request? In SQLContext.get(key,null) for a key that is not defined in the conf, and doesn't have a default value defined, throws a NPE. Int happens only when conf has a value converter Added null check on defaultValue inside SQLConf.getConfString to avoid calling entry.valueConverter(defaultValue) ## How was this patch tested? Added unit test Author: vinodkc <vinod.kc.in@gmail.com> Closes #18852 from vinodkc/br_Fix_SPARK-21588. (cherry picked from commit 1ba967b25e6d88be2db7a4e100ac3ead03a2ade9) Signed-off-by: gatorsmile <gatorsmile@gmail.com> 06 August 2017, 06:05:09 UTC
734b144 [SPARK-21330][SQL] Bad partitioning does not allow to read a JDBC table with extreme values on the partition column ## What changes were proposed in this pull request? An overflow of the difference of bounds on the partitioning column leads to no data being read. This patch checks for this overflow. ## How was this patch tested? New unit test. Author: Andrew Ray <ray.andrew@gmail.com> Closes #18800 from aray/SPARK-21330. (cherry picked from commit 25826c77ddf0d5753d2501d0e764111da2caa8b6) Signed-off-by: Sean Owen <sowen@cloudera.com> 04 August 2017, 07:58:27 UTC
d93e45b [SPARK-12717][PYTHON][BRANCH-2.1] Adding thread-safe broadcast pickle registry ## What changes were proposed in this pull request? When using PySpark broadcast variables in a multi-threaded environment, `SparkContext._pickled_broadcast_vars` becomes a shared resource. A race condition can occur when broadcast variables that are pickled from one thread get added to the shared ` _pickled_broadcast_vars` and become part of the python command from another thread. This PR introduces a thread-safe pickled registry using thread local storage so that when python command is pickled (causing the broadcast variable to be pickled and added to the registry) each thread will have their own view of the pickle registry to retrieve and clear the broadcast variables used. ## How was this patch tested? Added a unit test that causes this race condition using another thread. Author: Bryan Cutler <cutlerb@gmail.com> Closes #18825 from BryanCutler/pyspark-bcast-threadsafe-SPARK-12717-2_1. 03 August 2017, 01:35:56 UTC
b31b302 [SPARK-21522][CORE] Fix flakiness in LauncherServerSuite. Handle the case where the server closes the socket before the full message has been written by the client. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #18727 from vanzin/SPARK-21522. (cherry picked from commit b133501800b43fa5c538a4e5ad597c9dc7d8378e) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> 01 August 2017, 17:06:25 UTC
78f7cdf [SPARK-21555][SQL] RuntimeReplaceable should be compared semantically by its canonicalized child ## What changes were proposed in this pull request? When there are aliases (these aliases were added for nested fields) as parameters in `RuntimeReplaceable`, as they are not in the children expression, those aliases can't be cleaned up in analyzer rule `CleanupAliases`. An expression `nvl(foo.foo1, "value")` can be resolved to two semantically different expressions in a group by query because they contain different aliases. Because those aliases are not children of `RuntimeReplaceable` which is an `UnaryExpression`. So we can't trim the aliases out by simple transforming the expressions in `CleanupAliases`. If we want to replace the non-children aliases in `RuntimeReplaceable`, we need to add more codes to `RuntimeReplaceable` and modify all expressions of `RuntimeReplaceable`. It makes the interface ugly IMO. Consider those aliases will be replaced later at optimization and so they're no harm, this patch chooses to simply override `canonicalized` of `RuntimeReplaceable`. One concern is about `CleanupAliases`. Because it actually cannot clean up ALL aliases inside a plan. To make caller of this rule notice that, this patch adds a comment to `CleanupAliases`. ## How was this patch tested? Added test. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #18761 from viirya/SPARK-21555. (cherry picked from commit 9c8109ef414c92553335bb1e90e9681e142128a4) Signed-off-by: gatorsmile <gatorsmile@gmail.com> 29 July 2017, 17:04:12 UTC
258ca40 Revert "[SPARK-21306][ML] OneVsRest should support setWeightCol" This reverts commit 8520d7c6d5e880dea3c1a8a874148c07222b4b4b. 28 July 2017, 12:24:54 UTC
8520d7c [SPARK-21306][ML] OneVsRest should support setWeightCol ## What changes were proposed in this pull request? add `setWeightCol` method for OneVsRest. `weightCol` is ignored if classifier doesn't inherit HasWeightCol trait. ## How was this patch tested? + [x] add an unit test. Author: Yan Facai (颜发才) <facai.yan@gmail.com> Closes #18554 from facaiy/BUG/oneVsRest_missing_weightCol. (cherry picked from commit a5a3189974ea4628e9489eb50099a5432174e80c) Signed-off-by: Yanbo Liang <ybliang8@gmail.com> 28 July 2017, 02:15:59 UTC
9498798 [SPARK-21446][SQL] Fix setAutoCommit never executed ## What changes were proposed in this pull request? JIRA Issue: https://issues.apache.org/jira/browse/SPARK-21446 options.asConnectionProperties can not have fetchsize,because fetchsize belongs to Spark-only options, and Spark-only options have been excluded in connection properities. So change properties of beforeFetch from options.asConnectionProperties.asScala.toMap to options.asProperties.asScala.toMap ## How was this patch tested? Author: DFFuture <albert.zhang23@gmail.com> Closes #18665 from DFFuture/sparksql_pg. (cherry picked from commit c9729187bcef78299390e53cd9af38c3e084060e) Signed-off-by: gatorsmile <gatorsmile@gmail.com> 19 July 2017, 21:45:43 UTC
ac20693 [SPARK-21441][SQL] Incorrect Codegen in SortMergeJoinExec results failures in some cases ## What changes were proposed in this pull request? https://issues.apache.org/jira/projects/SPARK/issues/SPARK-21441 This issue can be reproduced by the following example: ``` val spark = SparkSession .builder() .appName("smj-codegen") .master("local") .config("spark.sql.autoBroadcastJoinThreshold", "1") .getOrCreate() val df1 = spark.createDataFrame(Seq((1, 1), (2, 2), (3, 3))).toDF("key", "int") val df2 = spark.createDataFrame(Seq((1, "1"), (2, "2"), (3, "3"))).toDF("key", "str") val df = df1.join(df2, df1("key") === df2("key")) .filter("int = 2 or reflect('java.lang.Integer', 'valueOf', str) = 1") .select("int") df.show() ``` To conclude, the issue happens when: (1) SortMergeJoin condition contains CodegenFallback expressions. (2) In PhysicalPlan tree, SortMergeJoin node is the child of root node, e.g., the Project in above example. This patch fixes the logic in `CollapseCodegenStages` rule. ## How was this patch tested? Unit test and manual verification in our cluster. Author: donnyzone <wellfengzhu@gmail.com> Closes #18656 from DonnyZone/Fix_SortMergeJoinExec. (cherry picked from commit 6b6dd682e84d3b03d0b15fbd81a0d16729e521d2) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 19 July 2017, 13:53:07 UTC
caf32b3 [SPARK-21332][SQL] Incorrect result type inferred for some decimal expressions ## What changes were proposed in this pull request? This PR changes the direction of expression transformation in the DecimalPrecision rule. Previously, the expressions were transformed down, which led to incorrect result types when decimal expressions had other decimal expressions as their operands. The root cause of this issue was in visiting outer nodes before their children. Consider the example below: ``` val inputSchema = StructType(StructField("col", DecimalType(26, 6)) :: Nil) val sc = spark.sparkContext val rdd = sc.parallelize(1 to 2).map(_ => Row(BigDecimal(12))) val df = spark.createDataFrame(rdd, inputSchema) // Works correctly since no nested decimal expression is involved // Expected result type: (26, 6) * (26, 6) = (38, 12) df.select($"col" * $"col").explain(true) df.select($"col" * $"col").printSchema() // Gives a wrong result since there is a nested decimal expression that should be visited first // Expected result type: ((26, 6) * (26, 6)) * (26, 6) = (38, 12) * (26, 6) = (38, 18) df.select($"col" * $"col" * $"col").explain(true) df.select($"col" * $"col" * $"col").printSchema() ``` The example above gives the following output: ``` // Correct result without sub-expressions == Parsed Logical Plan == 'Project [('col * 'col) AS (col * col)#4] +- LogicalRDD [col#1] == Analyzed Logical Plan == (col * col): decimal(38,12) Project [CheckOverflow((promote_precision(cast(col#1 as decimal(26,6))) * promote_precision(cast(col#1 as decimal(26,6)))), DecimalType(38,12)) AS (col * col)#4] +- LogicalRDD [col#1] == Optimized Logical Plan == Project [CheckOverflow((col#1 * col#1), DecimalType(38,12)) AS (col * col)#4] +- LogicalRDD [col#1] == Physical Plan == *Project [CheckOverflow((col#1 * col#1), DecimalType(38,12)) AS (col * col)#4] +- Scan ExistingRDD[col#1] // Schema root |-- (col * col): decimal(38,12) (nullable = true) // Incorrect result with sub-expressions == Parsed Logical Plan == 'Project [(('col * 'col) * 'col) AS ((col * col) * col)#11] +- LogicalRDD [col#1] == Analyzed Logical Plan == ((col * col) * col): decimal(38,12) Project [CheckOverflow((promote_precision(cast(CheckOverflow((promote_precision(cast(col#1 as decimal(26,6))) * promote_precision(cast(col#1 as decimal(26,6)))), DecimalType(38,12)) as decimal(26,6))) * promote_precision(cast(col#1 as decimal(26,6)))), DecimalType(38,12)) AS ((col * col) * col)#11] +- LogicalRDD [col#1] == Optimized Logical Plan == Project [CheckOverflow((cast(CheckOverflow((col#1 * col#1), DecimalType(38,12)) as decimal(26,6)) * col#1), DecimalType(38,12)) AS ((col * col) * col)#11] +- LogicalRDD [col#1] == Physical Plan == *Project [CheckOverflow((cast(CheckOverflow((col#1 * col#1), DecimalType(38,12)) as decimal(26,6)) * col#1), DecimalType(38,12)) AS ((col * col) * col)#11] +- Scan ExistingRDD[col#1] // Schema root |-- ((col * col) * col): decimal(38,12) (nullable = true) ``` ## How was this patch tested? This PR was tested with available unit tests. Moreover, there are tests to cover previously failing scenarios. Author: aokolnychyi <anton.okolnychyi@sap.com> Closes #18583 from aokolnychyi/spark-21332. (cherry picked from commit 0be5fb41a6b7ef4da9ba36f3604ac646cb6d4ae3) Signed-off-by: gatorsmile <gatorsmile@gmail.com> 18 July 2017, 04:08:30 UTC
a9efce4 [SPARK-19104][BACKPORT-2.1][SQL] Lambda variables in ExternalMapToCatalyst should be global ## What changes were proposed in this pull request? This PR is backport of #18418 to Spark 2.1. [SPARK-21391](https://issues.apache.org/jira/browse/SPARK-21391) reported this problem in Spark 2.1. The issue happens in `ExternalMapToCatalyst`. For example, the following codes create ExternalMap`ExternalMapToCatalyst`ToCatalyst to convert Scala Map to catalyst map format. ``` val data = Seq.tabulate(10)(i => NestedData(1, Map("key" -> InnerData("name", i + 100)))) val ds = spark.createDataset(data) ``` The `valueConverter` in `ExternalMapToCatalyst` looks like: ``` if (isnull(lambdavariable(ExternalMapToCatalyst_value52, ExternalMapToCatalyst_value_isNull52, ObjectType(class org.apache.spark.sql.InnerData), true))) null else named_struct(name, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(lambdavariable(ExternalMapToCatalyst_value52, ExternalMapToCatalyst_value_isNull52, ObjectType(class org.apache.spark.sql.InnerData), true)).name, true), value, assertnotnull(lambdavariable(ExternalMapToCatalyst_value52, ExternalMapToCatalyst_value_isNull52, ObjectType(class org.apache.spark.sql.InnerData), true)).value) ``` There is a `CreateNamedStruct` expression (`named_struct`) to create a row of `InnerData.name` and `InnerData.value` that are referred by `ExternalMapToCatalyst_value52`. Because `ExternalMapToCatalyst_value52` are local variable, when `CreateNamedStruct` splits expressions to individual functions, the local variable can't be accessed anymore. ## How was this patch tested? Added a new test suite into `DatasetPrimitiveSuite` Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Closes #18627 from kiszk/SPARK-21391. 18 July 2017, 01:18:32 UTC
ca4d2aa [SPARK-21344][SQL] BinaryType comparison does signed byte array comparison ## What changes were proposed in this pull request? This PR fixes a wrong comparison for `BinaryType`. This PR enables unsigned comparison and unsigned prefix generation for an array for `BinaryType`. Previous implementations uses signed operations. ## How was this patch tested? Added a test suite in `OrderingSuite`. Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Closes #18571 from kiszk/SPARK-21344. (cherry picked from commit ac5d5d795909061a17e056696cf0ef87d9e65dd1) Signed-off-by: gatorsmile <gatorsmile@gmail.com> 15 July 2017, 03:16:35 UTC
2c28462 [SPARK-21083][SQL][BRANCH-2.1] Store zero size and row count when analyzing empty table ## What changes were proposed in this pull request? We should be able to store zero size and row count after analyzing empty table. This is a backport for https://github.com/apache/spark/commit/9fccc3627fa41d32fbae6dbbb9bd1521e43eb4f0. ## How was this patch tested? Added new test. Author: Zhenhua Wang <wzh_zju@163.com> Closes #18577 from wzhfy/analyzeEmptyTable-2.1. 10 July 2017, 03:31:55 UTC
5e2bfd5 [SPARK-21345][SQL][TEST][TEST-MAVEN][BRANCH-2.1] SparkSessionBuilderSuite should clean up stopped sessions. ## What changes were proposed in this pull request? `SparkSessionBuilderSuite` should clean up stopped sessions. Otherwise, it leaves behind some stopped `SparkContext`s interfereing with other test suites using `ShardSQLContext`. Recently, master branch fails consequtively. - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/ ## How was this patch tested? Pass the Jenkins with a updated suite. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #18572 from dongjoon-hyun/SPARK-21345-BRANCH-2.1. 09 July 2017, 03:11:02 UTC
7f7b63b [SPARK-21312][SQL] correct offsetInBytes in UnsafeRow.writeToStream ## What changes were proposed in this pull request? Corrects offsetInBytes calculation in UnsafeRow.writeToStream. Known failures include writes to some DataSources that have own SparkPlan implementations and cause EXCHANGE in writes. ## How was this patch tested? Extended UnsafeRowSuite.writeToStream to include an UnsafeRow over byte array having non-zero offset. Author: Sumedh Wale <swale@snappydata.io> Closes #18535 from sumwale/SPARK-21312. (cherry picked from commit 14a3bb3a008c302aac908d7deaf0942a98c63be7) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 06 July 2017, 06:48:12 UTC
8f1ca69 [SPARK-20256][SQL][BRANCH-2.1] SessionState should be created more lazily ## What changes were proposed in this pull request? `SessionState` is designed to be created lazily. However, in reality, it created immediately in `SparkSession.Builder.getOrCreate` ([here](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala#L943)). This PR aims to recover the lazy behavior by keeping the options into `initialSessionOptions`. The benefit is like the following. Users can start `spark-shell` and use RDD operations without any problems. **BEFORE** ```scala $ bin/spark-shell java.lang.IllegalArgumentException: Error while instantiating 'org.apache.spark.sql.hive.HiveSessionStateBuilder' ... Caused by: org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.security.AccessControlException: Permission denied: user=spark, access=READ, inode="/apps/hive/warehouse":hive:hdfs:drwx------ ``` As reported in SPARK-20256, this happens when the warehouse directory is not allowed for this user. **AFTER** ```scala $ bin/spark-shell ... Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 2.1.2-SNAPSHOT /_/ Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_131) Type in expressions to have them evaluated. Type :help for more information. scala> sc.range(0, 10, 1).count() res0: Long = 10 ``` ## How was this patch tested? Manual. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #18530 from dongjoon-hyun/SPARK-20256-BRANCH-2.1. 05 July 2017, 02:29:37 UTC
3ecef24 Revert "[SPARK-21258][SQL] Fix WindowExec complex object aggregation with spilling" This reverts commit d995dac1cdeec940364453675f59ce5cf2b53684. 30 June 2017, 06:45:55 UTC
d995dac [SPARK-21258][SQL] Fix WindowExec complex object aggregation with spilling ## What changes were proposed in this pull request? `WindowExec` currently improperly stores complex objects (UnsafeRow, UnsafeArrayData, UnsafeMapData, UTF8String) during aggregation by keeping a reference in the buffer used by `GeneratedMutableProjections` to the actual input data. Things go wrong when the input object (or the backing bytes) are reused for other things. This could happen in window functions when it starts spilling to disk. When reading the back the spill files the `UnsafeSorterSpillReader` reuses the buffer to which the `UnsafeRow` points, leading to weird corruption scenario's. Note that this only happens for aggregate functions that preserve (parts of) their input, for example `FIRST`, `LAST`, `MIN` & `MAX`. This was not seen before, because the spilling logic was not doing actual spills as much and actually used an in-memory page. This page was not cleaned up during window processing and made sure unsafe objects point to their own dedicated memory location. This was changed by https://github.com/apache/spark/pull/16909, after this PR Spark spills more eagerly. This PR provides a surgical fix because we are close to releasing Spark 2.2. This change just makes sure that there cannot be any object reuse at the expensive of a little bit of performance. We will follow-up with a more subtle solution at a later point. ## How was this patch tested? Added a regression test to `DataFrameWindowFunctionsSuite`. Author: Herman van Hovell <hvanhovell@databricks.com> Closes #18470 from hvanhovell/SPARK-21258. (cherry picked from commit e2f32ee45ac907f1f53fde7e412676a849a94872) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 30 June 2017, 04:34:53 UTC
083adb0 [SPARK-21176][WEB UI] Limit number of selector threads for admin ui proxy servlets to 8 ## What changes were proposed in this pull request? Please see also https://issues.apache.org/jira/browse/SPARK-21176 This change limits the number of selector threads that jetty creates to maximum 8 per proxy servlet (Jetty default is number of processors / 2). The newHttpClient for Jettys ProxyServlet class is overwritten to avoid the Jetty defaults (which are designed for high-performance http servers). Once https://github.com/eclipse/jetty.project/issues/1643 is available, the code could be cleaned up to avoid the method override. I really need this on v2.1.1 - what is the best way for a backport automatic merge works fine)? Shall I create another PR? ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) The patch was tested manually on a Spark cluster with a head node that has 88 processors using JMX to verify that the number of selector threads is now limited to 8 per proxy. gurvindersingh zsxwing can you please review the change? Author: IngoSchuster <ingo.schuster@de.ibm.com> Author: Ingo Schuster <ingo.schuster@de.ibm.com> Closes #18437 from IngoSchuster/master. (cherry picked from commit 88a536babf119b7e331d02aac5d52b57658803bf) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 30 June 2017, 03:16:38 UTC
26f4f34 Revert "[SPARK-18016][SQL][CATALYST][BRANCH-2.1] Code Generation: Constant Pool Limit - Class Splitting" This reverts commit 6b37c863848d5991821aa2c3233754ca3061f4d6. 25 June 2017, 06:22:16 UTC
0d6b701 [SPARK-21203][SQL] Fix wrong results of insertion of Array of Struct ### What changes were proposed in this pull request? ```SQL CREATE TABLE `tab1` (`custom_fields` ARRAY<STRUCT<`id`: BIGINT, `value`: STRING>>) USING parquet INSERT INTO `tab1` SELECT ARRAY(named_struct('id', 1, 'value', 'a'), named_struct('id', 2, 'value', 'b')) SELECT custom_fields.id, custom_fields.value FROM tab1 ``` The above query always return the last struct of the array, because the rule `SimplifyCasts` incorrectly rewrites the query. The underlying cause is we always use the same `GenericInternalRow` object when doing the cast. ### How was this patch tested? Author: gatorsmile <gatorsmile@gmail.com> Closes #18412 from gatorsmile/castStruct. (cherry picked from commit 2e1586f60a77ea0adb6f3f68ba74323f0c242199) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 24 June 2017, 14:36:43 UTC
6750db3 [SPARK-21159][CORE] Don't try to connect to launcher in standalone cluster mode. Monitoring for standalone cluster mode is not implemented (see SPARK-11033), but the same scheduler implementation is used, and if it tries to connect to the launcher it will fail. So fix the scheduler so it only tries that in client mode; cluster mode applications will be correctly launched and will work, but monitoring through the launcher handle will not be available. Tested by running a cluster mode app with "SparkLauncher.startApplication". Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #18397 from vanzin/SPARK-21159. (cherry picked from commit bfd73a7c48b87456d1b84d826e04eca938a1be64) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 24 June 2017, 05:36:53 UTC
f12883e [SPARK-20555][SQL] Fix mapping of Oracle DECIMAL types to Spark types in read path This PR is to revert some code changes in the read path of https://github.com/apache/spark/pull/14377. The original fix is https://github.com/apache/spark/pull/17830 When merging this PR, please give the credit to gaborfeher Added a test case to OracleIntegrationSuite.scala Author: Gabor Feher <gabor.feher@lynxanalytics.com> Author: gatorsmile <gatorsmile@gmail.com> Closes #18408 from gatorsmile/OracleType. 24 June 2017, 05:08:43 UTC
bcaf06c [MINOR][DOCS] Docs in DataFrameNaFunctions.scala use wrong method ## What changes were proposed in this pull request? * Following the first few examples in this file, the remaining methods should also be methods of `df.na` not `df`. * Filled in some missing parentheses ## How was this patch tested? N/A Author: Ong Ming Yang <me@ongmingyang.com> Closes #18398 from ongmingyang/master. (cherry picked from commit 4cc62951a2b12a372a2b267bf8597a0a31e2b2cb) Signed-off-by: Xiao Li <gatorsmile@gmail.com> 23 June 2017, 17:57:22 UTC
f8fd3b4 [SPARK-21181] Release byteBuffers to suppress netty error messages ## What changes were proposed in this pull request? We are explicitly calling release on the byteBuf's used to encode the string to Base64 to suppress the memory leak error message reported by netty. This is to make it less confusing for the user. ### Changes proposed in this fix By explicitly invoking release on the byteBuf's we are decrement the internal reference counts for the wrappedByteBuf's. Now, when the GC kicks in, these would be reclaimed as before, just that netty wouldn't report any memory leak error messages as the internal ref. counts are now 0. ## How was this patch tested? Ran a few spark-applications and examined the logs. The error message no longer appears. Original PR was opened against branch-2.1 => https://github.com/apache/spark/pull/18392 Author: Dhruve Ashar <dhruveashar@gmail.com> Closes #18407 from dhruve/master. (cherry picked from commit 1ebe7ffe072bcac03360e65e959a6cd36530a9c4) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> 23 June 2017, 17:36:51 UTC
1a98d5d [SPARK-21167][SS] Decode the path generated by File sink to handle special characters ## What changes were proposed in this pull request? Decode the path generated by File sink to handle special characters. ## How was this patch tested? The added unit test. Author: Shixiong Zhu <shixiong@databricks.com> Closes #18381 from zsxwing/SPARK-21167. (cherry picked from commit d66b143eec7f604595089f72d8786edbdcd74282) Signed-off-by: Shixiong Zhu <shixiong@databricks.com> 22 June 2017, 06:43:42 UTC
6b37c86 [SPARK-18016][SQL][CATALYST][BRANCH-2.1] Code Generation: Constant Pool Limit - Class Splitting ## What changes were proposed in this pull request? This is a backport patch for Spark 2.1.x of the class splitting feature over excess generated code as was merged in #18075. ## How was this patch tested? The same test provided in #18075 is included in this patch. Author: ALeksander Eskilson <alek.eskilson@cerner.com> Closes #18354 from bdrillard/class_splitting_2.1. 22 June 2017, 05:23:59 UTC
8923bac [SPARK-21123][DOCS][STRUCTURED STREAMING] Options for file stream source are in a wrong table - version to fix 2.1 ## What changes were proposed in this pull request? The description for several options of File Source for structured streaming appeared in the File Sink description instead. This commit continues on PR #18342 and targets the fixes for the documentation of version spark version 2.1 ## How was this patch tested? Built the documentation by SKIP_API=1 jekyll build and visually inspected the structured streaming programming guide. zsxwing This is the PR to fix version 2.1 as discussed in PR #18342 Author: assafmendelson <assaf.mendelson@gmail.com> Closes #18363 from assafmendelson/spark-21123-for-spark2.1. 20 June 2017, 20:07:51 UTC
7799f35 [SPARK-21138][YARN] Cannot delete staging dir when the clusters of "spark.yarn.stagingDir" and "spark.hadoop.fs.defaultFS" are different ## What changes were proposed in this pull request? When I set different clusters for "spark.hadoop.fs.defaultFS" and "spark.yarn.stagingDir" as follows: ``` spark.hadoop.fs.defaultFS hdfs://tl-nn-tdw.tencent-distribute.com:54310 spark.yarn.stagingDir hdfs://ss-teg-2-v2/tmp/spark ``` The staging dir can not be deleted, it will prompt following message: ``` java.lang.IllegalArgumentException: Wrong FS: hdfs://ss-teg-2-v2/tmp/spark/.sparkStaging/application_1496819138021_77618, expected: hdfs://tl-nn-tdw.tencent-distribute.com:54310 ``` ## How was this patch tested? Existing tests Author: sharkdtu <sharkdtu@tencent.com> Closes #18352 from sharkdtu/master. (cherry picked from commit 3d4d11a80fe8953d48d8bfac2ce112e37d38dc90) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> 19 June 2017, 21:55:47 UTC
a44c118 [SPARK-19688][STREAMING] Not to read `spark.yarn.credentials.file` from checkpoint. ## What changes were proposed in this pull request? Reload the `spark.yarn.credentials.file` property when restarting a streaming application from checkpoint. ## How was this patch tested? Manual tested with 1.6.3 and 2.1.1. I didn't test this with master because of some compile problems, but I think it will be the same result. ## Notice This should be merged into maintenance branches too. jira: [SPARK-21008](https://issues.apache.org/jira/browse/SPARK-21008) Author: saturday_s <shi.indetail@gmail.com> Closes #18230 from saturday-shi/SPARK-21008. (cherry picked from commit e92ffe6f1771e3fe9ea2e62ba552c1b5cf255368) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> 19 June 2017, 17:24:56 UTC
0ebb3b8 [SPARK-21114][TEST][2.1] Fix test failure in Spark 2.1/2.0 due to name mismatch ## What changes were proposed in this pull request? Name mismatch between 2.1/2.0 and 2.2. Thus, the test cases failed after we backport a fix to 2.1/2.0. This PR is to fix the issue. https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-branch-2.1-test-maven-hadoop-2.7/lastCompletedBuild/testReport/org.apache.spark.sql/SQLQueryTestSuite/arithmetic_sql/ https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-branch-2.0-test-maven-hadoop-2.2/lastCompletedBuild/testReport/org.apache.spark.sql/SQLQueryTestSuite/arithmetic_sql/ ## How was this patch tested? N/A Author: gatorsmile <gatorsmile@gmail.com> Closes #18319 from gatorsmile/fixDecimal. 16 June 2017, 04:19:55 UTC
915a201 [SPARK-21072][SQL] TreeNode.mapChildren should only apply to the children node. ## What changes were proposed in this pull request? Just as the function name and comments of `TreeNode.mapChildren` mentioned, the function should be apply to all currently node children. So, the follow code should judge whether it is the children node. https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala#L342 ## How was this patch tested? Existing tests. Author: Xianyang Liu <xianyang.liu@intel.com> Closes #18284 from ConeyLiu/treenode. (cherry picked from commit 87ab0cec65b50584a627037b9d1b6fdecaee725c) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 16 June 2017, 04:10:38 UTC
62f2b80 [SPARK-16251][SPARK-20200][CORE][TEST] Flaky test: org.apache.spark.rdd.LocalCheckpointSuite.missing checkpoint block fails with informative message ## What changes were proposed in this pull request? Currently we don't wait to confirm the removal of the block from the slave's BlockManager, if the removal takes too much time, we will fail the assertion in this test case. The failure can be easily reproduced if we sleep for a while before we remove the block in BlockManagerSlaveEndpoint.receiveAndReply(). ## How was this patch tested? N/A Author: Xingbo Jiang <xingbo.jiang@databricks.com> Closes #18314 from jiangxb1987/LocalCheckpointSuite. (cherry picked from commit 7dc3e697c74864a4e3cca7342762f1427058b3c3) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 15 June 2017, 16:07:26 UTC
a890466 [SPARK-20211][SQL][BACKPORT-2.2] Fix the Precision and Scale of Decimal Values when the Input is BigDecimal between -1.0 and 1.0 ### What changes were proposed in this pull request? This PR is to backport https://github.com/apache/spark/pull/18244 to 2.2 --- The precision and scale of decimal values are wrong when the input is BigDecimal between -1.0 and 1.0. The BigDecimal's precision is the digit count starts from the leftmost nonzero digit based on the [JAVA's BigDecimal definition](https://docs.oracle.com/javase/7/docs/api/java/math/BigDecimal.html). However, our Decimal decision follows the database decimal standard, which is the total number of digits, including both to the left and the right of the decimal point. Thus, this PR is to fix the issue by doing the conversion. Before this PR, the following queries failed: ```SQL select 1 > 0.0001 select floor(0.0001) select ceil(0.0001) ``` ### How was this patch tested? Added test cases. Author: gatorsmile <gatorsmile@gmail.com> Closes #18297 from gatorsmile/backport18244. (cherry picked from commit 626511953b87747e933e4f64b9fcd4c4776a5c4e) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 14 June 2017, 11:18:43 UTC
ee0e74e [SPARK-21064][CORE][TEST] Fix the default value bug in NettyBlockTransferServiceSuite ## What changes were proposed in this pull request? The default value for `spark.port.maxRetries` is 100, but we use 10 in the suite file. So we change it to 100 to avoid test failure. ## How was this patch tested? No test Author: DjvuLee <lihu@bytedance.com> Closes #18280 from djvulee/NettyTestBug. (cherry picked from commit b36ce2a2469ff923a3367a530d4a14899ecf9238) Signed-off-by: Sean Owen <sowen@cloudera.com> 13 June 2017, 14:56:21 UTC
58a8a37 [SPARK-20920][SQL] ForkJoinPool pools are leaked when writing hive tables with many partitions ## What changes were proposed in this pull request? Don't leave thread pool running from AlterTableRecoverPartitionsCommand DDL command ## How was this patch tested? Existing tests. Author: Sean Owen <sowen@cloudera.com> Closes #18216 from srowen/SPARK-20920. (cherry picked from commit 7b7c85ede398996aafffb126440e5f0c67f67210) Signed-off-by: Sean Owen <sowen@cloudera.com> 13 June 2017, 09:48:25 UTC
03cc18b [SPARK-20914][DOCS] Javadoc contains code that is invalid ## What changes were proposed in this pull request? Fix Java, Scala Dataset examples in scaladoc, which didn't compile. ## How was this patch tested? Existing compilation/test Author: Sean Owen <sowen@cloudera.com> Closes #18215 from srowen/SPARK-20914. (cherry picked from commit 847efe12656756f9ad6a4dc14bd183ac1a0760a6) Signed-off-by: Sean Owen <sowen@cloudera.com> 08 June 2017, 09:56:42 UTC
afab855 [SPARK-20974][BUILD] we should run REPL tests if SQL module has code changes ## What changes were proposed in this pull request? REPL module depends on SQL module, so we should run REPL tests if SQL module has code changes. ## How was this patch tested? N/A Author: Wenchen Fan <wenchen@databricks.com> Closes #18191 from cloud-fan/test. (cherry picked from commit 864d94fe879a32de324da65a844e62a0260b222d) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 03 June 2017, 05:00:19 UTC
0b25a7d [SPARK-20922][CORE][HOTFIX] Don't use Java 8 lambdas in older branches. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #18178 from vanzin/SPARK-20922-hotfix. 01 June 2017, 23:45:31 UTC
back to top