https://github.com/apache/spark

sort by:
Revision Author Date Message Commit Date
2abaea9 Preparing Spark release v2.1.2-rc4 02 October 2017, 18:57:15 UTC
ff4179b [SPARK-22167][R][BUILD] sparkr packaging issue allow zinc ## What changes were proposed in this pull request? When zinc is running the pwd might be in the root of the project. A quick solution to this is to not go a level up incase we are in the root rather than root/core/. If we are in the root everything works fine, if we are in core add a script which goes and runs the level up ## How was this patch tested? set -x in the SparkR install scripts. Author: Holden Karau <holden@us.ibm.com> Closes #19402 from holdenk/SPARK-22167-sparkr-packaging-issue-allow-zinc. (cherry picked from commit 8fab7995d36c7bc4524393b20a4e524dbf6bbf62) Signed-off-by: Holden Karau <holden@us.ibm.com> 02 October 2017, 18:47:36 UTC
81e4008 [SPARK-18971][CORE] Upgrade Netty to 4.0.43.Final ## What changes were proposed in this pull request? Upgrade Netty to `4.0.43.Final` to add the fix for https://github.com/netty/netty/issues/6153 ## How was this patch tested? Jenkins Author: Shixiong Zhu <shixiong@databricks.com> Closes #16568 from zsxwing/SPARK-18971. (cherry picked from commit a8567e34dc77a32ddeb280f8f9f603f301722059) Signed-off-by: Sean Owen <sowen@cloudera.com> # Conflicts: # dev/deps/spark-deps-hadoop-2.2 # dev/deps/spark-deps-hadoop-2.3 # dev/deps/spark-deps-hadoop-2.4 # dev/deps/spark-deps-hadoop-2.6 # dev/deps/spark-deps-hadoop-2.7 02 October 2017, 18:33:21 UTC
78661f9 Preparing development version 2.1.3-SNAPSHOT 29 September 2017, 16:04:35 UTC
efdbef4 Preparing Spark release v2.1.2-rc3 29 September 2017, 16:04:26 UTC
60f78c2 [SPARK-22129][SPARK-22138] Release script improvements ## What changes were proposed in this pull request? Use the GPG_KEY param, fix lsof to non-hardcoded path, remove version swap since it wasn't really needed. Use EXPORT on JAVA_HOME for downstream scripts as well. ## How was this patch tested? Rolled 2.1.2 RC2 Author: Holden Karau <holden@us.ibm.com> Closes #19359 from holdenk/SPARK-22129-fix-signing. (cherry picked from commit ecbe416ab5001b32737966c5a2407597a1dafc32) Signed-off-by: Holden Karau <holden@us.ibm.com> 29 September 2017, 15:04:38 UTC
361aa0e [SPARK-22083][CORE] Release locks in MemoryStore.evictBlocksToFreeSpace ## What changes were proposed in this pull request? MemoryStore.evictBlocksToFreeSpace acquires write locks for all the blocks it intends to evict up front. If there is a failure to evict blocks (eg., some failure dropping a block to disk), then we have to release the lock. Otherwise the lock is never released and an executor trying to get the lock will wait forever. ## How was this patch tested? Added unit test. Author: Imran Rashid <irashid@cloudera.com> Closes #19311 from squito/SPARK-22083. (cherry picked from commit 2c5b9b1173c23f6ca8890817a9a35dc7557b0776) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> 25 September 2017, 19:02:51 UTC
0b3e7cc [SPARK-18136] Fix SPARK_JARS_DIR for Python pip install on Windows ## What changes were proposed in this pull request? Fix for setup of `SPARK_JARS_DIR` on Windows as it looks for `%SPARK_HOME%\RELEASE` file instead of `%SPARK_HOME%\jars` as it should. RELEASE file is not included in the `pip` build of PySpark. ## How was this patch tested? Local install of PySpark on Anaconda 4.4.0 (Python 3.6.1). Author: Jakub Nowacki <j.s.nowacki@gmail.com> Closes #19310 from jsnowacki/master. (cherry picked from commit c11f24a94007bbaad0835645843e776507094071) Signed-off-by: hyukjinkwon <gurwls223@gmail.com> 23 September 2017, 12:05:04 UTC
03db721 Preparing development version 2.1.3-SNAPSHOT 22 September 2017, 15:07:45 UTC
fabbb7f Preparing Spark release v2.1.2-rc2 22 September 2017, 15:07:37 UTC
d930bbb [SPARK-22072][SPARK-22071][BUILD] Improve release build scripts ## What changes were proposed in this pull request? Check JDK version (with javac) and use SPARK_VERSION for publish-release ## How was this patch tested? Manually tried local build with wrong JDK / JAVA_HOME & built a local release (LFTP disabled) Author: Holden Karau <holden@us.ibm.com> Closes #19312 from holdenk/improve-release-scripts-r2. (cherry picked from commit 8f130ad40178e35fecb3f2ba4a61ad23e6a90e3d) Signed-off-by: Holden Karau <holden@us.ibm.com> 22 September 2017, 07:15:27 UTC
1a4b6ee [SPARK-21928][CORE] Set classloader on SerializerManager's private kryo ## What changes were proposed in this pull request? We have to make sure that SerializerManager's private instance of kryo also uses the right classloader, regardless of the current thread classloader. In particular, this fixes serde during remote cache fetches, as those occur in netty threads. ## How was this patch tested? Manual tests & existing suite via jenkins. I haven't been able to reproduce this is in a unit test, because when a remote RDD partition can be fetched, there is a warning message and then the partition is just recomputed locally. I manually verified the warning message is no longer present. (cherry picked from commit b75bd1777496ce0354458bf85603a8087a6a0ff8) Author: Imran Rashid <irashid@cloudera.com> Closes #19313 from squito/SPARK-21928_2.1. 21 September 2017, 22:47:23 UTC
56865a1 [SPARK-19318][SPARK-22041][SPARK-16625][BACKPORT-2.1][SQL] Docker test case failure: `: General data types to be mapped to Oracle` ## What changes were proposed in this pull request? This PR is backport of https://github.com/apache/spark/pull/16891 to Spark 2.1. ## How was this patch tested? unit tests Author: Yuming Wang <wgyumg@gmail.com> Closes #19259 from wangyum/SPARK-22041-BACKPORT-2.1. 19 September 2017, 23:55:13 UTC
30ce056 [SPARK-22052] Incorrect Metric assigned in MetricsReporter.scala Current implementation for processingRate-total uses wrong metric: mistakenly uses inputRowsPerSecond instead of processedRowsPerSecond ## What changes were proposed in this pull request? Adjust processingRate-total from using inputRowsPerSecond to processedRowsPerSecond ## How was this patch tested? Built spark from source with proposed change and tested output with correct parameter. Before change the csv metrics file for inputRate-total and processingRate-total displayed the same values due to the error. After changing MetricsReporter.scala the processingRate-total csv file displayed the correct metric. <img width="963" alt="processed rows per second" src="https://user-images.githubusercontent.com/32072374/30554340-82eea12c-9ca4-11e7-8370-8168526ff9a2.png"> Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Taaffy <32072374+Taaffy@users.noreply.github.com> Closes #19268 from Taaffy/patch-1. (cherry picked from commit 1bc17a6b8add02772a8a0a1048ac6a01d045baf4) Signed-off-by: Sean Owen <sowen@cloudera.com> 19 September 2017, 09:20:25 UTC
b35136a [SPARK-22043][PYTHON] Improves error message for show_profiles and dump_profiles ## What changes were proposed in this pull request? This PR proposes to improve error message from: ``` >>> sc.show_profiles() Traceback (most recent call last): File "<stdin>", line 1, in <module> File ".../spark/python/pyspark/context.py", line 1000, in show_profiles self.profiler_collector.show_profiles() AttributeError: 'NoneType' object has no attribute 'show_profiles' >>> sc.dump_profiles("/tmp/abc") Traceback (most recent call last): File "<stdin>", line 1, in <module> File ".../spark/python/pyspark/context.py", line 1005, in dump_profiles self.profiler_collector.dump_profiles(path) AttributeError: 'NoneType' object has no attribute 'dump_profiles' ``` to ``` >>> sc.show_profiles() Traceback (most recent call last): File "<stdin>", line 1, in <module> File ".../spark/python/pyspark/context.py", line 1003, in show_profiles raise RuntimeError("'spark.python.profile' configuration must be set " RuntimeError: 'spark.python.profile' configuration must be set to 'true' to enable Python profile. >>> sc.dump_profiles("/tmp/abc") Traceback (most recent call last): File "<stdin>", line 1, in <module> File ".../spark/python/pyspark/context.py", line 1012, in dump_profiles raise RuntimeError("'spark.python.profile' configuration must be set " RuntimeError: 'spark.python.profile' configuration must be set to 'true' to enable Python profile. ``` ## How was this patch tested? Unit tests added in `python/pyspark/tests.py` and manual tests. Author: hyukjinkwon <gurwls223@gmail.com> Closes #19260 from HyukjinKwon/profile-errors. (cherry picked from commit 7c7266208a3be984ac1ce53747dc0c3640f4ecac) Signed-off-by: hyukjinkwon <gurwls223@gmail.com> 18 September 2017, 04:20:48 UTC
99de4b8 [SPARK-21953] Show both memory and disk bytes spilled if either is present As written now, there must be both memory and disk bytes spilled to show either of them. If there is only one of those types of spill recorded, it will be hidden. Author: Andrew Ash <andrew@andrewash.com> Closes #19164 from ash211/patch-3. (cherry picked from commit 6308c65f08b507408033da1f1658144ea8c1491f) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 18 September 2017, 02:43:21 UTC
3ae7ab8 [SPARK-21985][PYSPARK] PairDeserializer is broken for double-zipped RDDs ## What changes were proposed in this pull request? (edited) Fixes a bug introduced in #16121 In PairDeserializer convert each batch of keys and values to lists (if they do not have `__len__` already) so that we can check that they are the same size. Normally they already are lists so this should not have a performance impact, but this is needed when repeated `zip`'s are done. ## How was this patch tested? Additional unit test Author: Andrew Ray <ray.andrew@gmail.com> Closes #19226 from aray/SPARK-21985. (cherry picked from commit 6adf67dd14b0ece342bb91adf800df0a7101e038) Signed-off-by: hyukjinkwon <gurwls223@gmail.com> 17 September 2017, 17:47:06 UTC
e49c997 Preparing development version 2.1.3-SNAPSHOT 14 September 2017, 02:34:45 UTC
6f47032 Preparing Spark release v2.1.2-rc1 14 September 2017, 02:34:41 UTC
e7696eb [SPARK-21976][DOC] Fix wrong documentation for Mean Absolute Error. ## What changes were proposed in this pull request? Fixed wrong documentation for Mean Absolute Error. Even though the code is correct for the MAE: ```scala Since("1.2.0") def meanAbsoluteError: Double = { summary.normL1(1) / summary.count } ``` In the documentation the division by N is missing. ## How was this patch tested? All of spark tests were run. Please review http://spark.apache.org/contributing.html before opening a pull request. Author: FavioVazquez <favio.vazquezp@gmail.com> Author: faviovazquez <favio.vazquezp@gmail.com> Author: Favio André Vázquez <favio.vazquezp@gmail.com> Closes #19190 from FavioVazquez/mae-fix. (cherry picked from commit e2ac2f1c71a0f8b03743d0d916dc0ef28482a393) Signed-off-by: Sean Owen <sowen@cloudera.com> 12 September 2017, 09:33:54 UTC
ae4e8ae [SPARKR][BACKPORT-2.1] backporting package and test changes ## What changes were proposed in this pull request? cherrypick or manually porting changes to 2.1 ## How was this patch tested? Jenkins Author: Felix Cheung <felixcheung_m@hotmail.com> Author: hyukjinkwon <gurwls223@gmail.com> Author: Wayne Zhang <actuaryzhang@uber.com> Closes #19165 from felixcheung/rbackportpkg21. 10 September 2017, 17:24:46 UTC
6a8a726 [SPARK-21950][SQL][PYTHON][TEST] pyspark.sql.tests.SQLTests2 should stop SparkContext. ## What changes were proposed in this pull request? `pyspark.sql.tests.SQLTests2` doesn't stop newly created spark context in the test and it might affect the following tests. This pr makes `pyspark.sql.tests.SQLTests2` stop `SparkContext`. ## How was this patch tested? Existing tests. Author: Takuya UESHIN <ueshin@databricks.com> Closes #19158 from ueshin/issues/SPARK-21950. (cherry picked from commit 57bc1e9eb452284cbed090dbd5008eb2062f1b36) Signed-off-by: Takuya UESHIN <ueshin@databricks.com> 08 September 2017, 05:26:48 UTC
041eccb [SPARK-21834] Incorrect executor request in case of dynamic allocation ## What changes were proposed in this pull request? killExecutor api currently does not allow killing an executor without updating the total number of executors needed. In case of dynamic allocation is turned on and the allocator tries to kill an executor, the scheduler reduces the total number of executors needed ( see https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala#L635) which is incorrect because the allocator already takes care of setting the required number of executors itself. ## How was this patch tested? Ran a job on the cluster and made sure the executor request is correct Author: Sital Kedia <skedia@fb.com> Closes #19081 from sitalkedia/skedia/oss_fix_executor_allocation. (cherry picked from commit 6949a9c5c6120fdde1b63876ede661adbd1eb15e) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> 30 August 2017, 21:19:33 UTC
5769753 [SPARK-21826][SQL][2.1][2.0] outer broadcast hash join should not throw NPE backport https://github.com/apache/spark/pull/19036 to branch 2.1 and 2.0 Author: Wenchen Fan <wenchen@databricks.com> Closes #19040 from cloud-fan/bug. 24 August 2017, 17:36:37 UTC
3d3be4d [SPARK-21721][SQL][BACKPORT-2.1][FOLLOWUP] Clear FileSystem deleteOnExit cache when paths are successfully removed ## What changes were proposed in this pull request? Fix a typo in test. ## How was this patch tested? Jenkins tests. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #19006 from viirya/SPARK-21721-backport-2.1-followup. 20 August 2017, 13:57:27 UTC
2394ae2 [MINOR] Correct validateAndTransformSchema in GaussianMixture and AFTSurvivalRegression ## What changes were proposed in this pull request? The line SchemaUtils.appendColumn(schema, $(predictionCol), IntegerType) did not modify the variable schema, hence only the last line had any effect. A temporary variable is used to correctly append the two columns predictionCol and probabilityCol. ## How was this patch tested? Manually. Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Cédric Pelvet <cedric.pelvet@gmail.com> Closes #18980 from sharp-pixel/master. (cherry picked from commit 73e04ecc4f29a0fe51687ed1337c61840c976f89) Signed-off-by: Sean Owen <sowen@cloudera.com> 20 August 2017, 10:06:14 UTC
6f366fb [SPARK-21721][SQL][BACKPORT-2.1] Clear FileSystem deleteOnExit cache when paths are successfully removed ## What changes were proposed in this pull request? Backport SPARK-21721 to branch 2.1: We put staging path to delete into the deleteOnExit cache of FileSystem in case of the path can't be successfully removed. But when we successfully remove the path, we don't remove it from the cache. We should do it to avoid continuing grow the cache size. ## How was this patch tested? Added test. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #18947 from viirya/SPARK-21721-backport-2.1. 15 August 2017, 15:48:00 UTC
9b749b6 [SPARK-21306][ML] For branch 2.1, OneVsRest should support setWeightCol The PR is related to #18554, and is modified for branch 2.1. ## What changes were proposed in this pull request? add `setWeightCol` method for OneVsRest. `weightCol` is ignored if classifier doesn't inherit HasWeightCol trait. ## How was this patch tested? + [x] add an unit test. Author: Yan Facai (颜发才) <facai.yan@gmail.com> Closes #18763 from facaiy/BUG/branch-2.1_OneVsRest_support_setWeightCol. 08 August 2017, 03:05:36 UTC
444cca1 [SPARK-18535][SPARK-19720][CORE][BACKPORT-2.1] Redact sensitive information ## What changes were proposed in this pull request? Backporting SPARK-18535 and SPARK-19720 to spark 2.1 It's a backport PR that redacts senstive information by configuration to Spark UI and Spark Submit console logs. Using reference from Mark Grover markapache.org PRs ## How was this patch tested? Same tests from PR applied Author: Mark Grover <mark@apache.org> Closes #18802 from dmvieira/feature-redact. 07 August 2017, 21:23:05 UTC
5634fad [SPARK-21588][SQL] SQLContext.getConf(key, null) should return null ## What changes were proposed in this pull request? In SQLContext.get(key,null) for a key that is not defined in the conf, and doesn't have a default value defined, throws a NPE. Int happens only when conf has a value converter Added null check on defaultValue inside SQLConf.getConfString to avoid calling entry.valueConverter(defaultValue) ## How was this patch tested? Added unit test Author: vinodkc <vinod.kc.in@gmail.com> Closes #18852 from vinodkc/br_Fix_SPARK-21588. (cherry picked from commit 1ba967b25e6d88be2db7a4e100ac3ead03a2ade9) Signed-off-by: gatorsmile <gatorsmile@gmail.com> 06 August 2017, 06:05:09 UTC
734b144 [SPARK-21330][SQL] Bad partitioning does not allow to read a JDBC table with extreme values on the partition column ## What changes were proposed in this pull request? An overflow of the difference of bounds on the partitioning column leads to no data being read. This patch checks for this overflow. ## How was this patch tested? New unit test. Author: Andrew Ray <ray.andrew@gmail.com> Closes #18800 from aray/SPARK-21330. (cherry picked from commit 25826c77ddf0d5753d2501d0e764111da2caa8b6) Signed-off-by: Sean Owen <sowen@cloudera.com> 04 August 2017, 07:58:27 UTC
d93e45b [SPARK-12717][PYTHON][BRANCH-2.1] Adding thread-safe broadcast pickle registry ## What changes were proposed in this pull request? When using PySpark broadcast variables in a multi-threaded environment, `SparkContext._pickled_broadcast_vars` becomes a shared resource. A race condition can occur when broadcast variables that are pickled from one thread get added to the shared ` _pickled_broadcast_vars` and become part of the python command from another thread. This PR introduces a thread-safe pickled registry using thread local storage so that when python command is pickled (causing the broadcast variable to be pickled and added to the registry) each thread will have their own view of the pickle registry to retrieve and clear the broadcast variables used. ## How was this patch tested? Added a unit test that causes this race condition using another thread. Author: Bryan Cutler <cutlerb@gmail.com> Closes #18825 from BryanCutler/pyspark-bcast-threadsafe-SPARK-12717-2_1. 03 August 2017, 01:35:56 UTC
b31b302 [SPARK-21522][CORE] Fix flakiness in LauncherServerSuite. Handle the case where the server closes the socket before the full message has been written by the client. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #18727 from vanzin/SPARK-21522. (cherry picked from commit b133501800b43fa5c538a4e5ad597c9dc7d8378e) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> 01 August 2017, 17:06:25 UTC
78f7cdf [SPARK-21555][SQL] RuntimeReplaceable should be compared semantically by its canonicalized child ## What changes were proposed in this pull request? When there are aliases (these aliases were added for nested fields) as parameters in `RuntimeReplaceable`, as they are not in the children expression, those aliases can't be cleaned up in analyzer rule `CleanupAliases`. An expression `nvl(foo.foo1, "value")` can be resolved to two semantically different expressions in a group by query because they contain different aliases. Because those aliases are not children of `RuntimeReplaceable` which is an `UnaryExpression`. So we can't trim the aliases out by simple transforming the expressions in `CleanupAliases`. If we want to replace the non-children aliases in `RuntimeReplaceable`, we need to add more codes to `RuntimeReplaceable` and modify all expressions of `RuntimeReplaceable`. It makes the interface ugly IMO. Consider those aliases will be replaced later at optimization and so they're no harm, this patch chooses to simply override `canonicalized` of `RuntimeReplaceable`. One concern is about `CleanupAliases`. Because it actually cannot clean up ALL aliases inside a plan. To make caller of this rule notice that, this patch adds a comment to `CleanupAliases`. ## How was this patch tested? Added test. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #18761 from viirya/SPARK-21555. (cherry picked from commit 9c8109ef414c92553335bb1e90e9681e142128a4) Signed-off-by: gatorsmile <gatorsmile@gmail.com> 29 July 2017, 17:04:12 UTC
258ca40 Revert "[SPARK-21306][ML] OneVsRest should support setWeightCol" This reverts commit 8520d7c6d5e880dea3c1a8a874148c07222b4b4b. 28 July 2017, 12:24:54 UTC
8520d7c [SPARK-21306][ML] OneVsRest should support setWeightCol ## What changes were proposed in this pull request? add `setWeightCol` method for OneVsRest. `weightCol` is ignored if classifier doesn't inherit HasWeightCol trait. ## How was this patch tested? + [x] add an unit test. Author: Yan Facai (颜发才) <facai.yan@gmail.com> Closes #18554 from facaiy/BUG/oneVsRest_missing_weightCol. (cherry picked from commit a5a3189974ea4628e9489eb50099a5432174e80c) Signed-off-by: Yanbo Liang <ybliang8@gmail.com> 28 July 2017, 02:15:59 UTC
9498798 [SPARK-21446][SQL] Fix setAutoCommit never executed ## What changes were proposed in this pull request? JIRA Issue: https://issues.apache.org/jira/browse/SPARK-21446 options.asConnectionProperties can not have fetchsize,because fetchsize belongs to Spark-only options, and Spark-only options have been excluded in connection properities. So change properties of beforeFetch from options.asConnectionProperties.asScala.toMap to options.asProperties.asScala.toMap ## How was this patch tested? Author: DFFuture <albert.zhang23@gmail.com> Closes #18665 from DFFuture/sparksql_pg. (cherry picked from commit c9729187bcef78299390e53cd9af38c3e084060e) Signed-off-by: gatorsmile <gatorsmile@gmail.com> 19 July 2017, 21:45:43 UTC
ac20693 [SPARK-21441][SQL] Incorrect Codegen in SortMergeJoinExec results failures in some cases ## What changes were proposed in this pull request? https://issues.apache.org/jira/projects/SPARK/issues/SPARK-21441 This issue can be reproduced by the following example: ``` val spark = SparkSession .builder() .appName("smj-codegen") .master("local") .config("spark.sql.autoBroadcastJoinThreshold", "1") .getOrCreate() val df1 = spark.createDataFrame(Seq((1, 1), (2, 2), (3, 3))).toDF("key", "int") val df2 = spark.createDataFrame(Seq((1, "1"), (2, "2"), (3, "3"))).toDF("key", "str") val df = df1.join(df2, df1("key") === df2("key")) .filter("int = 2 or reflect('java.lang.Integer', 'valueOf', str) = 1") .select("int") df.show() ``` To conclude, the issue happens when: (1) SortMergeJoin condition contains CodegenFallback expressions. (2) In PhysicalPlan tree, SortMergeJoin node is the child of root node, e.g., the Project in above example. This patch fixes the logic in `CollapseCodegenStages` rule. ## How was this patch tested? Unit test and manual verification in our cluster. Author: donnyzone <wellfengzhu@gmail.com> Closes #18656 from DonnyZone/Fix_SortMergeJoinExec. (cherry picked from commit 6b6dd682e84d3b03d0b15fbd81a0d16729e521d2) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 19 July 2017, 13:53:07 UTC
caf32b3 [SPARK-21332][SQL] Incorrect result type inferred for some decimal expressions ## What changes were proposed in this pull request? This PR changes the direction of expression transformation in the DecimalPrecision rule. Previously, the expressions were transformed down, which led to incorrect result types when decimal expressions had other decimal expressions as their operands. The root cause of this issue was in visiting outer nodes before their children. Consider the example below: ``` val inputSchema = StructType(StructField("col", DecimalType(26, 6)) :: Nil) val sc = spark.sparkContext val rdd = sc.parallelize(1 to 2).map(_ => Row(BigDecimal(12))) val df = spark.createDataFrame(rdd, inputSchema) // Works correctly since no nested decimal expression is involved // Expected result type: (26, 6) * (26, 6) = (38, 12) df.select($"col" * $"col").explain(true) df.select($"col" * $"col").printSchema() // Gives a wrong result since there is a nested decimal expression that should be visited first // Expected result type: ((26, 6) * (26, 6)) * (26, 6) = (38, 12) * (26, 6) = (38, 18) df.select($"col" * $"col" * $"col").explain(true) df.select($"col" * $"col" * $"col").printSchema() ``` The example above gives the following output: ``` // Correct result without sub-expressions == Parsed Logical Plan == 'Project [('col * 'col) AS (col * col)#4] +- LogicalRDD [col#1] == Analyzed Logical Plan == (col * col): decimal(38,12) Project [CheckOverflow((promote_precision(cast(col#1 as decimal(26,6))) * promote_precision(cast(col#1 as decimal(26,6)))), DecimalType(38,12)) AS (col * col)#4] +- LogicalRDD [col#1] == Optimized Logical Plan == Project [CheckOverflow((col#1 * col#1), DecimalType(38,12)) AS (col * col)#4] +- LogicalRDD [col#1] == Physical Plan == *Project [CheckOverflow((col#1 * col#1), DecimalType(38,12)) AS (col * col)#4] +- Scan ExistingRDD[col#1] // Schema root |-- (col * col): decimal(38,12) (nullable = true) // Incorrect result with sub-expressions == Parsed Logical Plan == 'Project [(('col * 'col) * 'col) AS ((col * col) * col)#11] +- LogicalRDD [col#1] == Analyzed Logical Plan == ((col * col) * col): decimal(38,12) Project [CheckOverflow((promote_precision(cast(CheckOverflow((promote_precision(cast(col#1 as decimal(26,6))) * promote_precision(cast(col#1 as decimal(26,6)))), DecimalType(38,12)) as decimal(26,6))) * promote_precision(cast(col#1 as decimal(26,6)))), DecimalType(38,12)) AS ((col * col) * col)#11] +- LogicalRDD [col#1] == Optimized Logical Plan == Project [CheckOverflow((cast(CheckOverflow((col#1 * col#1), DecimalType(38,12)) as decimal(26,6)) * col#1), DecimalType(38,12)) AS ((col * col) * col)#11] +- LogicalRDD [col#1] == Physical Plan == *Project [CheckOverflow((cast(CheckOverflow((col#1 * col#1), DecimalType(38,12)) as decimal(26,6)) * col#1), DecimalType(38,12)) AS ((col * col) * col)#11] +- Scan ExistingRDD[col#1] // Schema root |-- ((col * col) * col): decimal(38,12) (nullable = true) ``` ## How was this patch tested? This PR was tested with available unit tests. Moreover, there are tests to cover previously failing scenarios. Author: aokolnychyi <anton.okolnychyi@sap.com> Closes #18583 from aokolnychyi/spark-21332. (cherry picked from commit 0be5fb41a6b7ef4da9ba36f3604ac646cb6d4ae3) Signed-off-by: gatorsmile <gatorsmile@gmail.com> 18 July 2017, 04:08:30 UTC
a9efce4 [SPARK-19104][BACKPORT-2.1][SQL] Lambda variables in ExternalMapToCatalyst should be global ## What changes were proposed in this pull request? This PR is backport of #18418 to Spark 2.1. [SPARK-21391](https://issues.apache.org/jira/browse/SPARK-21391) reported this problem in Spark 2.1. The issue happens in `ExternalMapToCatalyst`. For example, the following codes create ExternalMap`ExternalMapToCatalyst`ToCatalyst to convert Scala Map to catalyst map format. ``` val data = Seq.tabulate(10)(i => NestedData(1, Map("key" -> InnerData("name", i + 100)))) val ds = spark.createDataset(data) ``` The `valueConverter` in `ExternalMapToCatalyst` looks like: ``` if (isnull(lambdavariable(ExternalMapToCatalyst_value52, ExternalMapToCatalyst_value_isNull52, ObjectType(class org.apache.spark.sql.InnerData), true))) null else named_struct(name, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(lambdavariable(ExternalMapToCatalyst_value52, ExternalMapToCatalyst_value_isNull52, ObjectType(class org.apache.spark.sql.InnerData), true)).name, true), value, assertnotnull(lambdavariable(ExternalMapToCatalyst_value52, ExternalMapToCatalyst_value_isNull52, ObjectType(class org.apache.spark.sql.InnerData), true)).value) ``` There is a `CreateNamedStruct` expression (`named_struct`) to create a row of `InnerData.name` and `InnerData.value` that are referred by `ExternalMapToCatalyst_value52`. Because `ExternalMapToCatalyst_value52` are local variable, when `CreateNamedStruct` splits expressions to individual functions, the local variable can't be accessed anymore. ## How was this patch tested? Added a new test suite into `DatasetPrimitiveSuite` Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Closes #18627 from kiszk/SPARK-21391. 18 July 2017, 01:18:32 UTC
ca4d2aa [SPARK-21344][SQL] BinaryType comparison does signed byte array comparison ## What changes were proposed in this pull request? This PR fixes a wrong comparison for `BinaryType`. This PR enables unsigned comparison and unsigned prefix generation for an array for `BinaryType`. Previous implementations uses signed operations. ## How was this patch tested? Added a test suite in `OrderingSuite`. Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Closes #18571 from kiszk/SPARK-21344. (cherry picked from commit ac5d5d795909061a17e056696cf0ef87d9e65dd1) Signed-off-by: gatorsmile <gatorsmile@gmail.com> 15 July 2017, 03:16:35 UTC
2c28462 [SPARK-21083][SQL][BRANCH-2.1] Store zero size and row count when analyzing empty table ## What changes were proposed in this pull request? We should be able to store zero size and row count after analyzing empty table. This is a backport for https://github.com/apache/spark/commit/9fccc3627fa41d32fbae6dbbb9bd1521e43eb4f0. ## How was this patch tested? Added new test. Author: Zhenhua Wang <wzh_zju@163.com> Closes #18577 from wzhfy/analyzeEmptyTable-2.1. 10 July 2017, 03:31:55 UTC
5e2bfd5 [SPARK-21345][SQL][TEST][TEST-MAVEN][BRANCH-2.1] SparkSessionBuilderSuite should clean up stopped sessions. ## What changes were proposed in this pull request? `SparkSessionBuilderSuite` should clean up stopped sessions. Otherwise, it leaves behind some stopped `SparkContext`s interfereing with other test suites using `ShardSQLContext`. Recently, master branch fails consequtively. - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/ ## How was this patch tested? Pass the Jenkins with a updated suite. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #18572 from dongjoon-hyun/SPARK-21345-BRANCH-2.1. 09 July 2017, 03:11:02 UTC
7f7b63b [SPARK-21312][SQL] correct offsetInBytes in UnsafeRow.writeToStream ## What changes were proposed in this pull request? Corrects offsetInBytes calculation in UnsafeRow.writeToStream. Known failures include writes to some DataSources that have own SparkPlan implementations and cause EXCHANGE in writes. ## How was this patch tested? Extended UnsafeRowSuite.writeToStream to include an UnsafeRow over byte array having non-zero offset. Author: Sumedh Wale <swale@snappydata.io> Closes #18535 from sumwale/SPARK-21312. (cherry picked from commit 14a3bb3a008c302aac908d7deaf0942a98c63be7) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 06 July 2017, 06:48:12 UTC
8f1ca69 [SPARK-20256][SQL][BRANCH-2.1] SessionState should be created more lazily ## What changes were proposed in this pull request? `SessionState` is designed to be created lazily. However, in reality, it created immediately in `SparkSession.Builder.getOrCreate` ([here](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala#L943)). This PR aims to recover the lazy behavior by keeping the options into `initialSessionOptions`. The benefit is like the following. Users can start `spark-shell` and use RDD operations without any problems. **BEFORE** ```scala $ bin/spark-shell java.lang.IllegalArgumentException: Error while instantiating 'org.apache.spark.sql.hive.HiveSessionStateBuilder' ... Caused by: org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.security.AccessControlException: Permission denied: user=spark, access=READ, inode="/apps/hive/warehouse":hive:hdfs:drwx------ ``` As reported in SPARK-20256, this happens when the warehouse directory is not allowed for this user. **AFTER** ```scala $ bin/spark-shell ... Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 2.1.2-SNAPSHOT /_/ Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_131) Type in expressions to have them evaluated. Type :help for more information. scala> sc.range(0, 10, 1).count() res0: Long = 10 ``` ## How was this patch tested? Manual. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #18530 from dongjoon-hyun/SPARK-20256-BRANCH-2.1. 05 July 2017, 02:29:37 UTC
3ecef24 Revert "[SPARK-21258][SQL] Fix WindowExec complex object aggregation with spilling" This reverts commit d995dac1cdeec940364453675f59ce5cf2b53684. 30 June 2017, 06:45:55 UTC
d995dac [SPARK-21258][SQL] Fix WindowExec complex object aggregation with spilling ## What changes were proposed in this pull request? `WindowExec` currently improperly stores complex objects (UnsafeRow, UnsafeArrayData, UnsafeMapData, UTF8String) during aggregation by keeping a reference in the buffer used by `GeneratedMutableProjections` to the actual input data. Things go wrong when the input object (or the backing bytes) are reused for other things. This could happen in window functions when it starts spilling to disk. When reading the back the spill files the `UnsafeSorterSpillReader` reuses the buffer to which the `UnsafeRow` points, leading to weird corruption scenario's. Note that this only happens for aggregate functions that preserve (parts of) their input, for example `FIRST`, `LAST`, `MIN` & `MAX`. This was not seen before, because the spilling logic was not doing actual spills as much and actually used an in-memory page. This page was not cleaned up during window processing and made sure unsafe objects point to their own dedicated memory location. This was changed by https://github.com/apache/spark/pull/16909, after this PR Spark spills more eagerly. This PR provides a surgical fix because we are close to releasing Spark 2.2. This change just makes sure that there cannot be any object reuse at the expensive of a little bit of performance. We will follow-up with a more subtle solution at a later point. ## How was this patch tested? Added a regression test to `DataFrameWindowFunctionsSuite`. Author: Herman van Hovell <hvanhovell@databricks.com> Closes #18470 from hvanhovell/SPARK-21258. (cherry picked from commit e2f32ee45ac907f1f53fde7e412676a849a94872) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 30 June 2017, 04:34:53 UTC
083adb0 [SPARK-21176][WEB UI] Limit number of selector threads for admin ui proxy servlets to 8 ## What changes were proposed in this pull request? Please see also https://issues.apache.org/jira/browse/SPARK-21176 This change limits the number of selector threads that jetty creates to maximum 8 per proxy servlet (Jetty default is number of processors / 2). The newHttpClient for Jettys ProxyServlet class is overwritten to avoid the Jetty defaults (which are designed for high-performance http servers). Once https://github.com/eclipse/jetty.project/issues/1643 is available, the code could be cleaned up to avoid the method override. I really need this on v2.1.1 - what is the best way for a backport automatic merge works fine)? Shall I create another PR? ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) The patch was tested manually on a Spark cluster with a head node that has 88 processors using JMX to verify that the number of selector threads is now limited to 8 per proxy. gurvindersingh zsxwing can you please review the change? Author: IngoSchuster <ingo.schuster@de.ibm.com> Author: Ingo Schuster <ingo.schuster@de.ibm.com> Closes #18437 from IngoSchuster/master. (cherry picked from commit 88a536babf119b7e331d02aac5d52b57658803bf) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 30 June 2017, 03:16:38 UTC
26f4f34 Revert "[SPARK-18016][SQL][CATALYST][BRANCH-2.1] Code Generation: Constant Pool Limit - Class Splitting" This reverts commit 6b37c863848d5991821aa2c3233754ca3061f4d6. 25 June 2017, 06:22:16 UTC
0d6b701 [SPARK-21203][SQL] Fix wrong results of insertion of Array of Struct ### What changes were proposed in this pull request? ```SQL CREATE TABLE `tab1` (`custom_fields` ARRAY<STRUCT<`id`: BIGINT, `value`: STRING>>) USING parquet INSERT INTO `tab1` SELECT ARRAY(named_struct('id', 1, 'value', 'a'), named_struct('id', 2, 'value', 'b')) SELECT custom_fields.id, custom_fields.value FROM tab1 ``` The above query always return the last struct of the array, because the rule `SimplifyCasts` incorrectly rewrites the query. The underlying cause is we always use the same `GenericInternalRow` object when doing the cast. ### How was this patch tested? Author: gatorsmile <gatorsmile@gmail.com> Closes #18412 from gatorsmile/castStruct. (cherry picked from commit 2e1586f60a77ea0adb6f3f68ba74323f0c242199) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 24 June 2017, 14:36:43 UTC
6750db3 [SPARK-21159][CORE] Don't try to connect to launcher in standalone cluster mode. Monitoring for standalone cluster mode is not implemented (see SPARK-11033), but the same scheduler implementation is used, and if it tries to connect to the launcher it will fail. So fix the scheduler so it only tries that in client mode; cluster mode applications will be correctly launched and will work, but monitoring through the launcher handle will not be available. Tested by running a cluster mode app with "SparkLauncher.startApplication". Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #18397 from vanzin/SPARK-21159. (cherry picked from commit bfd73a7c48b87456d1b84d826e04eca938a1be64) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 24 June 2017, 05:36:53 UTC
f12883e [SPARK-20555][SQL] Fix mapping of Oracle DECIMAL types to Spark types in read path This PR is to revert some code changes in the read path of https://github.com/apache/spark/pull/14377. The original fix is https://github.com/apache/spark/pull/17830 When merging this PR, please give the credit to gaborfeher Added a test case to OracleIntegrationSuite.scala Author: Gabor Feher <gabor.feher@lynxanalytics.com> Author: gatorsmile <gatorsmile@gmail.com> Closes #18408 from gatorsmile/OracleType. 24 June 2017, 05:08:43 UTC
bcaf06c [MINOR][DOCS] Docs in DataFrameNaFunctions.scala use wrong method ## What changes were proposed in this pull request? * Following the first few examples in this file, the remaining methods should also be methods of `df.na` not `df`. * Filled in some missing parentheses ## How was this patch tested? N/A Author: Ong Ming Yang <me@ongmingyang.com> Closes #18398 from ongmingyang/master. (cherry picked from commit 4cc62951a2b12a372a2b267bf8597a0a31e2b2cb) Signed-off-by: Xiao Li <gatorsmile@gmail.com> 23 June 2017, 17:57:22 UTC
f8fd3b4 [SPARK-21181] Release byteBuffers to suppress netty error messages ## What changes were proposed in this pull request? We are explicitly calling release on the byteBuf's used to encode the string to Base64 to suppress the memory leak error message reported by netty. This is to make it less confusing for the user. ### Changes proposed in this fix By explicitly invoking release on the byteBuf's we are decrement the internal reference counts for the wrappedByteBuf's. Now, when the GC kicks in, these would be reclaimed as before, just that netty wouldn't report any memory leak error messages as the internal ref. counts are now 0. ## How was this patch tested? Ran a few spark-applications and examined the logs. The error message no longer appears. Original PR was opened against branch-2.1 => https://github.com/apache/spark/pull/18392 Author: Dhruve Ashar <dhruveashar@gmail.com> Closes #18407 from dhruve/master. (cherry picked from commit 1ebe7ffe072bcac03360e65e959a6cd36530a9c4) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> 23 June 2017, 17:36:51 UTC
1a98d5d [SPARK-21167][SS] Decode the path generated by File sink to handle special characters ## What changes were proposed in this pull request? Decode the path generated by File sink to handle special characters. ## How was this patch tested? The added unit test. Author: Shixiong Zhu <shixiong@databricks.com> Closes #18381 from zsxwing/SPARK-21167. (cherry picked from commit d66b143eec7f604595089f72d8786edbdcd74282) Signed-off-by: Shixiong Zhu <shixiong@databricks.com> 22 June 2017, 06:43:42 UTC
6b37c86 [SPARK-18016][SQL][CATALYST][BRANCH-2.1] Code Generation: Constant Pool Limit - Class Splitting ## What changes were proposed in this pull request? This is a backport patch for Spark 2.1.x of the class splitting feature over excess generated code as was merged in #18075. ## How was this patch tested? The same test provided in #18075 is included in this patch. Author: ALeksander Eskilson <alek.eskilson@cerner.com> Closes #18354 from bdrillard/class_splitting_2.1. 22 June 2017, 05:23:59 UTC
8923bac [SPARK-21123][DOCS][STRUCTURED STREAMING] Options for file stream source are in a wrong table - version to fix 2.1 ## What changes were proposed in this pull request? The description for several options of File Source for structured streaming appeared in the File Sink description instead. This commit continues on PR #18342 and targets the fixes for the documentation of version spark version 2.1 ## How was this patch tested? Built the documentation by SKIP_API=1 jekyll build and visually inspected the structured streaming programming guide. zsxwing This is the PR to fix version 2.1 as discussed in PR #18342 Author: assafmendelson <assaf.mendelson@gmail.com> Closes #18363 from assafmendelson/spark-21123-for-spark2.1. 20 June 2017, 20:07:51 UTC
7799f35 [SPARK-21138][YARN] Cannot delete staging dir when the clusters of "spark.yarn.stagingDir" and "spark.hadoop.fs.defaultFS" are different ## What changes were proposed in this pull request? When I set different clusters for "spark.hadoop.fs.defaultFS" and "spark.yarn.stagingDir" as follows: ``` spark.hadoop.fs.defaultFS hdfs://tl-nn-tdw.tencent-distribute.com:54310 spark.yarn.stagingDir hdfs://ss-teg-2-v2/tmp/spark ``` The staging dir can not be deleted, it will prompt following message: ``` java.lang.IllegalArgumentException: Wrong FS: hdfs://ss-teg-2-v2/tmp/spark/.sparkStaging/application_1496819138021_77618, expected: hdfs://tl-nn-tdw.tencent-distribute.com:54310 ``` ## How was this patch tested? Existing tests Author: sharkdtu <sharkdtu@tencent.com> Closes #18352 from sharkdtu/master. (cherry picked from commit 3d4d11a80fe8953d48d8bfac2ce112e37d38dc90) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> 19 June 2017, 21:55:47 UTC
a44c118 [SPARK-19688][STREAMING] Not to read `spark.yarn.credentials.file` from checkpoint. ## What changes were proposed in this pull request? Reload the `spark.yarn.credentials.file` property when restarting a streaming application from checkpoint. ## How was this patch tested? Manual tested with 1.6.3 and 2.1.1. I didn't test this with master because of some compile problems, but I think it will be the same result. ## Notice This should be merged into maintenance branches too. jira: [SPARK-21008](https://issues.apache.org/jira/browse/SPARK-21008) Author: saturday_s <shi.indetail@gmail.com> Closes #18230 from saturday-shi/SPARK-21008. (cherry picked from commit e92ffe6f1771e3fe9ea2e62ba552c1b5cf255368) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> 19 June 2017, 17:24:56 UTC
0ebb3b8 [SPARK-21114][TEST][2.1] Fix test failure in Spark 2.1/2.0 due to name mismatch ## What changes were proposed in this pull request? Name mismatch between 2.1/2.0 and 2.2. Thus, the test cases failed after we backport a fix to 2.1/2.0. This PR is to fix the issue. https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-branch-2.1-test-maven-hadoop-2.7/lastCompletedBuild/testReport/org.apache.spark.sql/SQLQueryTestSuite/arithmetic_sql/ https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-branch-2.0-test-maven-hadoop-2.2/lastCompletedBuild/testReport/org.apache.spark.sql/SQLQueryTestSuite/arithmetic_sql/ ## How was this patch tested? N/A Author: gatorsmile <gatorsmile@gmail.com> Closes #18319 from gatorsmile/fixDecimal. 16 June 2017, 04:19:55 UTC
915a201 [SPARK-21072][SQL] TreeNode.mapChildren should only apply to the children node. ## What changes were proposed in this pull request? Just as the function name and comments of `TreeNode.mapChildren` mentioned, the function should be apply to all currently node children. So, the follow code should judge whether it is the children node. https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala#L342 ## How was this patch tested? Existing tests. Author: Xianyang Liu <xianyang.liu@intel.com> Closes #18284 from ConeyLiu/treenode. (cherry picked from commit 87ab0cec65b50584a627037b9d1b6fdecaee725c) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 16 June 2017, 04:10:38 UTC
62f2b80 [SPARK-16251][SPARK-20200][CORE][TEST] Flaky test: org.apache.spark.rdd.LocalCheckpointSuite.missing checkpoint block fails with informative message ## What changes were proposed in this pull request? Currently we don't wait to confirm the removal of the block from the slave's BlockManager, if the removal takes too much time, we will fail the assertion in this test case. The failure can be easily reproduced if we sleep for a while before we remove the block in BlockManagerSlaveEndpoint.receiveAndReply(). ## How was this patch tested? N/A Author: Xingbo Jiang <xingbo.jiang@databricks.com> Closes #18314 from jiangxb1987/LocalCheckpointSuite. (cherry picked from commit 7dc3e697c74864a4e3cca7342762f1427058b3c3) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 15 June 2017, 16:07:26 UTC
a890466 [SPARK-20211][SQL][BACKPORT-2.2] Fix the Precision and Scale of Decimal Values when the Input is BigDecimal between -1.0 and 1.0 ### What changes were proposed in this pull request? This PR is to backport https://github.com/apache/spark/pull/18244 to 2.2 --- The precision and scale of decimal values are wrong when the input is BigDecimal between -1.0 and 1.0. The BigDecimal's precision is the digit count starts from the leftmost nonzero digit based on the [JAVA's BigDecimal definition](https://docs.oracle.com/javase/7/docs/api/java/math/BigDecimal.html). However, our Decimal decision follows the database decimal standard, which is the total number of digits, including both to the left and the right of the decimal point. Thus, this PR is to fix the issue by doing the conversion. Before this PR, the following queries failed: ```SQL select 1 > 0.0001 select floor(0.0001) select ceil(0.0001) ``` ### How was this patch tested? Added test cases. Author: gatorsmile <gatorsmile@gmail.com> Closes #18297 from gatorsmile/backport18244. (cherry picked from commit 626511953b87747e933e4f64b9fcd4c4776a5c4e) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 14 June 2017, 11:18:43 UTC
ee0e74e [SPARK-21064][CORE][TEST] Fix the default value bug in NettyBlockTransferServiceSuite ## What changes were proposed in this pull request? The default value for `spark.port.maxRetries` is 100, but we use 10 in the suite file. So we change it to 100 to avoid test failure. ## How was this patch tested? No test Author: DjvuLee <lihu@bytedance.com> Closes #18280 from djvulee/NettyTestBug. (cherry picked from commit b36ce2a2469ff923a3367a530d4a14899ecf9238) Signed-off-by: Sean Owen <sowen@cloudera.com> 13 June 2017, 14:56:21 UTC
58a8a37 [SPARK-20920][SQL] ForkJoinPool pools are leaked when writing hive tables with many partitions ## What changes were proposed in this pull request? Don't leave thread pool running from AlterTableRecoverPartitionsCommand DDL command ## How was this patch tested? Existing tests. Author: Sean Owen <sowen@cloudera.com> Closes #18216 from srowen/SPARK-20920. (cherry picked from commit 7b7c85ede398996aafffb126440e5f0c67f67210) Signed-off-by: Sean Owen <sowen@cloudera.com> 13 June 2017, 09:48:25 UTC
03cc18b [SPARK-20914][DOCS] Javadoc contains code that is invalid ## What changes were proposed in this pull request? Fix Java, Scala Dataset examples in scaladoc, which didn't compile. ## How was this patch tested? Existing compilation/test Author: Sean Owen <sowen@cloudera.com> Closes #18215 from srowen/SPARK-20914. (cherry picked from commit 847efe12656756f9ad6a4dc14bd183ac1a0760a6) Signed-off-by: Sean Owen <sowen@cloudera.com> 08 June 2017, 09:56:42 UTC
afab855 [SPARK-20974][BUILD] we should run REPL tests if SQL module has code changes ## What changes were proposed in this pull request? REPL module depends on SQL module, so we should run REPL tests if SQL module has code changes. ## How was this patch tested? N/A Author: Wenchen Fan <wenchen@databricks.com> Closes #18191 from cloud-fan/test. (cherry picked from commit 864d94fe879a32de324da65a844e62a0260b222d) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 03 June 2017, 05:00:19 UTC
0b25a7d [SPARK-20922][CORE][HOTFIX] Don't use Java 8 lambdas in older branches. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #18178 from vanzin/SPARK-20922-hotfix. 01 June 2017, 23:45:31 UTC
772a9b9 [SPARK-20922][CORE] Add whitelist of classes that can be deserialized by the launcher. Blindly deserializing classes using Java serialization opens the code up to issues in other libraries, since just deserializing data from a stream may end up execution code (think readObject()). Since the launcher protocol is pretty self-contained, there's just a handful of classes it legitimately needs to deserialize, and they're in just two packages, so add a filter that throws errors if classes from any other package show up in the stream. This also maintains backwards compatibility (the updated launcher code can still communicate with the backend code in older Spark releases). Tested with new and existing unit tests. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #18166 from vanzin/SPARK-20922. (cherry picked from commit 8efc6e986554ae66eab93cd64a9035d716adbab0) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> 01 June 2017, 21:44:58 UTC
dade85f [SPARK-20940][CORE] Replace IllegalAccessError with IllegalStateException ## What changes were proposed in this pull request? `IllegalAccessError` is a fatal error (a subclass of LinkageError) and its meaning is `Thrown if an application attempts to access or modify a field, or to call a method that it does not have access to`. Throwing a fatal error for AccumulatorV2 is not necessary and is pretty bad because it usually will just kill executors or SparkContext ([SPARK-20666](https://issues.apache.org/jira/browse/SPARK-20666) is an example of killing SparkContext due to `IllegalAccessError`). I think the correct type of exception in AccumulatorV2 should be `IllegalStateException`. ## How was this patch tested? Jenkins Author: Shixiong Zhu <shixiong@databricks.com> Closes #18168 from zsxwing/SPARK-20940. (cherry picked from commit 24db35826a81960f08e3eb68556b0f51781144e1) Signed-off-by: Shixiong Zhu <shixiong@databricks.com> 01 June 2017, 00:26:37 UTC
4640086 [SPARK-20275][UI] Do not display "Completed" column for in-progress applications ## What changes were proposed in this pull request? Current HistoryServer will display completed date of in-progress application as `1969-12-31 23:59:59`, which is not so meaningful. Instead of unnecessarily showing this incorrect completed date, here propose to make this column invisible for in-progress applications. The purpose of only making this column invisible rather than deleting this field is that: this data is fetched through REST API, and in the REST API the format is like below shows, in which `endTime` matches `endTimeEpoch`. So instead of changing REST API to break backward compatibility, here choosing a simple solution to only make this column invisible. ``` [ { "id" : "local-1491805439678", "name" : "Spark shell", "attempts" : [ { "startTime" : "2017-04-10T06:23:57.574GMT", "endTime" : "1969-12-31T23:59:59.999GMT", "lastUpdated" : "2017-04-10T06:23:57.574GMT", "duration" : 0, "sparkUser" : "", "completed" : false, "startTimeEpoch" : 1491805437574, "endTimeEpoch" : -1, "lastUpdatedEpoch" : 1491805437574 } ] } ]% ``` Here is UI before changed: <img width="1317" alt="screen shot 2017-04-10 at 3 45 57 pm" src="https://cloud.githubusercontent.com/assets/850797/24851938/17d46cc0-1e08-11e7-84c7-90120e171b41.png"> And after: <img width="1281" alt="screen shot 2017-04-10 at 4 02 35 pm" src="https://cloud.githubusercontent.com/assets/850797/24851945/1fe9da58-1e08-11e7-8d0d-9262324f9074.png"> ## How was this patch tested? Manual verification. Author: jerryshao <sshao@hortonworks.com> Closes #17588 from jerryshao/SPARK-20275. (cherry picked from commit 52ed9b289d169219f7257795cbedc56565a39c71) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 31 May 2017, 03:25:16 UTC
38f37c5 [SPARK-20393][WEBU UI] Strengthen Spark to prevent XSS vulnerabilities Add stripXSS and stripXSSMap to Spark Core's UIUtils. Calling these functions at any point that getParameter is called against a HttpServletRequest. Unit tests, IBM Security AppScan Standard no longer showing vulnerabilities, manual verification of WebUI pages. Author: NICHOLAS T. MARION <nmarion@us.ibm.com> Closes #17686 from n-marion/xss-fix. (cherry picked from commit b512233a457092b0e2a39d0b42cb021abc69d375) Signed-off-by: Sean Owen <sowen@cloudera.com> 27 May 2017, 11:06:23 UTC
ebd72f4 [SPARK-20843][CORE] Add a config to set driver terminate timeout ## What changes were proposed in this pull request? Add a `worker` configuration to set how long to wait before forcibly killing driver. ## How was this patch tested? Jenkins Author: Shixiong Zhu <shixiong@databricks.com> Closes #18126 from zsxwing/SPARK-20843. (cherry picked from commit 6c1dbd6fc8d49acf7c1c902d2ebf89ed5e788a4e) Signed-off-by: Shixiong Zhu <shixiong@databricks.com> 27 May 2017, 05:25:54 UTC
6e6adcc [SPARK-20868][CORE] UnsafeShuffleWriter should verify the position after FileChannel.transferTo ## What changes were proposed in this pull request? Long time ago we fixed a [bug](https://issues.apache.org/jira/browse/SPARK-3948) in shuffle writer about `FileChannel.transferTo`. We were not very confident about that fix, so we added a position check after the writing, try to discover the bug earlier. However this checking is missing in the new `UnsafeShuffleWriter`, this PR adds it. https://issues.apache.org/jira/browse/SPARK-18105 maybe related to that `FileChannel.transferTo` bug, hopefully we can find out the root cause after adding this position check. ## How was this patch tested? N/A Author: Wenchen Fan <wenchen@databricks.com> Closes #18091 from cloud-fan/shuffle. (cherry picked from commit d9ad78908f6189719cec69d34557f1a750d2e6af) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 26 May 2017, 07:02:04 UTC
4f6fccf [SPARK-20874][EXAMPLES] Add Structured Streaming Kafka Source to examples project ## What changes were proposed in this pull request? Add Structured Streaming Kafka Source to the `examples` project so that people can run `bin/run-example StructuredKafkaWordCount ...`. ## How was this patch tested? manually tested it. Author: Shixiong Zhu <shixiong@databricks.com> Closes #18101 from zsxwing/add-missing-example-dep. (cherry picked from commit 98c3852986a2cb5f2d249d6c8ef602be283bd90e) Signed-off-by: Shixiong Zhu <shixiong@databricks.com> 25 May 2017, 17:49:32 UTC
7fc2347 [SPARK-20250][CORE] Improper OOM error when a task been killed while spilling data Currently, when a task is calling spill() but it receives a killing request from driver (e.g., speculative task), the `TaskMemoryManager` will throw an `OOM` exception. And we don't catch `Fatal` exception when a error caused by `Thread.interrupt`. So for `ClosedByInterruptException`, we should throw `RuntimeException` instead of `OutOfMemoryError`. https://issues.apache.org/jira/browse/SPARK-20250?jql=project%20%3D%20SPARK Existing unit tests. Author: Xianyang Liu <xianyang.liu@intel.com> Closes #18090 from ConeyLiu/SPARK-20250. (cherry picked from commit 731462a04f8e33ac507ad19b4270c783a012a33e) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 25 May 2017, 07:51:27 UTC
7015f6f [SPARK-20848][SQL][FOLLOW-UP] Shutdown the pool after reading parquet files ## What changes were proposed in this pull request? This is a follow-up to #18073. Taking a safer approach to shutdown the pool to prevent possible issue. Also using `ThreadUtils.newForkJoinPool` instead to set a better thread name. ## How was this patch tested? Manually test. Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #18100 from viirya/SPARK-20848-followup. (cherry picked from commit 6b68d61cf31748a088778dfdd66491b2f89a3c7b) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 25 May 2017, 01:56:16 UTC
c3302e8 [SPARK-18406][CORE][BACKPORT-2.1] Race between end-of-task and completion iterator read lock release This is a backport PR of #18076 to 2.1. ## What changes were proposed in this pull request? When a TaskContext is not propagated properly to all child threads for the task, just like the reported cases in this issue, we fail to get to TID from TaskContext and that causes unable to release the lock and assertion failures. To resolve this, we have to explicitly pass the TID value to the `unlock` method. ## How was this patch tested? Add new failing regression test case in `RDDSuite`. Author: Xingbo Jiang <xingbo.jiang@databricks.com> Closes #18099 from jiangxb1987/completion-iterator-2.1. 25 May 2017, 00:31:04 UTC
2f68631 [SPARK-20848][SQL] Shutdown the pool after reading parquet files ## What changes were proposed in this pull request? From JIRA: On each call to spark.read.parquet, a new ForkJoinPool is created. One of the threads in the pool is kept in the WAITING state, and never stopped, which leads to unbounded growth in number of threads. We should shutdown the pool after reading parquet files. ## How was this patch tested? Added a test to ParquetFileFormatSuite. Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #18073 from viirya/SPARK-20848. (cherry picked from commit f72ad303f05a6d99513ea3b121375726b177199c) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 24 May 2017, 16:36:22 UTC
13adc0f [SPARK-20862][MLLIB][PYTHON] Avoid passing float to ndarray.reshape in LogisticRegressionModel ## What changes were proposed in this pull request? Fixed TypeError with python3 and numpy 1.12.1. Numpy's `reshape` no longer takes floats as arguments as of 1.12. Also, python3 uses float division for `/`, we should be using `//` to ensure that `_dataWithBiasSize` doesn't get set to a float. ## How was this patch tested? Existing tests run using python3 and numpy 1.12. Author: Bago Amirbekian <bago@databricks.com> Closes #18081 from MrBago/BF-py3floatbug. (cherry picked from commit bc66a77bbe2120cc21bd8da25194efca4cde13c3) Signed-off-by: Yanbo Liang <ybliang8@gmail.com> 24 May 2017, 14:58:16 UTC
f4538c9 [SPARK-20763][SQL][BACKPORT-2.1] The function of `month` and `day` return the value which is not we expected. What changes were proposed in this pull request? This PR is to backport #17997 to Spark 2.1 when the date before "1582-10-04", the function of month and day return the value which is not we expected. How was this patch tested? unit tests Author: liuxian <liu.xian3@zte.com.cn> Closes #18054 from 10110346/wip-lx-0522. 23 May 2017, 01:09:18 UTC
f5ef076 [SPARK-20756][YARN] yarn-shuffle jar references unshaded guava and contains scala classes ## What changes were proposed in this pull request? This change ensures that all references to guava from within the yarn shuffle jar pointed to the shaded guava class already provided in the jar. Also, it explicitly excludes scala classes from being added to the jar. ## How was this patch tested? Ran unit tests on the module and they passed. javap now returns the expected result - reference to the shaded guava under `org/spark_project` (previously this was referring to `com.google...` ``` javap -cp common/network-yarn/target/scala-2.11/spark-2.3.0-SNAPSHOT-yarn-shuffle.jar -c org/apache/spark/network/yarn/YarnShuffleService | grep Lists 57: invokestatic #138 // Method org/spark_project/guava/collect/Lists.newArrayList:()Ljava/util/ArrayList; ``` Guava is still shaded in the jar: ``` jar -tf common/network-yarn/target/scala-2.11/spark-2.3.0-SNAPSHOT-yarn-shuffle.jar | grep guava | head META-INF/maven/com.google.guava/ META-INF/maven/com.google.guava/guava/ META-INF/maven/com.google.guava/guava/pom.properties META-INF/maven/com.google.guava/guava/pom.xml org/spark_project/guava/ org/spark_project/guava/annotations/ org/spark_project/guava/annotations/Beta.class org/spark_project/guava/annotations/GwtCompatible.class org/spark_project/guava/annotations/GwtIncompatible.class org/spark_project/guava/annotations/VisibleForTesting.class ``` (not sure if the above META-INF/* is a problem or not) I took this jar, deployed it on a yarn cluster with shuffle service enabled, and made sure the YARN node managers came up. An application with a shuffle was run and it succeeded. Author: Mark Grover <mark@apache.org> Closes #17990 from markgrover/spark-20756. (cherry picked from commit 36309110046a89d749a7c9746eaa16997de26922) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> 22 May 2017, 17:11:04 UTC
c3a986b [SPARK-20687][MLLIB] mllib.Matrices.fromBreeze may crash when converting from Breeze sparse matrix ## What changes were proposed in this pull request? When two Breeze SparseMatrices are operated, the result matrix may contain provisional 0 values extra in rowIndices and data arrays. This causes an incoherence with the colPtrs data, but Breeze get away with this incoherence by keeping a counter of the valid data. In spark, when this matrices are converted to SparseMatrices, Sparks relies solely on rowIndices, data, and colPtrs, but these might be incorrect because of breeze internal hacks. Therefore, we need to slice both rowIndices and data, using their counter of active data This method is at least called by BlockMatrix when performing distributed block operations, causing exceptions on valid operations. See http://stackoverflow.com/questions/33528555/error-thrown-when-using-blockmatrix-add ## How was this patch tested? Added a test to MatricesSuite that verifies that the conversions are valid and that code doesn't crash. Originally the same code would crash on Spark. Bugfix for https://issues.apache.org/jira/browse/SPARK-20687 Author: Ignacio Bermudez <ignaciobermudez@gmail.com> Author: Ignacio Bermudez Corrales <icorrales@splunk.com> Closes #17940 from ghoto/bug-fix/SPARK-20687. (cherry picked from commit 06dda1d58f8670e996921e935d5f5402d664699e) Signed-off-by: Sean Owen <sowen@cloudera.com> 22 May 2017, 09:27:48 UTC
e9804b3 [SPARK-20781] the location of Dockerfile in docker.properties.templat is wrong [https://issues.apache.org/jira/browse/SPARK-20781](https://issues.apache.org/jira/browse/SPARK-20781) the location of Dockerfile in docker.properties.template should be "../external/docker/spark-mesos/Dockerfile" Author: liuzhaokun <liu.zhaokun@zte.com.cn> Closes #18013 from liu-zhaokun/dockerfile_location. (cherry picked from commit 749418d285461958a0f22ed355edafd87f1ee913) Signed-off-by: Sean Owen <sowen@cloudera.com> 19 May 2017, 19:47:48 UTC
c53fe79 [SPARK-20759] SCALA_VERSION in _config.yml should be consistent with pom.xml [https://issues.apache.org/jira/browse/SPARK-20759](https://issues.apache.org/jira/browse/SPARK-20759) SCALA_VERSION in _config.yml is 2.11.7, but 2.11.8 in pom.xml. So I think SCALA_VERSION in _config.yml should be consistent with pom.xml. Author: liuzhaokun <liu.zhaokun@zte.com.cn> Closes #17992 from liu-zhaokun/new. (cherry picked from commit dba2ca2c129b6d2597f1707e0315d4e238c40ed6) Signed-off-by: Sean Owen <sowen@cloudera.com> 19 May 2017, 14:27:48 UTC
e326de4 [SPARK-20798] GenerateUnsafeProjection should check if a value is null before calling the getter ## What changes were proposed in this pull request? GenerateUnsafeProjection.writeStructToBuffer() did not honor the assumption that the caller must make sure that a value is not null before using the getter. This could lead to various errors. This change fixes that behavior. Example of code generated before: ```scala /* 059 */ final UTF8String fieldName = value.getUTF8String(0); /* 060 */ if (value.isNullAt(0)) { /* 061 */ rowWriter1.setNullAt(0); /* 062 */ } else { /* 063 */ rowWriter1.write(0, fieldName); /* 064 */ } ``` Example of code generated now: ```scala /* 060 */ boolean isNull1 = value.isNullAt(0); /* 061 */ UTF8String value1 = isNull1 ? null : value.getUTF8String(0); /* 062 */ if (isNull1) { /* 063 */ rowWriter1.setNullAt(0); /* 064 */ } else { /* 065 */ rowWriter1.write(0, value1); /* 066 */ } ``` ## How was this patch tested? Adds GenerateUnsafeProjectionSuite. Author: Ala Luszczak <ala@databricks.com> Closes #18030 from ala/fix-generate-unsafe-projection. (cherry picked from commit ce8edb8bf4db5f82bcfeb11efbdf5229b0d25dfa) Signed-off-by: Herman van Hovell <hvanhovell@databricks.com> 19 May 2017, 11:19:17 UTC
e06d936 [SPARK-20796] the location of start-master.sh in spark-standalone.md is wrong [https://issues.apache.org/jira/browse/SPARK-20796](https://issues.apache.org/jira/browse/SPARK-20796) the location of start-master.sh in spark-standalone.md should be "sbin/start-master.sh" rather than "bin/start-master.sh". Author: liuzhaokun <liu.zhaokun@zte.com.cn> Closes #18027 from liu-zhaokun/sbin. (cherry picked from commit 99452df44fb98c2721d427da4c97f549793615fe) Signed-off-by: Sean Owen <sowen@cloudera.com> 18 May 2017, 16:44:58 UTC
ba35c6b [SPARK-20769][DOC] Incorrect documentation for using Jupyter notebook ## What changes were proposed in this pull request? SPARK-13973 incorrectly removed the required PYSPARK_DRIVER_PYTHON_OPTS=notebook from documentation to use pyspark with Jupyter notebook. This patch corrects the documentation error. ## How was this patch tested? Tested invocation locally with ```bash PYSPARK_DRIVER_PYTHON=jupyter PYSPARK_DRIVER_PYTHON_OPTS=notebook ./bin/pyspark ``` Author: Andrew Ray <ray.andrew@gmail.com> Closes #18001 from aray/patch-1. (cherry picked from commit 1995417696a028f8a4fa7f706a77537c7182528d) Signed-off-by: Sean Owen <sowen@cloudera.com> 17 May 2017, 09:06:19 UTC
14b6a9d [SPARK-20735][SQL][TEST] Enable cross join in TPCDSQueryBenchmark ## What changes were proposed in this pull request? Since [SPARK-17298](https://issues.apache.org/jira/browse/SPARK-17298), some queries (q28, q61, q77, q88, q90) in the test suites fail with a message "_Use the CROSS JOIN syntax to allow cartesian products between these relations_". This benchmark is used as a reference model for Spark TPC-DS, so this PR aims to enable the correct configuration in `TPCDSQueryBenchmark.scala`. ## How was this patch tested? Manual. (Run TPCDSQueryBenchmark) Author: Dongjoon Hyun <dongjoon@apache.org> Closes #17977 from dongjoon-hyun/SPARK-20735. (cherry picked from commit bbd163d589e7503c5cb150d934e7565b18a908f2) Signed-off-by: Xiao Li <gatorsmile@gmail.com> 15 May 2017, 18:25:03 UTC
62969e9 [SPARK-20705][WEB-UI] The sort function can not be used in the master page when you use Firefox or Google Chrome. ## What changes were proposed in this pull request? When you open the master page, when you use Firefox or Google Chrom, the console of Firefox or Google Chrome is wrong. But The IE is no problem. e.g. ![error](https://cloud.githubusercontent.com/assets/26266482/25946143/74467a5c-367c-11e7-8f9f-d3585b1aea88.png) My Firefox version is 48.0.2. My Google Chrome version is 49.0.2623.75 m. ## How was this patch tested? manual tests Please review http://spark.apache.org/contributing.html before opening a pull request. Author: guoxiaolong <guo.xiaolong1@zte.com.cn> Author: 郭小龙 10207633 <guo.xiaolong1@zte.com.cn> Author: guoxiaolongzte <guo.xiaolong1@zte.com.cn> Closes #17952 from guoxiaolongzte/SPARK-20705. (cherry picked from commit 99d5799927301e7dfceb9405e2829af3433f104b) Signed-off-by: Sean Owen <sowen@cloudera.com> 15 May 2017, 06:52:09 UTC
95de467 [SPARK-17424] Fix unsound substitution bug in ScalaReflection. ## What changes were proposed in this pull request? This method gets a type's primary constructor and fills in type parameters with concrete types. For example, `MapPartitions[T, U] -> MapPartitions[Int, String]`. This Substitution fails when the actual type args are empty because they are still unknown. Instead, when there are no resolved types to subsitute, this returns the original args with unresolved type parameters. ## How was this patch tested? This doesn't affect substitutions where the type args are determined. This fixes our case where the actual type args are empty and our job runs successfully. Author: Ryan Blue <blue@apache.org> Closes #15062 from rdblue/SPARK-17424-fix-unsound-reflect-substitution. (cherry picked from commit b23693390781a99ff9248ea07a22e68884ffc747) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 12 May 2017, 12:39:12 UTC
6e89d57 [SPARK-20665][SQL] Bround" and "Round" function return NULL spark-sql>select bround(12.3, 2); spark-sql>NULL For this case, the expected result is 12.3, but it is null. So ,when the second parameter is bigger than "decimal.scala", the result is not we expected. "round" function has the same problem. This PR can solve the problem for both of them. unit test cases in MathExpressionsSuite and MathFunctionsSuite Author: liuxian <liu.xian3@zte.com.cn> Closes #17906 from 10110346/wip_lx_0509. (cherry picked from commit 2b36eb696f6c738e1328582630755aaac4293460) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 12 May 2017, 03:41:40 UTC
92a71a6 [SPARK-20685] Fix BatchPythonEvaluation bug in case of single UDF w/ repeated arg. ## What changes were proposed in this pull request? There's a latent corner-case bug in PySpark UDF evaluation where executing a `BatchPythonEvaluation` with a single multi-argument UDF where _at least one argument value is repeated_ will crash at execution with a confusing error. This problem was introduced in #12057: the code there has a fast path for handling a "batch UDF evaluation consisting of a single Python UDF", but that branch incorrectly assumes that a single UDF won't have repeated arguments and therefore skips the code for unpacking arguments from the input row (whose schema may not necessarily match the UDF inputs due to de-duplication of repeated arguments which occurred in the JVM before sending UDF inputs to Python). This fix here is simply to remove this special-casing: it turns out that the code in the "multiple UDFs" branch just so happens to work for the single-UDF case because Python treats `(x)` as equivalent to `x`, not as a single-argument tuple. ## How was this patch tested? New regression test in `pyspark.python.sql.tests` module (tested and confirmed that it fails before my fix). Author: Josh Rosen <joshrosen@databricks.com> Closes #17927 from JoshRosen/SPARK-20685. (cherry picked from commit 8ddbc431d8b21d5ee57d3d209a4f25e301f15283) Signed-off-by: Xiao Li <gatorsmile@gmail.com> 10 May 2017, 23:51:16 UTC
bdc08ab [SPARK-20688][SQL] correctly check analysis for scalar sub-queries In `CheckAnalysis`, we should call `checkAnalysis` for `ScalarSubquery` at the beginning, as later we will call `plan.output` which is invalid if `plan` is not resolved. new regression test Author: Wenchen Fan <wenchen@databricks.com> Closes #17930 from cloud-fan/tmp. (cherry picked from commit 789bdbe3d0d9558043872161bdfa148ec021a849) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 10 May 2017, 11:32:44 UTC
69786ea [SPARK-20631][PYTHON][ML] LogisticRegression._checkThresholdConsistency should use values not Params ## What changes were proposed in this pull request? - Replace `getParam` calls with `getOrDefault` calls. - Fix exception message to avoid unintended `TypeError`. - Add unit tests ## How was this patch tested? New unit tests. Author: zero323 <zero323@users.noreply.github.com> Closes #17891 from zero323/SPARK-20631. (cherry picked from commit 804949c6bf00b8e26c39d48bbcc4d0470ee84e47) Signed-off-by: Yanbo Liang <ybliang8@gmail.com> 10 May 2017, 08:58:34 UTC
8e09789 [SPARK-20686][SQL] PropagateEmptyRelation incorrectly handles aggregate without grouping The query ``` SELECT 1 FROM (SELECT COUNT(*) WHERE FALSE) t1 ``` should return a single row of output because the subquery is an aggregate without a group-by and thus should return a single row. However, Spark incorrectly returns zero rows. This is caused by SPARK-16208 / #13906, a patch which added an optimizer rule to propagate EmptyRelation through operators. The logic for handling aggregates is wrong: it checks whether aggregate expressions are non-empty for deciding whether the output should be empty, whereas it should be checking grouping expressions instead: An aggregate with non-empty grouping expression will return one output row per group. If the input to the grouped aggregate is empty then all groups will be empty and thus the output will be empty. It doesn't matter whether the aggregation output columns include aggregate expressions since that won't affect the number of output rows. If the grouping expressions are empty, however, then the aggregate will always produce a single output row and thus we cannot propagate the EmptyRelation. The current implementation is incorrect and also misses an optimization opportunity by not propagating EmptyRelation in the case where a grouped aggregate has aggregate expressions (in other words, `SELECT COUNT(*) from emptyRelation GROUP BY x` would _not_ be optimized to `EmptyRelation` in the old code, even though it safely could be). This patch resolves this issue by modifying `PropagateEmptyRelation` to consider only the presence/absence of grouping expressions, not the aggregate functions themselves, when deciding whether to propagate EmptyRelation. - Added end-to-end regression tests in `SQLQueryTest`'s `group-by.sql` file. - Updated unit tests in `PropagateEmptyRelationSuite`. Author: Josh Rosen <joshrosen@databricks.com> Closes #17929 from JoshRosen/fix-PropagateEmptyRelation. (cherry picked from commit a90c5cd8226146a58362732171b92cb99a7bc4c7) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 10 May 2017, 06:41:46 UTC
50f28df [SPARK-17685][SQL] Make SortMergeJoinExec's currentVars is null when calling createJoinKey ## What changes were proposed in this pull request? The following SQL query cause `IndexOutOfBoundsException` issue when `LIMIT > 1310720`: ```sql CREATE TABLE tab1(int int, int2 int, str string); CREATE TABLE tab2(int int, int2 int, str string); INSERT INTO tab1 values(1,1,'str'); INSERT INTO tab1 values(2,2,'str'); INSERT INTO tab2 values(1,1,'str'); INSERT INTO tab2 values(2,3,'str'); SELECT count(*) FROM ( SELECT t1.int, t2.int2 FROM (SELECT * FROM tab1 LIMIT 1310721) t1 INNER JOIN (SELECT * FROM tab2 LIMIT 1310721) t2 ON (t1.int = t2.int AND t1.int2 = t2.int2) ) t; ``` This pull request fix this issue. ## How was this patch tested? unit tests Author: Yuming Wang <wgyumg@gmail.com> Closes #17920 from wangyum/SPARK-17685. (cherry picked from commit 771abeb46f637592aba2e63db2ed05b6cabfd0be) Signed-off-by: Herman van Hovell <hvanhovell@databricks.com> 10 May 2017, 02:45:22 UTC
12c937e [SPARK-20627][PYSPARK] Drop the hadoop distirbution name from the Python version ## What changes were proposed in this pull request? Drop the hadoop distirbution name from the Python version (PEP440 - https://www.python.org/dev/peps/pep-0440/). We've been using the local version string to disambiguate between different hadoop versions packaged with PySpark, but PEP0440 states that local versions should not be used when publishing up-stream. Since we no longer make PySpark pip packages for different hadoop versions, we can simply drop the hadoop information. If at a later point we need to start publishing different hadoop versions we can look at make different packages or similar. ## How was this patch tested? Ran `make-distribution` locally Author: Holden Karau <holden@us.ibm.com> Closes #17885 from holdenk/SPARK-20627-remove-pip-local-version-string. (cherry picked from commit 1b85bcd9298cf84dd746fe8e91ab0b0df69ef17e) Signed-off-by: Holden Karau <holden@us.ibm.com> 09 May 2017, 18:26:25 UTC
f7a91a1 [SPARK-20615][ML][TEST] SparseVector.argmax throws IndexOutOfBoundsException ## What changes were proposed in this pull request? Added a check for for the number of defined values. Previously the argmax function assumed that at least one value was defined if the vector size was greater than zero. ## How was this patch tested? Tests were added to the existing VectorsSuite to cover this case. Author: Jon McLean <jon.mclean@atsid.com> Closes #17877 from jonmclean/vectorArgmaxIndexBug. (cherry picked from commit be53a78352ae7c70d8a07d0df24574b3e3129b4a) Signed-off-by: Sean Owen <sowen@cloudera.com> 09 May 2017, 08:48:09 UTC
a1112c6 [SPARK-20616] RuleExecutor logDebug of batch results should show diff to start of batch ## What changes were proposed in this pull request? Due to a likely typo, the logDebug msg printing the diff of query plans shows a diff to the initial plan, not diff to the start of batch. ## How was this patch tested? Now the debug message prints the diff between start and end of batch. Author: Juliusz Sompolski <julek@databricks.com> Closes #17875 from juliuszsompolski/SPARK-20616. (cherry picked from commit 5d75b14bf0f4c1f0813287efaabf49797908ed55) Signed-off-by: Reynold Xin <rxin@databricks.com> 05 May 2017, 22:31:23 UTC
back to top