sort by:
Revision Author Date Message Commit Date
a8d981d [SPARK-20718][SQL] FileSourceScanExec with different filter orders should be the same after canonicalization ## What changes were proposed in this pull request? Since `constraints` in `QueryPlan` is a set, the order of filters can differ. Usually this is ok because of canonicalization. However, in `FileSourceScanExec`, its data filters and partition filters are sequences, and their orders are not canonicalized. So `def sameResult` returns different results for different orders of data/partition filters. This leads to, e.g. different decision for `ReuseExchange`, and thus results in unstable performance. ## How was this patch tested? Added a new test for `FileSourceScanExec.sameResult`. Author: wangzhenhua <wangzhenhua@huawei.com> Closes #17959 from wzhfy/canonicalizeFileSourceScanExec. (cherry picked from commit c8da5356000c8e4ff9141e4a2892ebe0b9641d63) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 12 May 2017, 05:43:04 UTC
2cac317 [SPARK-20665][SQL] Bround" and "Round" function return NULL ## What changes were proposed in this pull request? spark-sql>select bround(12.3, 2); spark-sql>NULL For this case, the expected result is 12.3, but it is null. So ,when the second parameter is bigger than "decimal.scala", the result is not we expected. "round" function has the same problem. This PR can solve the problem for both of them. ## How was this patch tested? unit test cases in MathExpressionsSuite and MathFunctionsSuite Author: liuxian <liu.xian3@zte.com.cn> Closes #17906 from 10110346/wip_lx_0509. (cherry picked from commit 2b36eb696f6c738e1328582630755aaac4293460) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 12 May 2017, 03:39:02 UTC
3d1908f [SPARK-20399][SQL] Add a config to fallback string literal parsing consistent with old sql parser behavior ## What changes were proposed in this pull request? The new SQL parser is introduced into Spark 2.0. All string literals are unescaped in parser. Seems it bring an issue regarding the regex pattern string. The following codes can reproduce it: val data = Seq("\u0020\u0021\u0023", "abc") val df = data.toDF() // 1st usage: works in 1.6 // Let parser parse pattern string val rlike1 = df.filter("value rlike '^\\x20[\\x20-\\x23]+$'") // 2nd usage: works in 1.6, 2.x // Call Column.rlike so the pattern string is a literal which doesn't go through parser val rlike2 = df.filter($"value".rlike("^\\x20[\\x20-\\x23]+$")) // In 2.x, we need add backslashes to make regex pattern parsed correctly val rlike3 = df.filter("value rlike '^\\\\x20[\\\\x20-\\\\x23]+$'") Follow the discussion in #17736, this patch adds a config to fallback to 1.6 string literal parsing and mitigate migration issue. ## How was this patch tested? Jenkins tests. Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #17887 from viirya/add-config-fallback-string-parsing. (cherry picked from commit 609ba5f2b9fd89b1b9971d08f7cc680d202dbc7c) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 12 May 2017, 03:15:26 UTC
5844151 [SPARK-20600][SS] KafkaRelation should be pretty printed in web UI ## What changes were proposed in this pull request? User-friendly name of `KafkaRelation` in web UI (under Details for Query). ### Before <img width="516" alt="spark-20600-before" src="https://cloud.githubusercontent.com/assets/62313/25841955/74479ac6-34a2-11e7-87fb-d9f62a1356a7.png"> ### After <img width="439" alt="spark-20600-after" src="https://cloud.githubusercontent.com/assets/62313/25841829/f5335630-34a1-11e7-85a4-afe9b66d73c8.png"> ## How was this patch tested? Local build ``` ./bin/spark-shell --jars ~/.m2/repository/org/apache/spark/spark-sql-kafka-0-10_2.11/2.3.0-SNAPSHOT/spark-sql-kafka-0-10_2.11-2.3.0-SNAPSHOT.jar --packages org.apache.kafka:kafka-clients:0.10.0.1 ``` Author: Jacek Laskowski <jacek@japila.pl> Closes #17917 from jaceklaskowski/SPARK-20600-KafkaRelation-webUI. (cherry picked from commit 7144b51809aa99ac076786c369389e2330142beb) Signed-off-by: Shixiong Zhu <shixiong@databricks.com> 11 May 2017, 17:55:31 UTC
dd9e3b2 [SPARK-20569][SQL] RuntimeReplaceable functions should not take extra parameters ## What changes were proposed in this pull request? `RuntimeReplaceable` always has a constructor with the expression to replace with, and this constructor should not be the function builder. ## How was this patch tested? new regression test Author: Wenchen Fan <wenchen@databricks.com> Closes #17876 from cloud-fan/minor. (cherry picked from commit b4c99f43690f8cfba414af90fa2b3998a510bba8) Signed-off-by: Xiao Li <gatorsmile@gmail.com> 11 May 2017, 07:41:35 UTC
80a57fa [SPARK-20606][ML] Revert "[] ML 2.2 QA: Remove deprecated methods for ML" This reverts commit b8733e0ad9f5a700f385e210450fd2c10137293e. Author: Yanbo Liang <ybliang8@gmail.com> Closes #17944 from yanboliang/spark-20606-revert. (cherry picked from commit 0698e6c88ca11fdfd6e5498cab784cf6dbcdfacb) Signed-off-by: Yanbo Liang <ybliang8@gmail.com> 11 May 2017, 06:48:26 UTC
3eb0ee0 [SPARK-20685] Fix BatchPythonEvaluation bug in case of single UDF w/ repeated arg. ## What changes were proposed in this pull request? There's a latent corner-case bug in PySpark UDF evaluation where executing a `BatchPythonEvaluation` with a single multi-argument UDF where _at least one argument value is repeated_ will crash at execution with a confusing error. This problem was introduced in #12057: the code there has a fast path for handling a "batch UDF evaluation consisting of a single Python UDF", but that branch incorrectly assumes that a single UDF won't have repeated arguments and therefore skips the code for unpacking arguments from the input row (whose schema may not necessarily match the UDF inputs due to de-duplication of repeated arguments which occurred in the JVM before sending UDF inputs to Python). This fix here is simply to remove this special-casing: it turns out that the code in the "multiple UDFs" branch just so happens to work for the single-UDF case because Python treats `(x)` as equivalent to `x`, not as a single-argument tuple. ## How was this patch tested? New regression test in `pyspark.python.sql.tests` module (tested and confirmed that it fails before my fix). Author: Josh Rosen <joshrosen@databricks.com> Closes #17927 from JoshRosen/SPARK-20685. (cherry picked from commit 8ddbc431d8b21d5ee57d3d209a4f25e301f15283) Signed-off-by: Xiao Li <gatorsmile@gmail.com> 10 May 2017, 23:51:05 UTC
86cef4d [SPARK-19447] Remove remaining references to generated rows metric ## What changes were proposed in this pull request? https://github.com/apache/spark/commit/b486ffc86d8ad6c303321dcf8514afee723f61f8 left behind references to "number of generated rows" metrics, that should have been removed. ## How was this patch tested? Existing unit tests. Author: Ala Luszczak <ala@databricks.com> Closes #17939 from ala/SPARK-19447-fix. (cherry picked from commit 5c2c4dcce529d228a97ede0386b95213ea0e1da5) Signed-off-by: Herman van Hovell <hvanhovell@databricks.com> 10 May 2017, 15:41:14 UTC
358516d [MINOR][BUILD] Fix lint-java breaks. ## What changes were proposed in this pull request? This PR proposes to fix the lint-breaks as below: ``` [ERROR] src/main/java/org/apache/spark/unsafe/Platform.java:[51] (regexp) RegexpSingleline: No trailing whitespace allowed. [ERROR] src/main/scala/org/apache/spark/sql/streaming/Trigger.java:[45,25] (naming) MethodName: Method name 'ProcessingTime' must match pattern '^[a-z][a-z0-9][a-zA-Z0-9_]*$'. [ERROR] src/main/scala/org/apache/spark/sql/streaming/Trigger.java:[62,25] (naming) MethodName: Method name 'ProcessingTime' must match pattern '^[a-z][a-z0-9][a-zA-Z0-9_]*$'. [ERROR] src/main/scala/org/apache/spark/sql/streaming/Trigger.java:[78,25] (naming) MethodName: Method name 'ProcessingTime' must match pattern '^[a-z][a-z0-9][a-zA-Z0-9_]*$'. [ERROR] src/main/scala/org/apache/spark/sql/streaming/Trigger.java:[92,25] (naming) MethodName: Method name 'ProcessingTime' must match pattern '^[a-z][a-z0-9][a-zA-Z0-9_]*$'. [ERROR] src/main/scala/org/apache/spark/sql/streaming/Trigger.java:[102,25] (naming) MethodName: Method name 'Once' must match pattern '^[a-z][a-z0-9][a-zA-Z0-9_]*$'. [ERROR] src/test/java/org/apache/spark/streaming/kinesis/JavaKinesisInputDStreamBuilderSuite.java:[28,8] (imports) UnusedImports: Unused import - org.apache.spark.streaming.api.java.JavaDStream. ``` after: ``` dev/lint-java Checkstyle checks passed. ``` [Test Result](https://travis-ci.org/ConeyLiu/spark/jobs/229666169) ## How was this patch tested? Travis CI Author: Xianyang Liu <xianyang.liu@intel.com> Closes #17890 from ConeyLiu/codestyle. (cherry picked from commit fcb88f9211e39c705073db5300c96ceeb3f227d7) Signed-off-by: Sean Owen <sowen@cloudera.com> 10 May 2017, 12:56:42 UTC
5f6029c [SPARK-20678][SQL] Ndv for columns not in filter condition should also be updated ## What changes were proposed in this pull request? In filter estimation, we update column stats for those columns in filter condition. However, if the number of rows decreases after the filter (i.e. the overall selectivity is less than 1), we need to update (scale down) the number of distinct values (NDV) for all columns, no matter they are in filter conditions or not. This pr also fixes the inconsistency of rounding mode for ndv and rowCount. ## How was this patch tested? Added new tests. Author: wangzhenhua <wangzhenhua@huawei.com> Closes #17918 from wzhfy/scaleDownNdvAfterFilter. (cherry picked from commit 76e4a5566b1e9579632e03440cecd04dd142bc44) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 10 May 2017, 11:43:05 UTC
0851b6c [SPARK-20688][SQL] correctly check analysis for scalar sub-queries ## What changes were proposed in this pull request? In `CheckAnalysis`, we should call `checkAnalysis` for `ScalarSubquery` at the beginning, as later we will call `plan.output` which is invalid if `plan` is not resolved. ## How was this patch tested? new regression test Author: Wenchen Fan <wenchen@databricks.com> Closes #17930 from cloud-fan/tmp. (cherry picked from commit 789bdbe3d0d9558043872161bdfa148ec021a849) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 10 May 2017, 11:30:26 UTC
7597a52 [SPARK-20630][WEB UI] Fixed column visibility in Executor Tab ## What changes were proposed in this pull request? #14617 added new columns to the executor table causing the visibility checks for the logs and threadDump columns to toggle the wrong columns since they used hard-coded column numbers. I've updated the checks to use column names instead of numbers so future updates don't accidentally break this again. Note: This will also need to be back ported into 2.2 since #14617 was merged there ## How was this patch tested? Manually tested Author: Alex Bozarth <ajbozart@us.ibm.com> Closes #17904 from ajbozarth/spark20630. (cherry picked from commit ca4625e0e58df7f02346470d22a9478d9640709d) Signed-off-by: Sean Owen <sowen@cloudera.com> 10 May 2017, 09:20:19 UTC
3ed2f4d [SPARK-20631][PYTHON][ML] LogisticRegression._checkThresholdConsistency should use values not Params ## What changes were proposed in this pull request? - Replace `getParam` calls with `getOrDefault` calls. - Fix exception message to avoid unintended `TypeError`. - Add unit tests ## How was this patch tested? New unit tests. Author: zero323 <zero323@users.noreply.github.com> Closes #17891 from zero323/SPARK-20631. (cherry picked from commit 804949c6bf00b8e26c39d48bbcc4d0470ee84e47) Signed-off-by: Yanbo Liang <ybliang8@gmail.com> 10 May 2017, 08:58:08 UTC
ef50a95 [SPARK-20686][SQL] PropagateEmptyRelation incorrectly handles aggregate without grouping ## What changes were proposed in this pull request? The query ``` SELECT 1 FROM (SELECT COUNT(*) WHERE FALSE) t1 ``` should return a single row of output because the subquery is an aggregate without a group-by and thus should return a single row. However, Spark incorrectly returns zero rows. This is caused by SPARK-16208 / #13906, a patch which added an optimizer rule to propagate EmptyRelation through operators. The logic for handling aggregates is wrong: it checks whether aggregate expressions are non-empty for deciding whether the output should be empty, whereas it should be checking grouping expressions instead: An aggregate with non-empty grouping expression will return one output row per group. If the input to the grouped aggregate is empty then all groups will be empty and thus the output will be empty. It doesn't matter whether the aggregation output columns include aggregate expressions since that won't affect the number of output rows. If the grouping expressions are empty, however, then the aggregate will always produce a single output row and thus we cannot propagate the EmptyRelation. The current implementation is incorrect and also misses an optimization opportunity by not propagating EmptyRelation in the case where a grouped aggregate has aggregate expressions (in other words, `SELECT COUNT(*) from emptyRelation GROUP BY x` would _not_ be optimized to `EmptyRelation` in the old code, even though it safely could be). This patch resolves this issue by modifying `PropagateEmptyRelation` to consider only the presence/absence of grouping expressions, not the aggregate functions themselves, when deciding whether to propagate EmptyRelation. ## How was this patch tested? - Added end-to-end regression tests in `SQLQueryTest`'s `group-by.sql` file. - Updated unit tests in `PropagateEmptyRelationSuite`. Author: Josh Rosen <joshrosen@databricks.com> Closes #17929 from JoshRosen/fix-PropagateEmptyRelation. (cherry picked from commit a90c5cd8226146a58362732171b92cb99a7bc4c7) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 10 May 2017, 06:36:54 UTC
7b6f3a1 [SPARK-20590][SQL] Use Spark internal datasource if multiples are found for the same shorten name ## What changes were proposed in this pull request? One of the common usability problems around reading data in spark (particularly CSV) is that there can often be a conflict between different readers in the classpath. As an example, if someone launches a 2.x spark shell with the spark-csv package in the classpath, Spark currently fails in an extremely unfriendly way (see databricks/spark-csv#367): ```bash ./bin/spark-shell --packages com.databricks:spark-csv_2.11:1.5.0 scala> val df = spark.read.csv("/foo/bar.csv") java.lang.RuntimeException: Multiple sources found for csv (org.apache.spark.sql.execution.datasources.csv.CSVFileFormat, com.databricks.spark.csv.DefaultSource15), please specify the fully qualified class name. at scala.sys.package$.error(package.scala:27) at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:574) at org.apache.spark.sql.execution.datasources.DataSource.providingClass$lzycompute(DataSource.scala:85) at org.apache.spark.sql.execution.datasources.DataSource.providingClass(DataSource.scala:85) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:295) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178) at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:533) at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:412) ... 48 elided ``` This PR proposes a simple way of fixing this error by picking up the internal datasource if there is single (the datasource that has "org.apache.spark" prefix). ```scala scala> spark.range(1).write.format("csv").mode("overwrite").save("/tmp/abc") 17/05/10 09:47:44 WARN DataSource: Multiple sources found for csv (org.apache.spark.sql.execution.datasources.csv.CSVFileFormat, com.databricks.spark.csv.DefaultSource15), defaulting to the internal datasource (org.apache.spark.sql.execution.datasources.csv.CSVFileFormat). ``` ```scala scala> spark.range(1).write.format("Csv").mode("overwrite").save("/tmp/abc") 17/05/10 09:47:52 WARN DataSource: Multiple sources found for Csv (org.apache.spark.sql.execution.datasources.csv.CSVFileFormat, com.databricks.spark.csv.DefaultSource15), defaulting to the internal datasource (org.apache.spark.sql.execution.datasources.csv.CSVFileFormat). ``` ## How was this patch tested? Manually tested as below: ```bash ./bin/spark-shell --packages com.databricks:spark-csv_2.11:1.5.0 ``` ```scala spark.sparkContext.setLogLevel("WARN") ``` **positive cases**: ```scala scala> spark.range(1).write.format("csv").mode("overwrite").save("/tmp/abc") 17/05/10 09:47:44 WARN DataSource: Multiple sources found for csv (org.apache.spark.sql.execution.datasources.csv.CSVFileFormat, com.databricks.spark.csv.DefaultSource15), defaulting to the internal datasource (org.apache.spark.sql.execution.datasources.csv.CSVFileFormat). ``` ```scala scala> spark.range(1).write.format("Csv").mode("overwrite").save("/tmp/abc") 17/05/10 09:47:52 WARN DataSource: Multiple sources found for Csv (org.apache.spark.sql.execution.datasources.csv.CSVFileFormat, com.databricks.spark.csv.DefaultSource15), defaulting to the internal datasource (org.apache.spark.sql.execution.datasources.csv.CSVFileFormat). ``` (newlines were inserted for readability). ```scala scala> spark.range(1).write.format("com.databricks.spark.csv").mode("overwrite").save("/tmp/abc") ``` ```scala scala> spark.range(1).write.format("org.apache.spark.sql.execution.datasources.csv.CSVFileFormat").mode("overwrite").save("/tmp/abc") ``` **negative cases**: ```scala scala> spark.range(1).write.format("com.databricks.spark.csv.CsvRelation").save("/tmp/abc") java.lang.InstantiationException: com.databricks.spark.csv.CsvRelation ... ``` ```scala scala> spark.range(1).write.format("com.databricks.spark.csv.CsvRelatio").save("/tmp/abc") java.lang.ClassNotFoundException: Failed to find data source: com.databricks.spark.csv.CsvRelatio. Please find packages at http://spark.apache.org/third-party-projects.html ... ``` Author: hyukjinkwon <gurwls223@gmail.com> Closes #17916 from HyukjinKwon/datasource-detect. (cherry picked from commit 3d2131ab4ddead29601fb3c597b798202ac25fdd) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 10 May 2017, 05:45:07 UTC
6a996b3 [SPARK-17685][SQL] Make SortMergeJoinExec's currentVars is null when calling createJoinKey ## What changes were proposed in this pull request? The following SQL query cause `IndexOutOfBoundsException` issue when `LIMIT > 1310720`: ```sql CREATE TABLE tab1(int int, int2 int, str string); CREATE TABLE tab2(int int, int2 int, str string); INSERT INTO tab1 values(1,1,'str'); INSERT INTO tab1 values(2,2,'str'); INSERT INTO tab2 values(1,1,'str'); INSERT INTO tab2 values(2,3,'str'); SELECT count(*) FROM ( SELECT t1.int, t2.int2 FROM (SELECT * FROM tab1 LIMIT 1310721) t1 INNER JOIN (SELECT * FROM tab2 LIMIT 1310721) t2 ON (t1.int = t2.int AND t1.int2 = t2.int2) ) t; ``` This pull request fix this issue. ## How was this patch tested? unit tests Author: Yuming Wang <wgyumg@gmail.com> Closes #17920 from wangyum/SPARK-17685. (cherry picked from commit 771abeb46f637592aba2e63db2ed05b6cabfd0be) Signed-off-by: Herman van Hovell <hvanhovell@databricks.com> 10 May 2017, 02:45:09 UTC
7600a7a [SPARK-20373][SQL][SS] Batch queries with 'Dataset/DataFrame.withWatermark()` does not execute ## What changes were proposed in this pull request? Any Dataset/DataFrame batch query with the operation `withWatermark` does not execute because the batch planner does not have any rule to explicitly handle the EventTimeWatermark logical plan. The right solution is to simply remove the plan node, as the watermark should not affect any batch query in any way. Changes: - In this PR, we add a new rule `EliminateEventTimeWatermark` to check if we need to ignore the event time watermark. We will ignore watermark in any batch query. Depends upon: - [SPARK-20672](https://issues.apache.org/jira/browse/SPARK-20672). We can not add this rule into analyzer directly, because streaming query will be copied to `triggerLogicalPlan ` in every trigger, and the rule will be applied to `triggerLogicalPlan` mistakenly. Others: - A typo fix in example. ## How was this patch tested? add new unit test. Author: uncleGen <hustyugm@gmail.com> Closes #17896 from uncleGen/SPARK-20373. (cherry picked from commit c0189abc7c6ddbecc1832d2ff0cfc5546a010b60) Signed-off-by: Shixiong Zhu <shixiong@databricks.com> 09 May 2017, 22:08:38 UTC
d191b96 Revert "[SPARK-20311][SQL] Support aliases for table value functions" This reverts commit 714811d0b5bcb5d47c39782ff74f898d276ecc59. 09 May 2017, 21:49:02 UTC
9e8d23b [SPARK-20627][PYSPARK] Drop the hadoop distirbution name from the Python version ## What changes were proposed in this pull request? Drop the hadoop distirbution name from the Python version (PEP440 - https://www.python.org/dev/peps/pep-0440/). We've been using the local version string to disambiguate between different hadoop versions packaged with PySpark, but PEP0440 states that local versions should not be used when publishing up-stream. Since we no longer make PySpark pip packages for different hadoop versions, we can simply drop the hadoop information. If at a later point we need to start publishing different hadoop versions we can look at make different packages or similar. ## How was this patch tested? Ran `make-distribution` locally Author: Holden Karau <holden@us.ibm.com> Closes #17885 from holdenk/SPARK-20627-remove-pip-local-version-string. (cherry picked from commit 1b85bcd9298cf84dd746fe8e91ab0b0df69ef17e) Signed-off-by: Holden Karau <holden@us.ibm.com> 09 May 2017, 18:26:00 UTC
c7bd909 [SPARK-19876][BUILD] Move Trigger.java to java source hierarchy ## What changes were proposed in this pull request? Simply moves `Trigger.java` to `src/main/java` from `src/main/scala` See https://github.com/apache/spark/pull/17219 ## How was this patch tested? Existing tests. Author: Sean Owen <sowen@cloudera.com> Closes #17921 from srowen/SPARK-19876.2. (cherry picked from commit 25ee816e090c42f0e35be2d2cb0f8ec60726317c) Signed-off-by: Herman van Hovell <hvanhovell@databricks.com> 09 May 2017, 17:22:32 UTC
73aa23b [SPARK-20674][SQL] Support registering UserDefinedFunction as named UDF ## What changes were proposed in this pull request? For some reason we don't have an API to register UserDefinedFunction as named UDF. It is a no brainer to add one, in addition to the existing register functions we have. ## How was this patch tested? Added a test case in UDFSuite for the new API. Author: Reynold Xin <rxin@databricks.com> Closes #17915 from rxin/SPARK-20674. (cherry picked from commit d099f414d2cb53f5a61f6e77317c736be6f953a0) Signed-off-by: Xiao Li <gatorsmile@gmail.com> 09 May 2017, 16:24:36 UTC
08e1b78 [SPARK-20548][FLAKY-TEST] share one REPL instance among REPL test cases `ReplSuite.newProductSeqEncoder with REPL defined class` was flaky and throws OOM exception frequently. By analyzing the heap dump, we found the reason is that, in each test case of `ReplSuite`, we create a REPL instance, which creates a classloader and loads a lot of classes related to `SparkContext`. More details please see https://github.com/apache/spark/pull/17833#issuecomment-298711435. In this PR, we create a new test suite, `SingletonReplSuite`, which shares one REPL instances among all the test cases. Then we move most of the tests from `ReplSuite` to `SingletonReplSuite`, to avoid creating a lot of REPL instances and reduce memory footprint. test only change Author: Wenchen Fan <wenchen@databricks.com> Closes #17844 from cloud-fan/flaky-test. (cherry picked from commit f561a76b2f895dea52f228a9376948242c3331ad) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 09 May 2017, 16:11:25 UTC
272d2a1 [SPARK-20311][SQL] Support aliases for table value functions ## What changes were proposed in this pull request? This pr added parsing rules to support aliases in table value functions. ## How was this patch tested? Added tests in `PlanParserSuite`. Author: Takeshi Yamamuro <yamamuro@apache.org> Closes #17666 from maropu/SPARK-20311. (cherry picked from commit 714811d0b5bcb5d47c39782ff74f898d276ecc59) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 09 May 2017, 12:23:53 UTC
b330967 [SPARK-20667][SQL][TESTS] Cleanup the cataloged metadata after completing the package of sql/core and sql/hive ## What changes were proposed in this pull request? So far, we do not drop all the cataloged objects after each package. Sometimes, we might hit strange test case errors because the previous test suite did not drop the cataloged/temporary objects (tables/functions/database). At least, we can first clean up the environment when completing the package of `sql/core` and `sql/hive`. ## How was this patch tested? N/A Author: Xiao Li <gatorsmile@gmail.com> Closes #17908 from gatorsmile/reset. (cherry picked from commit 0d00c768a860fc03402c8f0c9081b8147c29133e) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 09 May 2017, 12:11:08 UTC
4b7aa0b [SPARK-20606][ML] ML 2.2 QA: Remove deprecated methods for ML ## What changes were proposed in this pull request? Remove ML methods we deprecated in 2.1. ## How was this patch tested? Existing tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes #17867 from yanboliang/spark-20606. (cherry picked from commit b8733e0ad9f5a700f385e210450fd2c10137293e) Signed-off-by: Yanbo Liang <ybliang8@gmail.com> 09 May 2017, 09:30:50 UTC
4bbfad4 [SPARK-20615][ML][TEST] SparseVector.argmax throws IndexOutOfBoundsException ## What changes were proposed in this pull request? Added a check for for the number of defined values. Previously the argmax function assumed that at least one value was defined if the vector size was greater than zero. ## How was this patch tested? Tests were added to the existing VectorsSuite to cover this case. Author: Jon McLean <jon.mclean@atsid.com> Closes #17877 from jonmclean/vectorArgmaxIndexBug. (cherry picked from commit be53a78352ae7c70d8a07d0df24574b3e3129b4a) Signed-off-by: Sean Owen <sowen@cloudera.com> 09 May 2017, 08:47:58 UTC
ca3f7ed [SPARK-20587][ML] Improve performance of ML ALS recommendForAll This PR is a `DataFrame` version of #17742 for [SPARK-11968](https://issues.apache.org/jira/browse/SPARK-11968), for improving the performance of `recommendAll` methods. ## How was this patch tested? Existing unit tests. Author: Nick Pentreath <nickp@za.ibm.com> Closes #17845 from MLnick/ml-als-perf. (cherry picked from commit 10b00abadf4a3473332eef996db7b66f491316f2) Signed-off-by: Nick Pentreath <nickp@za.ibm.com> 09 May 2017, 08:13:36 UTC
72fca9a [SPARK-11968][MLLIB] Optimize MLLIB ALS recommendForAll The recommendForAll of MLLIB ALS is very slow. GC is a key problem of the current method. The task use the following code to keep temp result: val output = new Array[(Int, (Int, Double))](m*n) m = n = 4096 (default value, no method to set) so output is about 4k * 4k * (4 + 4 + 8) = 256M. This is a large memory and cause serious GC problem, and it is frequently OOM. Actually, we don't need to save all the temp result. Support we recommend topK (topK is about 10, or 20) product for each user, we only need 4k * topK * (4 + 4 + 8) memory to save the temp result. The Test Environment: 3 workers: each work 10 core, each work 30G memory, each work 1 executor. The Data: User 480,000, and Item 17,000 BlockSize: 1024 2048 4096 8192 Old method: 245s 332s 488s OOM This solution: 121s 118s 117s 120s The existing UT. Author: Peng <peng.meng@intel.com> Author: Peng Meng <peng.meng@intel.com> Closes #17742 from mpjlu/OptimizeAls. (cherry picked from commit 8079424763c2043264f30a6898ce964379bd9b56) Signed-off-by: Nick Pentreath <nickp@za.ibm.com> 09 May 2017, 08:08:23 UTC
54e0743 [SPARK-20661][SPARKR][TEST][FOLLOWUP] SparkR tableNames() test fails ## What changes were proposed in this pull request? Change it to check for relative count like in this test https://github.com/apache/spark/blame/master/R/pkg/inst/tests/testthat/test_sparkSQL.R#L3355 for catalog APIs ## How was this patch tested? unit tests, this needs to combine with another commit with SQL change to check Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #17905 from felixcheung/rtabletests. (cherry picked from commit b952b44af4d243f1e3ad88bccf4af7d04df3fc81) Signed-off-by: Felix Cheung <felixcheung@apache.org> 09 May 2017, 05:49:53 UTC
4179ffc [SPARK-20661][SPARKR][TEST] SparkR tableNames() test fails ## What changes were proposed in this pull request? Cleaning existing temp tables before running tableNames tests ## How was this patch tested? SparkR Unit tests Author: Hossein <hossein@databricks.com> Closes #17903 from falaki/SPARK-20661. (cherry picked from commit 2abfee18b6511482b916c36f00bf3abf68a59e19) Signed-off-by: Yin Huai <yhuai@databricks.com> 08 May 2017, 21:48:29 UTC
23681e9 [SPARK-20621][DEPLOY] Delete deprecated config parameter in 'spark-env.sh' ## What changes were proposed in this pull request? Currently, `spark.executor.instances` is deprecated in `spark-env.sh`, because we suggest config it in `spark-defaults.conf` or other config file. And also this parameter is useless even if you set it in `spark-env.sh`, so remove it in this patch. ## How was this patch tested? Existing tests. Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Xianyang Liu <xianyang.liu@intel.com> Closes #17881 from ConeyLiu/deprecatedParam. (cherry picked from commit aeb2ecc0cd898f5352df0a04be1014b02ea3e20e) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> 08 May 2017, 17:25:35 UTC
7b9d05a [SPARK-20596][ML][TEST] Consolidate and improve ALS recommendAll test cases Existing test cases for `recommendForAllX` methods (added in [SPARK-19535](https://issues.apache.org/jira/browse/SPARK-19535)) test `k < num items` and `k = num items`. Technically we should also test that `k > num items` returns the same results as `k = num items`. ## How was this patch tested? Updated existing unit tests. Author: Nick Pentreath <nickp@za.ibm.com> Closes #17860 from MLnick/SPARK-20596-als-rec-tests. (cherry picked from commit 58518d070777fc0665c4d02bad8adf910807df98) Signed-off-by: Nick Pentreath <nickp@za.ibm.com> 08 May 2017, 10:45:17 UTC
d8a5a0d [SPARKR][DOC] fix typo in vignettes ## What changes were proposed in this pull request? Fix typo in vignettes Author: Wayne Zhang <actuaryzhang@uber.com> Closes #17884 from actuaryzhang/typo. (cherry picked from commit 2fdaeb52bbe2ed1a9127ac72917286e505303c85) Signed-off-by: Felix Cheung <felixcheung@apache.org> 08 May 2017, 06:16:44 UTC
6c5b7e1 [SPARK-20626][SPARKR] address date test warning with timezone on windows ## What changes were proposed in this pull request? set timezone on windows ## How was this patch tested? unit test, AppVeyor Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #17892 from felixcheung/rtimestamptest. (cherry picked from commit c24bdaab5a234d18b273544cefc44cc4005bf8fc) Signed-off-by: Felix Cheung <felixcheung@apache.org> 08 May 2017, 06:10:42 UTC
048e989 [SPARK-20543][SPARKR][FOLLOWUP] Don't skip tests on AppVeyor ## What changes were proposed in this pull request? add environment ## How was this patch tested? wait for appveyor run Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #17878 from felixcheung/appveyorrcran. (cherry picked from commit 7087e01194964a1aad0b45bdb41506a17100eacf) Signed-off-by: Felix Cheung <felixcheung@apache.org> 07 May 2017, 20:10:25 UTC
423a786 [SPARK-20208][DOCS][FOLLOW-UP] Add FP-Growth to SparkR programming guide ## What changes were proposed in this pull request? Add `spark.fpGrowth` to SparkR programming guide. ## How was this patch tested? Manual tests. Author: zero323 <zero323@users.noreply.github.com> Closes #17775 from zero323/SPARK-20208-FOLLOW-UP. (cherry picked from commit ba7666274e71f1903e5050a5e53fbdcd21debde5) Signed-off-by: Felix Cheung <felixcheung@apache.org> 06 May 2017, 03:55:44 UTC
1d9b7a7 [SPARK-20616] RuleExecutor logDebug of batch results should show diff to start of batch ## What changes were proposed in this pull request? Due to a likely typo, the logDebug msg printing the diff of query plans shows a diff to the initial plan, not diff to the start of batch. ## How was this patch tested? Now the debug message prints the diff between start and end of batch. Author: Juliusz Sompolski <julek@databricks.com> Closes #17875 from juliuszsompolski/SPARK-20616. (cherry picked from commit 5d75b14bf0f4c1f0813287efaabf49797908ed55) Signed-off-by: Reynold Xin <rxin@databricks.com> 05 May 2017, 22:31:13 UTC
f59c74a [SPARK-20132][DOCS] Add documentation for column string functions ## What changes were proposed in this pull request? Add docstrings to column.py for the Column functions `rlike`, `like`, `startswith`, and `endswith`. Pass these docstrings through `_bin_op` There may be a better place to put the docstrings. I put them immediately above the Column class. ## How was this patch tested? I ran `make html` on my local computer to remake the documentation, and verified that the html pages were displaying the docstrings correctly. I tried running `dev-tests`, and the formatting tests passed. However, my mvn build didn't work I think due to issues on my computer. These docstrings are my original work and free license. davies has done the most recent work reorganizing `_bin_op` Author: Michael Patterson <map222@gmail.com> Closes #17469 from map222/patterson-documentation. 05 May 2017, 20:26:49 UTC
24fffac [SPARK-20603][SS][TEST] Set default number of topic partitions to 1 to reduce the load ## What changes were proposed in this pull request? I checked the logs of https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-2.2-test-maven-hadoop-2.7/47/ and found it took several seconds to create Kafka internal topic `__consumer_offsets`. As Kafka creates this topic lazily, the topic creation happens in the first test `deserialization of initial offset with Spark 2.1.0` and causes it timeout. This PR changes `offsets.topic.num.partitions` from the default value 50 to 1 to make creating `__consumer_offsets` (50 partitions -> 1 partition) much faster. ## How was this patch tested? Jenkins Author: Shixiong Zhu <shixiong@databricks.com> Closes #17863 from zsxwing/fix-kafka-flaky-test. (cherry picked from commit bd5788287957d8610a6d19c273b75bd4cdd2d166) Signed-off-by: Shixiong Zhu <shixiong@databricks.com> 05 May 2017, 18:08:32 UTC
f71aea6 [SPARK-20381][SQL] Add SQL metrics of numOutputRows for ObjectHashAggregateExec ## What changes were proposed in this pull request? ObjectHashAggregateExec is missing numOutputRows, add this metrics for it. ## How was this patch tested? Added unit tests for the new metrics. Author: Yucai <yucai.yu@intel.com> Closes #17678 from yucai/objectAgg_numOutputRows. (cherry picked from commit 41439fd52dd263b9f7d92e608f027f193f461777) Signed-off-by: Xiao Li <gatorsmile@gmail.com> 05 May 2017, 16:52:10 UTC
1fa3c86 [SPARK-20613] Remove excess quotes in Windows executable ## What changes were proposed in this pull request? Quotes are already added to the RUNNER variable on line 54. There is no need to put quotes on line 67. If you do, you will get an error when launching Spark. '""C:\Program' is not recognized as an internal or external command, operable program or batch file. ## How was this patch tested? Tested manually on Windows 10. Author: Jarrett Meyer <jarrettmeyer@gmail.com> Closes #17861 from jarrettmeyer/fix-windows-cmd. (cherry picked from commit b9ad2d1916af5091c8585d06ccad8219e437e2bc) Signed-off-by: Felix Cheung <felixcheung@apache.org> 05 May 2017, 15:30:56 UTC
dbb54a7 [SPARK-20546][DEPLOY] spark-class gets syntax error in posix mode ## What changes were proposed in this pull request? Updated spark-class to turn off posix mode so the process substitution doesn't cause a syntax error. ## How was this patch tested? Existing unit tests, manual spark-shell testing with posix mode on Author: jyu00 <jessieyu@us.ibm.com> Closes #17852 from jyu00/master. (cherry picked from commit 5773ab121d5d7cbefeef17ff4ac6f8af36cc1251) Signed-off-by: Sean Owen <sowen@cloudera.com> 05 May 2017, 10:36:58 UTC
7cb566a [SPARK-19660][SQL] Replace the deprecated property name fs.default.name to fs.defaultFS that newly introduced ## What changes were proposed in this pull request? Replace the deprecated property name `fs.default.name` to `fs.defaultFS` that newly introduced. ## How was this patch tested? Existing tests Author: Yuming Wang <wgyumg@gmail.com> Closes #17856 from wangyum/SPARK-19660. (cherry picked from commit 37cdf077cd3f436f777562df311e3827b0727ce7) Signed-off-by: Sean Owen <sowen@cloudera.com> 05 May 2017, 10:32:07 UTC
c875628 [SPARK-20574][ML] Allow Bucketizer to handle non-Double numeric column ## What changes were proposed in this pull request? Bucketizer currently requires input column to be Double, but the logic should work on any numeric data types. Many practical problems have integer/float data types, and it could get very tedious to manually cast them into Double before calling bucketizer. This PR extends bucketizer to handle all numeric types. ## How was this patch tested? New test. Author: Wayne Zhang <actuaryzhang@uber.com> Closes #17840 from actuaryzhang/bucketizer. (cherry picked from commit 0d16faab90e4cd1f73c5b749dbda7bc2a400b26f) Signed-off-by: Yanbo Liang <ybliang8@gmail.com> 05 May 2017, 02:24:12 UTC
425ed26 [SPARK-20047][FOLLOWUP][ML] Constrained Logistic Regression follow up ## What changes were proposed in this pull request? Address some minor comments for #17715: * Put bound-constrained optimization params under expertParams. * Update some docs. ## How was this patch tested? Existing tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes #17829 from yanboliang/spark-20047-followup. (cherry picked from commit c5dceb8c65545169bc96628140b5acdaa85dd226) Signed-off-by: Yanbo Liang <ybliang8@gmail.com> 04 May 2017, 09:57:08 UTC
b672779 [SPARK-20571][SPARKR][SS] Flaky Structured Streaming tests ## What changes were proposed in this pull request? Make tests more reliable by having it till processed. Increasing timeout value might help but ultimately the flakiness from processing delay when Jenkins is hard to account for. This isn't an actual public API supported ## How was this patch tested? unit tests Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #17857 from felixcheung/rsstestrelia. (cherry picked from commit 57b64703e66ec8490d8d9dbf6beebc160a61ec29) Signed-off-by: Felix Cheung <felixcheung@apache.org> 04 May 2017, 08:55:13 UTC
3f5c548 [SPARK-20585][SPARKR] R generic hint support Adds support for generic hints on `SparkDataFrame` Unit tests, `check-cran.sh` Author: zero323 <zero323@users.noreply.github.com> Closes #17851 from zero323/SPARK-20585. (cherry picked from commit 9c36aa27919fb7625e388f5c3c90af62ef902b24) Signed-off-by: Felix Cheung <felixcheung@apache.org> 04 May 2017, 08:42:55 UTC
6c5c594 [SPARK-20015][SPARKR][SS][DOC][EXAMPLE] Document R Structured Streaming (experimental) in R vignettes and R & SS programming guide, R example Add - R vignettes - R programming guide - SS programming guide - R example Also disable spark.als in vignettes for now since it's failing (SPARK-20402) manually Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #17814 from felixcheung/rdocss. (cherry picked from commit b8302ccd02265f9d7a7895c7b033441fa2d8ffd1) Signed-off-by: Felix Cheung <felixcheung@apache.org> 04 May 2017, 07:29:20 UTC
5fe9313 [SPARK-20544][SPARKR] skip tests when running on CRAN General rule on skip or not: skip if - RDD tests - tests could run long or complicated (streaming, hivecontext) - tests on error conditions - tests won't likely change/break unit tests, `R CMD check --as-cran`, `R CMD check` Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #17817 from felixcheung/rskiptest. (cherry picked from commit fc472bddd1d9c6a28e57e31496c0166777af597e) Signed-off-by: Felix Cheung <felixcheung@apache.org> 04 May 2017, 04:51:33 UTC
d8bd213 [SPARK-20584][PYSPARK][SQL] Python generic hint support ## What changes were proposed in this pull request? Adds `hint` method to PySpark `DataFrame`. ## How was this patch tested? Unit tests, doctests. Author: zero323 <zero323@users.noreply.github.com> Closes #17850 from zero323/SPARK-20584. (cherry picked from commit 02bbe73118a39e2fb378aa2002449367a92f6d67) Signed-off-by: Reynold Xin <rxin@databricks.com> 04 May 2017, 02:15:42 UTC
a3a5fcf Preparing development version 2.2.1-SNAPSHOT 03 May 2017, 23:50:12 UTC
1d4017b Preparing Spark release v2.2.0-rc2 03 May 2017, 23:50:08 UTC
2629e7c [MINOR][SQL] Fix the test title from =!= to <=>, remove a duplicated test and add a test for =!= ## What changes were proposed in this pull request? This PR proposes three things as below: - This test looks not testing `<=>` and identical with the test above, `===`. So, it removes the test. ```diff - test("<=>") { - checkAnswer( - testData2.filter($"a" === 1), - testData2.collect().toSeq.filter(r => r.getInt(0) == 1)) - - checkAnswer( - testData2.filter($"a" === $"b"), - testData2.collect().toSeq.filter(r => r.getInt(0) == r.getInt(1))) - } ``` - Replace the test title from `=!=` to `<=>`. It looks the test actually testing `<=>`. ```diff + private lazy val nullData = Seq( + (Some(1), Some(1)), (Some(1), Some(2)), (Some(1), None), (None, None)).toDF("a", "b") + ... - test("=!=") { + test("<=>") { - val nullData = spark.createDataFrame(sparkContext.parallelize( - Row(1, 1) :: - Row(1, 2) :: - Row(1, null) :: - Row(null, null) :: Nil), - StructType(Seq(StructField("a", IntegerType), StructField("b", IntegerType)))) - checkAnswer( nullData.filter($"b" <=> 1), ... ``` - Add the tests for `=!=` which looks not existing. ```diff + test("=!=") { + checkAnswer( + nullData.filter($"b" =!= 1), + Row(1, 2) :: Nil) + + checkAnswer(nullData.filter($"b" =!= null), Nil) + + checkAnswer( + nullData.filter($"a" =!= $"b"), + Row(1, 2) :: Nil) + } ``` ## How was this patch tested? Manually running the tests. Author: hyukjinkwon <gurwls223@gmail.com> Closes #17842 from HyukjinKwon/minor-test-fix. (cherry picked from commit 13eb37c860c8f672d0e9d9065d0333f981db71e3) Signed-off-by: Reynold Xin <rxin@databricks.com> 03 May 2017, 20:08:31 UTC
36d8079 [SPARK-19965][SS] DataFrame batch reader may fail to infer partitions when reading FileStreamSink's output ## The Problem Right now DataFrame batch reader may fail to infer partitions when reading FileStreamSink's output: ``` [info] - partitioned writing and batch reading with 'basePath' *** FAILED *** (3 seconds, 928 milliseconds) [info] java.lang.AssertionError: assertion failed: Conflicting directory structures detected. Suspicious paths: [info] ***/stream.output-65e3fa45-595a-4d29-b3df-4c001e321637 [info] ***/stream.output-65e3fa45-595a-4d29-b3df-4c001e321637/_spark_metadata [info] [info] If provided paths are partition directories, please set "basePath" in the options of the data source to specify the root directory of the table. If there are multiple root directories, please load them separately and then union them. [info] at scala.Predef$.assert(Predef.scala:170) [info] at org.apache.spark.sql.execution.datasources.PartitioningUtils$.parsePartitions(PartitioningUtils.scala:133) [info] at org.apache.spark.sql.execution.datasources.PartitioningUtils$.parsePartitions(PartitioningUtils.scala:98) [info] at org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex.inferPartitioning(PartitioningAwareFileIndex.scala:156) [info] at org.apache.spark.sql.execution.datasources.InMemoryFileIndex.partitionSpec(InMemoryFileIndex.scala:54) [info] at org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex.partitionSchema(PartitioningAwareFileIndex.scala:55) [info] at org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:133) [info] at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:361) [info] at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:160) [info] at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:536) [info] at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:520) [info] at org.apache.spark.sql.streaming.FileStreamSinkSuite$$anonfun$8.apply$mcV$sp(FileStreamSinkSuite.scala:292) [info] at org.apache.spark.sql.streaming.FileStreamSinkSuite$$anonfun$8.apply(FileStreamSinkSuite.scala:268) [info] at org.apache.spark.sql.streaming.FileStreamSinkSuite$$anonfun$8.apply(FileStreamSinkSuite.scala:268) ``` ## What changes were proposed in this pull request? This patch alters `InMemoryFileIndex` to filter out these `basePath`s whose ancestor is the streaming metadata dir (`_spark_metadata`). E.g., the following and other similar dir or files will be filtered out: - (introduced by globbing `basePath/*`) - `basePath/_spark_metadata` - (introduced by globbing `basePath/*/*`) - `basePath/_spark_metadata/0` - `basePath/_spark_metadata/1` - ... ## How was this patch tested? Added unit tests Author: Liwei Lin <lwlin7@gmail.com> Closes #17346 from lw-lin/filter-metadata. (cherry picked from commit 6b9e49d12fc4c9b29d497122daa4cc9bf4540b16) Signed-off-by: Shixiong Zhu <shixiong@databricks.com> 03 May 2017, 18:10:31 UTC
f0e80aa [SPARK-20576][SQL] Support generic hint function in Dataset/DataFrame ## What changes were proposed in this pull request? We allow users to specify hints (currently only "broadcast" is supported) in SQL and DataFrame. However, while SQL has a standard hint format (/*+ ... */), DataFrame doesn't have one and sometimes users are confused that they can't find how to apply a broadcast hint. This ticket adds a generic hint function on DataFrame that allows using the same hint on DataFrames as well as SQL. As an example, after this patch, the following will apply a broadcast hint on a DataFrame using the new hint function: ``` df1.join(df2.hint("broadcast")) ``` ## How was this patch tested? Added a test case in DataFrameJoinSuite. Author: Reynold Xin <rxin@databricks.com> Closes #17839 from rxin/SPARK-20576. (cherry picked from commit 527fc5d0c990daaacad4740f62cfe6736609b77b) Signed-off-by: Reynold Xin <rxin@databricks.com> 03 May 2017, 16:22:41 UTC
b1a732f [SPARK-20441][SPARK-20432][SS] Within the same streaming query, one StreamingRelation should only be transformed to one StreamingExecutionRelation ## What changes were proposed in this pull request? Within the same streaming query, when one `StreamingRelation` is referred multiple times – e.g. `df.union(df)` – we should transform it only to one `StreamingExecutionRelation`, instead of two or more different `StreamingExecutionRelation`s (each of which would have a separate set of source, source logs, ...). ## How was this patch tested? Added two test cases, each of which would fail without this patch. Author: Liwei Lin <lwlin7@gmail.com> Closes #17735 from lw-lin/SPARK-20441. (cherry picked from commit 27f543b15f2f493f6f8373e46b4c9564b0a1bf81) Signed-off-by: Burak Yavuz <brkyvz@gmail.com> 03 May 2017, 15:55:17 UTC
b5947f5 [SPARK-20523][BUILD] Clean up build warnings for 2.2.0 release ## What changes were proposed in this pull request? Fix build warnings primarily related to Breeze 0.13 operator changes, Java style problems ## How was this patch tested? Existing tests Author: Sean Owen <sowen@cloudera.com> Closes #17803 from srowen/SPARK-20523. (cherry picked from commit 16fab6b0ef3dcb33f92df30e17680922ad5fb672) Signed-off-by: Sean Owen <sowen@cloudera.com> 03 May 2017, 09:18:48 UTC
4f647ab [SPARK-6227][MLLIB][PYSPARK] Implement PySpark wrappers for SVD and PCA (v2) Add PCA and SVD to PySpark's wrappers for `RowMatrix` and `IndexedRowMatrix` (SVD only). Based on #7963, updated. ## How was this patch tested? New doc tests and unit tests. Ran all examples locally. Author: MechCoder <manojkumarsivaraj334@gmail.com> Author: Nick Pentreath <nickp@za.ibm.com> Closes #17621 from MLnick/SPARK-6227-pyspark-svd-pca. (cherry picked from commit db2fb84b4a3c45daa449cc9232340193ce8eb37d) Signed-off-by: Nick Pentreath <nickp@za.ibm.com> 03 May 2017, 08:58:20 UTC
c80242a [SPARK-20567] Lazily bind in GenerateExec It is not valid to eagerly bind with the child's output as this causes failures when we attempt to canonicalize the plan (replacing the attribute references with dummies). Author: Michael Armbrust <michael@databricks.com> Closes #17838 from marmbrus/fixBindExplode. (cherry picked from commit 6235132a8ce64bb12d825d0a65e5dd052d1ee647) Signed-off-by: Herman van Hovell <hvanhovell@databricks.com> 03 May 2017, 05:44:35 UTC
c199764 [SPARK-20558][CORE] clear InheritableThreadLocal variables in SparkContext when stopping it ## What changes were proposed in this pull request? To better understand this problem, let's take a look at an example first: ``` object Main { def main(args: Array[String]): Unit = { var t = new Test new Thread(new Runnable { override def run() = {} }).start() println("first thread finished") t.a = null t = new Test new Thread(new Runnable { override def run() = {} }).start() } } class Test { var a = new InheritableThreadLocal[String] { override protected def childValue(parent: String): String = { println("parent value is: " + parent) parent } } a.set("hello") } ``` The result is: ``` parent value is: hello first thread finished parent value is: hello parent value is: hello ``` Once an `InheritableThreadLocal` has been set value, child threads will inherit its value as long as it has not been GCed, so setting the variable which holds the `InheritableThreadLocal` to `null` doesn't work as we expected. In `SparkContext`, we have an `InheritableThreadLocal` for local properties, we should clear it when stopping `SparkContext`, or all the future child threads will still inherit it and copy the properties and waste memory. This is the root cause of https://issues.apache.org/jira/browse/SPARK-20548 , which creates/stops `SparkContext` many times and finally have a lot of `InheritableThreadLocal` alive, and cause OOM when starting new threads in the internal thread pools. ## How was this patch tested? N/A Author: Wenchen Fan <wenchen@databricks.com> Closes #17833 from cloud-fan/core. (cherry picked from commit b946f3160eb7953fb30edf1f097ea87be75b33e7) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 03 May 2017, 02:09:04 UTC
871b073 [SPARK-20421][CORE] Add a missing deprecation tag. In the previous patch I deprecated StorageStatus, but not the method in SparkContext that exposes that class publicly. So deprecate the method too. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #17824 from vanzin/SPARK-20421. (cherry picked from commit ef3df9125a30f8fb817fe855b74d7130be45b0ee) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> 02 May 2017, 21:30:15 UTC
4f4083b [SPARK-19235][SQL][TEST][FOLLOW-UP] Enable Test Cases in DDLSuite with Hive Metastore ### What changes were proposed in this pull request? This is a follow-up of enabling test cases in DDLSuite with Hive Metastore. It consists of the following remaining tasks: - Run all the `alter table` and `drop table` DDL tests against data source tables when using Hive metastore. - Do not run any `alter table` and `drop table` DDL test against Hive serde tables when using InMemoryCatalog. - Reenable `alter table: set serde partition` and `alter table: set serde` tests for Hive serde tables. ### How was this patch tested? N/A Author: Xiao Li <gatorsmile@gmail.com> Closes #17524 from gatorsmile/cleanupDDLSuite. (cherry picked from commit b1e639ab09d3a7a1545119e45a505c9a04308353) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 02 May 2017, 08:54:17 UTC
01f3be7 [SPARK-20300][ML][PYSPARK] Python API for ALSModel.recommendForAllUsers,Items Add Python API for `ALSModel` methods `recommendForAllUsers`, `recommendForAllItems` ## How was this patch tested? New doc tests. Author: Nick Pentreath <nickp@za.ibm.com> Closes #17622 from MLnick/SPARK-20300-pyspark-recall. (cherry picked from commit e300a5a145820ecd466885c73245d6684e8cb0aa) Signed-off-by: Nick Pentreath <nickp@za.ibm.com> 02 May 2017, 08:49:39 UTC
ef5e2a0 [SPARK-20549] java.io.CharConversionException: Invalid UTF-32' in JsonToStructs ## What changes were proposed in this pull request? A fix for the same problem was made in #17693 but ignored `JsonToStructs`. This PR uses the same fix for `JsonToStructs`. ## How was this patch tested? Regression test Author: Burak Yavuz <brkyvz@gmail.com> Closes #17826 from brkyvz/SPARK-20549. (cherry picked from commit 86174ea89b39a300caaba6baffac70f3dc702788) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 02 May 2017, 06:08:31 UTC
b146481 [SPARK-20537][CORE] Fixing OffHeapColumnVector reallocation ## What changes were proposed in this pull request? As #17773 revealed `OnHeapColumnVector` may copy a part of the original storage. `OffHeapColumnVector` reallocation also copies to the new storage data up to 'elementsAppended'. This variable is only updated when using the `ColumnVector.appendX` API, while `ColumnVector.putX` is more commonly used. This PR copies the new storage data up to the previously-allocated size in`OffHeapColumnVector`. ## How was this patch tested? Existing test suites Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Closes #17811 from kiszk/SPARK-20537. (cherry picked from commit afb21bf22a59c9416c04637412fb69d1442e6826) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 02 May 2017, 05:57:08 UTC
b7c1c2f [SPARK-20192][SPARKR][DOC] SparkR migration guide to 2.2.0 ## What changes were proposed in this pull request? Updating R Programming Guide ## How was this patch tested? manually Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #17816 from felixcheung/r22relnote. (cherry picked from commit d20a976e8918ca8d607af452301e8014fe14e64a) Signed-off-by: Felix Cheung <felixcheung@apache.org> 02 May 2017, 04:04:04 UTC
5a0a8b0 [SPARK-20459][SQL] JdbcUtils throws IllegalStateException: Cause already initialized after getting SQLException ## What changes were proposed in this pull request? Avoid failing to initCause on JDBC exception with cause initialized to null ## How was this patch tested? Existing tests Author: Sean Owen <sowen@cloudera.com> Closes #17800 from srowen/SPARK-20459. (cherry picked from commit af726cd6117de05c6e3b9616b8699d884a53651b) Signed-off-by: Xiao Li <gatorsmile@gmail.com> 02 May 2017, 00:01:13 UTC
cfa6bcb [SPARK-20540][CORE] Fix unstable executor requests. There are two problems fixed in this commit. First, the ExecutorAllocationManager sets a timeout to avoid requesting executors too often. However, the timeout is always updated based on its value and a timeout, not the current time. If the call is delayed by locking for more than the ongoing scheduler timeout, the manager will request more executors on every run. This seems to be the main cause of SPARK-20540. The second problem is that the total number of requested executors is not tracked by the CoarseGrainedSchedulerBackend. Instead, it calculates the value based on the current status of 3 variables: the number of known executors, the number of executors that have been killed, and the number of pending executors. But, the number of pending executors is never less than 0, even though there may be more known than requested. When executors are killed and not replaced, this can cause the request sent to YARN to be incorrect because there were too many executors due to the scheduler's state being slightly out of date. This is fixed by tracking the currently requested size explicitly. ## How was this patch tested? Existing tests. Author: Ryan Blue <blue@apache.org> Closes #17813 from rdblue/SPARK-20540-fix-dynamic-allocation. (cherry picked from commit 2b2dd08e975dd7fbf261436aa877f1d7497ed31f) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> 01 May 2017, 21:48:11 UTC
6f0d296 [SPARK-20464][SS] Add a job group and description for streaming queries and fix cancellation of running jobs using the job group ## What changes were proposed in this pull request? Job group: adding a job group is required to properly cancel running jobs related to a query. Description: the new description makes it easier to group the batches of a query by sorting by name in the Spark Jobs UI. ## How was this patch tested? - Unit tests - UI screenshot - Order by job id: ![screen shot 2017-04-27 at 5 10 09 pm](https://cloud.githubusercontent.com/assets/7865120/25509468/15452274-2b6e-11e7-87ba-d929816688cf.png) - Order by description: ![screen shot 2017-04-27 at 5 10 22 pm](https://cloud.githubusercontent.com/assets/7865120/25509474/1c298512-2b6e-11e7-99b8-fef1ef7665c1.png) - Order by job id (no query name): ![screen shot 2017-04-27 at 5 21 33 pm](https://cloud.githubusercontent.com/assets/7865120/25509482/28c96dc8-2b6e-11e7-8df0-9d3cdbb05e36.png) - Order by description (no query name): ![screen shot 2017-04-27 at 5 21 44 pm](https://cloud.githubusercontent.com/assets/7865120/25509489/37674742-2b6e-11e7-9357-b5c38ec16ac4.png) Author: Kunal Khamar <kkhamar@outlook.com> Closes #17765 from kunalkhamar/sc-6696. (cherry picked from commit 6fc6cf88d871f5b05b0ad1a504e0d6213cf9d331) Signed-off-by: Shixiong Zhu <shixiong@databricks.com> 01 May 2017, 18:37:44 UTC
38edb92 [SPARK-20517][UI] Fix broken history UI download link The download link in history server UI is concatenated with: ``` <td><a href="{{uiroot}}/api/v1/applications/{{id}}/{{num}}/logs" class="btn btn-info btn-mini">Download</a></td> ``` Here `num` field represents number of attempts, this is not equal to REST APIs. In the REST API, if attempt id is not existed the URL should be `api/v1/applications/<id>/logs`, otherwise the URL should be `api/v1/applications/<id>/<attemptId>/logs`. Using `<num>` to represent `<attemptId>` will lead to the issue of "no such app". Manual verification. CC ajbozarth can you please review this change, since you add this feature before? Thanks! Author: jerryshao <sshao@hortonworks.com> Closes #17795 from jerryshao/SPARK-20517. (cherry picked from commit ab30590f448d05fc1864c54a59b6815bdeef8fc7) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> 01 May 2017, 17:26:21 UTC
813abd2 [SPARK-20534][SQL] Make outer generate exec return empty rows ## What changes were proposed in this pull request? Generate exec does not produce `null` values if the generator for the input row is empty and the generate operates in outer mode without join. This is caused by the fact that the `join=false` code path is different from the `join=true` code path, and that the `join=false` code path did deal with outer properly. This PR addresses this issue. ## How was this patch tested? Updated `outer*` tests in `GeneratorFunctionSuite`. Author: Herman van Hovell <hvanhovell@databricks.com> Closes #17810 from hvanhovell/SPARK-20534. (cherry picked from commit 6b44c4d63ab14162e338c5f1ac77333956870a90) Signed-off-by: gatorsmile <gatorsmile@gmail.com> 01 May 2017, 16:46:44 UTC
c890e93 [SPARK-20541][SPARKR][SS] support awaitTermination without timeout ## What changes were proposed in this pull request? Add without param for timeout - will need this to submit a job that runs until stopped Need this for 2.2 ## How was this patch tested? manually, unit test Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #17815 from felixcheung/rssawaitinfinite. (cherry picked from commit a355b667a3718d9c5d48a0781e836bf5418ab842) Signed-off-by: Felix Cheung <felixcheung@apache.org> 01 May 2017, 06:24:03 UTC
994d9da [MINOR][DOCS][PYTHON] Adding missing boolean type for replacement value in fillna ## What changes were proposed in this pull request? Currently pyspark Dataframe.fillna API supports boolean type when we pass dict, but it is missing in documentation. ## How was this patch tested? >>> spark.createDataFrame([Row(a=True),Row(a=None)]).fillna({"a" : True}).show() +----+ | a| +----+ |true| |true| +----+ Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Srinivasa Reddy Vundela <vsr@cloudera.com> Closes #17688 from vundela/fillna_doc_fix. (cherry picked from commit 6613046c8c2daaf46a8ec13dd0a016aad22af1a4) Signed-off-by: Felix Cheung <felixcheung@apache.org> 01 May 2017, 04:45:16 UTC
c5beabc [SPARK-20492][SQL] Do not print empty parentheses for invalid primitive types in parser ## What changes were proposed in this pull request? Currently, when the type string is invalid, it looks printing empty parentheses. This PR proposes a small improvement in an error message by removing it in the parse as below: ```scala spark.range(1).select($"col".cast("aa")) ``` **Before** ``` org.apache.spark.sql.catalyst.parser.ParseException: DataType aa() is not supported.(line 1, pos 0) == SQL == aa ^^^ ``` **After** ``` org.apache.spark.sql.catalyst.parser.ParseException: DataType aa is not supported.(line 1, pos 0) == SQL == aa ^^^ ``` ## How was this patch tested? Unit tests in `DataTypeParserSuite`. Author: hyukjinkwon <gurwls223@gmail.com> Closes #17784 from HyukjinKwon/SPARK-20492. (cherry picked from commit 1ee494d0868a85af3154996732817ed63679f382) Signed-off-by: Herman van Hovell <hvanhovell@databricks.com> 30 April 2017, 15:24:24 UTC
c5f5593 [SPARK-20521][DOC][CORE] The default of 'spark.worker.cleanup.appDataTtl' should be 604800 in spark-standalone.md ## What changes were proposed in this pull request? Currently, our project needs to be set to clean up the worker directory cleanup cycle is three days. When I follow http://spark.apache.org/docs/latest/spark-standalone.html, configure the 'spark.worker.cleanup.appDataTtl' parameter, I configured to 3 * 24 * 3600. When I start the spark service, the startup fails, and the worker log displays the error log as follows: 2017-04-28 15:02:03,306 INFO Utils: Successfully started service 'sparkWorker' on port 48728. Exception in thread "main" java.lang.NumberFormatException: For input string: "3 * 24 * 3600" at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) at java.lang.Long.parseLong(Long.java:430) at java.lang.Long.parseLong(Long.java:483) at scala.collection.immutable.StringLike$class.toLong(StringLike.scala:276) at scala.collection.immutable.StringOps.toLong(StringOps.scala:29) at org.apache.spark.SparkConf$$anonfun$getLong$2.apply(SparkConf.scala:380) at org.apache.spark.SparkConf$$anonfun$getLong$2.apply(SparkConf.scala:380) at scala.Option.map(Option.scala:146) at org.apache.spark.SparkConf.getLong(SparkConf.scala:380) at org.apache.spark.deploy.worker.Worker.<init>(Worker.scala:100) at org.apache.spark.deploy.worker.Worker$.startRpcEnvAndEndpoint(Worker.scala:730) at org.apache.spark.deploy.worker.Worker$.main(Worker.scala:709) at org.apache.spark.deploy.worker.Worker.main(Worker.scala) **Because we put 7 * 24 * 3600 as a string, forced to convert to the dragon type, will lead to problems in the program.** **So I think the default value of the current configuration should be a specific long value, rather than 7 * 24 * 3600,should be 604800. Because it would mislead users for similar configurations, resulting in spark start failure.** ## How was this patch tested? manual tests Please review http://spark.apache.org/contributing.html before opening a pull request. Author: 郭小龙 10207633 <guo.xiaolong1@zte.com.cn> Author: guoxiaolong <guo.xiaolong1@zte.com.cn> Author: guoxiaolongzte <guo.xiaolong1@zte.com.cn> Closes #17798 from guoxiaolongzte/SPARK-20521. (cherry picked from commit 4d99b95ad0d0c7ef909c8e492ec45e94cf0189b4) Signed-off-by: Sean Owen <sowen@cloudera.com> 30 April 2017, 08:06:34 UTC
9789d5c [SPARK-19791][ML] Add doc and example for fpgrowth ## What changes were proposed in this pull request? Add a new section for fpm Add Example for FPGrowth in scala and Java updated: Rewrite transform to be more compact. ## How was this patch tested? local doc generation. Author: Yuhao Yang <yuhao.yang@intel.com> Closes #17130 from hhbyyh/fpmdoc. (cherry picked from commit add9d1bba5cf33218a115428a03d3c76a514aa86) Signed-off-by: Felix Cheung <felixcheung@apache.org> 29 April 2017, 17:52:01 UTC
4a86d8d [SPARK-20477][SPARKR][DOC] Document R bisecting k-means in R programming guide ## What changes were proposed in this pull request? Add hyper link in the SparkR programming guide. ## How was this patch tested? Build doc and manually check the doc link. Author: wangmiao1981 <wm624@hotmail.com> Closes #17805 from wangmiao1981/doc. (cherry picked from commit b28c3bc2020a6936e4ac4c28d49fd832a952af42) Signed-off-by: Felix Cheung <felixcheung@apache.org> 29 April 2017, 17:31:14 UTC
ca6c59e [SPARK-20487][SQL] Display `serde` for `HiveTableScan` node in explained plan ## What changes were proposed in this pull request? This was a suggestion by rxin at https://github.com/apache/spark/pull/17780#issuecomment-298073408 ## How was this patch tested? - modified existing unit test - manual testing: ``` scala> hc.sql(" SELECT * FROM tejasp_bucketed_partitioned_1 where name = '' ").explain(true) == Parsed Logical Plan == 'Project [*] +- 'Filter ('name = ) +- 'UnresolvedRelation `tejasp_bucketed_partitioned_1` == Analyzed Logical Plan == user_id: bigint, name: string, ds: string Project [user_id#24L, name#25, ds#26] +- Filter (name#25 = ) +- SubqueryAlias tejasp_bucketed_partitioned_1 +- CatalogRelation `default`.`tejasp_bucketed_partitioned_1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [user_id#24L, name#25], [ds#26] == Optimized Logical Plan == Filter (isnotnull(name#25) && (name#25 = )) +- CatalogRelation `default`.`tejasp_bucketed_partitioned_1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [user_id#24L, name#25], [ds#26] == Physical Plan == *Filter (isnotnull(name#25) && (name#25 = )) +- HiveTableScan [user_id#24L, name#25, ds#26], CatalogRelation `default`.`tejasp_bucketed_partitioned_1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [user_id#24L, name#25], [ds#26] ``` Author: Tejas Patil <tejasp@fb.com> Closes #17806 from tejasapatil/add_serde. (cherry picked from commit 814a61a867ded965433c944c90961df529ac83ab) Signed-off-by: gatorsmile <gatorsmile@gmail.com> 29 April 2017, 06:12:34 UTC
1405862 [SPARK-19525][CORE] Add RDD checkpoint compression support ## What changes were proposed in this pull request? This PR adds RDD checkpoint compression support and add a new config `spark.checkpoint.compress` to enable/disable it. Credit goes to aramesh117 Closes #17024 ## How was this patch tested? The new unit test. Author: Shixiong Zhu <shixiong@databricks.com> Author: Aaditya Ramesh <aramesh@conviva.com> Closes #17789 from zsxwing/pr17024. (cherry picked from commit 77bcd77ed5fbd91fe61849cca76a8dffe5e4d6b2) Signed-off-by: Shixiong Zhu <shixiong@databricks.com> 28 April 2017, 22:29:34 UTC
5547002 [SPARK-20471] Remove AggregateBenchmark testsuite warning: Two level hashmap is disabled but vectorized hashmap is enabled What changes were proposed in this pull request? remove AggregateBenchmark testsuite warning: such as '14:26:33.220 WARN org.apache.spark.sql.execution.aggregate.HashAggregateExec: Two level hashmap is disabled but vectorized hashmap is enabled.' How was this patch tested? unit tests: AggregateBenchmark Modify the 'ignore function for 'test funtion Author: caoxuewen <cao.xuewen@zte.com.cn> Closes #17771 from heary-cao/AggregateBenchmark. (cherry picked from commit ebff519c5ead31536e17a5b16cc47c2bf380d55e) Signed-off-by: gatorsmile <gatorsmile@gmail.com> 28 April 2017, 21:47:33 UTC
f66aabd [SPARK-20514][CORE] Upgrade Jetty to 9.3.11.v20160721 Upgrade Jetty so it can work with Hadoop 3 (alpha 2 release, in particular). Without this change, because of incompatibily between Jetty versions, Spark fails to compile when built against Hadoop 3 ## How was this patch tested? Unit tests being run. Author: Mark Grover <mark@apache.org> Closes #17790 from markgrover/spark-20514. (cherry picked from commit 5d71f3db83138bf50749dcd425ef7365c34bd799) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> 28 April 2017, 21:07:07 UTC
ec712d7 [SPARK-20496][SS] Bug in KafkaWriter Looks at Unanalyzed Plans ## What changes were proposed in this pull request? We didn't enforce analyzed plans in Spark 2.1 when writing out to Kafka. ## How was this patch tested? New unit test. Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Bill Chambers <bill@databricks.com> Closes #17804 from anabranch/SPARK-20496-2. (cherry picked from commit 733b81b835f952ab96723c749461d6afc0c71974) Signed-off-by: Burak Yavuz <brkyvz@gmail.com> 28 April 2017, 17:18:50 UTC
ea5b114 [SPARK-20465][CORE] Throws a proper exception when any temp directory could not be got ## What changes were proposed in this pull request? This PR proposes to throw an exception with better message rather than `ArrayIndexOutOfBoundsException` when temp directories could not be created. Running the commands below: ```bash ./bin/spark-shell --conf spark.local.dir=/NONEXISTENT_DIR_ONE,/NONEXISTENT_DIR_TWO ``` produces ... **Before** ``` Exception in thread "main" java.lang.ExceptionInInitializerError ... Caused by: java.lang.ArrayIndexOutOfBoundsException: 0 ... ``` **After** ``` Exception in thread "main" java.lang.ExceptionInInitializerError ... Caused by: java.io.IOException: Failed to get a temp directory under [/NONEXISTENT_DIR_ONE,/NONEXISTENT_DIR_TWO]. ... ``` ## How was this patch tested? Unit tests in `LocalDirsSuite.scala`. Author: hyukjinkwon <gurwls223@gmail.com> Closes #17768 from HyukjinKwon/throws-temp-dir-exception. (cherry picked from commit 8c911adac56a1b1d95bc19915e0070ce7305257c) Signed-off-by: Sean Owen <sowen@cloudera.com> 28 April 2017, 07:54:15 UTC
af3a141 [SPARK-14471][SQL] Aliases in SELECT could be used in GROUP BY ## What changes were proposed in this pull request? This pr added a new rule in `Analyzer` to resolve aliases in `GROUP BY`. The current master throws an exception if `GROUP BY` clauses have aliases in `SELECT`; ``` scala> spark.sql("select a a1, a1 + 1 as b, count(1) from t group by a1") org.apache.spark.sql.AnalysisException: cannot resolve '`a1`' given input columns: [a]; line 1 pos 51; 'Aggregate ['a1], [a#83L AS a1#87L, ('a1 + 1) AS b#88, count(1) AS count(1)#90L] +- SubqueryAlias t +- Project [id#80L AS a#83L] +- Range (0, 10, step=1, splits=Some(8)) at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:77) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:74) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289) ``` ## How was this patch tested? Added tests in `SQLQuerySuite` and `SQLQueryTestSuite`. Author: Takeshi Yamamuro <yamamuro@apache.org> Closes #17191 from maropu/SPARK-14471. (cherry picked from commit 59e3a564448777657125b6f65057ed20d0162d13) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 28 April 2017, 06:42:05 UTC
26a9e29 [SPARK-20476][SQL] Block users to create a table that use commas in the column names ### What changes were proposed in this pull request? ```SQL hive> create table t1(`a,` string); OK Time taken: 1.399 seconds hive> create table t2(`a,` string, b string); FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. java.lang.RuntimeException: MetaException(message:org.apache.hadoop.hive.serde2.SerDeException org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe: columns has 3 elements while columns.types has 2 elements!) hive> create table t2(`a,` string, b string) stored as parquet; FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. java.lang.IllegalArgumentException: ParquetHiveSerde initialization failed. Number of column name and column type differs. columnNames = [a, , b], columnTypes = [string, string] ``` It has a bug in Hive metastore. When users do not provide alias name in the SELECT query, we call `toPrettySQL` to generate the alias name. For example, the string `get_json_object(jstring, '$.f1')` will be the alias name for the function call in the statement ```SQL SELECT key, get_json_object(jstring, '$.f1') FROM tempView ``` Above is not an issue for the SELECT query statements. However, for CTAS, we hit the issue due to a bug in Hive metastore. Hive metastore does not like the column names containing commas and returned a confusing error message, like: ``` 17/04/26 23:12:56 ERROR [hive.log(397) -- main]: error in initSerDe: org.apache.hadoop.hive.serde2.SerDeException org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe: columns has 2 elements while columns.types has 1 elements! org.apache.hadoop.hive.serde2.SerDeException: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe: columns has 2 elements while columns.types has 1 elements! ``` Thus, this PR is to block users to create a table in Hive metastore when the table table has a column containing commas in the name. ### How was this patch tested? Added a test case Author: Xiao Li <gatorsmile@gmail.com> Closes #17781 from gatorsmile/blockIllegalColumnNames. (cherry picked from commit e3c816043389e227db5e7a328c7c554209b4f394) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 28 April 2017, 06:17:05 UTC
f60ed0c [SPARKR][DOC] Document LinearSVC in R programming guide ## What changes were proposed in this pull request? add link to svmLinear in the SparkR programming document. ## How was this patch tested? Build doc manually and click the link to the document. It looks good. Author: wangmiao1981 <wm624@hotmail.com> Closes #17797 from wangmiao1981/doc. (cherry picked from commit 7fe8249793bd3eed4fa67cb4a210264a80520786) Signed-off-by: Felix Cheung <felixcheung@apache.org> 28 April 2017, 05:30:00 UTC
e02b6eb [SPARK-12837][CORE] Do not send the name of internal accumulator to executor side ## What changes were proposed in this pull request? When sending accumulator updates back to driver, the network overhead is pretty big as there are a lot of accumulators, e.g. `TaskMetrics` will send about 20 accumulators everytime, there may be a lot of `SQLMetric` if the query plan is complicated. Therefore, it's critical to reduce the size of serialized accumulator. A simple way is to not send the name of internal accumulators to executor side, as it's unnecessary. When executor sends accumulator updates back to driver, we can look up the accumulator name in `AccumulatorContext` easily. Note that, we still need to send names of normal accumulators, as the user code run at executor side may rely on accumulator names. In the future, we should reimplement `TaskMetrics` to not rely on accumulators and use custom serialization. Tried on the example in https://issues.apache.org/jira/browse/SPARK-12837, the size of serialized accumulator has been cut down by about 40%. ## How was this patch tested? existing tests. Author: Wenchen Fan <wenchen@databricks.com> Closes #17596 from cloud-fan/oom. (cherry picked from commit b90bf520fd7b979a90d1377cfc2ee7f0bf82c705) Signed-off-by: Herman van Hovell <hvanhovell@databricks.com> 28 April 2017, 02:38:25 UTC
3d53d82 [SPARK-20452][SS][KAFKA] Fix a potential ConcurrentModificationException for batch Kafka DataFrame ## What changes were proposed in this pull request? Cancel a batch Kafka query but one of task cannot be cancelled, and rerun the same DataFrame may cause ConcurrentModificationException because it may launch two tasks sharing the same group id. This PR always create a new consumer when `reuseKafkaConsumer = false` to avoid ConcurrentModificationException. It also contains other minor fixes. ## How was this patch tested? Jenkins. Author: Shixiong Zhu <shixiong@databricks.com> Closes #17752 from zsxwing/kafka-fix. (cherry picked from commit 823baca2cb8edb62885af547d3511c9e8923cefd) Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com> 27 April 2017, 20:58:54 UTC
753e129 [SPARK-20461][CORE][SS] Use UninterruptibleThread for Executor and fix the potential hang in CachedKafkaConsumer ## What changes were proposed in this pull request? This PR changes Executor's threads to `UninterruptibleThread` so that we can use `runUninterruptibly` in `CachedKafkaConsumer`. However, this is just best effort to avoid hanging forever. If the user uses`CachedKafkaConsumer` in another thread (e.g., create a new thread or Future), the potential hang may still happen. ## How was this patch tested? The new added test. Author: Shixiong Zhu <shixiong@databricks.com> Closes #17761 from zsxwing/int. (cherry picked from commit 01c999e7f94d5e6c2fce67304dc62351dfbdf963) Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com> 27 April 2017, 20:55:19 UTC
4512e2a [SPARK-20047][ML] Constrained Logistic Regression ## What changes were proposed in this pull request? MLlib ```LogisticRegression``` should support bound constrained optimization (only for L2 regularization). Users can add bound constraints to coefficients to make the solver produce solution in the specified range. Under the hood, we call Breeze [```L-BFGS-B```](https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/optimize/LBFGSB.scala) as the solver for bound constrained optimization. But in the current breeze implementation, there are some bugs in L-BFGS-B, and https://github.com/scalanlp/breeze/pull/633 fixed them. We need to upgrade dependent breeze later, and currently we use the workaround L-BFGS-B in this PR temporary for reviewing. ## How was this patch tested? Unit tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes #17715 from yanboliang/spark-20047. (cherry picked from commit 606432a13ad22d862c7cb5028ad6fe73c9985423) Signed-off-by: DB Tsai <dbtsai@dbtsai.com> 27 April 2017, 20:49:00 UTC
c29c6de [SPARK-20483][MINOR] Test for Mesos Coarse mode may starve other Mesos frameworks ## What changes were proposed in this pull request? Add test case for scenarios where executor.cores is set as a (non)divisor of spark.cores.max This tests the change in #17786 ## How was this patch tested? Ran the existing test suite with the new tests dbtsai Author: Davis Shepherd <dshepherd@netflix.com> Closes #17788 from dgshep/add_mesos_test. (cherry picked from commit 039e32ca19d113e3be2c09171c7c921698be7ab8) Signed-off-by: DB Tsai <dbtsai@dbtsai.com> 27 April 2017, 20:26:08 UTC
c69d862 [SPARK-20426] Lazy initialization of FileSegmentManagedBuffer for shuffle service. ## What changes were proposed in this pull request? When application contains large amount of shuffle blocks. NodeManager requires lots of memory to keep metadata(`FileSegmentManagedBuffer`) in `StreamManager`. When the number of shuffle blocks is big enough. NodeManager can run OOM. This pr proposes to do lazy initialization of `FileSegmentManagedBuffer` in shuffle service. ## How was this patch tested? Manually test. Author: jinxing <jinxing6042@126.com> Closes #17744 from jinxing64/SPARK-20426. (cherry picked from commit 85c6ce61930490e2247fb4b0e22dfebbb8b6a1ee) Signed-off-by: Tom Graves <tgraves@yahoo-inc.com> 27 April 2017, 19:21:28 UTC
92b61f0 [SPARK-20487][SQL] `HiveTableScan` node is quite verbose in explained plan ## What changes were proposed in this pull request? Changed `TreeNode.argString` to handle `CatalogTable` separately (otherwise it would call the default `toString` on the `CatalogTable`) ## How was this patch tested? - Expanded scope of existing unit test to ensure that verbose information is not present - Manual testing Before ``` scala> hc.sql(" SELECT * FROM my_table WHERE name = 'foo' ").explain(true) == Parsed Logical Plan == 'Project [*] +- 'Filter ('name = foo) +- 'UnresolvedRelation `my_table` == Analyzed Logical Plan == user_id: bigint, name: string, ds: string Project [user_id#13L, name#14, ds#15] +- Filter (name#14 = foo) +- SubqueryAlias my_table +- CatalogRelation CatalogTable( Database: default Table: my_table Owner: tejasp Created: Fri Apr 14 17:05:50 PDT 2017 Last Access: Wed Dec 31 16:00:00 PST 1969 Type: MANAGED Provider: hive Properties: [serialization.format=1] Statistics: 9223372036854775807 bytes Location: file:/tmp/warehouse/my_table Serde Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe InputFormat: org.apache.hadoop.mapred.TextInputFormat OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat Partition Provider: Catalog Partition Columns: [`ds`] Schema: root -- user_id: long (nullable = true) -- name: string (nullable = true) -- ds: string (nullable = true) ), [user_id#13L, name#14], [ds#15] == Optimized Logical Plan == Filter (isnotnull(name#14) && (name#14 = foo)) +- CatalogRelation CatalogTable( Database: default Table: my_table Owner: tejasp Created: Fri Apr 14 17:05:50 PDT 2017 Last Access: Wed Dec 31 16:00:00 PST 1969 Type: MANAGED Provider: hive Properties: [serialization.format=1] Statistics: 9223372036854775807 bytes Location: file:/tmp/warehouse/my_table Serde Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe InputFormat: org.apache.hadoop.mapred.TextInputFormat OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat Partition Provider: Catalog Partition Columns: [`ds`] Schema: root -- user_id: long (nullable = true) -- name: string (nullable = true) -- ds: string (nullable = true) ), [user_id#13L, name#14], [ds#15] == Physical Plan == *Filter (isnotnull(name#14) && (name#14 = foo)) +- HiveTableScan [user_id#13L, name#14, ds#15], CatalogRelation CatalogTable( Database: default Table: my_table Owner: tejasp Created: Fri Apr 14 17:05:50 PDT 2017 Last Access: Wed Dec 31 16:00:00 PST 1969 Type: MANAGED Provider: hive Properties: [serialization.format=1] Statistics: 9223372036854775807 bytes Location: file:/tmp/warehouse/my_table Serde Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe InputFormat: org.apache.hadoop.mapred.TextInputFormat OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat Partition Provider: Catalog Partition Columns: [`ds`] Schema: root -- user_id: long (nullable = true) -- name: string (nullable = true) -- ds: string (nullable = true) ), [user_id#13L, name#14], [ds#15] ``` After ``` scala> hc.sql(" SELECT * FROM my_table WHERE name = 'foo' ").explain(true) == Parsed Logical Plan == 'Project [*] +- 'Filter ('name = foo) +- 'UnresolvedRelation `my_table` == Analyzed Logical Plan == user_id: bigint, name: string, ds: string Project [user_id#13L, name#14, ds#15] +- Filter (name#14 = foo) +- SubqueryAlias my_table +- CatalogRelation `default`.`my_table`, [user_id#13L, name#14], [ds#15] == Optimized Logical Plan == Filter (isnotnull(name#14) && (name#14 = foo)) +- CatalogRelation `default`.`my_table`, [user_id#13L, name#14], [ds#15] == Physical Plan == *Filter (isnotnull(name#14) && (name#14 = foo)) +- HiveTableScan [user_id#13L, name#14, ds#15], CatalogRelation `default`.`my_table`, [user_id#13L, name#14], [ds#15] ``` Author: Tejas Patil <tejasp@fb.com> Closes #17780 from tejasapatil/SPARK-20487_verbose_plan. (cherry picked from commit a4aa4665a6775b514b714c88b70576090d2b4a7e) Signed-off-by: Xiao Li <gatorsmile@gmail.com> 27 April 2017, 19:13:31 UTC
090b337 [SPARK-20482][SQL] Resolving Casts is too strict on having time zone set ## What changes were proposed in this pull request? Relax the requirement that a `TimeZoneAwareExpression` has to have its `timeZoneId` set to be considered resolved. With this change, a `Cast` (which is a `TimeZoneAwareExpression`) can be considered resolved if the `(fromType, toType)` combination doesn't require time zone information. Also de-relaxed test cases in `CastSuite` so Casts in that test suite don't get a default`timeZoneId = Option("GMT")`. ## How was this patch tested? Ran the de-relaxed`CastSuite` and it's passing. Also ran the SQL unit tests and they're passing too. Author: Kris Mok <kris.mok@databricks.com> Closes #17777 from rednaxelafx/fix-catalyst-cast-timezone. (cherry picked from commit 26ac2ce05cbaf8f152347219403e31491e9c9bf1) Signed-off-by: Xiao Li <gatorsmile@gmail.com> 27 April 2017, 19:08:41 UTC
87d27e5 [SPARK-20421][CORE] Mark internal listeners as deprecated. These listeners weren't really meant for external consumption, but they're public and marked with DeveloperApi. Adding the deprecated tag warns people that they may soon go away (as they will as part of the work for SPARK-18085). Note that not all types made public by https://github.com/apache/spark/pull/648 are being deprecated. Some remaining types are still exposed through the SparkListener API. Also note the text for StorageStatus is a tiny bit different, since I'm not so sure I'll be able to remove it. But the effect for the users should be the same (they should stop trying to use it). Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #17766 from vanzin/SPARK-20421. (cherry picked from commit 561e9cc390b429e4252f59f00a7ca4f6f8c853f8) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> 27 April 2017, 18:31:11 UTC
c86c078 [SPARK-20483] Mesos Coarse mode may starve other Mesos frameworks ## What changes were proposed in this pull request? Set maxCores to be a multiple of the smallest executor that can be launched. This ensures that we correctly detect the condition where no more executors will be launched when spark.cores.max is not a multiple of spark.executor.cores ## How was this patch tested? This was manually tested with other sample frameworks measuring their incoming offers to determine if starvation would occur. dbtsai mgummelt Author: Davis Shepherd <dshepherd@netflix.com> Closes #17786 from dgshep/fix_mesos_max_cores. (cherry picked from commit 7633933e54ffb08ab9d959be5f76c26fae29d1d9) Signed-off-by: DB Tsai <dbtsai@dbtsai.com> 27 April 2017, 18:06:31 UTC
75544c0 Preparing development version 2.2.0-SNAPSHOT 27 April 2017, 00:32:23 UTC
8ccb4a5 Preparing Spark release v2.2.0-rc1 27 April 2017, 00:32:19 UTC
d6efda5 [SPARK-20435][CORE] More thorough redaction of sensitive information This change does a more thorough redaction of sensitive information from logs and UI Add unit tests that ensure that no regressions happen that leak sensitive information to the logs. The motivation for this change was appearance of password like so in `SparkListenerEnvironmentUpdate` in event logs under some JVM configurations: `"sun.java.command":"org.apache.spark.deploy.SparkSubmit ... --conf spark.executorEnv.HADOOP_CREDSTORE_PASSWORD=secret_password ..." ` Previously redaction logic was only checking if the key matched the secret regex pattern, it'd redact it's value. That worked for most cases. However, in the above case, the key (sun.java.command) doesn't tell much, so the value needs to be searched. This PR expands the check to check for values as well. ## How was this patch tested? New unit tests added that ensure that no sensitive information is present in the event logs or the yarn logs. Old unit test in UtilsSuite was modified because the test was asserting that a non-sensitive property's value won't be redacted. However, the non-sensitive value had the literal "secret" in it which was causing it to redact. Simply updating the non-sensitive property's value to another arbitrary value (that didn't have "secret" in it) fixed it. Author: Mark Grover <mark@apache.org> Closes #17725 from markgrover/spark-20435. (cherry picked from commit 66636ef0b046e5d1f340c3b8153d7213fa9d19c7) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> 27 April 2017, 00:06:30 UTC
b48bb3a [SPARK-12868][SQL] Allow adding jars from hdfs ## What changes were proposed in this pull request? Spark 2.2 is going to be cut, it'll be great if SPARK-12868 can be resolved before that. There have been several PRs for this like [PR#16324](https://github.com/apache/spark/pull/16324) , but all of them are inactivity for a long time or have been closed. This PR added a SparkUrlStreamHandlerFactory, which relies on 'protocol' to choose the appropriate UrlStreamHandlerFactory like FsUrlStreamHandlerFactory to create URLStreamHandler. ## How was this patch tested? 1. Add a new unit test. 2. Check manually. Before: throw an exception with " failed unknown protocol: hdfs" <img width="914" alt="screen shot 2017-03-17 at 9 07 36 pm" src="https://cloud.githubusercontent.com/assets/8546874/24075277/5abe0a7c-0bd5-11e7-900e-ec3d3105da0b.png"> After: <img width="1148" alt="screen shot 2017-03-18 at 11 42 18 am" src="https://cloud.githubusercontent.com/assets/8546874/24075283/69382a60-0bd5-11e7-8d30-d9405c3aaaba.png"> Author: Weiqing Yang <yangweiqing001@gmail.com> Closes #17342 from weiqingy/SPARK-18910. (cherry picked from commit 2ba1eba371213d1ac3d1fa1552e5906e043c2ee4) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> 26 April 2017, 20:54:49 UTC
back to top