https://github.com/apache/spark

sort by:
Revision Author Date Message Commit Date
584354e Preparing Spark release v2.0.2-rc3 07 November 2016, 20:26:31 UTC
10525c2 [SPARK-18283][STRUCTURED STREAMING][KAFKA] Added test to check whether default starting offset in latest ## What changes were proposed in this pull request? Added test to check whether default starting offset in latest ## How was this patch tested? new unit test Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #15778 from tdas/SPARK-18283. (cherry picked from commit b06c23db9aedae48c9eba9d702ae82fa5647cfe5) Signed-off-by: Shixiong Zhu <shixiong@databricks.com> 07 November 2016, 18:44:05 UTC
b5d7217 [SPARK-18125][SQL][BRANCH-2.0] Fix a compilation error in codegen due to splitExpression ## What changes were proposed in this pull request? Backport to branch 2.0. As reported in the jira, sometimes the generated java code in codegen will cause compilation error. Code snippet to test it: case class Route(src: String, dest: String, cost: Int) case class GroupedRoutes(src: String, dest: String, routes: Seq[Route]) val ds = sc.parallelize(Array( Route("a", "b", 1), Route("a", "b", 2), Route("a", "c", 2), Route("a", "d", 10), Route("b", "a", 1), Route("b", "a", 5), Route("b", "c", 6)) ).toDF.as[Route] val grped = ds.map(r => GroupedRoutes(r.src, r.dest, Seq(r))) .groupByKey(r => (r.src, r.dest)) .reduceGroups { (g1: GroupedRoutes, g2: GroupedRoutes) => GroupedRoutes(g1.src, g1.dest, g1.routes ++ g2.routes) }.map(_._2) The problem here is, in `ReferenceToExpressions` we evaluate the children vars to local variables. Then the result expression is evaluated to use those children variables. In the above case, the result expression code is too long and will be split by `CodegenContext.splitExpression`. So those local variables cannot be accessed and cause compilation error. ## How was this patch tested? Jenkins tests. Please review https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark before opening a pull request. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #15796 from viirya/fix-codege-compilation-error-2.0. 07 November 2016, 18:09:18 UTC
dd5cb0a [SPARK-17849][SQL] Fix NPE problem when using grouping sets ## What changes were proposed in this pull request? Prior this pr, the following code would cause an NPE: `case class point(a:String, b:String, c:String, d: Int)` `val data = Seq( point("1","2","3", 1), point("4","5","6", 1), point("7","8","9", 1) )` `sc.parallelize(data).toDF().registerTempTable("table")` `spark.sql("select a, b, c, count(d) from table group by a, b, c GROUPING SETS ((a)) ").show()` The reason is that when the grouping_id() behavior was changed in #10677, some code (which should be changed) was left out. Take the above code for example, prior #10677, the bit mask for set "(a)" was `001`, while after #10677 the bit mask was changed to `011`. However, the `nonNullBitmask` was not changed accordingly. This pr will fix this problem. ## How was this patch tested? add integration tests Author: wangyang <wangyang@haizhi.com> Closes #15416 from yangw1234/groupingid. (cherry picked from commit fb0d60814a79747beb68da9613679141c44f2540) Signed-off-by: Herman van Hovell <hvanhovell@databricks.com> 05 November 2016, 13:33:04 UTC
5b9eb42 [SPARK-17693][SQL][BACKPORT-2.0] Fixed Insert Failure To Data Source Tables when the Schema has the Comment Field ### What changes were proposed in this pull request? **This PR is to backport the fix https://github.com/apache/spark/pull/15615 to 2.0.** ``` SQL CREATE TABLE tab1(col1 int COMMENT 'a', col2 int) USING parquet INSERT INTO TABLE tab1 SELECT 1, 2 ``` The insert attempt will fail if the target table has a column with comments. The error is strange to the external users: ``` assertion failed: No plan for InsertIntoTable Relation[col1#15,col2#16] parquet, false, false +- Project [1 AS col1#19, 2 AS col2#20] +- OneRowRelation$ ``` This PR is to fix the above bug by checking the metadata when comparing the schema between the table and the query. If not matched, we also copy the metadata. This is an alternative to https://github.com/apache/spark/pull/15266 ### How was this patch tested? Added a test case Author: gatorsmile <gatorsmile@gmail.com> Closes #15782 from gatorsmile/insertDataSourceTableWithCommentSolutionBackPort2.0. 05 November 2016, 10:45:16 UTC
d023c6c [SPARK-17981][SPARK-17957][SQL][BACKPORT-2.0] Fix Incorrect Nullability Setting to False in FilterExec ### What changes were proposed in this pull request? **This PR is to backport the fix https://github.com/apache/spark/pull/15523 to 2.0.** When `FilterExec` contains `isNotNull`, which could be inferred and pushed down or users specified, we convert the nullability of the involved columns if the top-layer expression is null-intolerant. However, this is not correct, if the top-layer expression is not a leaf expression, it could still tolerate the null when it has null-tolerant child expressions. For example, `cast(coalesce(a#5, a#15) as double)`. Although `cast` is a null-intolerant expression, but obviously`coalesce` is null-tolerant. Thus, it could eat null. When the nullability is wrong, we could generate incorrect results in different cases. For example, ``` Scala val df1 = Seq((1, 2), (2, 3)).toDF("a", "b") val df2 = Seq((2, 5), (3, 4)).toDF("a", "c") val joinedDf = df1.join(df2, Seq("a"), "outer").na.fill(0) val df3 = Seq((3, 1)).toDF("a", "d") joinedDf.join(df3, "a").show ``` The optimized plan is like ``` Project [a#29, b#30, c#31, d#42] +- Join Inner, (a#29 = a#41) :- Project [cast(coalesce(cast(coalesce(a#5, a#15) as double), 0.0) as int) AS a#29, cast(coalesce(cast(b#6 as double), 0.0) as int) AS b#30, cast(coalesce(cast(c#16 as double), 0.0) as int) AS c#31] : +- Filter isnotnull(cast(coalesce(cast(coalesce(a#5, a#15) as double), 0.0) as int)) : +- Join FullOuter, (a#5 = a#15) : :- LocalRelation [a#5, b#6] : +- LocalRelation [a#15, c#16] +- LocalRelation [a#41, d#42] ``` Without the fix, it returns an empty result. With the fix, it can return a correct answer: ``` +---+---+---+---+ | a| b| c| d| +---+---+---+---+ | 3| 0| 4| 1| +---+---+---+---+ ``` ### How was this patch tested? Added test cases to verify the nullability changes in FilterExec. Also added a test case for verifying the reported incorrect result. Author: gatorsmile <gatorsmile@gmail.com> Closes #15781 from gatorsmile/nullabilityFix. 05 November 2016, 10:32:41 UTC
8b99e20 [SPARK-18189][SQL][FOLLOWUP] Move test from ReplSuite to prevent java.lang.ClassCircularityError ## What changes were proposed in this pull request? Move the test which is causing java.lang.ClassCircularityError from ReplSuite to DatasetSuite. ## How was this patch tested? > build/mvn -DskipTests -Phadoop-2.3 -Pyarn -Phive -Phive-thriftserver -Pkinesis-asl -Pmesos clean package > build/mvn -Dtest=none -DwildcardSuites=org.apache.spark.repl.ReplSuite test Author: Ergin Seyfe <eseyfe@fb.com> Closes #15774 from seyfe/fix_replsuite_test_error_branch2.0. 05 November 2016, 06:29:20 UTC
399597b [SPARK-17337][SPARK-16804][SQL][BRANCH-2.0] Backport subquery related PRs ## What changes were proposed in this pull request? This PR backports two subquery related PRs to branch-2.0: - https://github.com/apache/spark/pull/14411 - https://github.com/apache/spark/pull/15761 ## How was this patch tested? Added a tests to `SubquerySuite`. Author: Nattavut Sutyanyong <nsy.can@gmail.com> Author: Herman van Hovell <hvanhovell@databricks.com> Closes #15772 from hvanhovell/SPARK-17337-2.0. 04 November 2016, 22:54:58 UTC
c864e8a [SPARK-18200][GRAPHX][FOLLOW-UP] Support zero as an initial capacity in OpenHashSet ## What changes were proposed in this pull request? This is a follow-up PR of #15741 in order to keep `nextPowerOf2` consistent. **Before** ``` nextPowerOf2(0) => 2 nextPowerOf2(1) => 1 nextPowerOf2(2) => 2 nextPowerOf2(3) => 4 nextPowerOf2(4) => 4 nextPowerOf2(5) => 8 ``` **After** ``` nextPowerOf2(0) => 1 nextPowerOf2(1) => 1 nextPowerOf2(2) => 2 nextPowerOf2(3) => 4 nextPowerOf2(4) => 4 nextPowerOf2(5) => 8 ``` ## How was this patch tested? N/A Author: Dongjoon Hyun <dongjoon@apache.org> Closes #15754 from dongjoon-hyun/SPARK-18200-2. (cherry picked from commit 27602c33751cebf6cd173c0de103454608cf6625) Signed-off-by: Reynold Xin <rxin@databricks.com> 04 November 2016, 06:17:15 UTC
dae1581 [SPARK-18200][GRAPHX] Support zero as an initial capacity in OpenHashSet ## What changes were proposed in this pull request? [SPARK-18200](https://issues.apache.org/jira/browse/SPARK-18200) reports Apache Spark 2.x raises `java.lang.IllegalArgumentException: requirement failed: Invalid initial capacity` while running `triangleCount`. The root cause is that `VertexSet`, a type alias of `OpenHashSet`, does not allow zero as a initial size. This PR loosens the restriction to allow zero. ## How was this patch tested? Pass the Jenkins test with a new test case in `OpenHashSetSuite`. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #15741 from dongjoon-hyun/SPARK-18200. (cherry picked from commit d24e736471f34ef8f2c12766393379c4213fe96e) Signed-off-by: Reynold Xin <rxin@databricks.com> 03 November 2016, 06:51:26 UTC
3253ae7 [SPARK-18111][SQL] Wrong approximate quantile answer when multiple records have the minimum value(for branch 2.0) ## What changes were proposed in this pull request? When multiple records have the minimum value, the answer of `StatFunctions.multipleApproxQuantiles` is wrong. ## How was this patch tested? add a test case Author: wangzhenhua <wangzhenhua@huawei.com> Closes #15732 from wzhfy/percentile2. 02 November 2016, 18:49:30 UTC
1696bcf [SPARK-18160][CORE][YARN] spark.files & spark.jars should not be passed to driver in yarn mode ## What changes were proposed in this pull request? spark.files is still passed to driver in yarn mode, so SparkContext will still handle it which cause the error in the jira desc. ## How was this patch tested? Tested manually in a 5 node cluster. As this issue only happens in multiple node cluster, so I didn't write test for it. Author: Jeff Zhang <zjffdu@apache.org> Closes #15669 from zjffdu/SPARK-18160. (cherry picked from commit 3c24299b71e23e159edbb972347b13430f92a465) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> 02 November 2016, 18:48:25 UTC
eb790c5 [SPARK-16796][WEB UI] Mask spark.authenticate.secret on Spark environ… ## What changes were proposed in this pull request? Mask `spark.authenticate.secret` on Spark environment page (Web UI). This is addition to https://github.com/apache/spark/pull/14409 ## How was this patch tested? `./dev/run-tests` [info] ScalaTest [info] Run completed in 1 hour, 8 minutes, 38 seconds. [info] Total number of tests run: 2166 [info] Suites: completed 65, aborted 0 [info] Tests: succeeded 2166, failed 0, canceled 0, ignored 590, pending 0 [info] All tests passed. Author: Artur Sukhenko <artur.sukhenko@gmail.com> Closes #14484 from Devian-ua/SPARK-16796. (cherry picked from commit 14dba45208d8a5511be2cf8ddf22e688ef141e88) Signed-off-by: Sean Owen <sowen@cloudera.com> 02 November 2016, 18:06:16 UTC
09178b6 [SPARK-18133][BRANCH-2.0][EXAMPLES][ML] Python ML Pipeline Exampl… ## What changes were proposed in this pull request? [Fix] [branch-2.0] In Python 3, there is only one integer type (i.e., int), which mostly behaves like the long type in Python 2. Since Python 3 won't accept "L", so removed "L" in all examples. ## How was this patch tested? Unit tests. …e has syntax errors] Author: Jagadeesan <as2@us.ibm.com> Closes #15729 from jagadeesanas2/SPARK-18133_branch2.0. 02 November 2016, 09:23:30 UTC
81f0804 [SPARK-18144][SQL] logging StreamingQueryListener$QueryStartedEvent ## What changes were proposed in this pull request? The PR fixes the bug that the QueryStartedEvent is not logged the postToAll() in the original code is actually calling StreamingQueryListenerBus.postToAll() which has no listener at all....we shall post by sparkListenerBus.postToAll(s) and this.postToAll() to trigger local listeners as well as the listeners registered in LiveListenerBus zsxwing ## How was this patch tested? The following snapshot shows that QueryStartedEvent has been logged correctly ![image](https://cloud.githubusercontent.com/assets/678008/19821553/007a7d28-9d2d-11e6-9f13-49851559cdaa.png) Author: CodingCat <zhunansjtu@gmail.com> Closes #15675 from CodingCat/SPARK-18144. (cherry picked from commit 85c5424d466f4a5765c825e0e2ab30da97611285) Signed-off-by: Shixiong Zhu <shixiong@databricks.com> 02 November 2016, 06:40:12 UTC
d401a74 Preparing development version 2.0.3-SNAPSHOT 01 November 2016, 19:46:05 UTC
a6abe1e Preparing Spark release v2.0.2-rc2 01 November 2016, 19:45:54 UTC
a01b950 [SPARK-18114][HOTFIX] Fix line-too-long style error from backport of SPARK-18114 ## What changes were proposed in this pull request? Fix style error introduced in cherry-pick of https://github.com/apache/spark/pull/15643 to branch-2.0. ## How was this patch tested? Existing tests Author: Sean Owen <sowen@cloudera.com> Closes #15719 from srowen/SPARK-18114.2. 01 November 2016, 19:43:50 UTC
4176da8 [SPARK-18148][SQL] Misleading Error Message for Aggregation Without Window/GroupBy ## What changes were proposed in this pull request? Aggregation Without Window/GroupBy expressions will fail in `checkAnalysis`, the error message is a bit misleading, we should generate a more specific error message for this case. For example, ``` spark.read.load("/some-data") .withColumn("date_dt", to_date($"date")) .withColumn("year", year($"date_dt")) .withColumn("week", weekofyear($"date_dt")) .withColumn("user_count", count($"userId")) .withColumn("daily_max_in_week", max($"user_count").over(weeklyWindow)) ) ``` creates the following output: ``` org.apache.spark.sql.AnalysisException: expression '`randomColumn`' is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in first() (or first_value) if you don't care which value you get.; ``` In the error message above, `randomColumn` doesn't appear in the query(acturally it's added by function `withColumn`), so the message is not enough for the user to address the problem. ## How was this patch tested? Manually test Before: ``` scala> spark.sql("select col, count(col) from tbl") org.apache.spark.sql.AnalysisException: expression 'tbl.`col`' is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in first() (or first_value) if you don't care which value you get.;; ``` After: ``` scala> spark.sql("select col, count(col) from tbl") org.apache.spark.sql.AnalysisException: grouping expressions sequence is empty, and 'tbl.`col`' is not an aggregate function. Wrap '(count(col#231L) AS count(col)#239L)' in windowing function(s) or wrap 'tbl.`col`' in first() (or first_value) if you don't care which value you get.;; ``` Also add new test sqls in `group-by.sql`. Author: jiangxingbo <jiangxb1987@gmail.com> Closes #15672 from jiangxb1987/groupBy-empty. (cherry picked from commit d0272b436512b71f04313e109d3d21a6e9deefca) Signed-off-by: Reynold Xin <rxin@databricks.com> 01 November 2016, 18:25:18 UTC
58655f5 [SPARK-18189][SQL] Fix serialization issue in KeyValueGroupedDataset ## What changes were proposed in this pull request? Likewise [DataSet.scala](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L156) KeyValueGroupedDataset should mark the queryExecution as transient. As mentioned in the Jira ticket, without transient we saw serialization issues like ``` Caused by: java.io.NotSerializableException: org.apache.spark.sql.execution.QueryExecution Serialization stack: - object not serializable (class: org.apache.spark.sql.execution.QueryExecution, value: == ``` ## How was this patch tested? Run the query which is specified in the Jira ticket before and after: ``` val a = spark.createDataFrame(sc.parallelize(Seq((1,2),(3,4)))).as[(Int,Int)] val grouped = a.groupByKey( {x:(Int,Int)=>x._1} ) val mappedGroups = grouped.mapGroups((k,x)=> {(k,1)} ) val yyy = sc.broadcast(1) val last = mappedGroups.rdd.map(xx=> { val simpley = yyy.value 1 } ) ``` Author: Ergin Seyfe <eseyfe@fb.com> Closes #15706 from seyfe/keyvaluegrouped_serialization. (cherry picked from commit 8a538c97b556f80f67c80519af0ce879557050d5) Signed-off-by: Reynold Xin <rxin@databricks.com> 01 November 2016, 18:18:50 UTC
4d2672a [SPARK-18114][MESOS] Fix mesos cluster scheduler generage command option error Enclose --conf option value with "" to support multi value configs like spark.driver.extraJavaOptions, without "", driver will fail to start. Jenkins Tests. Test in our production environment, also unit tests, It is a very small change. Author: Wang Lei <lei.wang@kongming-inc.com> Closes #15643 from LeightonWong/messos-cluster. (cherry picked from commit 9b377aa49f14af31f54164378d60e0fdea2142e5) Signed-off-by: Sean Owen <sowen@cloudera.com> 01 November 2016, 13:43:47 UTC
e06f43e [SPARK-18030][TESTS] Fix flaky FileStreamSourceSuite by not deleting the files ## What changes were proposed in this pull request? The test `when schema inference is turned on, should read partition data` should not delete files because the source maybe is listing files. This PR just removes the delete actions since they are not necessary. ## How was this patch tested? Jenkins Author: Shixiong Zhu <shixiong@databricks.com> Closes #15699 from zsxwing/SPARK-18030. (cherry picked from commit de3f87fa712c305fdd463fc36acffc5418c95c4d) Signed-off-by: Shixiong Zhu <shixiong@databricks.com> 31 October 2016, 23:05:26 UTC
300d596 [SPARK-18143][SQL] Ignore Structured Streaming event logs to avoid breaking history server (branch 2.0) ## What changes were proposed in this pull request? Backport #15663 to branch-2.0 and fixed conflicts in `ReplayListenerBus`. ## How was this patch tested? Jenkins Author: Shixiong Zhu <shixiong@databricks.com> Closes #15695 from zsxwing/fix-event-log-2.0. 31 October 2016, 23:03:44 UTC
9f92474 [SPARK-16312][FOLLOW-UP][STREAMING][KAFKA][DOC] Add java code snippet for Kafka 0.10 integration doc ## What changes were proposed in this pull request? added java code snippet for Kafka 0.10 integration doc ## How was this patch tested? SKIP_API=1 jekyll build ## Screenshot ![kafka-doc](https://cloud.githubusercontent.com/assets/15843379/19826272/bf0d8a4c-9db8-11e6-9e40-1396723df4bc.png) Author: Liwei Lin <lwlin7@gmail.com> Closes #15679 from lw-lin/kafka-010-examples. (cherry picked from commit 505b927cb7ff037adb797b9c3b9ecac3f885b7c8) Signed-off-by: Sean Owen <sowen@cloudera.com> 30 October 2016, 09:32:37 UTC
9ed8976 [SPARK-18164][SQL] ForeachSink should fail the Spark job if `process` throws exception ## What changes were proposed in this pull request? Fixed the issue that ForeachSink didn't rethrow the exception. ## How was this patch tested? The fixed unit test. Author: Shixiong Zhu <shixiong@databricks.com> Closes #15674 from zsxwing/foreach-sink-error. (cherry picked from commit 59cccbda489f25add3e10997e950de7e88704aa7) Signed-off-by: Shixiong Zhu <shixiong@databricks.com> 29 October 2016, 03:14:53 UTC
578e40e [SPARK-16963][SQL] Fix test "StreamExecution metadata garbage collection" ## What changes were proposed in this pull request? A follow up PR for #14553 to fix the flaky test. It's flaky because the file list API doesn't guarantee any order of the return list. ## How was this patch tested? Jenkins Author: Shixiong Zhu <shixiong@databricks.com> Closes #15661 from zsxwing/fix-StreamingQuerySuite. (cherry picked from commit 79fd0cc0584e48fb021c4237877b15abbffb319a) Signed-off-by: Shixiong Zhu <shixiong@databricks.com> 27 October 2016, 19:33:07 UTC
6fb1f73 [SPARK-17813][SQL][KAFKA] Maximum data per trigger ## What changes were proposed in this pull request? maxOffsetsPerTrigger option for rate limiting, proportionally based on volume of different topicpartitions. ## How was this patch tested? Added unit test Author: cody koeninger <cody@koeninger.org> Closes #15527 from koeninger/SPARK-17813. (cherry picked from commit 104232580528c097a284d753adb5795f6de8b0a5) Signed-off-by: Shixiong Zhu <shixiong@databricks.com> 27 October 2016, 17:31:10 UTC
1a4be51 [SPARK-18132] Fix checkstyle This PR fixes checkstyle. Author: Yin Huai <yhuai@databricks.com> Closes #15656 from yhuai/fix-format. (cherry picked from commit d3b4831d009905185ad74096ce3ecfa934bc191d) Signed-off-by: Yin Huai <yhuai@databricks.com> 27 October 2016, 05:22:55 UTC
dcf2f09 [SPARK-18009][SQL] Fix ClassCastException while calling toLocalIterator() on dataframe produced by RunnableCommand A short code snippet that uses toLocalIterator() on a dataframe produced by a RunnableCommand reproduces the problem. toLocalIterator() is called by thriftserver when `spark.sql.thriftServer.incrementalCollect`is set to handle queries producing large result set. **Before** ```SQL scala> spark.sql("show databases") res0: org.apache.spark.sql.DataFrame = [databaseName: string] scala> res0.toLocalIterator() 16/10/26 03:00:24 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0) java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.GenericInternalRow cannot be cast to org.apache.spark.sql.catalyst.expressions.UnsafeRow ``` **After** ```SQL scala> spark.sql("drop database databases") res30: org.apache.spark.sql.DataFrame = [] scala> spark.sql("show databases") res31: org.apache.spark.sql.DataFrame = [databaseName: string] scala> res31.toLocalIterator().asScala foreach println [default] [parquet] ``` Added a test in DDLSuite Author: Dilip Biswal <dbiswal@us.ibm.com> Closes #15642 from dilipbiswal/SPARK-18009. (cherry picked from commit dd4f088c1df6abd728e5544a17ba85322bedfe4c) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 27 October 2016, 05:14:32 UTC
ea205e3 [SPARK-16963][STREAMING][SQL] Changes to Source trait and related implementation classes ## What changes were proposed in this pull request? This PR contains changes to the Source trait such that the scheduler can notify data sources when it is safe to discard buffered data. Summary of changes: * Added a method `commit(end: Offset)` that tells the Source that is OK to discard all offsets up `end`, inclusive. * Changed the semantics of a `None` value for the `getBatch` method to mean "from the very beginning of the stream"; as opposed to "all data present in the Source's buffer". * Added notes that the upper layers of the system will never call `getBatch` with a start value less than the last value passed to `commit`. * Added a `lastCommittedOffset` method to allow the scheduler to query the status of each Source on restart. This addition is not strictly necessary, but it seemed like a good idea -- Sources will be maintaining their own persistent state, and there may be bugs in the checkpointing code. * The scheduler in `StreamExecution.scala` now calls `commit` on its stream sources after marking each batch as complete in its checkpoint. * `MemoryStream` now cleans committed batches out of its internal buffer. * `TextSocketSource` now cleans committed batches from its internal buffer. ## How was this patch tested? Existing regression tests already exercise the new code. Author: frreiss <frreiss@us.ibm.com> Closes #14553 from frreiss/fred-16963. (cherry picked from commit 5b27598ff50cb08e7570fade458da0a3d4d4eabc) Signed-off-by: Shixiong Zhu <shixiong@databricks.com> 27 October 2016, 00:33:16 UTC
72b3cff Preparing development version 2.0.3-SNAPSHOT 26 October 2016, 21:02:05 UTC
1c2908e Preparing Spark release v2.0.2-rc1 26 October 2016, 21:01:59 UTC
76b71ee [SPARK-13747][SQL] Fix concurrent executions in ForkJoinPool for SQL (branch 2.0) ## What changes were proposed in this pull request? Backport #15520 to 2.0. ## How was this patch tested? Jenkins Author: Shixiong Zhu <shixiong@databricks.com> Closes #15646 from zsxwing/SPARK-13747-2.0. 26 October 2016, 20:21:46 UTC
b482b3d [SPARK-18104][DOC] Don't build KafkaSource doc ## What changes were proposed in this pull request? Don't need to build doc for KafkaSource because the user should use the data source APIs to use KafkaSource. All KafkaSource APIs are internal. ## How was this patch tested? Verified manually. Author: Shixiong Zhu <shixiong@databricks.com> Closes #15630 from zsxwing/kafka-unidoc. (cherry picked from commit 7d10631c16b980adf1f55378c128436310daed65) Signed-off-by: Shixiong Zhu <shixiong@databricks.com> 26 October 2016, 18:16:27 UTC
5b81b01 [SPARK-18063][SQL] Failed to infer constraints over multiple aliases ## What changes were proposed in this pull request? The `UnaryNode.getAliasedConstraints` function fails to replace all expressions by their alias where constraints contains more than one expression to be replaced. For example: ``` val tr = LocalRelation('a.int, 'b.string, 'c.int) val multiAlias = tr.where('a === 'c + 10).select('a.as('x), 'c.as('y)) multiAlias.analyze.constraints ``` currently outputs: ``` ExpressionSet(Seq( IsNotNull(resolveColumn(multiAlias.analyze, "x")), IsNotNull(resolveColumn(multiAlias.analyze, "y")) ) ``` The constraint `resolveColumn(multiAlias.analyze, "x") === resolveColumn(multiAlias.analyze, "y") + 10)` is missing. ## How was this patch tested? Add new test cases in `ConstraintPropagationSuite`. Author: jiangxingbo <jiangxb1987@gmail.com> Closes #15597 from jiangxb1987/alias-constraints. (cherry picked from commit fa7d9d70825a6816495d239da925d0087f7cb94f) Signed-off-by: Reynold Xin <rxin@databricks.com> 26 October 2016, 18:12:44 UTC
773fbfe [SPARK-16304] LinkageError should not crash Spark executor ## What changes were proposed in this pull request? This patch updates the failure handling logic so Spark executor does not crash when seeing LinkageError. ## How was this patch tested? Added an end-to-end test in FailureSuite. Author: petermaxlee <petermaxlee@gmail.com> Closes #13982 from petermaxlee/SPARK-16304. 26 October 2016, 17:27:54 UTC
b4a7b65 [SPARK-18093][SQL] Fix default value test in SQLConfSuite to work rega… …rdless of warehouse dir's existence ## What changes were proposed in this pull request? Appending a trailing slash, if there already isn't one for the sake comparison of the two paths. It doesn't take away from the essence of the check, but removes any potential mismatch due to lack of trailing slash. ## How was this patch tested? Ran unit tests and they passed. Author: Mark Grover <mark@apache.org> Closes #15623 from markgrover/spark-18093. (cherry picked from commit 4bee9540790a40acb74db4b0b44c364c4b3f537d) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> 26 October 2016, 16:07:42 UTC
192c1dd [SPARK-17733][SQL] InferFiltersFromConstraints rule never terminates for query ## What changes were proposed in this pull request? The function `QueryPlan.inferAdditionalConstraints` and `UnaryNode.getAliasedConstraints` can produce a non-converging set of constraints for recursive functions. For instance, if we have two constraints of the form(where a is an alias): `a = b, a = f(b, c)` Applying both these rules in the next iteration would infer: `f(b, c) = f(f(b, c), c)` This process repeated, the iteration won't converge and the set of constraints will grow larger and larger until OOM. ~~To fix this problem, we collect alias from expressions and skip infer constraints if we are to transform an `Expression` to another which contains it.~~ To fix this problem, we apply additional check in `inferAdditionalConstraints`, when it's possible to generate recursive constraints, we skip generate that. ## How was this patch tested? Add new testcase in `SQLQuerySuite`/`InferFiltersFromConstraintsSuite`. Author: jiangxingbo <jiangxb1987@gmail.com> Closes #15319 from jiangxb1987/constraints. (cherry picked from commit 3c023570b28bc1ed24f5b2448311130fd1777fd3) Signed-off-by: Herman van Hovell <hvanhovell@databricks.com> 26 October 2016, 15:10:01 UTC
c2cce2e [SPARK-18022][SQL] java.lang.NullPointerException instead of real exception when saving DF to MySQL ## What changes were proposed in this pull request? On null next exception in JDBC, don't init it as cause or suppressed ## How was this patch tested? Existing tests Author: Sean Owen <sowen@cloudera.com> Closes #15599 from srowen/SPARK-18022. (cherry picked from commit 6c7d094ec4d45a05c1ec8a418e507e45f5a88b7d) Signed-off-by: Sean Owen <sowen@cloudera.com> 26 October 2016, 12:19:59 UTC
912487e [SPARK-16988][SPARK SHELL] spark history server log needs to be fixed to show https url when ssl is enabled spark history server log needs to be fixed to show https url when ssl is enabled Author: chie8842 <chie@chie-no-Mac-mini.local> Closes #15611 from hayashidac/SPARK-16988. (cherry picked from commit c329a568b58d65c492a43926bf0f588f2ae6a66e) Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.co.jp> 25 October 2016, 22:14:14 UTC
7c8d9a5 [SPARK-18070][SQL] binary operator should not consider nullability when comparing input types ## What changes were proposed in this pull request? Binary operator requires its inputs to be of same type, but it should not consider nullability, e.g. `EqualTo` should be able to compare an element-nullable array and an element-non-nullable array. ## How was this patch tested? a regression test in `DataFrameSuite` Author: Wenchen Fan <wenchen@databricks.com> Closes #15606 from cloud-fan/type-bug. (cherry picked from commit a21791e3164f4e6546fbe0a90017a4394a05deb1) Signed-off-by: Yin Huai <yhuai@databricks.com> 25 October 2016, 19:08:28 UTC
1c1e847 [SPARK-17624][SQL][STREAMING][TEST] Fixed flaky StateStoreSuite.maintenance ## What changes were proposed in this pull request? The reason for the flakiness was follows. The test starts the maintenance background thread, and then writes 20 versions of the state store. The maintenance thread is expected to create snapshots in the middle, and clean up old files that are not needed any more. The earliest delta file (1.delta) is expected to be deleted as snapshots will ensure that the earliest delta would not be needed. However, the default configuration for the maintenance thread is to retain files such that last 2 versions can be recovered, and delete the rest. Now while generating the versions, the maintenance thread can kick in and create snapshots anywhere between version 10 and 20 (at least 10 deltas needed for snapshot). Then later it will choose to retain only version 20 and 19 (last 2). There are two cases. - Common case: One of the version between 10 and 19 gets snapshotted. Then recovering versions 19 and 20 just needs 19.snapshot and 20.delta, so 1.delta gets deleted. - Uncommon case (reason for flakiness): Only version 20 gets snapshotted. Then recovering versoin 20 requires 20.snapshot, and recovering version 19 all the previous 19...1.delta. So 1.delta does not get deleted. This PR rearranges the checks such that it create 20 versions, and then waits that there is at least one snapshot, then creates another 20. This will ensure that the latest 2 versions cannot require anything older than the first snapshot generated, and therefore will 1.delta will be deleted. In addition, I have added more logs, and comments that I felt would help future debugging and understanding what is going on. ## How was this patch tested? Ran the StateStoreSuite > 6K times in a heavily loaded machine (10 instances of tests running in parallel). No failures. Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #15592 from tdas/SPARK-17624. (cherry picked from commit 407c3cedf29a4413339dcde758295dc3225a0054) Signed-off-by: Shixiong Zhu <shixiong@databricks.com> 25 October 2016, 00:21:26 UTC
bad15bc [SPARK-18044][STREAMING] FileStreamSource should not infer partitions in every batch ## What changes were proposed in this pull request? In `FileStreamSource.getBatch`, we will create a `DataSource` with specified schema, to avoid inferring the schema again and again. However, we don't pass the partition columns, and will infer the partition again and again. This PR fixes it by keeping the partition columns in `FileStreamSource`, like schema. ## How was this patch tested? N/A Author: Wenchen Fan <wenchen@databricks.com> Closes #15581 from cloud-fan/stream. 24 October 2016, 17:51:04 UTC
aef65ac [SPARK-17153][SQL] Should read partition data when reading new files in filestream without globbing ## What changes were proposed in this pull request? When reading file stream with non-globbing path, the results return data with all `null`s for the partitioned columns. E.g., case class A(id: Int, value: Int) val data = spark.createDataset(Seq( A(1, 1), A(2, 2), A(2, 3)) ) val url = "/tmp/test" data.write.partitionBy("id").parquet(url) spark.read.parquet(url).show +-----+---+ |value| id| +-----+---+ | 2| 2| | 3| 2| | 1| 1| +-----+---+ val s = spark.readStream.schema(spark.read.load(url).schema).parquet(url) s.writeStream.queryName("test").format("memory").start() sql("SELECT * FROM test").show +-----+----+ |value| id| +-----+----+ | 2|null| | 3|null| | 1|null| +-----+----+ ## How was this patch tested? Jenkins tests. Author: Liang-Chi Hsieh <simonh@tw.ibm.com> Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #14803 from viirya/filestreamsource-option. 24 October 2016, 17:45:05 UTC
064db17 [SPARK-17810][SQL] Default spark.sql.warehouse.dir is relative to local FS but can resolve as HDFS path Always resolve spark.sql.warehouse.dir as a local path, and as relative to working dir not home dir Existing tests. Author: Sean Owen <sowen@cloudera.com> Closes #15382 from srowen/SPARK-17810. (cherry picked from commit 4ecbe1b92f4c4c5b2d734895c09d8ded0ed48d4d) Signed-off-by: Sean Owen <sowen@cloudera.com> 24 October 2016, 10:19:52 UTC
00a2e01 [SPARK-18058][SQL] [BRANCH-2.0]Comparing column types ignoring Nullability in Union and SetOperation ## What changes were proposed in this pull request? The PR tries to fix [SPARK-18058](https://issues.apache.org/jira/browse/SPARK-18058) which refers to a bug that the column types are compared with the extra care about Nullability in Union and SetOperation. This PR converts the columns types by setting all fields as nullable before comparison ## How was this patch tested? regular unit test cases Author: CodingCat <zhunansjtu@gmail.com> Closes #15602 from CodingCat/branch-2.0. 23 October 2016, 20:20:18 UTC
0e0d83a [SPARKR][BRANCH-2.0] R merge API doc and example fix ## What changes were proposed in this pull request? Fixes for R doc ## How was this patch tested? N/A Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #15589 from felixcheung/rdocmergefix. 23 October 2016, 17:53:27 UTC
e21e9d4 [SPARK-17123][SQL][BRANCH-2.0] Use type-widened encoder for DataFrame for set operations ## What changes were proposed in this pull request? This PR backports https://github.com/apache/spark/pull/15072 Please note that the test code is a bit different with the master as https://github.com/apache/spark/pull/14786 was only merged into master and therefore, it does not support type-widening between `DateType` and `TimestampType`. So, both types were taken out from the test. ## How was this patch tested? Unit test in `DataFrameSuite`. Author: hyukjinkwon <gurwls223@gmail.com> Closes #15601 from HyukjinKwon/backport-17123. 23 October 2016, 12:00:35 UTC
3d58787 [SPARK-17698][SQL] Join predicates should not contain filter clauses ## What changes were proposed in this pull request? This is a backport of https://github.com/apache/spark/pull/15272 to 2.0 branch. Jira : https://issues.apache.org/jira/browse/SPARK-17698 `ExtractEquiJoinKeys` is incorrectly using filter predicates as the join condition for joins. `canEvaluate` [0] tries to see if the an `Expression` can be evaluated using output of a given `Plan`. In case of filter predicates (eg. `a.id='1'`), the `Expression` passed for the right hand side (ie. '1' ) is a `Literal` which does not have any attribute references. Thus `expr.references` is an empty set which theoretically is a subset of any set. This leads to `canEvaluate` returning `true` and `a.id='1'` is treated as a join predicate. While this does not lead to incorrect results but in case of bucketed + sorted tables, we might miss out on avoiding un-necessary shuffle + sort. See example below: [0] : https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala#L91 eg. ``` val df = (1 until 10).toDF("id").coalesce(1) hc.sql("DROP TABLE IF EXISTS table1").collect df.write.bucketBy(8, "id").sortBy("id").saveAsTable("table1") hc.sql("DROP TABLE IF EXISTS table2").collect df.write.bucketBy(8, "id").sortBy("id").saveAsTable("table2") sqlContext.sql(""" SELECT a.id, b.id FROM table1 a FULL OUTER JOIN table2 b ON a.id = b.id AND a.id='1' AND b.id='1' """).explain(true) ``` BEFORE: This is doing shuffle + sort over table scan outputs which is not needed as both tables are bucketed and sorted on the same columns and have same number of buckets. This should be a single stage job. ``` SortMergeJoin [id#38, cast(id#38 as double), 1.0], [id#39, 1.0, cast(id#39 as double)], FullOuter :- *Sort [id#38 ASC NULLS FIRST, cast(id#38 as double) ASC NULLS FIRST, 1.0 ASC NULLS FIRST], false, 0 : +- Exchange hashpartitioning(id#38, cast(id#38 as double), 1.0, 200) : +- *FileScan parquet default.table1[id#38] Batched: true, Format: ParquetFormat, InputPaths: file:spark-warehouse/table1, PartitionFilters: [], PushedFilters: [], ReadSchema: struct<id:int> +- *Sort [id#39 ASC NULLS FIRST, 1.0 ASC NULLS FIRST, cast(id#39 as double) ASC NULLS FIRST], false, 0 +- Exchange hashpartitioning(id#39, 1.0, cast(id#39 as double), 200) +- *FileScan parquet default.table2[id#39] Batched: true, Format: ParquetFormat, InputPaths: file:spark-warehouse/table2, PartitionFilters: [], PushedFilters: [], ReadSchema: struct<id:int> ``` AFTER : ``` SortMergeJoin [id#32], [id#33], FullOuter, ((cast(id#32 as double) = 1.0) && (cast(id#33 as double) = 1.0)) :- *FileScan parquet default.table1[id#32] Batched: true, Format: ParquetFormat, InputPaths: file:spark-warehouse/table1, PartitionFilters: [], PushedFilters: [], ReadSchema: struct<id:int> +- *FileScan parquet default.table2[id#33] Batched: true, Format: ParquetFormat, InputPaths: file:spark-warehouse/table2, PartitionFilters: [], PushedFilters: [], ReadSchema: struct<id:int> ``` ## How was this patch tested? - Added a new test case for this scenario : `SPARK-17698 Join predicates should not contain filter clauses` - Ran all the tests in `BucketedReadSuite` Author: Tejas Patil <tejasp@fb.com> Closes #15600 from tejasapatil/SPARK-17698_2.0_backport. 22 October 2016, 23:32:49 UTC
b959dab [SPARK-17986][ML] SQLTransformer should remove temporary tables ## What changes were proposed in this pull request? A call to the method `SQLTransformer.transform` previously would create a temporary table and never delete it. This change adds a call to `dropTempView()` that deletes this temporary table before returning the result so that the table will not remain in spark's table catalog. Because `tableName` is randomized and not exposed, there should be no expected use of this table outside of the `transform` method. ## How was this patch tested? A single new assertion was added to the existing test of the `SQLTransformer.transform` method that all temporary tables are removed. Without the corresponding code change, this new assertion fails. I am not aware of any circumstances in which removing this temporary view would be bad for performance or correctness in other ways, but some expertise here would be helpful. Author: Drew Robb <drewrobb@gmail.com> Closes #15526 from drewrobb/SPARK-17986. (cherry picked from commit ab3363e9f6b1f7fc26682509fe7382c570f91778) Signed-off-by: Yanbo Liang <ybliang8@gmail.com> 22 October 2016, 09:00:05 UTC
a0c03c9 [SPARK-16606][MINOR] Tiny follow-up to , to correct more instances of the same log message typo ## What changes were proposed in this pull request? Tiny follow-up to SPARK-16606 / https://github.com/apache/spark/pull/14533 , to correct more instances of the same log message typo ## How was this patch tested? Existing tests (no functional change anyway) Author: Sean Owen <sowen@cloudera.com> Closes #15586 from srowen/SPARK-16606.2. (cherry picked from commit 7178c56433cd138dae53db9194c55e3f4fa0fa69) Signed-off-by: Reynold Xin <rxin@databricks.com> 22 October 2016, 05:21:07 UTC
d3c78c4 [STREAMING][KAFKA][DOC] clarify kafka settings needed for larger batches ## What changes were proposed in this pull request? Minor doc change to mention kafka configuration for larger spark batches. ## How was this patch tested? Doc change only, confirmed via jekyll. The configuration issue was discussed / confirmed with users on the mailing list. Author: cody koeninger <cody@koeninger.org> Closes #15570 from koeninger/kafka-doc-heartbeat. (cherry picked from commit c9720b2195a465653690b3e221ce789142217b0d) Signed-off-by: Shixiong Zhu <shixiong@databricks.com> 21 October 2016, 23:27:26 UTC
3e9840f [SPARK-17812][SQL][KAFKA] Assign and specific startingOffsets for structured stream ## What changes were proposed in this pull request? startingOffsets takes specific per-topicpartition offsets as a json argument, usable with any consumer strategy assign with specific topicpartitions as a consumer strategy ## How was this patch tested? Unit tests Author: cody koeninger <cody@koeninger.org> Closes #15504 from koeninger/SPARK-17812. (cherry picked from commit 268ccb9a48dfefc4d7bc85155e7e20a2dfe89307) Signed-off-by: Shixiong Zhu <shixiong@databricks.com> 21 October 2016, 22:55:11 UTC
b113b5d [SPARK-17929][CORE] Fix deadlock when CoarseGrainedSchedulerBackend reset ## What changes were proposed in this pull request? https://issues.apache.org/jira/browse/SPARK-17929 Now `CoarseGrainedSchedulerBackend` reset will get the lock, ``` protected def reset(): Unit = synchronized { numPendingExecutors = 0 executorsPendingToRemove.clear() // Remove all the lingering executors that should be removed but not yet. The reason might be // because (1) disconnected event is not yet received; (2) executors die silently. executorDataMap.toMap.foreach { case (eid, _) => driverEndpoint.askWithRetry[Boolean]( RemoveExecutor(eid, SlaveLost("Stale executor after cluster manager re-registered."))) } } ``` but on removeExecutor also need the lock "CoarseGrainedSchedulerBackend.this.synchronized", this will cause deadlock. ``` private def removeExecutor(executorId: String, reason: ExecutorLossReason): Unit = { logDebug(s"Asked to remove executor $executorId with reason $reason") executorDataMap.get(executorId) match { case Some(executorInfo) => // This must be synchronized because variables mutated // in this block are read when requesting executors val killed = CoarseGrainedSchedulerBackend.this.synchronized { addressToExecutorId -= executorInfo.executorAddress executorDataMap -= executorId executorsPendingLossReason -= executorId executorsPendingToRemove.remove(executorId).getOrElse(false) } ... ## How was this patch tested? manual test. Author: w00228970 <wangfei1@huawei.com> Closes #15481 from scwf/spark-17929. (cherry picked from commit c1f344f1a09b8834bec70c1ece30b9bff63e55ea) Signed-off-by: Shixiong Zhu <shixiong@databricks.com> 21 October 2016, 21:44:05 UTC
af2e6e0 [SPARK-17926][SQL][STREAMING] Added json for statuses ## What changes were proposed in this pull request? StreamingQueryStatus exposed through StreamingQueryListener often needs to be recorded (similar to SparkListener events). This PR adds `.json` and `.prettyJson` to `StreamingQueryStatus`, `SourceStatus` and `SinkStatus`. ## How was this patch tested? New unit tests Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #15476 from tdas/SPARK-17926. (cherry picked from commit 7a531e3054f8d4820216ed379433559f57f571b8) Signed-off-by: Yin Huai <yhuai@databricks.com> 21 October 2016, 20:07:59 UTC
78458a7 [SPARK-17811] SparkR cannot parallelize data.frame with NA or NULL in Date columns ## What changes were proposed in this pull request? NA date values are serialized as "NA" and NA time values are serialized as NaN from R. In the backend we did not have proper logic to deal with them. As a result we got an IllegalArgumentException for Date and wrong value for time. This PR adds support for deserializing NA as Date and Time. ## How was this patch tested? * [x] TODO Author: Hossein <hossein@databricks.com> Closes #15421 from falaki/SPARK-17811. (cherry picked from commit e371040a0150e4ed748a7c25465965840b61ca63) Signed-off-by: Felix Cheung <felixcheung@apache.org> 21 October 2016, 19:45:35 UTC
a65d40a [SPARK-18034] Upgrade to MiMa 0.1.11 to fix flakiness We should upgrade to the latest release of MiMa (0.1.11) in order to include a fix for a bug which led to flakiness in the MiMa checks (https://github.com/typesafehub/migration-manager/issues/115). Author: Josh Rosen <joshrosen@databricks.com> Closes #15571 from JoshRosen/SPARK-18034. 21 October 2016, 18:28:49 UTC
6cc6cb2 [SPARKR] fix warnings ## What changes were proposed in this pull request? Fix for a bunch of test warnings that were added recently. We need to investigate why warnings are not turning into errors. ``` Warnings ----------------------------------------------------------------------- 1. createDataFrame uses files for large objects (test_sparkSQL.R#215) - Use Sepal_Length instead of Sepal.Length as column name 2. createDataFrame uses files for large objects (test_sparkSQL.R#215) - Use Sepal_Width instead of Sepal.Width as column name 3. createDataFrame uses files for large objects (test_sparkSQL.R#215) - Use Petal_Length instead of Petal.Length as column name 4. createDataFrame uses files for large objects (test_sparkSQL.R#215) - Use Petal_Width instead of Petal.Width as column name Consider adding importFrom("utils", "object.size") to your NAMESPACE file. ``` ## How was this patch tested? unit tests Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #15560 from felixcheung/rwarnings. (cherry picked from commit 3180272d2d49e440516085c0e4aebd5bad18bcad) Signed-off-by: Felix Cheung <felixcheung@apache.org> 21 October 2016, 04:13:12 UTC
e8923d2 [SPARK-17999][KAFKA][SQL] Add getPreferredLocations for KafkaSourceRDD ## What changes were proposed in this pull request? The newly implemented Structured Streaming `KafkaSource` did calculate the preferred locations for each topic partition, but didn't offer this information through RDD's `getPreferredLocations` method. So here propose to add this method in `KafkaSourceRDD`. ## How was this patch tested? Manual verification. Author: jerryshao <sshao@hortonworks.com> Closes #15545 from jerryshao/SPARK-17999. (cherry picked from commit 947f4f25273161dc4719419a35613a71c2e2a150) Signed-off-by: Shixiong Zhu <shixiong@databricks.com> 20 October 2016, 17:50:42 UTC
4131623 [SPARK-18003][SPARK CORE] Fix bug of RDD zipWithIndex & zipWithUniqueId index value overflowing ## What changes were proposed in this pull request? - Fix bug of RDD `zipWithIndex` generating wrong result when one partition contains more than 2147483647 records. - Fix bug of RDD `zipWithUniqueId` generating wrong result when one partition contains more than 2147483647 records. ## How was this patch tested? test added. Author: WeichenXu <WeichenXu123@outlook.com> Closes #15550 from WeichenXu123/fix_rdd_zipWithIndex_overflow. (cherry picked from commit 39755169fb5bb07332eef263b4c18ede1528812d) Signed-off-by: Reynold Xin <rxin@databricks.com> 20 October 2016, 06:41:46 UTC
995f602 [SPARK-17989][SQL] Check ascendingOrder type in sort_array function rather than throwing ClassCastException ## What changes were proposed in this pull request? This PR proposes to check the second argument, `ascendingOrder` rather than throwing `ClassCastException` exception message. ```sql select sort_array(array('b', 'd'), '1'); ``` **Before** ``` 16/10/19 13:16:08 ERROR SparkSQLDriver: Failed in [select sort_array(array('b', 'd'), '1')] java.lang.ClassCastException: org.apache.spark.unsafe.types.UTF8String cannot be cast to java.lang.Boolean at scala.runtime.BoxesRunTime.unboxToBoolean(BoxesRunTime.java:85) at org.apache.spark.sql.catalyst.expressions.SortArray.nullSafeEval(collectionOperations.scala:185) at org.apache.spark.sql.catalyst.expressions.BinaryExpression.eval(Expression.scala:416) at org.apache.spark.sql.catalyst.optimizer.ConstantFolding$$anonfun$apply$1$$anonfun$applyOrElse$1.applyOrElse(expressions.scala:50) at org.apache.spark.sql.catalyst.optimizer.ConstantFolding$$anonfun$apply$1$$anonfun$applyOrElse$1.applyOrElse(expressions.scala:43) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:292) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:292) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:74) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:291) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:297) ``` **After** ``` Error in query: cannot resolve 'sort_array(array('b', 'd'), '1')' due to data type mismatch: Sort order in second argument requires a boolean literal.; line 1 pos 7; ``` ## How was this patch tested? Unit test in `DataFrameFunctionsSuite`. Author: hyukjinkwon <gurwls223@gmail.com> Closes #15532 from HyukjinKwon/SPARK-17989. (cherry picked from commit 4b2011ec9da1245923b5cbd883240fef0dbf3ef0) Signed-off-by: Reynold Xin <rxin@databricks.com> 20 October 2016, 02:36:53 UTC
cdd2570 [SPARK-18001][DOCUMENT] fix broke link to SparkDataFrame ## What changes were proposed in this pull request? In http://spark.apache.org/docs/latest/sql-programming-guide.html, Section "Untyped Dataset Operations (aka DataFrame Operations)" Link to R DataFrame doesn't work that return The requested URL /docs/latest/api/R/DataFrame.html was not found on this server. Correct link is SparkDataFrame.html for spark 2.0 ## How was this patch tested? Manual checked. Author: Tommy YU <tummyyu@163.com> Closes #15543 from Wenpei/spark-18001. (cherry picked from commit f39852e59883c214b0d007faffb406570ea3084b) Signed-off-by: Reynold Xin <rxin@databricks.com> 19 October 2016, 04:15:40 UTC
3796a98 [SPARK-17711][TEST-HADOOP2.2] Fix hadoop2.2 compilation error ## What changes were proposed in this pull request? Fix hadoop2.2 compilation error. ## How was this patch tested? Existing tests. cc tdas zsxwing Author: Yu Peng <loneknightpy@gmail.com> Closes #15537 from loneknightpy/fix-17711. (cherry picked from commit 2629cd74602cfe77188b76428fed62a7a7149315) Signed-off-by: Shixiong Zhu <shixiong@databricks.com> 19 October 2016, 02:43:17 UTC
99943bf [SPARK-17731][SQL][STREAMING][FOLLOWUP] Refactored StreamingQueryListener APIs for branch-2.0 This is the branch-2.0 PR of #15530 to make the APIs consistent with the master. Since these APIs are experimental and not direct user facing (StreamingQueryListener is advanced Structured Streaming APIs), its okay to change them in branch-2.0. ## What changes were proposed in this pull request? As per rxin request, here are further API changes - Changed `Stream(Started/Progress/Terminated)` events to `Stream*Event` - Changed the fields in `StreamingQueryListener.on***` from `query*` to `event` ## How was this patch tested? Existing unit tests. Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #15535 from tdas/SPARK-17731-1-branch-2.0. 19 October 2016, 00:31:21 UTC
f6b8793 [SPARK-17841][STREAMING][KAFKA] drain commitQueue ## What changes were proposed in this pull request? Actually drain commit queue rather than just iterating it. iterator() on a concurrent linked queue won't remove items from the queue, poll() will. ## How was this patch tested? Unit tests Author: cody koeninger <cody@koeninger.org> Closes #15407 from koeninger/SPARK-17841. (cherry picked from commit cd106b050ff789b6de539956a7f01159ab15c820) Signed-off-by: Reynold Xin <rxin@databricks.com> 18 October 2016, 21:01:59 UTC
6ef9231 [MINOR][DOC] Add more built-in sources in sql-programming-guide.md ## What changes were proposed in this pull request? Add more built-in sources in sql-programming-guide.md. ## How was this patch tested? Manually. Author: Weiqing Yang <yangweiqing001@gmail.com> Closes #15522 from weiqingy/dsDoc. (cherry picked from commit 20dd11096cfda51e47b9dbe3b715a12ccbb4ce1d) Signed-off-by: Reynold Xin <rxin@databricks.com> 18 October 2016, 20:38:50 UTC
26e978a [SPARK-17711] Compress rolled executor log ## What changes were proposed in this pull request? This PR adds support for executor log compression. ## How was this patch tested? Unit tests cc: yhuai tdas mengxr Author: Yu Peng <loneknightpy@gmail.com> Closes #15285 from loneknightpy/compress-executor-log. (cherry picked from commit 231f39e3f6641953a90bc4c40444ede63f363b23) Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com> 18 October 2016, 20:23:50 UTC
2aa2583 [SPARK-17751][SQL][BACKPORT-2.0] Remove spark.sql.eagerAnalysis and Output the Plan if Existed in AnalysisException ### What changes were proposed in this pull request? This PR is to backport the fix https://github.com/apache/spark/pull/15316 to 2.0. Dataset always does eager analysis now. Thus, `spark.sql.eagerAnalysis` is not used any more. Thus, we need to remove it. This PR also outputs the plan. Without the fix, the analysis error is like ``` cannot resolve '`k1`' given input columns: [k, v]; line 1 pos 12 ``` After the fix, the analysis error becomes: ``` org.apache.spark.sql.AnalysisException: cannot resolve '`k1`' given input columns: [k, v]; line 1 pos 12; 'Project [unresolvedalias(CASE WHEN ('k1 = 2) THEN 22 WHEN ('k1 = 4) THEN 44 ELSE 0 END, None), v#6] +- SubqueryAlias t +- Project [_1#2 AS k#5, _2#3 AS v#6] +- LocalRelation [_1#2, _2#3] ``` ### How was this patch tested? N/A Author: gatorsmile <gatorsmile@gmail.com> Closes #15529 from gatorsmile/eagerAnalysis20. 18 October 2016, 17:58:19 UTC
9e806f2 [SQL][STREAMING][TEST] Follow up to remove Option.contains for Scala 2.10 compatibility ## What changes were proposed in this pull request? Scala 2.10 does not have Option.contains, which broke Scala 2.10 build. ## How was this patch tested? Locally compiled and ran sql/core unit tests in 2.10 Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #15531 from tdas/metrics-flaky-test-fix-1. 18 October 2016, 09:31:27 UTC
01520de [SQL][STREAMING][TEST] Fix flaky tests in StreamingQueryListenerSuite This work has largely been done by lw-lin in his PR #15497. This is a slight refactoring of it. ## What changes were proposed in this pull request? There were two sources of flakiness in StreamingQueryListener test. - When testing with manual clock, consecutive attempts to advance the clock can occur without the stream execution thread being unblocked and doing some work between the two attempts. Hence the following can happen with the current ManualClock. ``` +-----------------------------------+--------------------------------+ | StreamExecution thread | testing thread | +-----------------------------------+--------------------------------+ | ManualClock.waitTillTime(100) { | | | _isWaiting = true | | | wait(10) | | | still in wait(10) | if (_isWaiting) advance(100) | | still in wait(10) | if (_isWaiting) advance(200) | <- this should be disallowed ! | still in wait(10) | if (_isWaiting) advance(300) | <- this should be disallowed ! | wake up from wait(10) | | | current time is 600 | | | _isWaiting = false | | | } | | +-----------------------------------+--------------------------------+ ``` - Second source of flakiness is that the adding data to memory stream may get processing in any trigger, not just the first trigger. My fix is to make the manual clock wait for the other stream execution thread to start waiting for the clock at the right wait start time. That is, `advance(200)` (see above) will wait for stream execution thread to complete the wait that started at time 0, and start a new wait at time 200 (i.e. time stamp after the previous `advance(100)`). In addition, since this is a feature that is solely used by StreamExecution, I removed all the non-generic code from ManualClock and put them in StreamManualClock inside StreamTest. ## How was this patch tested? Ran existing unit test MANY TIME in Jenkins Author: Tathagata Das <tathagata.das1565@gmail.com> Author: Liwei Lin <lwlin7@gmail.com> Closes #15519 from tdas/metrics-flaky-test-fix. (cherry picked from commit 7d878cf2da04800bc4147b05610170865b148c64) Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com> 18 October 2016, 07:50:12 UTC
881e0eb [SPARK-17731][SQL][STREAMING] Metrics for structured streaming for branch-2.0 **This PR adds the same metrics to branch-2.0 that was added to master in #15307.** The differences compared to the #15307 are - The new configuration is added only in the `SQLConf `object (i.e. `SQLConf.STREAMING_METRICS_ENABLED`) and not in the `SQLConf` class (i.e. no `SQLConf.isStreamingMetricsEnabled`). Spark master has all the streaming configurations exposed as actual fields in SQLConf class (e.g. [streamingPollingDelay](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L642)), but [not in Spark 2.0](https://github.com/apache/spark/blob/branch-2.0/sql/core/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L608). So I didnt add it in this 2.0 PR. - In the previous master PR, the aboveconfiguration was read in `StreamExecution` as `sparkSession.sessionState.conf.isStreamingMetricsEnabled`. In this 2.0 PR, I am instead reading it as `sparkSession.conf.get(STREAMING_METRICS_ENABLED)`(i.e. no `sessionState`) to keep it consistent with how other confs are read in `StreamExecution` (e.g. [STREAMING_POLLING_DELAY](https://github.com/tdas/spark/blob/ee8e899e4c274c363a8b4d13e8bf57b0b467a50e/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala#L62)). - Different Mima exclusions ------ ## What changes were proposed in this pull request? Metrics are needed for monitoring structured streaming apps. Here is the design doc for implementing the necessary metrics. https://docs.google.com/document/d/1NIdcGuR1B3WIe8t7VxLrt58TJB4DtipWEbj5I_mzJys/edit?usp=sharing Specifically, this PR adds the following public APIs changes. - `StreamingQuery.status` returns a `StreamingQueryStatus` object (renamed from `StreamingQueryInfo`, see later) - `StreamingQueryStatus` has the following important fields - inputRate - Current rate (rows/sec) at which data is being generated by all the sources - processingRate - Current rate (rows/sec) at which the query is processing data from all the sources - ~~outputRate~~ - *Does not work with wholestage codegen* - latency - Current average latency between the data being available in source and the sink writing the corresponding output - sourceStatuses: Array[SourceStatus] - Current statuses of the sources - sinkStatus: SinkStatus - Current status of the sink - triggerStatus - Low-level detailed status of the last completed/currently active trigger - latencies - getOffset, getBatch, full trigger, wal writes - timestamps - trigger start, finish, after getOffset, after getBatch - numRows - input, output, state total/updated rows for aggregations - `SourceStatus` has the following important fields - inputRate - Current rate (rows/sec) at which data is being generated by the source - processingRate - Current rate (rows/sec) at which the query is processing data from the source - triggerStatus - Low-level detailed status of the last completed/currently active trigger - Python API for `StreamingQuery.status()` ### Breaking changes to existing APIs **Existing direct public facing APIs** - Deprecated direct public-facing APIs `StreamingQuery.sourceStatuses` and `StreamingQuery.sinkStatus` in favour of `StreamingQuery.status.sourceStatuses/sinkStatus`. - Branch 2.0 should have it deprecated, master should have it removed. **Existing advanced listener APIs** - `StreamingQueryInfo` renamed to `StreamingQueryStatus` for consistency with `SourceStatus`, `SinkStatus` - Earlier StreamingQueryInfo was used only in the advanced listener API, but now it is used in direct public-facing API (StreamingQuery.status) - Field `queryInfo` in listener events `QueryStarted`, `QueryProgress`, `QueryTerminated` changed have name `queryStatus` and return type `StreamingQueryStatus`. - Field `offsetDesc` in `SourceStatus` was Option[String], converted it to `String`. - For `SourceStatus` and `SinkStatus` made constructor private instead of private[sql] to make them more java-safe. Instead added `private[sql] object SourceStatus/SinkStatus.apply()` which are harder to accidentally use in Java. ## How was this patch tested? Old and new unit tests. - Rate calculation and other internal logic of StreamMetrics tested by StreamMetricsSuite. - New info in statuses returned through StreamingQueryListener is tested in StreamingQueryListenerSuite. - New and old info returned through StreamingQuery.status is tested in StreamingQuerySuite. - Source-specific tests for making sure input rows are counted are is source-specific test suites. - Additional tests to test minor additions in LocalTableScanExec, StateStore, etc. Metrics also manually tested using Ganglia sink Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #15472 from tdas/SPARK-17731-branch-2.0. 17 October 2016, 23:56:40 UTC
a0d9015 Fix example of tf_idf with minDocFreq ## What changes were proposed in this pull request? The python example for tf_idf with the parameter "minDocFreq" is not properly set up because the same variable is used to transform the document for both with and without the "minDocFreq" parameter. The IDF(minDocFreq=2) is stored in the variable "idfIgnore" but then it is the original variable "idf" used to transform the "tf" instead of the "idfIgnore". ## How was this patch tested? Before the results for "tfidf" and "tfidfIgnore" were the same: tfidf: (1048576,[1046921],[3.75828890549]) (1048576,[1046920],[3.75828890549]) (1048576,[1046923],[3.75828890549]) (1048576,[892732],[3.75828890549]) (1048576,[892733],[3.75828890549]) (1048576,[892734],[3.75828890549]) tfidfIgnore: (1048576,[1046921],[3.75828890549]) (1048576,[1046920],[3.75828890549]) (1048576,[1046923],[3.75828890549]) (1048576,[892732],[3.75828890549]) (1048576,[892733],[3.75828890549]) (1048576,[892734],[3.75828890549]) After the fix those are how they should be: tfidf: (1048576,[1046921],[3.75828890549]) (1048576,[1046920],[3.75828890549]) (1048576,[1046923],[3.75828890549]) (1048576,[892732],[3.75828890549]) (1048576,[892733],[3.75828890549]) (1048576,[892734],[3.75828890549]) tfidfIgnore: (1048576,[1046921],[0.0]) (1048576,[1046920],[0.0]) (1048576,[1046923],[0.0]) (1048576,[892732],[0.0]) (1048576,[892733],[0.0]) (1048576,[892734],[0.0]) Author: Maxime Rihouey <maxime.rihouey@gmail.com> Closes #15503 from maximerihouey/patch-1. (cherry picked from commit e3bf37fa3ada43624b2e77bef90ad3d3dbcd8ce1) Signed-off-by: Sean Owen <sowen@cloudera.com> 17 October 2016, 09:56:37 UTC
d1a0211 [SPARK-17892][SQL][2.0] Do Not Optimize Query in CTAS More Than Once #15048 ### What changes were proposed in this pull request? This PR is to backport https://github.com/apache/spark/pull/15048 and https://github.com/apache/spark/pull/15459. However, in 2.0, we do not have a unified logical node `CreateTable` and the analyzer rule `PreWriteCheck` is also different. To minimize the code changes, this PR adds a new rule `AnalyzeCreateTableAsSelect`. Please treat it as a new PR to review. Thanks! As explained in https://github.com/apache/spark/pull/14797: >Some analyzer rules have assumptions on logical plans, optimizer may break these assumption, we should not pass an optimized query plan into QueryExecution (will be analyzed again), otherwise we may some weird bugs. For example, we have a rule for decimal calculation to promote the precision before binary operations, use PromotePrecision as placeholder to indicate that this rule should not apply twice. But a Optimizer rule will remove this placeholder, that break the assumption, then the rule applied twice, cause wrong result. We should not optimize the query in CTAS more than once. For example, ```Scala spark.range(99, 101).createOrReplaceTempView("tab1") val sqlStmt = "SELECT id, cast(id as long) * cast('1.0' as decimal(38, 18)) as num FROM tab1" sql(s"CREATE TABLE tab2 USING PARQUET AS $sqlStmt") checkAnswer(spark.table("tab2"), sql(sqlStmt)) ``` Before this PR, the results do not match ``` == Results == !== Correct Answer - 2 == == Spark Answer - 2 == ![100,100.000000000000000000] [100,null] [99,99.000000000000000000] [99,99.000000000000000000] ``` After this PR, the results match. ``` +---+----------------------+ |id |num | +---+----------------------+ |99 |99.000000000000000000 | |100|100.000000000000000000| +---+----------------------+ ``` In this PR, we do not treat the `query` in CTAS as a child. Thus, the `query` will not be optimized when optimizing CTAS statement. However, we still need to analyze it for normalizing and verifying the CTAS in the Analyzer. Thus, we do it in the analyzer rule `PreprocessDDL`, because so far only this rule needs the analyzed plan of the `query`. ### How was this patch tested? Author: gatorsmile <gatorsmile@gmail.com> Closes #15502 from gatorsmile/ctasOptimize2.0. 17 October 2016, 07:29:53 UTC
ca66f52 [MINOR][SQL] Add prettyName for current_database function ## What changes were proposed in this pull request? Added a `prettyname` for current_database function. ## How was this patch tested? Manually. Before: ``` scala> sql("select current_database()").show +-----------------+ |currentdatabase()| +-----------------+ | default| +-----------------+ ``` After: ``` scala> sql("select current_database()").show +------------------+ |current_database()| +------------------+ | default| +------------------+ ``` Author: Weiqing Yang <yangweiqing001@gmail.com> Closes #15506 from weiqingy/prettyName. (cherry picked from commit 56b0f5f4d1d7826737b81ebc4ec5dad83b6463e3) Signed-off-by: Reynold Xin <rxin@databricks.com> 17 October 2016, 05:38:36 UTC
3cc2fe5 [SPARK-17819][SQL][BRANCH-2.0] Support default database in connection URIs for Spark Thrift Server ## What changes were proposed in this pull request? Currently, Spark Thrift Server ignores the default database in URI. This PR supports that like the following. ```sql $ bin/beeline -u jdbc:hive2://localhost:10000 -e "create database testdb" $ bin/beeline -u jdbc:hive2://localhost:10000/testdb -e "create table t(a int)" $ bin/beeline -u jdbc:hive2://localhost:10000/testdb -e "show tables" ... +------------+--------------+--+ | tableName | isTemporary | +------------+--------------+--+ | t | false | +------------+--------------+--+ 1 row selected (0.347 seconds) $ bin/beeline -u jdbc:hive2://localhost:10000 -e "show tables" ... +------------+--------------+--+ | tableName | isTemporary | +------------+--------------+--+ +------------+--------------+--+ No rows selected (0.098 seconds) ``` ## How was this patch tested? Pass the Jenkins with a newly added testsuite. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #15507 from dongjoon-hyun/SPARK-17819-BACK. 17 October 2016, 05:15:47 UTC
2a1b10b [SPARK-17953][DOCUMENTATION] Fix typo in SparkSession scaladoc ## What changes were proposed in this pull request? ### Before: ```scala SparkSession.builder() .master("local") .appName("Word Count") .config("spark.some.config.option", "some-value"). .getOrCreate() ``` ### After: ```scala SparkSession.builder() .master("local") .appName("Word Count") .config("spark.some.config.option", "some-value") .getOrCreate() ``` There was one unexpected dot! Author: Jun Kim <i2r.jun@gmail.com> Closes #15498 from tae-jun/SPARK-17953. (cherry picked from commit 36d81c2c68ef4114592b069287743eb5cb078318) Signed-off-by: Reynold Xin <rxin@databricks.com> 15 October 2016, 07:37:04 UTC
c53b837 [SPARK-17863][SQL] should not add column into Distinct ## What changes were proposed in this pull request? We are trying to resolve the attribute in sort by pulling up some column for grandchild into child, but that's wrong when the child is Distinct, because the added column will change the behavior of Distinct, we should not do that. ## How was this patch tested? Added regression test. Author: Davies Liu <davies@databricks.com> Closes #15489 from davies/order_distinct. (cherry picked from commit da9aeb0fde589f7c21c2f4a32036a68c0353965d) Signed-off-by: Yin Huai <yhuai@databricks.com> 14 October 2016, 21:45:29 UTC
d7fa3e3 [SPARK-17834][SQL] Fetch the earliest offsets manually in KafkaSource instead of counting on KafkaConsumer ## What changes were proposed in this pull request? Because `KafkaConsumer.poll(0)` may update the partition offsets, this PR just calls `seekToBeginning` to manually set the earliest offsets for the KafkaSource initial offsets. ## How was this patch tested? Existing tests. Author: Shixiong Zhu <shixiong@databricks.com> Closes #15397 from zsxwing/SPARK-17834. (cherry picked from commit 08eac356095c7faa2b19d52f2fb0cbc47eb7d1d1) Signed-off-by: Shixiong Zhu <shixiong@databricks.com> 13 October 2016, 20:31:58 UTC
d38f38a minor doc fix for Row.scala ## What changes were proposed in this pull request? minor doc fix for "getAnyValAs" in class Row ## How was this patch tested? None. (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Author: buzhihuojie <ren.weiluo@gmail.com> Closes #15452 from david-weiluo-ren/minorDocFixForRow. (cherry picked from commit 7222a25a11790fa9d9d1428c84b6f827a785c9e8) Signed-off-by: Reynold Xin <rxin@databricks.com> 13 October 2016, 05:52:00 UTC
ab00e41 [SPARK-17876] Write StructuredStreaming WAL to a stream instead of materializing all at once ## What changes were proposed in this pull request? The CompactibleFileStreamLog materializes the whole metadata log in memory as a String. This can cause issues when there are lots of files that are being committed, especially during a compaction batch. You may come across stacktraces that look like: ``` java.lang.OutOfMemoryError: Requested array size exceeds VM limit at java.lang.StringCoding.encode(StringCoding.java:350) at java.lang.String.getBytes(String.java:941) at org.apache.spark.sql.execution.streaming.FileStreamSinkLog.serialize(FileStreamSinkLog.scala:127) ``` The safer way is to write to an output stream so that we don't have to materialize a huge string. ## How was this patch tested? Existing unit tests Author: Burak Yavuz <brkyvz@gmail.com> Closes #15437 from brkyvz/ser-to-stream. (cherry picked from commit edeb51a39d76d64196d7635f52be1b42c7ec4341) Signed-off-by: Shixiong Zhu <shixiong@databricks.com> 13 October 2016, 04:40:56 UTC
5903dab [SPARK-16827][BRANCH-2.0] Avoid reporting spill metrics as shuffle metrics ## What changes were proposed in this pull request? Fix a bug where spill metrics were being reported as shuffle metrics. Eventually these spill metrics should be reported (SPARK-3577), but separate from shuffle metrics. The fix itself basically reverts the line to what it was in 1.6. ## How was this patch tested? Cherry-picked from master (#15347) Author: Brian Cho <bcho@fb.com> Closes #15455 from dafrista/shuffle-metrics-2.0. 13 October 2016, 03:43:18 UTC
050b817 [SPARK-17782][STREAMING][KAFKA] alternative eliminate race condition of poll twice ## What changes were proposed in this pull request? Alternative approach to https://github.com/apache/spark/pull/15387 Author: cody koeninger <cody@koeninger.org> Closes #15401 from koeninger/SPARK-17782-alt. (cherry picked from commit f9a56a153e0579283160519065c7f3620d12da3e) Signed-off-by: Shixiong Zhu <shixiong@databricks.com> 12 October 2016, 22:22:15 UTC
d55ba30 [SPARK-17790][SPARKR] Support for parallelizing R data.frame larger than 2GB ## What changes were proposed in this pull request? If the R data structure that is being parallelized is larger than `INT_MAX` we use files to transfer data to JVM. The serialization protocol mimics Python pickling. This allows us to simply call `PythonRDD.readRDDFromFile` to create the RDD. I tested this on my MacBook. Following code works with this patch: ```R intMax <- .Machine$integer.max largeVec <- 1:intMax rdd <- SparkR:::parallelize(sc, largeVec, 2) ``` ## How was this patch tested? * [x] Unit tests Author: Hossein <hossein@databricks.com> Closes #15375 from falaki/SPARK-17790. (cherry picked from commit 5cc503f4fe9737a4c7947a80eecac053780606df) Signed-off-by: Felix Cheung <felixcheung@apache.org> 12 October 2016, 17:32:53 UTC
5451541 [SPARK-17884][SQL] To resolve Null pointer exception when casting from empty string to interval type. ## What changes were proposed in this pull request? This change adds a check in castToInterval method of Cast expression , such that if converted value is null , then isNull variable should be set to true. Earlier, the expression Cast(Literal(), CalendarIntervalType) was throwing NullPointerException because of the above mentioned reason. ## How was this patch tested? Added test case in CastSuite.scala jira entry for detail: https://issues.apache.org/jira/browse/SPARK-17884 Author: prigarg <prigarg@adobe.com> Closes #15449 from priyankagargnitk/SPARK-17884. (cherry picked from commit d5580ebaa086b9feb72d5428f24c5b60cd7da745) Signed-off-by: Reynold Xin <rxin@databricks.com> 12 October 2016, 17:14:55 UTC
4dcbde4 [SPARK-17808][PYSPARK] Upgraded version of Pyrolite to 4.13 ## What changes were proposed in this pull request? Upgraded to a newer version of Pyrolite which supports serialization of a BinaryType StructField for PySpark.SQL ## How was this patch tested? Added a unit test which fails with a raised ValueError when using the previous version of Pyrolite 4.9 and Python3 Author: Bryan Cutler <cutlerb@gmail.com> Closes #15386 from BryanCutler/pyrolite-upgrade-SPARK-17808. (cherry picked from commit 658c7147f5bf637f36e8c66b9207d94b1e7c74c5) Signed-off-by: Sean Owen <sowen@cloudera.com> 12 October 2016, 09:04:40 UTC
f12b74c [SPARK-17853][STREAMING][KAFKA][DOC] make it clear that reusing group.id is bad ## What changes were proposed in this pull request? Documentation fix to make it clear that reusing group id for different streams is super duper bad, just like it is with the underlying Kafka consumer. ## How was this patch tested? I built jekyll doc and made sure it looked ok. Author: cody koeninger <cody@koeninger.org> Closes #15442 from koeninger/SPARK-17853. (cherry picked from commit c264ef9b1918256a5018c7a42a1a2b42308ea3f7) Signed-off-by: Reynold Xin <rxin@databricks.com> 12 October 2016, 07:40:52 UTC
f3d82b5 [SPARK-17880][DOC] The url linking to `AccumulatorV2` in the document is incorrect. ## What changes were proposed in this pull request? In `programming-guide.md`, the url which links to `AccumulatorV2` says `api/scala/index.html#org.apache.spark.AccumulatorV2` but `api/scala/index.html#org.apache.spark.util.AccumulatorV2` is correct. ## How was this patch tested? manual test. Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp> Closes #15439 from sarutak/SPARK-17880. (cherry picked from commit b512f04f8e546843d5a3f35dcc6b675b5f4f5bc0) Signed-off-by: Reynold Xin <rxin@databricks.com> 12 October 2016, 05:37:03 UTC
e68e95e Fix hadoop.version in building-spark.md Couple of mvn build examples use `-Dhadoop.version=VERSION` instead of actual version number Author: Alexander Pivovarov <apivovarov@gmail.com> Closes #15440 from apivovarov/patch-1. (cherry picked from commit 299eb04ba05038c7dbb3ecf74a35d4bbfa456643) Signed-off-by: Reynold Xin <rxin@databricks.com> 12 October 2016, 05:32:35 UTC
5ec3e66 [SPARK-17816][CORE][BRANCH-2.0] Fix ConcurrentModificationException issue in BlockStatusesAccumulator ## What changes were proposed in this pull request? Replaced `BlockStatusesAccumulator` with `CollectionAccumulator` which is thread safe and few more cleanups. ## How was this patch tested? Tested in master branch and cherry-picked. Author: Ergin Seyfe <eseyfe@fb.com> Closes #15425 from seyfe/race_cond_jsonprotocal_branch-2.0. 11 October 2016, 19:51:08 UTC
a6b5e1d [SPARK-17346][SQL][TESTS] Fix the flaky topic deletion in KafkaSourceStressSuite ## What changes were proposed in this pull request? A follow up Pr for SPARK-17346 to fix flaky `org.apache.spark.sql.kafka010.KafkaSourceStressSuite`. Test log: https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.4/1855/testReport/junit/org.apache.spark.sql.kafka010/KafkaSourceStressSuite/_It_is_not_a_test_/ Looks like deleting the Kafka internal topic `__consumer_offsets` is flaky. This PR just simply ignores internal topics. ## How was this patch tested? Existing tests. Author: Shixiong Zhu <shixiong@databricks.com> Closes #15384 from zsxwing/SPARK-17346-flaky-test. (cherry picked from commit 75b9e351413dca0930e8545e6283874db09d8482) Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com> 11 October 2016, 17:53:26 UTC
ff9f5bb [SPARK-17738][TEST] Fix flaky test in ColumnTypeSuite ## What changes were proposed in this pull request? The default buffer size is not big enough for randomly generated MapType. ## How was this patch tested? Ran the tests in 100 times, it never fail (it fail 8 times before the patch). Author: Davies Liu <davies@databricks.com> Closes #15395 from davies/flaky_map. (cherry picked from commit d5ec4a3e014494a3d991a6350caffbc3b17be0fd) Signed-off-by: Shixiong Zhu <shixiong@databricks.com> 11 October 2016, 02:14:11 UTC
d719e9a [SPARK-17417][CORE] Fix # of partitions for Reliable RDD checkpointing ## What changes were proposed in this pull request? Currently the no. of partition files are limited to 10000 files (%05d format). If there are more than 10000 part files, the logic goes for a toss while recreating the RDD as it sorts them by string. More details can be found in the JIRA desc [here](https://issues.apache.org/jira/browse/SPARK-17417). ## How was this patch tested? I tested this patch by checkpointing a RDD and then manually renaming part files to the old format and tried to access the RDD. It was successfully created from the old format. Also verified loading a sample parquet file and saving it as multiple formats - CSV, JSON, Text, Parquet, ORC and read them successfully back from the saved files. I couldn't launch the unit test from my local box, so will wait for the Jenkins output. Author: Dhruve Ashar <dhruveashar@gmail.com> Closes #15370 from dhruve/bug/SPARK-17417. (cherry picked from commit 4bafacaa5f50a3e986c14a38bc8df9bae303f3a0) Signed-off-by: Tom Graves <tgraves@yahoo-inc.com> 10 October 2016, 15:57:30 UTC
d27df35 [SPARK-17832][SQL] TableIdentifier.quotedString creates un-parseable names when name contains a backtick ## What changes were proposed in this pull request? The `quotedString` method in `TableIdentifier` and `FunctionIdentifier` produce an illegal (un-parseable) name when the name contains a backtick. For example: ``` import org.apache.spark.sql.catalyst.parser.CatalystSqlParser._ import org.apache.spark.sql.catalyst.TableIdentifier import org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute val complexName = TableIdentifier("`weird`table`name", Some("`d`b`1")) parseTableIdentifier(complexName.unquotedString) // Does not work parseTableIdentifier(complexName.quotedString) // Does not work parseExpression(complexName.unquotedString) // Does not work parseExpression(complexName.quotedString) // Does not work ``` We should handle the backtick properly to make `quotedString` parseable. ## How was this patch tested? Add new testcases in `TableIdentifierParserSuite` and `ExpressionParserSuite`. Author: jiangxingbo <jiangxb1987@gmail.com> Closes #15403 from jiangxb1987/backtick. (cherry picked from commit 26fbca480604ba258f97b9590cfd6dda1ecd31db) Signed-off-by: Herman van Hovell <hvanhovell@databricks.com> 10 October 2016, 04:52:58 UTC
6d056c1 [SPARK-17806] [SQL] fix bug in join key rewritten in HashJoin ## What changes were proposed in this pull request? In HashJoin, we try to rewrite the join key as Long to improve the performance of finding a match. The rewriting part is not well tested, has a bug that could cause wrong result when there are at least three integral columns in the joining key also the total length of the key exceed 8 bytes. ## How was this patch tested? Added unit test to covering the rewriting with different number of columns and different data types. Manually test the reported case and confirmed that this PR fix the bug. Author: Davies Liu <davies@databricks.com> Closes #15390 from davies/rewrite_key. (cherry picked from commit 94b24b84a666517e31e9c9d693f92d9bbfd7f9ad) Signed-off-by: Davies Liu <davies.liu@gmail.com> 07 October 2016, 22:03:58 UTC
a84d8ef [SPARK-17782][STREAMING][BUILD] Add Kafka 0.10 project to build modules ## What changes were proposed in this pull request? This PR adds the Kafka 0.10 subproject to the build infrastructure. This makes sure Kafka 0.10 tests are only triggers when it or of its dependencies change. Author: Herman van Hovell <hvanhovell@databricks.com> Closes #15355 from hvanhovell/SPARK-17782. 07 October 2016, 18:53:07 UTC
f460a19 [SPARK-17346][SQL][TEST-MAVEN] Add Kafka source for Structured Streaming (branch 2.0) ## What changes were proposed in this pull request? Backport https://github.com/apache/spark/commit/9293734d35eb3d6e4fd4ebb86f54dd5d3a35e6db and https://github.com/apache/spark/commit/b678e465afa417780b54db0fbbaa311621311f15 into branch 2.0. The only difference is the Spark version in pom file. ## How was this patch tested? Jenkins. Author: Shixiong Zhu <shixiong@databricks.com> Closes #15367 from zsxwing/kafka-source-branch-2.0. 07 October 2016, 18:32:39 UTC
9f2eb27 [SPARK-17707][WEBUI] Web UI prevents spark-submit application to be finished This expands calls to Jetty's simple `ServerConnector` constructor to explicitly specify a `ScheduledExecutorScheduler` that makes daemon threads. It should otherwise result in exactly the same configuration, because the other args are copied from the constructor that is currently called. (I'm not sure we should change the Hive Thriftserver impl, but I did anyway.) This also adds `sc.stop()` to the quick start guide example. Existing tests; _pending_ at least manual verification of the fix. Author: Sean Owen <sowen@cloudera.com> Closes #15381 from srowen/SPARK-17707. (cherry picked from commit cff560755244dd4ccb998e0c56e81d2620cd4cff) Signed-off-by: Shixiong Zhu <shixiong@databricks.com> 07 October 2016, 17:37:42 UTC
3487b02 [SPARK-17805][PYSPARK] Fix in sqlContext.read.text when pass in list of paths ## What changes were proposed in this pull request? If given a list of paths, `pyspark.sql.readwriter.text` will attempt to use an undefined variable `paths`. This change checks if the param `paths` is a basestring and then converts it to a list, so that the same variable `paths` can be used for both cases ## How was this patch tested? Added unit test for reading list of files Author: Bryan Cutler <cutlerb@gmail.com> Closes #15379 from BryanCutler/sql-readtext-paths-SPARK-17805. (cherry picked from commit bcaa799cb01289f73e9f48526e94653a07628983) Signed-off-by: Reynold Xin <rxin@databricks.com> 07 October 2016, 07:28:02 UTC
380b099 [SPARK-17612][SQL][BRANCH-2.0] Support `DESCRIBE table PARTITION` SQL syntax ## What changes were proposed in this pull request? This is a backport of SPARK-17612. This implements `DESCRIBE table PARTITION` SQL Syntax again. It was supported until Spark 1.6.2, but was dropped since 2.0.0. **Spark 1.6.2** ```scala scala> sql("CREATE TABLE partitioned_table (a STRING, b INT) PARTITIONED BY (c STRING, d STRING)") res1: org.apache.spark.sql.DataFrame = [result: string] scala> sql("ALTER TABLE partitioned_table ADD PARTITION (c='Us', d=1)") res2: org.apache.spark.sql.DataFrame = [result: string] scala> sql("DESC partitioned_table PARTITION (c='Us', d=1)").show(false) +----------------------------------------------------------------+ |result | +----------------------------------------------------------------+ |a string | |b int | |c string | |d string | | | |# Partition Information | |# col_name data_type comment | | | |c string | |d string | +----------------------------------------------------------------+ ``` **Spark 2.0** - **Before** ```scala scala> sql("CREATE TABLE partitioned_table (a STRING, b INT) PARTITIONED BY (c STRING, d STRING)") res0: org.apache.spark.sql.DataFrame = [] scala> sql("ALTER TABLE partitioned_table ADD PARTITION (c='Us', d=1)") res1: org.apache.spark.sql.DataFrame = [] scala> sql("DESC partitioned_table PARTITION (c='Us', d=1)").show(false) org.apache.spark.sql.catalyst.parser.ParseException: Unsupported SQL statement ``` - **After** ```scala scala> sql("CREATE TABLE partitioned_table (a STRING, b INT) PARTITIONED BY (c STRING, d STRING)") res0: org.apache.spark.sql.DataFrame = [] scala> sql("ALTER TABLE partitioned_table ADD PARTITION (c='Us', d=1)") res1: org.apache.spark.sql.DataFrame = [] scala> sql("DESC partitioned_table PARTITION (c='Us', d=1)").show(false) +-----------------------+---------+-------+ |col_name |data_type|comment| +-----------------------+---------+-------+ |a |string |null | |b |int |null | |c |string |null | |d |string |null | |# Partition Information| | | |# col_name |data_type|comment| |c |string |null | |d |string |null | +-----------------------+---------+-------+ scala> sql("DESC EXTENDED partitioned_table PARTITION (c='Us', d=1)").show(100,false) +-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------+-------+ |col_name |data_type|comment| +-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------+-------+ |a |string |null | |b |int |null | |c |string |null | |d |string |null | |# Partition Information | | | |# col_name |data_type|comment| |c |string |null | |d |string |null | | | | | |Detailed Partition Information CatalogPartition( Partition Values: [Us, 1] Storage(Location: file:/Users/dhyun/SPARK-17612-DESC-PARTITION/spark-warehouse/partitioned_table/c=Us/d=1, InputFormat: org.apache.hadoop.mapred.TextInputFormat, OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, Serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Properties: [serialization.format=1]) Partition Parameters:{transient_lastDdlTime=1475001066})| | | +-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------+-------+ scala> sql("DESC FORMATTED partitioned_table PARTITION (c='Us', d=1)").show(100,false) +--------------------------------+---------------------------------------------------------------------------------------+-------+ |col_name |data_type |comment| +--------------------------------+---------------------------------------------------------------------------------------+-------+ |a |string |null | |b |int |null | |c |string |null | |d |string |null | |# Partition Information | | | |# col_name |data_type |comment| |c |string |null | |d |string |null | | | | | |# Detailed Partition Information| | | |Partition Value: |[Us, 1] | | |Database: |default | | |Table: |partitioned_table | | |Location: |file:/Users/dhyun/SPARK-17612-DESC-PARTITION/spark-warehouse/partitioned_table/c=Us/d=1| | |Partition Parameters: | | | | transient_lastDdlTime |1475001066 | | | | | | |# Storage Information | | | |SerDe Library: |org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe | | |InputFormat: |org.apache.hadoop.mapred.TextInputFormat | | |OutputFormat: |org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat | | |Compressed: |No | | |Storage Desc Parameters: | | | | serialization.format |1 | | +--------------------------------+---------------------------------------------------------------------------------------+-------+ ``` ## How was this patch tested? Pass the Jenkins tests with a new testcase. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #15351 from dongjoon-hyun/SPARK-17612-BACK. 07 October 2016, 05:27:20 UTC
594a2cf [SPARK-17792][ML] L-BFGS solver for linear regression does not accept general numeric label column types ## What changes were proposed in this pull request? Before, we computed `instances` in LinearRegression in two spots, even though they did the same thing. One of them did not cast the label column to `DoubleType`. This patch consolidates the computation and always casts the label column to `DoubleType`. ## How was this patch tested? Added a unit test to check all solvers. This test failed before this patch. Author: sethah <seth.hendrickson16@gmail.com> Closes #15364 from sethah/linreg_numeric_type. (cherry picked from commit 3713bb199142c5e06e2e527c99650f02f41f47b1) Signed-off-by: Yanbo Liang <ybliang8@gmail.com> 07 October 2016, 04:14:44 UTC
back to top