sort by:
Revision Author Date Message Commit Date
6c2a38a [MINOR] Correct validateAndTransformSchema in GaussianMixture and AFTSurvivalRegression ## What changes were proposed in this pull request? The line SchemaUtils.appendColumn(schema, $(predictionCol), IntegerType) did not modify the variable schema, hence only the last line had any effect. A temporary variable is used to correctly append the two columns predictionCol and probabilityCol. ## How was this patch tested? Manually. Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Cédric Pelvet <cedric.pelvet@gmail.com> Closes #18980 from sharp-pixel/master. (cherry picked from commit 73e04ecc4f29a0fe51687ed1337c61840c976f89) Signed-off-by: Sean Owen <sowen@cloudera.com> 20 August 2017, 10:06:02 UTC
fdea642 [SPARK-21739][SQL] Cast expression should initialize timezoneId when it is called statically to convert something into TimestampType ## What changes were proposed in this pull request? https://issues.apache.org/jira/projects/SPARK/issues/SPARK-21739 This issue is caused by introducing TimeZoneAwareExpression. When the **Cast** expression converts something into TimestampType, it should be resolved with setting `timezoneId`. In general, it is resolved in LogicalPlan phase. However, there are still some places that use Cast expression statically to convert datatypes without setting `timezoneId`. In such cases, `NoSuchElementException: None.get` will be thrown for TimestampType. This PR is proposed to fix the issue. We have checked the whole project and found two such usages(i.e., in`TableReader` and `HiveTableScanExec`). ## How was this patch tested? unit test Author: donnyzone <wellfengzhu@gmail.com> Closes #18960 from DonnyZone/spark-21739. (cherry picked from commit 310454be3b0ce5ff6b6ef0070c5daadf6fb16927) Signed-off-by: gatorsmile <gatorsmile@gmail.com> 18 August 2017, 05:37:41 UTC
2a96975 [SPARK-18464][SQL][BACKPORT] support old table which doesn't store schema in table properties backport https://github.com/apache/spark/pull/18907 to branch 2.2 Author: Wenchen Fan <wenchen@databricks.com> Closes #18963 from cloud-fan/backport. 16 August 2017, 16:36:33 UTC
f5ede0d [SPARK-21656][CORE] spark dynamic allocation should not idle timeout executors when tasks still to run ## What changes were proposed in this pull request? Right now spark lets go of executors when they are idle for the 60s (or configurable time). I have seen spark let them go when they are idle but they were really needed. I have seen this issue when the scheduler was waiting to get node locality but that takes longer than the default idle timeout. In these jobs the number of executors goes down really small (less than 10) but there are still like 80,000 tasks to run. We should consider not allowing executors to idle timeout if they are still needed according to the number of tasks to be run. ## How was this patch tested? Tested by manually adding executors to `executorsIdsToBeRemoved` list and seeing if those executors were removed when there are a lot of tasks and a high `numExecutorsTarget` value. Code used In `ExecutorAllocationManager.start()` ``` start_time = clock.getTimeMillis() ``` In `ExecutorAllocationManager.schedule()` ``` val executorIdsToBeRemoved = ArrayBuffer[String]() if ( now > start_time + 1000 * 60 * 2) { logInfo("--- REMOVING 1/2 of the EXECUTORS ---") start_time += 1000 * 60 * 100 var counter = 0 for (x <- executorIds) { counter += 1 if (counter == 2) { counter = 0 executorIdsToBeRemoved += x } } } Author: John Lee <jlee2@yahoo-inc.com> Closes #18874 from yoonlee95/SPARK-21656. (cherry picked from commit adf005dabe3b0060033e1eeaedbab31a868efc8c) Signed-off-by: Tom Graves <tgraves@yahoo-inc.com> 16 August 2017, 14:44:22 UTC
f1accc8 [SPARK-21723][ML] Fix writing LibSVM (key not found: numFeatures) Check the option "numFeatures" only when reading LibSVM, not when writing. When writing, Spark was raising an exception. After the change it will ignore the option completely. liancheng HyukjinKwon (Maybe the usage should be forbidden when writing, in a major version change?). Manual test, that loading and writing LibSVM files work fine, both with and without the numFeatures option. Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Jan Vrsovsky <jan.vrsovsky@firma.seznam.cz> Closes #18872 from ProtD/master. (cherry picked from commit 8321c141f63a911a97ec183aefa5ff75a338c051) Signed-off-by: Sean Owen <sowen@cloudera.com> 16 August 2017, 07:24:09 UTC
d9c8e62 [SPARK-21721][SQL] Clear FileSystem deleteOnExit cache when paths are successfully removed ## What changes were proposed in this pull request? We put staging path to delete into the deleteOnExit cache of `FileSystem` in case of the path can't be successfully removed. But when we successfully remove the path, we don't remove it from the cache. We should do it to avoid continuing grow the cache size. ## How was this patch tested? Added a test. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #18934 from viirya/SPARK-21721. (cherry picked from commit 4c3cf1cc5cdb400ceef447d366e9f395cd87b273) Signed-off-by: gatorsmile <gatorsmile@gmail.com> 15 August 2017, 05:29:27 UTC
48bacd3 [SPARK-21696][SS] Fix a potential issue that may generate partial snapshot files ## What changes were proposed in this pull request? Directly writing a snapshot file may generate a partial file. This PR changes it to write to a temp file then rename to the target file. ## How was this patch tested? Jenkins. Author: Shixiong Zhu <shixiong@databricks.com> Closes #18928 from zsxwing/SPARK-21696. (cherry picked from commit 282f00b410fdc4dc69b9d1f3cb3e2ba53cd85b8b) Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com> 14 August 2017, 22:07:44 UTC
7b98077 [SPARK-21563][CORE] Fix race condition when serializing TaskDescriptions and adding jars ## What changes were proposed in this pull request? Fix the race condition when serializing TaskDescriptions and adding jars by keeping the set of jars and files for a TaskSet constant across the lifetime of the TaskSet. Otherwise TaskDescription serialization can produce an invalid serialization when new file/jars are added concurrently as the TaskDescription is serialized. ## How was this patch tested? Additional unit test ensures jars/files contained in the TaskDescription remain constant throughout the lifetime of the TaskSet. Author: Andrew Ash <andrew@andrewash.com> Closes #18913 from ash211/SPARK-21563. (cherry picked from commit 6847e93cf427aa971dac1ea261c1443eebf4089e) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 14 August 2017, 14:57:11 UTC
406eb1c [SPARK-21595] Separate thresholds for buffering and spilling in ExternalAppendOnlyUnsafeRowArray ## What changes were proposed in this pull request? [SPARK-21595](https://issues.apache.org/jira/browse/SPARK-21595) reported that there is excessive spilling to disk due to default spill threshold for `ExternalAppendOnlyUnsafeRowArray` being quite small for WINDOW operator. Old behaviour of WINDOW operator (pre https://github.com/apache/spark/pull/16909) would hold data in an array for first 4096 records post which it would switch to `UnsafeExternalSorter` and start spilling to disk after reaching `spark.shuffle.spill.numElementsForceSpillThreshold` (or earlier if there was paucity of memory due to excessive consumers). Currently the (switch from in-memory to `UnsafeExternalSorter`) and (`UnsafeExternalSorter` spilling to disk) for `ExternalAppendOnlyUnsafeRowArray` is controlled by a single threshold. This PR aims to separate that to have more granular control. ## How was this patch tested? Added unit tests Author: Tejas Patil <tejasp@fb.com> Closes #18843 from tejasapatil/SPARK-21595. (cherry picked from commit 94439997d57875838a8283c543f9b44705d3a503) Signed-off-by: Herman van Hovell <hvanhovell@databricks.com> 11 August 2017, 20:01:11 UTC
c909496 [SPARK-21699][SQL] Remove unused getTableOption in ExternalCatalog ## What changes were proposed in this pull request? This patch removes the unused SessionCatalog.getTableMetadataOption and ExternalCatalog. getTableOption. ## How was this patch tested? Removed the test case. Author: Reynold Xin <rxin@databricks.com> Closes #18912 from rxin/remove-getTableOption. (cherry picked from commit 584c7f14370cdfafdc6cd554b2760b7ce7709368) Signed-off-by: Reynold Xin <rxin@databricks.com> 11 August 2017, 01:56:43 UTC
3ca55ea [SPARK-21663][TESTS] test("remote fetch below max RPC message size") should call masterTracker.stop() in MapOutputTrackerSuite Signed-off-by: 10087686 <wang.jiaochunzte.com.cn> ## What changes were proposed in this pull request? After Unit tests end,there should be call masterTracker.stop() to free resource; (Please fill in changes proposed in this fix) ## How was this patch tested? Run Unit tests; (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Please review http://spark.apache.org/contributing.html before opening a pull request. Author: 10087686 <wang.jiaochun@zte.com.cn> Closes #18867 from wangjiaochun/mapout. (cherry picked from commit 6426adffaf152651c30d481bb925d5025fd6130a) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 09 August 2017, 10:45:53 UTC
f6d56d2 [SPARK-21596][SS] Ensure places calling HDFSMetadataLog.get check the return value Same PR as #18799 but for branch 2.2. Main discussion the other PR. -------- When I was investigating a flaky test, I realized that many places don't check the return value of `HDFSMetadataLog.get(batchId: Long): Option[T]`. When a batch is supposed to be there, the caller just ignores None rather than throwing an error. If some bug causes a query doesn't generate a batch metadata file, this behavior will hide it and allow the query continuing to run and finally delete metadata logs and make it hard to debug. This PR ensures that places calling HDFSMetadataLog.get always check the return value. Jenkins Author: Shixiong Zhu <shixiong@databricks.com> Closes #18890 from tdas/SPARK-21596-2.2. 09 August 2017, 06:49:33 UTC
7446be3 [SPARK-21523][ML] update breeze to 0.13.2 for an emergency bugfix in strong wolfe line search ## What changes were proposed in this pull request? Update breeze to 0.13.1 for an emergency bugfix in strong wolfe line search https://github.com/scalanlp/breeze/pull/651 ## How was this patch tested? N/A Author: WeichenXu <WeichenXu123@outlook.com> Closes #18797 from WeichenXu123/update-breeze. (cherry picked from commit b35660dd0e930f4b484a079d9e2516b0a7dacf1d) Signed-off-by: Yanbo Liang <ybliang8@gmail.com> 09 August 2017, 06:44:39 UTC
d023314 [SPARK-21503][UI] Spark UI shows incorrect task status for a killed Executor Process The executor tab on Spark UI page shows task as completed when an executor process that is running that task is killed using the kill command. Added the case ExecutorLostFailure which was previously not there, thus, the default case would be executed in which case, task would be marked as completed. This case will consider all those cases where executor connection to Spark Driver was lost due to killing the executor process, network connection etc. ## How was this patch tested? Manually Tested the fix by observing the UI change before and after. Before: <img width="1398" alt="screen shot-before" src="https://user-images.githubusercontent.com/22228190/28482929-571c9cea-6e30-11e7-93dd-728de5cdea95.png"> After: <img width="1385" alt="screen shot-after" src="https://user-images.githubusercontent.com/22228190/28482964-8649f5ee-6e30-11e7-91bd-2eb2089c61cc.png"> Please review http://spark.apache.org/contributing.html before opening a pull request. Author: pgandhi <pgandhi@yahoo-inc.com> Author: pgandhi999 <parthkgandhi9@gmail.com> Closes #18707 from pgandhi999/master. (cherry picked from commit f016f5c8f6c6aae674e9905a5c0b0bede09163a4) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 09 August 2017, 05:46:25 UTC
e87ffca Revert "[SPARK-21567][SQL] Dataset should work with type alias" This reverts commit 86609a95af4b700e83638b7416c7e3706c2d64c6. 08 August 2017, 08:32:49 UTC
86609a9 [SPARK-21567][SQL] Dataset should work with type alias If we create a type alias for a type workable with Dataset, the type alias doesn't work with Dataset. A reproducible case looks like: object C { type TwoInt = (Int, Int) def tupleTypeAlias: TwoInt = (1, 1) } Seq(1).toDS().map(_ => ("", C.tupleTypeAlias)) It throws an exception like: type T1 is not a class scala.ScalaReflectionException: type T1 is not a class at scala.reflect.api.Symbols$SymbolApi$class.asClass(Symbols.scala:275) ... This patch accesses the dealias of type in many places in `ScalaReflection` to fix it. Added test case. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #18813 from viirya/SPARK-21567. (cherry picked from commit ee1304199bcd9c1d5fc94f5b06fdd5f6fe7336a1) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 08 August 2017, 08:17:05 UTC
a1c1199 [SPARK-21648][SQL] Fix confusing assert failure in JDBC source when parallel fetching parameters are not properly provided. ### What changes were proposed in this pull request? ```SQL CREATE TABLE mytesttable1 USING org.apache.spark.sql.jdbc OPTIONS ( url 'jdbc:mysql://${jdbcHostname}:${jdbcPort}/${jdbcDatabase}?user=${jdbcUsername}&password=${jdbcPassword}', dbtable 'mytesttable1', paritionColumn 'state_id', lowerBound '0', upperBound '52', numPartitions '53', fetchSize '10000' ) ``` The above option name `paritionColumn` is wrong. That mean, users did not provide the value for `partitionColumn`. In such case, users hit a confusing error. ``` AssertionError: assertion failed java.lang.AssertionError: assertion failed at scala.Predef$.assert(Predef.scala:156) at org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:39) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:312) ``` ### How was this patch tested? Added a test case Author: gatorsmile <gatorsmile@gmail.com> Closes #18864 from gatorsmile/jdbcPartCol. (cherry picked from commit baf5cac0f8c35925c366464d7e0eb5f6023fce57) Signed-off-by: gatorsmile <gatorsmile@gmail.com> 07 August 2017, 20:04:22 UTC
fa92a7b [SPARK-21565][SS] Propagate metadata in attribute replacement. ## What changes were proposed in this pull request? Propagate metadata in attribute replacement during streaming execution. This is necessary for EventTimeWatermarks consuming replaced attributes. ## How was this patch tested? new unit test, which was verified to fail before the fix Author: Jose Torres <joseph-torres@databricks.com> Closes #18840 from joseph-torres/SPARK-21565. (cherry picked from commit cce25b360ee9e39d9510134c73a1761475eaf4ac) Signed-off-by: Shixiong Zhu <shixiong@databricks.com> 07 August 2017, 19:27:30 UTC
43f9c84 [SPARK-21374][CORE] Fix reading globbed paths from S3 into DF with disabled FS cache This PR replaces #18623 to do some clean up. Closes #18623 Jenkins Author: Shixiong Zhu <shixiong@databricks.com> Author: Andrey Taptunov <taptunov@amazon.com> Closes #18848 from zsxwing/review-pr18623. 07 August 2017, 18:04:32 UTC
4f0eb0c [SPARK-21647][SQL] Fix SortMergeJoin when using CROSS ### What changes were proposed in this pull request? author: BoleynSu closes https://github.com/apache/spark/pull/18836 ```Scala val df = Seq((1, 1)).toDF("i", "j") df.createOrReplaceTempView("T") withSQLConf(SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "-1") { sql("select * from (select a.i from T a cross join T t where t.i = a.i) as t1 " + "cross join T t2 where t2.i = t1.i").explain(true) } ``` The above code could cause the following exception: ``` SortMergeJoinExec should not take Cross as the JoinType java.lang.IllegalArgumentException: SortMergeJoinExec should not take Cross as the JoinType at org.apache.spark.sql.execution.joins.SortMergeJoinExec.outputOrdering(SortMergeJoinExec.scala:100) ``` Our SortMergeJoinExec supports CROSS. We should not hit such an exception. This PR is to fix the issue. ### How was this patch tested? Modified the two existing test cases. Author: Xiao Li <gatorsmile@gmail.com> Author: Boleyn Su <boleyn.su@gmail.com> Closes #18863 from gatorsmile/pr-18836. (cherry picked from commit bbfd6b5d24be5919a3ab1ac3eaec46e33201df39) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 07 August 2017, 16:00:16 UTC
7a04def [SPARK-21621][CORE] Reset numRecordsWritten after DiskBlockObjectWriter.commitAndGet called ## What changes were proposed in this pull request? We should reset numRecordsWritten to zero after DiskBlockObjectWriter.commitAndGet called. Because when `revertPartialWritesAndClose` be called, we decrease the written records in `ShuffleWriteMetrics` . However, we decreased the written records to zero, this should be wrong, we should only decreased the number reords after the last `commitAndGet` called. ## How was this patch tested? Modified existing test. Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Xianyang Liu <xianyang.liu@intel.com> Closes #18830 from ConeyLiu/DiskBlockObjectWriter. (cherry picked from commit 534a063f7c693158437d13224f50d4ae789ff6fb) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 07 August 2017, 09:05:02 UTC
098aaec [SPARK-21588][SQL] SQLContext.getConf(key, null) should return null ## What changes were proposed in this pull request? In SQLContext.get(key,null) for a key that is not defined in the conf, and doesn't have a default value defined, throws a NPE. Int happens only when conf has a value converter Added null check on defaultValue inside SQLConf.getConfString to avoid calling entry.valueConverter(defaultValue) ## How was this patch tested? Added unit test Author: vinodkc <vinod.kc.in@gmail.com> Closes #18852 from vinodkc/br_Fix_SPARK-21588. (cherry picked from commit 1ba967b25e6d88be2db7a4e100ac3ead03a2ade9) Signed-off-by: gatorsmile <gatorsmile@gmail.com> 06 August 2017, 06:04:52 UTC
841bc2f [SPARK-21580][SQL] Integers in aggregation expressions are wrongly taken as group-by ordinal ## What changes were proposed in this pull request? create temporary view data as select * from values (1, 1), (1, 2), (2, 1), (2, 2), (3, 1), (3, 2) as data(a, b); `select 3, 4, sum(b) from data group by 1, 2;` `select 3 as c, 4 as d, sum(b) from data group by c, d;` When running these two cases, the following exception occurred: `Error in query: GROUP BY position 4 is not in select list (valid range is [1, 3]); line 1 pos 10` The cause of this failure: If an aggregateExpression is integer, after replaced with this aggregateExpression, the groupExpression still considered as an ordinal. The solution: This bug is due to re-entrance of an analyzed plan. We can solve it by using `resolveOperators` in `SubstituteUnresolvedOrdinals`. ## How was this patch tested? Added unit test case Author: liuxian <liu.xian3@zte.com.cn> Closes #18779 from 10110346/groupby. (cherry picked from commit 894d5a453a3f47525408ee8c91b3b594daa43ccb) Signed-off-by: gatorsmile <gatorsmile@gmail.com> 05 August 2017, 05:55:19 UTC
f9aae8e [SPARK-21330][SQL] Bad partitioning does not allow to read a JDBC table with extreme values on the partition column ## What changes were proposed in this pull request? An overflow of the difference of bounds on the partitioning column leads to no data being read. This patch checks for this overflow. ## How was this patch tested? New unit test. Author: Andrew Ray <ray.andrew@gmail.com> Closes #18800 from aray/SPARK-21330. (cherry picked from commit 25826c77ddf0d5753d2501d0e764111da2caa8b6) Signed-off-by: Sean Owen <sowen@cloudera.com> 04 August 2017, 07:58:08 UTC
1bcfa2a Fix Java SimpleApp spark application ## What changes were proposed in this pull request? Add missing import and missing parentheses to invoke `SparkSession::text()`. ## How was this patch tested? Built and the code for this application, ran jekyll locally per docs/README.md. Author: Christiam Camacho <camacho@ncbi.nlm.nih.gov> Closes #18795 from christiam/master. (cherry picked from commit dd72b10aba9997977f82605c5c1778f02dd1f91e) Signed-off-by: Sean Owen <sowen@cloudera.com> 03 August 2017, 22:40:33 UTC
690f491 [SPARK-12717][PYTHON][BRANCH-2.2] Adding thread-safe broadcast pickle registry ## What changes were proposed in this pull request? When using PySpark broadcast variables in a multi-threaded environment, `SparkContext._pickled_broadcast_vars` becomes a shared resource. A race condition can occur when broadcast variables that are pickled from one thread get added to the shared ` _pickled_broadcast_vars` and become part of the python command from another thread. This PR introduces a thread-safe pickled registry using thread local storage so that when python command is pickled (causing the broadcast variable to be pickled and added to the registry) each thread will have their own view of the pickle registry to retrieve and clear the broadcast variables used. ## How was this patch tested? Added a unit test that causes this race condition using another thread. Author: Bryan Cutler <cutlerb@gmail.com> Closes #18823 from BryanCutler/branch-2.2. 03 August 2017, 01:28:19 UTC
467ee8d [SPARK-21546][SS] dropDuplicates should ignore watermark when it's not a key ## What changes were proposed in this pull request? When the watermark is not a column of `dropDuplicates`, right now it will crash. This PR fixed this issue. ## How was this patch tested? The new unit test. Author: Shixiong Zhu <shixiong@databricks.com> Closes #18822 from zsxwing/SPARK-21546. (cherry picked from commit 0d26b3aa55f9cc75096b0e2b309f64fe3270b9a5) Signed-off-by: Shixiong Zhu <shixiong@databricks.com> 02 August 2017, 21:02:22 UTC
397f904 [SPARK-21597][SS] Fix a potential overflow issue in EventTimeStats ## What changes were proposed in this pull request? This PR fixed a potential overflow issue in EventTimeStats. ## How was this patch tested? The new unit tests Author: Shixiong Zhu <shixiong@databricks.com> Closes #18803 from zsxwing/avg. (cherry picked from commit 7f63e85b47a93434030482160e88fe63bf9cff4e) Signed-off-by: Shixiong Zhu <shixiong@databricks.com> 02 August 2017, 18:00:09 UTC
67c60d7 [SPARK-21339][CORE] spark-shell --packages option does not add jars to classpath on windows The --packages option jars are getting added to the classpath with the scheme as "file:///", in Unix it doesn't have problem with this since the scheme contains the Unix Path separator which separates the jar name with location in the classpath. In Windows, the jar file is not getting resolved from the classpath because of the scheme. Windows : file:///C:/Users/<user>/.ivy2/jars/<jar-name>.jar Unix : file:///home/<user>/.ivy2/jars/<jar-name>.jar With this PR, we are avoiding the 'file://' scheme to get added to the packages jar files. I have verified manually in Windows and Unix environments, with the change it adds the jar to classpath like below, Windows : C:\Users\<user>\.ivy2\jars\<jar-name>.jar Unix : /home/<user>/.ivy2/jars/<jar-name>.jar Author: Devaraj K <devaraj@apache.org> Closes #18708 from devaraj-kavali/SPARK-21339. (cherry picked from commit 58da1a2455258156fe8ba57241611eac1a7928ef) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> 01 August 2017, 20:42:50 UTC
79e5805 [SPARK-21593][DOCS] Fix 2 rendering errors on configuration page ## What changes were proposed in this pull request? Fix 2 rendering errors on configuration doc page, due to SPARK-21243 and SPARK-15355. ## How was this patch tested? Manually built and viewed docs with jekyll Author: Sean Owen <sowen@cloudera.com> Closes #18793 from srowen/SPARK-21593. (cherry picked from commit b1d59e60dee2a41f8eff8ef29b3bcac69111e2f0) Signed-off-by: Sean Owen <sowen@cloudera.com> 01 August 2017, 18:06:04 UTC
1745434 [SPARK-21522][CORE] Fix flakiness in LauncherServerSuite. Handle the case where the server closes the socket before the full message has been written by the client. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #18727 from vanzin/SPARK-21522. (cherry picked from commit b133501800b43fa5c538a4e5ad597c9dc7d8378e) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> 01 August 2017, 17:06:15 UTC
e2062b9 Revert "[SPARK-19451][SQL] rangeBetween method should accept Long value as boundary" This reverts commit 66fa6bd6d48b08625ecedfcb5a976678141300bd. 30 July 2017, 03:35:22 UTC
66fa6bd [SPARK-19451][SQL] rangeBetween method should accept Long value as boundary ## What changes were proposed in this pull request? Long values can be passed to `rangeBetween` as range frame boundaries, but we silently convert it to Int values, this can cause wrong results and we should fix this. Further more, we should accept any legal literal values as range frame boundaries. In this PR, we make it possible for Long values, and make accepting other DataTypes really easy to add. This PR is mostly based on Herman's previous amazing work: https://github.com/hvanhovell/spark/commit/596f53c339b1b4629f5651070e56a8836a397768 After this been merged, we can close #16818 . ## How was this patch tested? Add new tests in `DataFrameWindowFunctionsSuite` and `TypeCoercionSuite`. Author: Xingbo Jiang <xingbo.jiang@databricks.com> Closes #18540 from jiangxb1987/rangeFrame. (cherry picked from commit 92d85637e7f382aae61c0f26eb1524d2b4c93516) Signed-off-by: gatorsmile <gatorsmile@gmail.com> 29 July 2017, 17:11:43 UTC
24a9bac [SPARK-21555][SQL] RuntimeReplaceable should be compared semantically by its canonicalized child ## What changes were proposed in this pull request? When there are aliases (these aliases were added for nested fields) as parameters in `RuntimeReplaceable`, as they are not in the children expression, those aliases can't be cleaned up in analyzer rule `CleanupAliases`. An expression `nvl(foo.foo1, "value")` can be resolved to two semantically different expressions in a group by query because they contain different aliases. Because those aliases are not children of `RuntimeReplaceable` which is an `UnaryExpression`. So we can't trim the aliases out by simple transforming the expressions in `CleanupAliases`. If we want to replace the non-children aliases in `RuntimeReplaceable`, we need to add more codes to `RuntimeReplaceable` and modify all expressions of `RuntimeReplaceable`. It makes the interface ugly IMO. Consider those aliases will be replaced later at optimization and so they're no harm, this patch chooses to simply override `canonicalized` of `RuntimeReplaceable`. One concern is about `CleanupAliases`. Because it actually cannot clean up ALL aliases inside a plan. To make caller of this rule notice that, this patch adds a comment to `CleanupAliases`. ## How was this patch tested? Added test. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #18761 from viirya/SPARK-21555. (cherry picked from commit 9c8109ef414c92553335bb1e90e9681e142128a4) Signed-off-by: gatorsmile <gatorsmile@gmail.com> 29 July 2017, 17:03:09 UTC
df6cd35 [SPARK-21508][DOC] Fix example code provided in Spark Streaming Documentation ## What changes were proposed in this pull request? JIRA ticket : [SPARK-21508](https://issues.apache.org/jira/projects/SPARK/issues/SPARK-21508) correcting a mistake in example code provided in Spark Streaming Custom Receivers Documentation The example code provided in the documentation on 'Spark Streaming Custom Receivers' has an error. doc link : https://spark.apache.org/docs/latest/streaming-custom-receivers.html ``` // Assuming ssc is the StreamingContext val customReceiverStream = ssc.receiverStream(new CustomReceiver(host, port)) val words = lines.flatMap(_.split(" ")) ... ``` instead of `lines.flatMap(_.split(" "))` it should be `customReceiverStream.flatMap(_.split(" "))` ## How was this patch tested? this documentation change is tested manually by jekyll build , running below commands ``` jekyll build jekyll serve --watch ``` screen-shots provided below ![screenshot1](https://user-images.githubusercontent.com/8828470/28744636-a6de1ac6-7482-11e7-843b-ff84b5855ec0.png) ![screenshot2](https://user-images.githubusercontent.com/8828470/28744637-a6def496-7482-11e7-9512-7f4bbe027c6a.png) Author: Remis Haroon <Remis.Haroon@insdc01.pwc.com> Closes #18770 from remisharoon/master. (cherry picked from commit c14382030b373177cf6aa3c045e27d754368a927) Signed-off-by: Sean Owen <sowen@cloudera.com> 29 July 2017, 12:26:19 UTC
9379031 [SPARK-21306][ML] OneVsRest should support setWeightCol ## What changes were proposed in this pull request? add `setWeightCol` method for OneVsRest. `weightCol` is ignored if classifier doesn't inherit HasWeightCol trait. ## How was this patch tested? + [x] add an unit test. Author: Yan Facai (颜发才) <facai.yan@gmail.com> Closes #18554 from facaiy/BUG/oneVsRest_missing_weightCol. (cherry picked from commit a5a3189974ea4628e9489eb50099a5432174e80c) Signed-off-by: Yanbo Liang <ybliang8@gmail.com> 28 July 2017, 02:14:55 UTC
06b2ef0 [SPARK-21538][SQL] Attribute resolution inconsistency in the Dataset API ## What changes were proposed in this pull request? This PR contains a tiny update that removes an attribute resolution inconsistency in the Dataset API. The following example is taken from the ticket description: ``` spark.range(1).withColumnRenamed("id", "x").sort(col("id")) // works spark.range(1).withColumnRenamed("id", "x").sort($"id") // works spark.range(1).withColumnRenamed("id", "x").sort('id) // works spark.range(1).withColumnRenamed("id", "x").sort("id") // fails with: org.apache.spark.sql.AnalysisException: Cannot resolve column name "id" among (x); ``` The above `AnalysisException` happens because the last case calls `Dataset.apply()` to convert strings into columns, which triggers attribute resolution. To make the API consistent between overloaded methods, this PR defers the resolution and constructs columns directly. Author: aokolnychyi <anton.okolnychyi@sap.com> Closes #18740 from aokolnychyi/spark-21538. (cherry picked from commit f44ead89f48f040b7eb9dfc88df0ec995b47bfe9) Signed-off-by: gatorsmile <gatorsmile@gmail.com> 27 July 2017, 23:49:55 UTC
1bfd1a8 [SPARK-21494][NETWORK] Use correct app id when authenticating to external service. There was some code based on the old SASL handler in the new auth client that was incorrectly using the SASL user as the user to authenticate against the external shuffle service. This caused the external service to not be able to find the correct secret to authenticate the connection, failing the connection. In the course of debugging, I found that some log messages from the YARN shuffle service were a little noisy, so I silenced some of them, and also added a couple of new ones that helped find this issue. On top of that, I found that a check in the code that records app secrets was wrong, causing more log spam and also using an O(n) operation instead of an O(1) call. Also added a new integration suite for the YARN shuffle service with auth on, and verified it failed before, and passes now. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #18706 from vanzin/SPARK-21494. (cherry picked from commit 300807c6e3011e4d78c6cf750201d0ab8e5bdaf5) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> 26 July 2017, 00:57:35 UTC
c91191b [SPARK-21447][WEB UI] Spark history server fails to render compressed inprogress history file in some cases. Add failure handling for EOFException that can be thrown during decompression of an inprogress spark history file, treat same as case where can't parse the last line. ## What changes were proposed in this pull request? Failure handling for case of EOFException thrown within the ReplayListenerBus.replay method to handle the case analogous to json parse fail case. This path can arise in compressed inprogress history files since an incomplete compression block could be read (not flushed by writer on a block boundary). See the stack trace of this occurrence in the jira ticket (https://issues.apache.org/jira/browse/SPARK-21447) ## How was this patch tested? Added a unit test that specifically targets validating the failure handling path appropriately when maybeTruncated is true and false. Author: Eric Vandenberg <ericvandenberg@fb.com> Closes #18673 from ericvandenbergfb/fix_inprogress_compr_history_file. (cherry picked from commit 06a9793793ca41dcef2f10ca06af091a57c721c4) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> 25 July 2017, 18:45:46 UTC
e5ec339 [SPARK-21383][YARN] Fix the YarnAllocator allocates more Resource When NodeManagers launching Executors, the `missing` value will exceed the real value when the launch is slow, this can lead to YARN allocates more resource. We add the `numExecutorsRunning` when calculate the `missing` to avoid this. Test by experiment. Author: DjvuLee <lihu@bytedance.com> Closes #18651 from djvulee/YarnAllocate. (cherry picked from commit 8de080d9f9d3deac7745f9b3428d97595975701d) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> 25 July 2017, 17:21:56 UTC
62ca13d [SPARK-20904][CORE] Don't report task failures to driver during shutdown. Executors run a thread pool with daemon threads to run tasks. This means that those threads remain active when the JVM is shutting down, meaning those tasks are affected by code that runs in shutdown hooks. So if a shutdown hook messes with something that the task is using (e.g. an HDFS connection), the task will fail and will report that failure to the driver. That will make the driver mark the task as failed regardless of what caused the executor to shut down. So, for example, if YARN pre-empted that executor, the driver would consider that task failed when it should instead ignore the failure. This change avoids reporting failures to the driver when shutdown hooks are executing; this fixes the YARN preemption accounting, and doesn't really change things much for other scenarios, other than reporting a more generic error ("Executor lost") when the executor shuts down unexpectedly - which is arguably more correct. Tested with a hacky app running on spark-shell that tried to cause failures only when shutdown hooks were running, verified that preemption didn't cause the app to fail because of task failures exceeding the threshold. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #18594 from vanzin/SPARK-20904. (cherry picked from commit cecd285a2aabad4e7db5a3d18944b87fbc4eee6c) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 23 July 2017, 15:23:59 UTC
da403b9 [SPARK-21434][PYTHON][DOCS] Add pyspark pip documentation. Update the Quickstart and RDD programming guides to mention pip. Built docs locally. Author: Holden Karau <holden@us.ibm.com> Closes #18698 from holdenk/SPARK-21434-add-pyspark-pip-documentation. (cherry picked from commit cc00e99d5396893b2d3d50960161080837cf950a) Signed-off-by: Holden Karau <holden@us.ibm.com> 21 July 2017, 23:53:39 UTC
88dccda [SPARK-21243][CORE] Limit no. of map outputs in a shuffle fetch For configurations with external shuffle enabled, we have observed that if a very large no. of blocks are being fetched from a remote host, it puts the NM under extra pressure and can crash it. This change introduces a configuration `spark.reducer.maxBlocksInFlightPerAddress` , to limit the no. of map outputs being fetched from a given remote address. The changes applied here are applicable for both the scenarios - when external shuffle is enabled as well as disabled. Ran the job with the default configuration which does not change the existing behavior and ran it with few configurations of lower values -10,20,50,100. The job ran fine and there is no change in the output. (I will update the metrics related to NM in some time.) Author: Dhruve Ashar <dhruveashargmail.com> Closes #18487 from dhruve/impr/SPARK-21243. Author: Dhruve Ashar <dhruveashar@gmail.com> Closes #18691 from dhruve/branch-2.2. 21 July 2017, 19:03:46 UTC
9949fed [SPARK-21333][DOCS] Removed invalid joinTypes from javadoc of Dataset#joinWith ## What changes were proposed in this pull request? Two invalid join types were mistakenly listed in the javadoc for joinWith, in the Dataset class. I presume these were copied from the javadoc of join, but since joinWith returns a Dataset\<Tuple2\>, left_semi and left_anti are invalid, as they only return values from one of the datasets, instead of from both ## How was this patch tested? I ran the following code : ``` public static void main(String[] args) { SparkSession spark = new SparkSession(new SparkContext("local[*]", "Test")); Dataset<Row> one = spark.createDataFrame(Arrays.asList(new Bean(1), new Bean(2), new Bean(3), new Bean(4), new Bean(5)), Bean.class); Dataset<Row> two = spark.createDataFrame(Arrays.asList(new Bean(4), new Bean(5), new Bean(6), new Bean(7), new Bean(8), new Bean(9)), Bean.class); try {two.joinWith(one, one.col("x").equalTo(two.col("x")), "inner").show();} catch (Exception e) {e.printStackTrace();} try {two.joinWith(one, one.col("x").equalTo(two.col("x")), "cross").show();} catch (Exception e) {e.printStackTrace();} try {two.joinWith(one, one.col("x").equalTo(two.col("x")), "outer").show();} catch (Exception e) {e.printStackTrace();} try {two.joinWith(one, one.col("x").equalTo(two.col("x")), "full").show();} catch (Exception e) {e.printStackTrace();} try {two.joinWith(one, one.col("x").equalTo(two.col("x")), "full_outer").show();} catch (Exception e) {e.printStackTrace();} try {two.joinWith(one, one.col("x").equalTo(two.col("x")), "left").show();} catch (Exception e) {e.printStackTrace();} try {two.joinWith(one, one.col("x").equalTo(two.col("x")), "left_outer").show();} catch (Exception e) {e.printStackTrace();} try {two.joinWith(one, one.col("x").equalTo(two.col("x")), "right").show();} catch (Exception e) {e.printStackTrace();} try {two.joinWith(one, one.col("x").equalTo(two.col("x")), "right_outer").show();} catch (Exception e) {e.printStackTrace();} try {two.joinWith(one, one.col("x").equalTo(two.col("x")), "left_semi").show();} catch (Exception e) {e.printStackTrace();} try {two.joinWith(one, one.col("x").equalTo(two.col("x")), "left_anti").show();} catch (Exception e) {e.printStackTrace();} } ``` which tests all the different join types, and the last two (left_semi and left_anti) threw exceptions. The same code using join instead of joinWith did fine. The Bean class was just a java bean with a single int field, x. Author: Corey Woodfield <coreywoodfield@gmail.com> Closes #18462 from coreywoodfield/master. (cherry picked from commit 8cd9cdf17a7a4ad6f2eecd7c4b388ca363c20982) Signed-off-by: gatorsmile <gatorsmile@gmail.com> 19 July 2017, 22:21:49 UTC
308bce0 [SPARK-21446][SQL] Fix setAutoCommit never executed ## What changes were proposed in this pull request? JIRA Issue: https://issues.apache.org/jira/browse/SPARK-21446 options.asConnectionProperties can not have fetchsize,because fetchsize belongs to Spark-only options, and Spark-only options have been excluded in connection properities. So change properties of beforeFetch from options.asConnectionProperties.asScala.toMap to options.asProperties.asScala.toMap ## How was this patch tested? Author: DFFuture <albert.zhang23@gmail.com> Closes #18665 from DFFuture/sparksql_pg. (cherry picked from commit c9729187bcef78299390e53cd9af38c3e084060e) Signed-off-by: gatorsmile <gatorsmile@gmail.com> 19 July 2017, 21:45:23 UTC
86cd3c0 [SPARK-21464][SS] Minimize deprecation warnings caused by ProcessingTime class ## What changes were proposed in this pull request? Use of `ProcessingTime` class was deprecated in favor of `Trigger.ProcessingTime` in Spark 2.2. However interval uses to ProcessingTime causes deprecation warnings during compilation. This cannot be avoided entirely as even though it is deprecated as a public API, ProcessingTime instances are used internally in TriggerExecutor. This PR is to minimize the warning by removing its uses from tests as much as possible. ## How was this patch tested? Existing tests. Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #18678 from tdas/SPARK-21464. (cherry picked from commit 70fe99dc62ef636a99bcb8a580ad4de4dca95181) Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com> 19 July 2017, 18:02:18 UTC
4c212ee [SPARK-21441][SQL] Incorrect Codegen in SortMergeJoinExec results failures in some cases ## What changes were proposed in this pull request? https://issues.apache.org/jira/projects/SPARK/issues/SPARK-21441 This issue can be reproduced by the following example: ``` val spark = SparkSession .builder() .appName("smj-codegen") .master("local") .config("spark.sql.autoBroadcastJoinThreshold", "1") .getOrCreate() val df1 = spark.createDataFrame(Seq((1, 1), (2, 2), (3, 3))).toDF("key", "int") val df2 = spark.createDataFrame(Seq((1, "1"), (2, "2"), (3, "3"))).toDF("key", "str") val df = df1.join(df2, df1("key") === df2("key")) .filter("int = 2 or reflect('java.lang.Integer', 'valueOf', str) = 1") .select("int") df.show() ``` To conclude, the issue happens when: (1) SortMergeJoin condition contains CodegenFallback expressions. (2) In PhysicalPlan tree, SortMergeJoin node is the child of root node, e.g., the Project in above example. This patch fixes the logic in `CollapseCodegenStages` rule. ## How was this patch tested? Unit test and manual verification in our cluster. Author: donnyzone <wellfengzhu@gmail.com> Closes #18656 from DonnyZone/Fix_SortMergeJoinExec. (cherry picked from commit 6b6dd682e84d3b03d0b15fbd81a0d16729e521d2) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 19 July 2017, 13:52:01 UTC
5a0a76f [SPARK-21414] Refine SlidingWindowFunctionFrame to avoid OOM. ## What changes were proposed in this pull request? In `SlidingWindowFunctionFrame`, it is now adding all rows to the buffer for which the input row value is equal to or less than the output row upper bound, then drop all rows from the buffer for which the input row value is smaller than the output row lower bound. This could result in the buffer is very big though the window is small. For example: ``` select a, b, sum(a) over (partition by b order by a range between 1000000 following and 1000001 following) from table ``` We can refine the logic and just add the qualified rows into buffer. ## How was this patch tested? Manual test: Run sql `select shop, shopInfo, district, sum(revenue) over(partition by district order by revenue range between 100 following and 200 following) from revenueList limit 10` against a table with 4 columns(shop: String, shopInfo: String, district: String, revenue: Int). The biggest partition is around 2G bytes, containing 200k lines. Configure the executor with 2G bytes memory. With the change in this pr, it works find. Without this change, below exception will be thrown. ``` MemoryError: Java heap space at org.apache.spark.sql.catalyst.expressions.UnsafeRow.copy(UnsafeRow.java:504) at org.apache.spark.sql.catalyst.expressions.UnsafeRow.copy(UnsafeRow.java:62) at org.apache.spark.sql.execution.window.SlidingWindowFunctionFrame.write(WindowFunctionFrame.scala:201) at org.apache.spark.sql.execution.window.WindowExec$$anonfun$14$$anon$1.next(WindowExec.scala:365) at org.apache.spark.sql.execution.window.WindowExec$$anonfun$14$$anon$1.next(WindowExec.scala:289) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395) at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:231) at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:225) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:108) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:341) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) ``` Author: jinxing <jinxing6042@126.com> Closes #18634 from jinxing64/SPARK-21414. (cherry picked from commit 4eb081cc870a9d1c42aae90418535f7d782553e9) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 19 July 2017, 13:35:59 UTC
df061fd [SPARK-21457][SQL] ExternalCatalog.listPartitions should correctly handle partition values with dot ## What changes were proposed in this pull request? When we list partitions from hive metastore with a partial partition spec, we are expecting exact matching according to the partition values. However, hive treats dot specially and match any single character for dot. We should do an extra filter to drop unexpected partitions. ## How was this patch tested? new regression test. Author: Wenchen Fan <wenchen@databricks.com> Closes #18671 from cloud-fan/hive. (cherry picked from commit f18b905f6cace7686ef169fda7de474079d0af23) Signed-off-by: gatorsmile <gatorsmile@gmail.com> 18 July 2017, 22:56:27 UTC
99ce551 [SPARK-21445] Make IntWrapper and LongWrapper in UTF8String Serializable ## What changes were proposed in this pull request? Making those two classes will avoid Serialization issues like below: ``` Caused by: java.io.NotSerializableException: org.apache.spark.unsafe.types.UTF8String$IntWrapper Serialization stack: - object not serializable (class: org.apache.spark.unsafe.types.UTF8String$IntWrapper, value: org.apache.spark.unsafe.types.UTF8String$IntWrapper326450e) - field (class: org.apache.spark.sql.catalyst.expressions.Cast$$anonfun$castToInt$1, name: result$2, type: class org.apache.spark.unsafe.types.UTF8String$IntWrapper) - object (class org.apache.spark.sql.catalyst.expressions.Cast$$anonfun$castToInt$1, <function1>) ``` ## How was this patch tested? - [x] Manual testing - [ ] Unit test Author: Burak Yavuz <brkyvz@gmail.com> Closes #18660 from brkyvz/serializableutf8. (cherry picked from commit 26cd2ca0402d7d49780116d45a5622a45c79f661) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 18 July 2017, 04:09:27 UTC
83bdb04 [SPARK-21332][SQL] Incorrect result type inferred for some decimal expressions ## What changes were proposed in this pull request? This PR changes the direction of expression transformation in the DecimalPrecision rule. Previously, the expressions were transformed down, which led to incorrect result types when decimal expressions had other decimal expressions as their operands. The root cause of this issue was in visiting outer nodes before their children. Consider the example below: ``` val inputSchema = StructType(StructField("col", DecimalType(26, 6)) :: Nil) val sc = spark.sparkContext val rdd = sc.parallelize(1 to 2).map(_ => Row(BigDecimal(12))) val df = spark.createDataFrame(rdd, inputSchema) // Works correctly since no nested decimal expression is involved // Expected result type: (26, 6) * (26, 6) = (38, 12) df.select($"col" * $"col").explain(true) df.select($"col" * $"col").printSchema() // Gives a wrong result since there is a nested decimal expression that should be visited first // Expected result type: ((26, 6) * (26, 6)) * (26, 6) = (38, 12) * (26, 6) = (38, 18) df.select($"col" * $"col" * $"col").explain(true) df.select($"col" * $"col" * $"col").printSchema() ``` The example above gives the following output: ``` // Correct result without sub-expressions == Parsed Logical Plan == 'Project [('col * 'col) AS (col * col)#4] +- LogicalRDD [col#1] == Analyzed Logical Plan == (col * col): decimal(38,12) Project [CheckOverflow((promote_precision(cast(col#1 as decimal(26,6))) * promote_precision(cast(col#1 as decimal(26,6)))), DecimalType(38,12)) AS (col * col)#4] +- LogicalRDD [col#1] == Optimized Logical Plan == Project [CheckOverflow((col#1 * col#1), DecimalType(38,12)) AS (col * col)#4] +- LogicalRDD [col#1] == Physical Plan == *Project [CheckOverflow((col#1 * col#1), DecimalType(38,12)) AS (col * col)#4] +- Scan ExistingRDD[col#1] // Schema root |-- (col * col): decimal(38,12) (nullable = true) // Incorrect result with sub-expressions == Parsed Logical Plan == 'Project [(('col * 'col) * 'col) AS ((col * col) * col)#11] +- LogicalRDD [col#1] == Analyzed Logical Plan == ((col * col) * col): decimal(38,12) Project [CheckOverflow((promote_precision(cast(CheckOverflow((promote_precision(cast(col#1 as decimal(26,6))) * promote_precision(cast(col#1 as decimal(26,6)))), DecimalType(38,12)) as decimal(26,6))) * promote_precision(cast(col#1 as decimal(26,6)))), DecimalType(38,12)) AS ((col * col) * col)#11] +- LogicalRDD [col#1] == Optimized Logical Plan == Project [CheckOverflow((cast(CheckOverflow((col#1 * col#1), DecimalType(38,12)) as decimal(26,6)) * col#1), DecimalType(38,12)) AS ((col * col) * col)#11] +- LogicalRDD [col#1] == Physical Plan == *Project [CheckOverflow((cast(CheckOverflow((col#1 * col#1), DecimalType(38,12)) as decimal(26,6)) * col#1), DecimalType(38,12)) AS ((col * col) * col)#11] +- Scan ExistingRDD[col#1] // Schema root |-- ((col * col) * col): decimal(38,12) (nullable = true) ``` ## How was this patch tested? This PR was tested with available unit tests. Moreover, there are tests to cover previously failing scenarios. Author: aokolnychyi <anton.okolnychyi@sap.com> Closes #18583 from aokolnychyi/spark-21332. (cherry picked from commit 0be5fb41a6b7ef4da9ba36f3604ac646cb6d4ae3) Signed-off-by: gatorsmile <gatorsmile@gmail.com> 18 July 2017, 04:08:03 UTC
0ef98fd [SPARK-21321][SPARK CORE] Spark very verbose on shutdown ## What changes were proposed in this pull request? The current code is very verbose on shutdown. The changes I propose is to change the log level when the driver is shutting down and the RPC connections are closed (RpcEnvStoppedException). ## How was this patch tested? Tested with word count(deploy-mode = cluster, master = yarn, num-executors = 4) with 300GB of data. Author: John Lee <jlee2@yahoo-inc.com> Closes #18547 from yoonlee95/SPARK-21321. (cherry picked from commit 0e07a29cf4a5587f939585e6885ed0f7e68c31b5) Signed-off-by: Tom Graves <tgraves@yahoo-inc.com> 17 July 2017, 18:14:25 UTC
8e85ce6 [SPARK-21267][DOCS][MINOR] Follow up to avoid referencing programming-guide redirector ## What changes were proposed in this pull request? Update internal references from programming-guide to rdd-programming-guide See https://github.com/apache/spark-website/commit/5ddf243fd84a0f0f98a5193a207737cea9cdc083 and https://github.com/apache/spark/pull/18485#issuecomment-314789751 Let's keep the redirector even if it's problematic to build, but not rely on it internally. ## How was this patch tested? (Doc build) Author: Sean Owen <sowen@cloudera.com> Closes #18625 from srowen/SPARK-21267.2. (cherry picked from commit 74ac1fb081e9532d77278a4edca9f3f129fd62eb) Signed-off-by: Sean Owen <sowen@cloudera.com> 15 July 2017, 08:22:06 UTC
1cb4369 [SPARK-21344][SQL] BinaryType comparison does signed byte array comparison ## What changes were proposed in this pull request? This PR fixes a wrong comparison for `BinaryType`. This PR enables unsigned comparison and unsigned prefix generation for an array for `BinaryType`. Previous implementations uses signed operations. ## How was this patch tested? Added a test suite in `OrderingSuite`. Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Closes #18571 from kiszk/SPARK-21344. (cherry picked from commit ac5d5d795909061a17e056696cf0ef87d9e65dd1) Signed-off-by: gatorsmile <gatorsmile@gmail.com> 15 July 2017, 03:16:16 UTC
bfe3ba8 [SPARK-21376][YARN] Fix yarn client token expire issue when cleaning the staging files in long running scenario ## What changes were proposed in this pull request? This issue happens in long running application with yarn cluster mode, because yarn#client doesn't sync token with AM, so it will always keep the initial token, this token may be expired in the long running scenario, so when yarn#client tries to clean up staging directory after application finished, it will use this expired token and meet token expire issue. ## How was this patch tested? Manual verification is secure cluster. Author: jerryshao <sshao@hortonworks.com> Closes #18617 from jerryshao/SPARK-21376. (cherry picked from commit cb8d5cc90ff8d3c991ff33da41b136ab7634f71b) 14 July 2017, 17:51:28 UTC
cf0719b Revert "[SPARK-18646][REPL] Set parent classloader as null for ExecutorClassLoader" This reverts commit 39eba3053ac99f03d9df56471bae5fc5cc9f4462. 13 July 2017, 00:34:42 UTC
39eba30 [SPARK-18646][REPL] Set parent classloader as null for ExecutorClassLoader ## What changes were proposed in this pull request? `ClassLoader` will preferentially load class from `parent`. Only when `parent` is null or the load failed, that it will call the overridden `findClass` function. To avoid the potential issue caused by loading class using inappropriate class loader, we should set the `parent` of `ClassLoader` to null, so that we can fully control which class loader is used. This is take over of #17074, the primary author of this PR is taroplus . Should close #17074 after this PR get merged. ## How was this patch tested? Add test case in `ExecutorClassLoaderSuite`. Author: Kohki Nishio <taroplus@me.com> Author: Xingbo Jiang <xingbo.jiang@databricks.com> Closes #18614 from jiangxb1987/executor_classloader. (cherry picked from commit e08d06b37bc96cc48fec1c5e40f73e0bca09c616) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 13 July 2017, 00:22:53 UTC
cb6fc89 [SPARK-21219][CORE] Task retry occurs on same executor due to race co… …ndition with blacklisting There's a race condition in the current TaskSetManager where a failed task is added for retry (addPendingTask), and can asynchronously be assigned to an executor *prior* to the blacklist state (updateBlacklistForFailedTask), the result is the task might re-execute on the same executor. This is particularly problematic if the executor is shutting down since the retry task immediately becomes a lost task (ExecutorLostFailure). Another side effect is that the actual failure reason gets obscured by the retry task which never actually executed. There are sample logs showing the issue in the https://issues.apache.org/jira/browse/SPARK-21219 The fix is to change the ordering of the addPendingTask and updatingBlackListForFailedTask calls in TaskSetManager.handleFailedTask Implemented a unit test that verifies the task is black listed before it is added to the pending task. Ran the unit test without the fix and it fails. Ran the unit test with the fix and it passes. Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Eric Vandenberg <ericvandenbergfb.com> Closes #18427 from ericvandenbergfb/blacklistFix. ## What changes were proposed in this pull request? This is a backport of the fix to SPARK-21219, already checked in as 96d58f2. ## How was this patch tested? Ran TaskSetManagerSuite tests locally. Author: Eric Vandenberg <ericvandenberg@fb.com> Closes #18604 from jsoltren/branch-2.2. 12 July 2017, 06:49:15 UTC
399aa01 [SPARK-21366][SQL][TEST] Add sql test for window functions ## What changes were proposed in this pull request? Add sql test for window functions, also remove uncecessary test cases in `WindowQuerySuite`. ## How was this patch tested? Added `window.sql` and the corresponding output file. Author: Xingbo Jiang <xingbo.jiang@databricks.com> Closes #18591 from jiangxb1987/window. (cherry picked from commit 66d21686556681457aab6e44e19f5614c5635f0c) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 11 July 2017, 13:53:10 UTC
edcd9fb [SPARK-21369][CORE] Don't use Scala Tuple2 in common/network-* ## What changes were proposed in this pull request? Remove all usages of Scala Tuple2 from common/network-* projects. Otherwise, Yarn users cannot use `spark.reducer.maxReqSizeShuffleToMem`. ## How was this patch tested? Jenkins. Author: Shixiong Zhu <shixiong@databricks.com> Closes #18593 from zsxwing/SPARK-21369. (cherry picked from commit 833eab2c9bd273ee9577fbf9e480d3e3a4b7d203) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 11 July 2017, 03:26:28 UTC
a05edf4 [SPARK-21272] SortMergeJoin LeftAnti does not update numOutputRows ## What changes were proposed in this pull request? Updating numOutputRows metric was missing from one return path of LeftAnti SortMergeJoin. ## How was this patch tested? Non-zero output rows manually seen in metrics. Author: Juliusz Sompolski <julek@databricks.com> Closes #18494 from juliuszsompolski/SPARK-21272. 10 July 2017, 16:30:55 UTC
40fd0ce [SPARK-21342] Fix DownloadCallback to work well with RetryingBlockFetcher. When `RetryingBlockFetcher` retries fetching blocks. There could be two `DownloadCallback`s download the same content to the same target file. It could cause `ShuffleBlockFetcherIterator` reading a partial result. This pr proposes to create and delete the tmp files in `OneForOneBlockFetcher` Author: jinxing <jinxing6042@126.com> Author: Shixiong Zhu <zsxwing@gmail.com> Closes #18565 from jinxing64/SPARK-21342. (cherry picked from commit 6a06c4b03c4dd86241fb9d11b4360371488f0e53) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 10 July 2017, 13:10:02 UTC
3bfad9d [SPARK-21083][SQL][BRANCH-2.2] Store zero size and row count when analyzing empty table ## What changes were proposed in this pull request? We should be able to store zero size and row count after analyzing empty table. This is a backport for https://github.com/apache/spark/commit/9fccc3627fa41d32fbae6dbbb9bd1521e43eb4f0. ## How was this patch tested? Added new test. Author: Zhenhua Wang <wangzhenhua@huawei.com> Closes #18575 from wzhfy/analyzeEmptyTable-2.2. 09 July 2017, 10:51:06 UTC
964332b [SPARK-21343] Refine the document for spark.reducer.maxReqSizeShuffleToMem. ## What changes were proposed in this pull request? In current code, reducer can break the old shuffle service when `spark.reducer.maxReqSizeShuffleToMem` is enabled. Let's refine document. Author: jinxing <jinxing6042@126.com> Closes #18566 from jinxing64/SPARK-21343. (cherry picked from commit 062c336d06a0bd4e740a18d2349e03e311509243) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 08 July 2017, 16:28:14 UTC
c8d7855 [SPARK-20342][CORE] Update task accumulators before sending task end event. This makes sures that listeners get updated task information; otherwise it's possible to write incomplete task information into event logs, for example, making the information in a replayed UI inconsistent with the original application. Added a new unit test to try to detect the problem, but it's not guaranteed to fail since it's a race; but it fails pretty reliably for me without the scheduler changes. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #18393 from vanzin/SPARK-20342.try2. (cherry picked from commit 9131bdb7e12bcfb2cb699b3438f554604e28aaa8) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 08 July 2017, 16:25:09 UTC
a64f108 [SPARK-21345][SQL][TEST][TEST-MAVEN] SparkSessionBuilderSuite should clean up stopped sessions. `SparkSessionBuilderSuite` should clean up stopped sessions. Otherwise, it leaves behind some stopped `SparkContext`s interfereing with other test suites using `ShardSQLContext`. Recently, master branch fails consequtively. - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/ Pass the Jenkins with a updated suite. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #18567 from dongjoon-hyun/SPARK-SESSION. (cherry picked from commit 0b8dd2d08460f3e6eb578727d2c336b6f11959e7) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 08 July 2017, 12:21:18 UTC
7d0b1c9 [SPARK-21228][SQL][BRANCH-2.2] InSet incorrect handling of structs ## What changes were proposed in this pull request? This is backport of https://github.com/apache/spark/pull/18455 When data type is struct, InSet now uses TypeUtils.getInterpretedOrdering (similar to EqualTo) to build a TreeSet. In other cases it will use a HashSet as before (which should be faster). Similarly, In.eval uses Ordering.equiv instead of equals. ## How was this patch tested? New test in SQLQuerySuite. Author: Bogdan Raducanu <bogdan@databricks.com> Closes #18563 from bogdanrdc/SPARK-21228-BRANCH2.2. 08 July 2017, 12:14:59 UTC
ab12848 [SPARK-21069][SS][DOCS] Add rate source to programming guide. ## What changes were proposed in this pull request? SPARK-20979 added a new structured streaming source: Rate source. This patch adds the corresponding documentation to programming guide. ## How was this patch tested? Tested by running jekyll locally. Author: Prashant Sharma <prashant@apache.org> Author: Prashant Sharma <prashsh1@in.ibm.com> Closes #18562 from ScrapCodes/spark-21069/rate-source-docs. (cherry picked from commit d0bfc6733521709e453d643582df2bdd68f28de7) Signed-off-by: Shixiong Zhu <shixiong@databricks.com> 08 July 2017, 06:33:20 UTC
576fd4c [SPARK-21267][SS][DOCS] Update Structured Streaming Documentation ## What changes were proposed in this pull request? Few changes to the Structured Streaming documentation - Clarify that the entire stream input table is not materialized - Add information for Ganglia - Add Kafka Sink to the main docs - Removed a couple of leftover experimental tags - Added more associated reading material and talk videos. In addition, https://github.com/apache/spark/pull/16856 broke the link to the RDD programming guide in several places while renaming the page. This PR fixes those sameeragarwal cloud-fan. - Added a redirection to avoid breaking internal and possible external links. - Removed unnecessary redirection pages that were there since the separate scala, java, and python programming guides were merged together in 2013 or 2014. ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #18485 from tdas/SPARK-21267. (cherry picked from commit 0217dfd26f89133f146197359b556c9bf5aca172) Signed-off-by: Shixiong Zhu <shixiong@databricks.com> 07 July 2017, 00:28:28 UTC
4e53a4e [SS][MINOR] Fix flaky test in DatastreamReaderWriterSuite. temp checkpoint dir should be deleted ## What changes were proposed in this pull request? Stopping query while it is being initialized can throw interrupt exception, in which case temporary checkpoint directories will not be deleted, and the test will fail. Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #18442 from tdas/DatastreamReaderWriterSuite-fix. (cherry picked from commit 60043f22458668ac7ecba94fa78953f23a6bdcec) Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com> 06 July 2017, 07:20:40 UTC
6e1081c [SPARK-21312][SQL] correct offsetInBytes in UnsafeRow.writeToStream ## What changes were proposed in this pull request? Corrects offsetInBytes calculation in UnsafeRow.writeToStream. Known failures include writes to some DataSources that have own SparkPlan implementations and cause EXCHANGE in writes. ## How was this patch tested? Extended UnsafeRowSuite.writeToStream to include an UnsafeRow over byte array having non-zero offset. Author: Sumedh Wale <swale@snappydata.io> Closes #18535 from sumwale/SPARK-21312. (cherry picked from commit 14a3bb3a008c302aac908d7deaf0942a98c63be7) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 06 July 2017, 06:47:43 UTC
770fd2a [SPARK-21300][SQL] ExternalMapToCatalyst should null-check map key prior to converting to internal value. ## What changes were proposed in this pull request? `ExternalMapToCatalyst` should null-check map key prior to converting to internal value to throw an appropriate Exception instead of something like NPE. ## How was this patch tested? Added a test and existing tests. Author: Takuya UESHIN <ueshin@databricks.com> Closes #18524 from ueshin/issues/SPARK-21300. (cherry picked from commit ce10545d3401c555e56a214b7c2f334274803660) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 05 July 2017, 03:24:55 UTC
db21b67 [SPARK-20256][SQL] SessionState should be created more lazily ## What changes were proposed in this pull request? `SessionState` is designed to be created lazily. However, in reality, it created immediately in `SparkSession.Builder.getOrCreate` ([here](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala#L943)). This PR aims to recover the lazy behavior by keeping the options into `initialSessionOptions`. The benefit is like the following. Users can start `spark-shell` and use RDD operations without any problems. **BEFORE** ```scala $ bin/spark-shell java.lang.IllegalArgumentException: Error while instantiating 'org.apache.spark.sql.hive.HiveSessionStateBuilder' ... Caused by: org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.security.AccessControlException: Permission denied: user=spark, access=READ, inode="/apps/hive/warehouse":hive:hdfs:drwx------ ``` As reported in SPARK-20256, this happens when the warehouse directory is not allowed for this user. **AFTER** ```scala $ bin/spark-shell ... Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 2.3.0-SNAPSHOT /_/ Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_112) Type in expressions to have them evaluated. Type :help for more information. scala> sc.range(0, 10, 1).count() res0: Long = 10 ``` ## How was this patch tested? Manual. This closes #18512 . Author: Dongjoon Hyun <dongjoon@apache.org> Closes #18501 from dongjoon-hyun/SPARK-20256. (cherry picked from commit 1b50e0e0d6fd9d1b815a3bb37647ea659222e3f1) Signed-off-by: gatorsmile <gatorsmile@gmail.com> 04 July 2017, 16:48:58 UTC
6fd39ea [SPARK-21170][CORE] Utils.tryWithSafeFinallyAndFailureCallbacks throws IllegalArgumentException: Self-suppression not permitted ## What changes were proposed in this pull request? Not adding the exception to the suppressed if it is the same instance as originalThrowable. ## How was this patch tested? Added new tests to verify this, these tests fail without source code changes and passes with the change. Author: Devaraj K <devaraj@apache.org> Closes #18384 from devaraj-kavali/SPARK-21170. (cherry picked from commit 6beca9ce94f484de2f9ffb946bef8334781b3122) Signed-off-by: Sean Owen <sowen@cloudera.com> 01 July 2017, 14:54:18 UTC
85fddf4 Preparing development version 2.2.1-SNAPSHOT 30 June 2017, 22:54:39 UTC
a2c7b21 Preparing Spark release v2.2.0-rc6 30 June 2017, 22:54:34 UTC
29a0be2 [SPARK-21129][SQL] Arguments of SQL function call should not be named expressions ### What changes were proposed in this pull request? Function argument should not be named expressions. It could cause two issues: - Misleading error message - Unexpected query results when the column name is `distinct`, which is not a reserved word in our parser. ``` spark-sql> select count(distinct c1, distinct c2) from t1; Error in query: cannot resolve '`distinct`' given input columns: [c1, c2]; line 1 pos 26; 'Project [unresolvedalias('count(c1#30, 'distinct), None)] +- SubqueryAlias t1 +- CatalogRelation `default`.`t1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [c1#30, c2#31] ``` After the fix, the error message becomes ``` spark-sql> select count(distinct c1, distinct c2) from t1; Error in query: extraneous input 'c2' expecting {')', ',', '.', '[', 'OR', 'AND', 'IN', NOT, 'BETWEEN', 'LIKE', RLIKE, 'IS', EQ, '<=>', '<>', '!=', '<', LTE, '>', GTE, '+', '-', '*', '/', '%', 'DIV', '&', '|', '||', '^'}(line 1, pos 35) == SQL == select count(distinct c1, distinct c2) from t1 -----------------------------------^^^ ``` ### How was this patch tested? Added a test case to parser suite. Author: Xiao Li <gatorsmile@gmail.com> Author: gatorsmile <gatorsmile@gmail.com> Closes #18338 from gatorsmile/parserDistinctAggFunc. (cherry picked from commit eed9c4ef859fdb75a816a3e0ce2d593b34b23444) Signed-off-by: gatorsmile <gatorsmile@gmail.com> 30 June 2017, 21:24:05 UTC
8b08fd0 [SPARK-21258][SQL] Fix WindowExec complex object aggregation with spilling ## What changes were proposed in this pull request? `WindowExec` currently improperly stores complex objects (UnsafeRow, UnsafeArrayData, UnsafeMapData, UTF8String) during aggregation by keeping a reference in the buffer used by `GeneratedMutableProjections` to the actual input data. Things go wrong when the input object (or the backing bytes) are reused for other things. This could happen in window functions when it starts spilling to disk. When reading the back the spill files the `UnsafeSorterSpillReader` reuses the buffer to which the `UnsafeRow` points, leading to weird corruption scenario's. Note that this only happens for aggregate functions that preserve (parts of) their input, for example `FIRST`, `LAST`, `MIN` & `MAX`. This was not seen before, because the spilling logic was not doing actual spills as much and actually used an in-memory page. This page was not cleaned up during window processing and made sure unsafe objects point to their own dedicated memory location. This was changed by https://github.com/apache/spark/pull/16909, after this PR Spark spills more eagerly. This PR provides a surgical fix because we are close to releasing Spark 2.2. This change just makes sure that there cannot be any object reuse at the expensive of a little bit of performance. We will follow-up with a more subtle solution at a later point. ## How was this patch tested? Added a regression test to `DataFrameWindowFunctionsSuite`. Author: Herman van Hovell <hvanhovell@databricks.com> Closes #18470 from hvanhovell/SPARK-21258. (cherry picked from commit e2f32ee45ac907f1f53fde7e412676a849a94872) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 30 June 2017, 04:34:28 UTC
d16e262 [SPARK-21253][CORE][HOTFIX] Fix Scala 2.10 build ## What changes were proposed in this pull request? A follow up PR to fix Scala 2.10 build for #18472 ## How was this patch tested? Jenkins Author: Shixiong Zhu <shixiong@databricks.com> Closes #18478 from zsxwing/SPARK-21253-2. (cherry picked from commit cfc696f4a4289acf132cb26baf7c02c5b6305277) Signed-off-by: Shixiong Zhu <shixiong@databricks.com> 30 June 2017, 03:56:44 UTC
c6ba647 [SPARK-21176][WEB UI] Limit number of selector threads for admin ui proxy servlets to 8 ## What changes were proposed in this pull request? Please see also https://issues.apache.org/jira/browse/SPARK-21176 This change limits the number of selector threads that jetty creates to maximum 8 per proxy servlet (Jetty default is number of processors / 2). The newHttpClient for Jettys ProxyServlet class is overwritten to avoid the Jetty defaults (which are designed for high-performance http servers). Once https://github.com/eclipse/jetty.project/issues/1643 is available, the code could be cleaned up to avoid the method override. I really need this on v2.1.1 - what is the best way for a backport automatic merge works fine)? Shall I create another PR? ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) The patch was tested manually on a Spark cluster with a head node that has 88 processors using JMX to verify that the number of selector threads is now limited to 8 per proxy. gurvindersingh zsxwing can you please review the change? Author: IngoSchuster <ingo.schuster@de.ibm.com> Author: Ingo Schuster <ingo.schuster@de.ibm.com> Closes #18437 from IngoSchuster/master. (cherry picked from commit 88a536babf119b7e331d02aac5d52b57658803bf) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 30 June 2017, 03:16:19 UTC
8de67e3 [SPARK-21253][CORE] Disable spark.reducer.maxReqSizeShuffleToMem Disable spark.reducer.maxReqSizeShuffleToMem because it breaks the old shuffle service. Credits to wangyum Closes #18466 Jenkins Author: Shixiong Zhu <shixiong@databricks.com> Author: Yuming Wang <wgyumg@gmail.com> Closes #18467 from zsxwing/SPARK-21253. (cherry picked from commit 80f7ac3a601709dd9471092244612023363f54cd) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 30 June 2017, 03:04:27 UTC
20cf511 [SPARK-21253][CORE] Fix a bug that StreamCallback may not be notified if network errors happen ## What changes were proposed in this pull request? If a network error happens before processing StreamResponse/StreamFailure events, StreamCallback.onFailure won't be called. This PR fixes `failOutstandingRequests` to also notify outstanding StreamCallbacks. ## How was this patch tested? The new unit tests. Author: Shixiong Zhu <shixiong@databricks.com> Closes #18472 from zsxwing/fix-stream-2. (cherry picked from commit 4996c53949376153f9ebdc74524fed7226968808) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 30 June 2017, 02:57:21 UTC
17a04b9 [SPARK-21210][DOC][ML] Javadoc 8 fixes for ML shared param traits PR #15999 included fixes for doc strings in the ML shared param traits (occurrences of `>` and `>=`). This PR simply uses the HTML-escaped version of the param doc to embed into the Scaladoc, to ensure that when `SharedParamsCodeGen` is run, the generated javadoc will be compliant for Java 8. ## How was this patch tested? Existing tests Author: Nick Pentreath <nickp@za.ibm.com> Closes #18420 from MLnick/shared-params-javadoc8. (cherry picked from commit 70085e83d1ee728b23f7df15f570eb8d77f67a7a) Signed-off-by: Sean Owen <sowen@cloudera.com> 29 June 2017, 08:51:21 UTC
970f68c [SPARK-19104][SQL] Lambda variables in ExternalMapToCatalyst should be global The issue happens in `ExternalMapToCatalyst`. For example, the following codes create `ExternalMapToCatalyst` to convert Scala Map to catalyst map format. val data = Seq.tabulate(10)(i => NestedData(1, Map("key" -> InnerData("name", i + 100)))) val ds = spark.createDataset(data) The `valueConverter` in `ExternalMapToCatalyst` looks like: if (isnull(lambdavariable(ExternalMapToCatalyst_value52, ExternalMapToCatalyst_value_isNull52, ObjectType(class org.apache.spark.sql.InnerData), true))) null else named_struct(name, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(lambdavariable(ExternalMapToCatalyst_value52, ExternalMapToCatalyst_value_isNull52, ObjectType(class org.apache.spark.sql.InnerData), true)).name, true), value, assertnotnull(lambdavariable(ExternalMapToCatalyst_value52, ExternalMapToCatalyst_value_isNull52, ObjectType(class org.apache.spark.sql.InnerData), true)).value) There is a `CreateNamedStruct` expression (`named_struct`) to create a row of `InnerData.name` and `InnerData.value` that are referred by `ExternalMapToCatalyst_value52`. Because `ExternalMapToCatalyst_value52` are local variable, when `CreateNamedStruct` splits expressions to individual functions, the local variable can't be accessed anymore. Jenkins tests. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #18418 from viirya/SPARK-19104. (cherry picked from commit fd8c931a30a084ee981b75aa469fc97dda6cfaa9) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 27 June 2017, 16:59:11 UTC
d8e3a4a [SPARK-21079][SQL] Calculate total size of a partition table as a sum of individual partitions ## What changes were proposed in this pull request? Storage URI of a partitioned table may or may not point to a directory under which individual partitions are stored. In fact, individual partitions may be located in totally unrelated directories. Before this change, ANALYZE TABLE table COMPUTE STATISTICS command calculated total size of a table by adding up sizes of files found under table's storage URI. This calculation could produce 0 if partitions are stored elsewhere. This change uses storage URIs of individual partitions to calculate the sizes of all partitions of a table and adds these up to produce the total size of a table. CC: wzhfy ## How was this patch tested? Added unit test. Ran ANALYZE TABLE xxx COMPUTE STATISTICS on a partitioned Hive table and verified that sizeInBytes is calculated correctly. Before this change, the size would be zero. Author: Masha Basmanova <mbasmanova@fb.com> Closes #18309 from mbasmanova/mbasmanova-analyze-part-table. (cherry picked from commit b449a1d6aa322a50cf221cd7a2ae85a91d6c7e9f) Signed-off-by: gatorsmile <gatorsmile@gmail.com> 25 June 2017, 05:49:53 UTC
ad44ab5 [SPARK-21203][SQL] Fix wrong results of insertion of Array of Struct ### What changes were proposed in this pull request? ```SQL CREATE TABLE `tab1` (`custom_fields` ARRAY<STRUCT<`id`: BIGINT, `value`: STRING>>) USING parquet INSERT INTO `tab1` SELECT ARRAY(named_struct('id', 1, 'value', 'a'), named_struct('id', 2, 'value', 'b')) SELECT custom_fields.id, custom_fields.value FROM tab1 ``` The above query always return the last struct of the array, because the rule `SimplifyCasts` incorrectly rewrites the query. The underlying cause is we always use the same `GenericInternalRow` object when doing the cast. ### How was this patch tested? Author: gatorsmile <gatorsmile@gmail.com> Closes #18412 from gatorsmile/castStruct. (cherry picked from commit 2e1586f60a77ea0adb6f3f68ba74323f0c242199) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 24 June 2017, 14:36:14 UTC
96c04f1 [SPARK-21159][CORE] Don't try to connect to launcher in standalone cluster mode. Monitoring for standalone cluster mode is not implemented (see SPARK-11033), but the same scheduler implementation is used, and if it tries to connect to the launcher it will fail. So fix the scheduler so it only tries that in client mode; cluster mode applications will be correctly launched and will work, but monitoring through the launcher handle will not be available. Tested by running a cluster mode app with "SparkLauncher.startApplication". Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #18397 from vanzin/SPARK-21159. (cherry picked from commit bfd73a7c48b87456d1b84d826e04eca938a1be64) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 24 June 2017, 05:36:08 UTC
a3088d2 [SPARK-20555][SQL] Fix mapping of Oracle DECIMAL types to Spark types in read path ## What changes were proposed in this pull request? This PR is to revert some code changes in the read path of https://github.com/apache/spark/pull/14377. The original fix is https://github.com/apache/spark/pull/17830 When merging this PR, please give the credit to gaborfeher ## How was this patch tested? Added a test case to OracleIntegrationSuite.scala Author: Gabor Feher <gabor.feher@lynxanalytics.com> Author: gatorsmile <gatorsmile@gmail.com> Closes #18408 from gatorsmile/OracleType. (cherry picked from commit b837bf9ae97cf7ee7558c10a5a34636e69367a05) Signed-off-by: gatorsmile <gatorsmile@gmail.com> 24 June 2017, 04:53:53 UTC
3394b06 [MINOR][DOCS] Docs in DataFrameNaFunctions.scala use wrong method ## What changes were proposed in this pull request? * Following the first few examples in this file, the remaining methods should also be methods of `df.na` not `df`. * Filled in some missing parentheses ## How was this patch tested? N/A Author: Ong Ming Yang <me@ongmingyang.com> Closes #18398 from ongmingyang/master. (cherry picked from commit 4cc62951a2b12a372a2b267bf8597a0a31e2b2cb) Signed-off-by: Xiao Li <gatorsmile@gmail.com> 23 June 2017, 17:57:08 UTC
f160267 [SPARK-21181] Release byteBuffers to suppress netty error messages ## What changes were proposed in this pull request? We are explicitly calling release on the byteBuf's used to encode the string to Base64 to suppress the memory leak error message reported by netty. This is to make it less confusing for the user. ### Changes proposed in this fix By explicitly invoking release on the byteBuf's we are decrement the internal reference counts for the wrappedByteBuf's. Now, when the GC kicks in, these would be reclaimed as before, just that netty wouldn't report any memory leak error messages as the internal ref. counts are now 0. ## How was this patch tested? Ran a few spark-applications and examined the logs. The error message no longer appears. Original PR was opened against branch-2.1 => https://github.com/apache/spark/pull/18392 Author: Dhruve Ashar <dhruveashar@gmail.com> Closes #18407 from dhruve/master. (cherry picked from commit 1ebe7ffe072bcac03360e65e959a6cd36530a9c4) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> 23 June 2017, 17:36:40 UTC
9d29808 [SPARK-21144][SQL] Print a warning if the data schema and partition schema have the duplicate columns ## What changes were proposed in this pull request? The current master outputs unexpected results when the data schema and partition schema have the duplicate columns: ``` withTempPath { dir => val basePath = dir.getCanonicalPath spark.range(0, 3).toDF("foo").write.parquet(new Path(basePath, "foo=1").toString) spark.range(0, 3).toDF("foo").write.parquet(new Path(basePath, "foo=a").toString) spark.read.parquet(basePath).show() } +---+ |foo| +---+ | 1| | 1| | a| | a| | 1| | a| +---+ ``` This patch added code to print a warning when the duplication found. ## How was this patch tested? Manually checked. Author: Takeshi Yamamuro <yamamuro@apache.org> Closes #18375 from maropu/SPARK-21144-3. (cherry picked from commit f3dea60793d86212ba1068e88ad89cb3dcf07801) Signed-off-by: gatorsmile <gatorsmile@gmail.com> 23 June 2017, 16:28:21 UTC
b6749ba [SPARK-21165] [SQL] [2.2] Use executedPlan instead of analyzedPlan in INSERT AS SELECT [WIP] ### What changes were proposed in this pull request? The input query schema of INSERT AS SELECT could be changed after optimization. For example, the following query's output schema is changed by the rule `SimplifyCasts` and `RemoveRedundantAliases`. ```SQL SELECT word, length, cast(first as string) as first FROM view1 ``` This PR is to fix the issue in Spark 2.2. Instead of using the analyzed plan of the input query, this PR use its executed plan to determine the attributes in `FileFormatWriter`. The related issue in the master branch has been fixed by https://github.com/apache/spark/pull/18064. After this PR is merged, I will submit a separate PR to merge the test case to the master. ### How was this patch tested? Added a test case Author: Xiao Li <gatorsmile@gmail.com> Author: gatorsmile <gatorsmile@gmail.com> Closes #18386 from gatorsmile/newRC5. 23 June 2017, 12:44:25 UTC
b99c0e9 Revert "[SPARK-18016][SQL][CATALYST][BRANCH-2.2] Code Generation: Constant Pool Limit - Class Splitting" This reverts commit 198e3a036fbed373be3d26964e005b4f2fed2bb7. 23 June 2017, 00:37:24 UTC
d625734 [SQL][DOC] Fix documentation of lpad ## What changes were proposed in this pull request? Fix incomplete documentation for `lpad`. Author: actuaryzhang <actuaryzhang10@gmail.com> Closes #18367 from actuaryzhang/SQLDoc. (cherry picked from commit 97b307c87c0f262ea3e020bf3d72383deef76619) Signed-off-by: Sean Owen <sowen@cloudera.com> 22 June 2017, 09:12:42 UTC
6ef7a5b [SPARK-21167][SS] Decode the path generated by File sink to handle special characters ## What changes were proposed in this pull request? Decode the path generated by File sink to handle special characters. ## How was this patch tested? The added unit test. Author: Shixiong Zhu <shixiong@databricks.com> Closes #18381 from zsxwing/SPARK-21167. (cherry picked from commit d66b143eec7f604595089f72d8786edbdcd74282) Signed-off-by: Shixiong Zhu <shixiong@databricks.com> 22 June 2017, 06:43:30 UTC
198e3a0 [SPARK-18016][SQL][CATALYST][BRANCH-2.2] Code Generation: Constant Pool Limit - Class Splitting ## What changes were proposed in this pull request? This is a backport patch for Spark 2.2.x of the class splitting feature over excess generated code as was merged in #18075. ## How was this patch tested? The same test provided in #18075 is included in this patch. Author: ALeksander Eskilson <alek.eskilson@cerner.com> Closes #18377 from bdrillard/class_splitting_2.2. 22 June 2017, 05:25:45 UTC
529c04f [MINOR][DOCS] Add lost <tr> tag for configuration.md ## What changes were proposed in this pull request? Add lost `<tr>` tag for `configuration.md`. ## How was this patch tested? N/A Author: Yuming Wang <wgyumg@gmail.com> Closes #18372 from wangyum/docs-missing-tr. (cherry picked from commit 987eb8faddbb533e006c769d382a3e4fda3dd6ee) Signed-off-by: Sean Owen <sowen@cloudera.com> 21 June 2017, 14:30:39 UTC
e883498 Preparing development version 2.2.1-SNAPSHOT 20 June 2017, 17:56:55 UTC
62e442e Preparing Spark release v2.2.0-rc5 20 June 2017, 17:56:51 UTC
b8b80f6 [SPARK-21150][SQL] Persistent view stored in Hive metastore should be case preserving ## What changes were proposed in this pull request? This is a regression in Spark 2.2. In Spark 2.2, we introduced a new way to resolve persisted view: https://issues.apache.org/jira/browse/SPARK-18209 , but this makes the persisted view non case-preserving because we store the schema in hive metastore directly. We should follow data source table and store schema in table properties. ## How was this patch tested? new regression test Author: Wenchen Fan <wenchen@databricks.com> Closes #18360 from cloud-fan/view. (cherry picked from commit e862dc904963cf7832bafc1d3d0ea9090bbddd81) Signed-off-by: gatorsmile <gatorsmile@gmail.com> 20 June 2017, 16:15:41 UTC
back to top