https://github.com/apache/spark

sort by:
Revision Author Date Message Commit Date
4062cda Preparing Spark release v1.6.0-rc4 22 December 2015, 01:50:29 UTC
ca39985 [SPARK-12466] Fix harmless NPE in tests ``` [info] ReplayListenerSuite: [info] - Simple replay (58 milliseconds) java.lang.NullPointerException at org.apache.spark.deploy.master.Master$$anonfun$asyncRebuildSparkUI$1.applyOrElse(Master.scala:982) at org.apache.spark.deploy.master.Master$$anonfun$asyncRebuildSparkUI$1.applyOrElse(Master.scala:980) ``` https://amplab.cs.berkeley.edu/jenkins/view/Spark-QA-Test/job/Spark-Master-SBT/4316/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.2,label=spark-test/consoleFull This was introduced in #10284. It's harmless because the NPE is caused by a race that occurs mainly in `local-cluster` tests (but don't actually fail the tests). Tested locally to verify that the NPE is gone. Author: Andrew Or <andrew@databricks.com> Closes #10417 from andrewor14/fix-harmless-npe. (cherry picked from commit d655d37ddf59d7fb6db529324ac8044d53b2622a) Signed-off-by: Andrew Or <andrew@databricks.com> 21 December 2015, 22:09:11 UTC
c754a08 Doc typo: ltrim = trim from left end, not right Author: pshearer <pshearer@massmutual.com> Closes #10414 from pshearer/patch-1. (cherry picked from commit fc6dbcc7038c2b030ef6a2dc8be5848499ccee1c) Signed-off-by: Andrew Or <andrew@databricks.com> 21 December 2015, 22:05:07 UTC
d6a519f [SQL] Fix mistake doc of join type for dataframe.join Fix mistake doc of join type for ```dataframe.join```. Author: Yanbo Liang <ybliang8@gmail.com> Closes #10378 from yanboliang/leftsemi. (cherry picked from commit a073a73a561e78c734119c8b764d37a4e5e70da4) Signed-off-by: Reynold Xin <rxin@databricks.com> 19 December 2015, 08:34:38 UTC
eca401e [SPARK-11985][STREAMING][KINESIS][DOCS] Update Kinesis docs - Provide example on `message handler` - Provide bit on KPL record de-aggregation - Fix typos Author: Burak Yavuz <brkyvz@gmail.com> Closes #9970 from brkyvz/kinesis-docs. (cherry picked from commit 2377b707f25449f4557bf048bb384c743d9008e5) Signed-off-by: Shixiong Zhu <shixiong@databricks.com> 18 December 2015, 23:24:49 UTC
bd33d4e [SPARK-12404][SQL] Ensure objects passed to StaticInvoke is Serializable Now `StaticInvoke` receives `Any` as a object and `StaticInvoke` can be serialized but sometimes the object passed is not serializable. For example, following code raises Exception because `RowEncoder#extractorsFor` invoked indirectly makes `StaticInvoke`. ``` case class TimestampContainer(timestamp: java.sql.Timestamp) val rdd = sc.parallelize(1 to 2).map(_ => TimestampContainer(System.currentTimeMillis)) val df = rdd.toDF val ds = df.as[TimestampContainer] val rdd2 = ds.rdd <----------------- invokes extractorsFor indirectory ``` I'll add test cases. Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp> Author: Michael Armbrust <michael@databricks.com> Closes #10357 from sarutak/SPARK-12404. (cherry picked from commit 6eba655259d2bcea27d0147b37d5d1e476e85422) Signed-off-by: Michael Armbrust <michael@databricks.com> 18 December 2015, 22:05:16 UTC
3b903e4 Revert "[SPARK-12365][CORE] Use ShutdownHookManager where Runtime.getRuntime.addShutdownHook() is called" This reverts commit 4af64385b085002d94c54d11bbd144f9f026bbd8. 18 December 2015, 20:56:03 UTC
1dc71ec [SPARK-12218][SQL] Invalid splitting of nested AND expressions in Data Source filter API JIRA: https://issues.apache.org/jira/browse/SPARK-12218 When creating filters for Parquet/ORC, we should not push nested AND expressions partially. Author: Yin Huai <yhuai@databricks.com> Closes #10362 from yhuai/SPARK-12218. (cherry picked from commit 41ee7c57abd9f52065fd7ffb71a8af229603371d) Signed-off-by: Yin Huai <yhuai@databricks.com> 18 December 2015, 18:53:31 UTC
df02319 [SPARK-12413] Fix Mesos ZK persistence I believe this fixes SPARK-12413. I'm currently running an integration test to verify. Author: Michael Gummelt <mgummelt@mesosphere.io> Closes #10366 from mgummelt/fix-zk-mesos. (cherry picked from commit 2bebaa39d9da33bc93ef682959cd42c1968a6a3e) Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.co.jp> 18 December 2015, 11:21:42 UTC
9177ea3 [SPARK-11749][STREAMING] Duplicate creating the RDD in file stream when recovering from checkpoint data Add a transient flag `DStream.restoredFromCheckpointData` to control the restore processing in DStream to avoid duplicate works: check this flag first in `DStream.restoreCheckpointData`, only when `false`, the restore process will be executed. Author: jhu-chang <gt.hu.chang@gmail.com> Closes #9765 from jhu-chang/SPARK-11749. (cherry picked from commit f4346f612b6798517153a786f9172cf41618d34d) Signed-off-by: Shixiong Zhu <shixiong@databricks.com> 18 December 2015, 01:54:14 UTC
4df1dd4 [SPARK-12376][TESTS] Spark Streaming Java8APISuite fails in assertOrderInvariantEquals method org.apache.spark.streaming.Java8APISuite.java is failing due to trying to sort immutable list in assertOrderInvariantEquals method. Author: Evan Chen <chene@us.ibm.com> Closes #10336 from evanyc15/SPARK-12376-StreamingJavaAPISuite. 17 December 2015, 22:23:45 UTC
48dcee4 [SPARK-12397][SQL] Improve error messages for data sources when they are not found Point users to spark-packages.org to find them. Author: Reynold Xin <rxin@databricks.com> Closes #10351 from rxin/SPARK-12397. (cherry picked from commit e096a652b92fc64a7b3457cd0766ab324bcc980b) Signed-off-by: Michael Armbrust <michael@databricks.com> 17 December 2015, 22:16:58 UTC
c0ab14f [SPARK-12410][STREAMING] Fix places that use '.' and '|' directly in split String.split accepts a regular expression, so we should escape "." and "|". Author: Shixiong Zhu <shixiong@databricks.com> Closes #10361 from zsxwing/reg-bug. (cherry picked from commit 540b5aeadc84d1a5d61bda4414abd6bf35dc7ff9) Signed-off-by: Shixiong Zhu <shixiong@databricks.com> 17 December 2015, 21:23:58 UTC
88bbb54 [SPARK-12390] Clean up unused serializer parameter in BlockManager No change in functionality is intended. This only changes internal API. Author: Andrew Or <andrew@databricks.com> Closes #10343 from andrewor14/clean-bm-serializer. Conflicts: core/src/main/scala/org/apache/spark/storage/BlockManager.scala 17 December 2015, 20:01:13 UTC
881f254 [SPARK-12345][MESOS] Properly filter out SPARK_HOME in the Mesos REST server Fix problem with #10332, this one should fix Cluster mode on Mesos Author: Iulian Dragos <jaguarul@gmail.com> Closes #10359 from dragos/issue/fix-spark-12345-one-more-time. (cherry picked from commit 8184568810e8a2e7d5371db2c6a0366ef4841f70) Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.co.jp> 17 December 2015, 18:37:43 UTC
1fbca41 [SPARK-12220][CORE] Make Utils.fetchFile support files that contain special characters This PR encodes and decodes the file name to fix the issue. Author: Shixiong Zhu <shixiong@databricks.com> Closes #10208 from zsxwing/uri. (cherry picked from commit 86e405f357711ae93935853a912bc13985c259db) Signed-off-by: Shixiong Zhu <shixiong@databricks.com> 17 December 2015, 17:55:46 UTC
41ad8ac [SQL] Update SQLContext.read.text doc Since we rename the column name from ```text``` to ```value``` for DataFrame load by ```SQLContext.read.text```, we need to update doc. Author: Yanbo Liang <ybliang8@gmail.com> Closes #10349 from yanboliang/text-value. (cherry picked from commit 6e0771665b3c9330fc0a5b2c7740a796b4cd712e) Signed-off-by: Reynold Xin <rxin@databricks.com> 17 December 2015, 17:20:04 UTC
1ebedb2 [SPARK-12395] [SQL] fix resulting columns of outer join For API DataFrame.join(right, usingColumns, joinType), if the joinType is right_outer or full_outer, the resulting join columns could be wrong (will be null). The order of columns had been changed to match that with MySQL and PostgreSQL [1]. This PR also fix the nullability of output for outer join. [1] http://www.postgresql.org/docs/9.2/static/queries-table-expressions.html Author: Davies Liu <davies@databricks.com> Closes #10353 from davies/fix_join. (cherry picked from commit a170d34a1b309fecc76d1370063e0c4f44dc2142) Signed-off-by: Davies Liu <davies.liu@gmail.com> 17 December 2015, 16:04:24 UTC
a846648 Revert "Once driver register successfully, stop it to connect to master." This reverts commit da7542f2408140a9a3b7ea245350976ac18676a5. 17 December 2015, 16:01:59 UTC
da7542f Once driver register successfully, stop it to connect to master. This commit is to resolve SPARK-12396. Author: echo2mei <534384876@qq.com> Closes #10354 from echoTomei/master. (cherry picked from commit 5a514b61bbfb609c505d8d65f2483068a56f1f70) Signed-off-by: Davies Liu <davies.liu@gmail.com> 17 December 2015, 15:59:27 UTC
d509194 [SPARK-12057][SQL] Prevent failure on corrupt JSON records This PR makes JSON parser and schema inference handle more cases where we have unparsed records. It is based on #10043. The last commit fixes the failed test and updates the logic of schema inference. Regarding the schema inference change, if we have something like ``` {"f1":1} [1,2,3] ``` originally, we will get a DF without any column. After this change, we will get a DF with columns `f1` and `_corrupt_record`. Basically, for the second row, `[1,2,3]` will be the value of `_corrupt_record`. When merge this PR, please make sure that the author is simplyianm. JIRA: https://issues.apache.org/jira/browse/SPARK-12057 Closes #10043 Author: Ian Macalinao <me@ian.pw> Author: Yin Huai <yhuai@databricks.com> Closes #10288 from yhuai/handleCorruptJson. (cherry picked from commit 9d66c4216ad830812848c657bbcd8cd50949e199) Signed-off-by: Reynold Xin <rxin@databricks.com> 17 December 2015, 07:19:06 UTC
4ad0803 [SPARK-12386][CORE] Fix NPE when spark.executor.port is set. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #10339 from vanzin/SPARK-12386. (cherry picked from commit d1508dd9b765489913bc948575a69ebab82f217b) Signed-off-by: Andrew Or <andrew@databricks.com> 17 December 2015, 03:47:57 UTC
154567d [SPARK-12186][WEB UI] Send the complete request URI including the query string when redirecting. Author: Rohit Agarwal <rohita@qubole.com> Closes #10180 from mindprince/SPARK-12186. (cherry picked from commit fdb38227564c1af40cbfb97df420b23eb04c002b) Signed-off-by: Andrew Or <andrew@databricks.com> 17 December 2015, 03:04:43 UTC
4af6438 [SPARK-12365][CORE] Use ShutdownHookManager where Runtime.getRuntime.addShutdownHook() is called SPARK-9886 fixed ExternalBlockStore.scala This PR fixes the remaining references to Runtime.getRuntime.addShutdownHook() Author: tedyu <yuzhihong@gmail.com> Closes #10325 from ted-yu/master. (cherry picked from commit f590178d7a06221a93286757c68b23919bee9f03) Signed-off-by: Andrew Or <andrew@databricks.com> Conflicts: sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLCLIDriver.scala 17 December 2015, 03:03:30 UTC
fb02e4e [SPARK-10248][CORE] track exceptions in dagscheduler event loop in tests `DAGSchedulerEventLoop` normally only logs errors (so it can continue to process more events, from other jobs). However, this is not desirable in the tests -- the tests should be able to easily detect any exception, and also shouldn't silently succeed if there is an exception. This was suggested by mateiz on https://github.com/apache/spark/pull/7699. It may have already turned up an issue in "zero split job". Author: Imran Rashid <irashid@cloudera.com> Closes #8466 from squito/SPARK-10248. (cherry picked from commit 38d9795a4fa07086d65ff705ce86648345618736) Signed-off-by: Andrew Or <andrew@databricks.com> 17 December 2015, 03:01:13 UTC
638b89b [MINOR] Add missing interpolation in NettyRPCEnv ``` Exception in thread "main" org.apache.spark.rpc.RpcTimeoutException: Cannot receive any reply in ${timeout.duration}. This timeout is controlled by spark.rpc.askTimeout at org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48) at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63) at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59) at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33) ``` Author: Andrew Or <andrew@databricks.com> Closes #10334 from andrewor14/rpc-typo. (cherry picked from commit 861549acdbc11920cde51fc57752a8bc241064e5) Signed-off-by: Shixiong Zhu <shixiong@databricks.com> 17 December 2015, 00:13:55 UTC
552b38f [SPARK-12380] [PYSPARK] use SQLContext.getOrCreate in mllib MLlib should use SQLContext.getOrCreate() instead of creating new SQLContext. Author: Davies Liu <davies@databricks.com> Closes #10338 from davies/create_context. (cherry picked from commit 27b98e99d21a0cc34955337f82a71a18f9220ab2) Signed-off-by: Davies Liu <davies.liu@gmail.com> 16 December 2015, 23:48:21 UTC
04e868b [SPARK-12364][ML][SPARKR] Add ML example for SparkR We have DataFrame example for SparkR, we also need to add ML example under ```examples/src/main/r```. cc mengxr jkbradley shivaram Author: Yanbo Liang <ybliang8@gmail.com> Closes #10324 from yanboliang/spark-12364. (cherry picked from commit 1a8b2a17db7ab7a213d553079b83274aeebba86f) Signed-off-by: Joseph K. Bradley <joseph@databricks.com> 16 December 2015, 20:59:33 UTC
dffa610 [SPARK-11608][MLLIB][DOC] Added migration guide for MLlib 1.6 No known breaking changes, but some deprecations and changes of behavior. CC: mengxr Author: Joseph K. Bradley <joseph@databricks.com> Closes #10235 from jkbradley/mllib-guide-update-1.6. (cherry picked from commit 8148cc7a5c9f52c82c2eb7652d9aeba85e72d406) Signed-off-by: Joseph K. Bradley <joseph@databricks.com> 16 December 2015, 19:53:15 UTC
aee88eb Preparing development version 1.6.0-SNAPSHOT 16 December 2015, 19:23:52 UTC
168c89e Preparing Spark release v1.6.0-rc3 16 December 2015, 19:23:41 UTC
e1adf6d [SPARK-6518][MLLIB][EXAMPLE][DOC] Add example code and user guide for bisecting k-means This PR includes only an example code in order to finish it quickly. I'll send another PR for the docs soon. Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com> Closes #9952 from yu-iskw/SPARK-6518. (cherry picked from commit 7b6dc29d0ebbfb3bb941130f8542120b6bc3e234) Signed-off-by: Joseph K. Bradley <joseph@databricks.com> 16 December 2015, 18:55:54 UTC
e5b8571 [SPARK-12345][MESOS] Filter SPARK_HOME when submitting Spark jobs with Mesos cluster mode. SPARK_HOME is now causing problem with Mesos cluster mode since spark-submit script has been changed recently to take precendence when running spark-class scripts to look in SPARK_HOME if it's defined. We should skip passing SPARK_HOME from the Spark client in cluster mode with Mesos, since Mesos shouldn't use this configuration but should use spark.executor.home instead. Author: Timothy Chen <tnachen@gmail.com> Closes #10332 from tnachen/scheduler_ui. (cherry picked from commit ad8c1f0b840284d05da737fb2cc5ebf8848f4490) Signed-off-by: Andrew Or <andrew@databricks.com> 16 December 2015, 18:55:25 UTC
f815127 [SPARK-12318][SPARKR] Save mode in SparkR should be error by default shivaram Please help review. Author: Jeff Zhang <zjffdu@apache.org> Closes #10290 from zjffdu/SPARK-12318. (cherry picked from commit 2eb5af5f0d3c424dc617bb1a18dd0210ea9ba0bc) Signed-off-by: Shivaram Venkataraman <shivaram@cs.berkeley.edu> 16 December 2015, 18:48:54 UTC
16edd93 [SPARK-12215][ML][DOC] User guide section for KMeans in spark.ml cc jkbradley Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com> Closes #10244 from yu-iskw/SPARK-12215. (cherry picked from commit 26d70bd2b42617ff731b6e9e6d77933b38597ebe) Signed-off-by: Joseph K. Bradley <joseph@databricks.com> 16 December 2015, 18:43:55 UTC
ac0e2ea [SPARK-12310][SPARKR] Add write.json and write.parquet for SparkR Add ```write.json``` and ```write.parquet``` for SparkR, and deprecated ```saveAsParquetFile```. Author: Yanbo Liang <ybliang8@gmail.com> Closes #10281 from yanboliang/spark-12310. (cherry picked from commit 22f6cd86fc2e2d6f6ad2c3aae416732c46ebf1b1) Signed-off-by: Shivaram Venkataraman <shivaram@cs.berkeley.edu> 16 December 2015, 18:34:54 UTC
a2d584e [SPARK-12324][MLLIB][DOC] Fixes the sidebar in the ML documentation This fixes the sidebar, using a pure CSS mechanism to hide it when the browser's viewport is too narrow. Credit goes to the original author Titan-C (mentioned in the NOTICE). Note that I am not a CSS expert, so I can only address comments up to some extent. Default view: <img width="936" alt="screen shot 2015-12-14 at 12 46 39 pm" src="https://cloud.githubusercontent.com/assets/7594753/11793597/6d1d6eda-a261-11e5-836b-6eb2054e9054.png"> When collapsed manually by the user: <img width="1004" alt="screen shot 2015-12-14 at 12 54 02 pm" src="https://cloud.githubusercontent.com/assets/7594753/11793669/c991989e-a261-11e5-8bf6-aecf3bdb6319.png"> Disappears when column is too narrow: <img width="697" alt="screen shot 2015-12-14 at 12 47 22 pm" src="https://cloud.githubusercontent.com/assets/7594753/11793607/7754dbcc-a261-11e5-8b15-e0d074b0e47c.png"> Can still be opened by the user if necessary: <img width="651" alt="screen shot 2015-12-14 at 12 51 15 pm" src="https://cloud.githubusercontent.com/assets/7594753/11793612/7bf82968-a261-11e5-9cc3-e827a7a6b2b0.png"> Author: Timothy Hunter <timhunter@databricks.com> Closes #10297 from thunterdb/12324. (cherry picked from commit a6325fc401f68d9fa30cc947c44acc9d64ebda7b) Signed-off-by: Joseph K. Bradley <joseph@databricks.com> 16 December 2015, 18:12:47 UTC
fb08f7b [SPARK-10477][SQL] using DSL in ColumnPruningSuite to improve readability Author: Wenchen Fan <cloud0fan@outlook.com> Closes #8645 from cloud-fan/test. (cherry picked from commit a89e8b6122ee5a1517fbcf405b1686619db56696) Signed-off-by: Andrew Or <andrew@databricks.com> 16 December 2015, 02:29:25 UTC
93095eb [SPARK-12062][CORE] Change Master to asyc rebuild UI when application completes This change builds the event history of completed apps asynchronously so the RPC thread will not be blocked and allow new workers to register/remove if the event log history is very large and takes a long time to rebuild. Author: Bryan Cutler <bjcutler@us.ibm.com> Closes #10284 from BryanCutler/async-MasterUI-SPARK-12062. (cherry picked from commit c5b6b398d5e368626e589feede80355fb74c2bd8) Signed-off-by: Andrew Or <andrew@databricks.com> 16 December 2015, 02:28:26 UTC
8e9a600 [SPARK-9886][CORE] Fix to use ShutdownHookManager in ExternalBlockStore.scala Author: Naveen <naveenminchu@gmail.com> Closes #10313 from naveenminchu/branch-fix-SPARK-9886. (cherry picked from commit 8a215d2338c6286253e20122640592f9d69896c8) Signed-off-by: Andrew Or <andrew@databricks.com> 16 December 2015, 02:25:28 UTC
2c324d3 [SPARK-12351][MESOS] Add documentation about submitting Spark with mesos cluster mode. Adding more documentation about submitting jobs with mesos cluster mode. Author: Timothy Chen <tnachen@gmail.com> Closes #10086 from tnachen/mesos_supervise_docs. (cherry picked from commit c2de99a7c3a52b0da96517c7056d2733ef45495f) Signed-off-by: Andrew Or <andrew@databricks.com> 16 December 2015, 02:20:09 UTC
9e4ac56 [SPARK-12056][CORE] Part 2 Create a TaskAttemptContext only after calling setConf This is continuation of SPARK-12056 where change is applied to SqlNewHadoopRDD.scala andrewor14 FYI Author: tedyu <yuzhihong@gmail.com> Closes #10164 from tedyu/master. (cherry picked from commit f725b2ec1ab0d89e35b5e2d3ddeddb79fec85f6d) Signed-off-by: Andrew Or <andrew@databricks.com> 16 December 2015, 02:15:53 UTC
08aa3b4 Preparing development version 1.6.0-SNAPSHOT 15 December 2015, 23:10:04 UTC
00a39d9 Preparing Spark release v1.6.0-rc3 15 December 2015, 23:09:57 UTC
80d2617 Update branch-1.6 for 1.6.0 release Author: Michael Armbrust <michael@databricks.com> Closes #10317 from marmbrus/versions. 15 December 2015, 23:03:33 UTC
23c8846 [STREAMING][MINOR] Fix typo in function name of StateImpl cc\ tdas zsxwing , please review. Thanks a lot. Author: jerryshao <sshao@hortonworks.com> Closes #10305 from jerryshao/fix-typo-state-impl. (cherry picked from commit bc1ff9f4a41401599d3a87fb3c23a2078228a29b) Signed-off-by: Shixiong Zhu <shixiong@databricks.com> 15 December 2015, 17:41:50 UTC
352a0c8 [SPARK-12327] Disable commented code lintr temporarily cc yhuai felixcheung shaneknapp Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu> Closes #10300 from shivaram/comment-lintr-disable. (cherry picked from commit fb3778de685881df66bf0222b520f94dca99e8c8) Signed-off-by: Shivaram Venkataraman <shivaram@cs.berkeley.edu> 15 December 2015, 00:14:03 UTC
c0f0f6c [MINOR][DOC] Fix broken word2vec link Follow-up of [SPARK-12199](https://issues.apache.org/jira/browse/SPARK-12199) and #10193 where a broken link has been left as is. Author: BenFradet <benjamin.fradet@gmail.com> Closes #10282 from BenFradet/SPARK-12199. (cherry picked from commit e25f1fe42747be71c6b6e6357ca214f9544e3a46) Signed-off-by: Sean Owen <sowen@cloudera.com> 14 December 2015, 13:50:40 UTC
94ce502 [SPARK-12275][SQL] No plan for BroadcastHint in some condition When SparkStrategies.BasicOperators's "case BroadcastHint(child) => apply(child)" is hit, it only recursively invokes BasicOperators.apply with this "child". It makes many strategies have no change to process this plan, which probably leads to "No plan" issue, so we use planLater to go through all strategies. https://issues.apache.org/jira/browse/SPARK-12275 Author: yucai <yucai.yu@intel.com> Closes #10265 from yucai/broadcast_hint. (cherry picked from commit ed87f6d3b48a85391628c29c43d318c26e2c6de7) Signed-off-by: Yin Huai <yhuai@databricks.com> 14 December 2015, 07:08:40 UTC
fbf16da [SPARK-12281][CORE] Fix a race condition when reporting ExecutorState in the shutdown hook 1. Make sure workers and masters exit so that no worker or master will still be running when triggering the shutdown hook. 2. Set ExecutorState to FAILED if it's still RUNNING when executing the shutdown hook. This should fix the potential exceptions when exiting a local cluster ``` java.lang.AssertionError: assertion failed: executor 4 state transfer from RUNNING to RUNNING is illegal at scala.Predef$.assert(Predef.scala:179) at org.apache.spark.deploy.master.Master$$anonfun$receive$1.applyOrElse(Master.scala:260) at org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:116) at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:204) at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100) at org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:215) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) java.lang.IllegalStateException: Shutdown hooks cannot be modified during shutdown. at org.apache.spark.util.SparkShutdownHookManager.add(ShutdownHookManager.scala:246) at org.apache.spark.util.ShutdownHookManager$.addShutdownHook(ShutdownHookManager.scala:191) at org.apache.spark.util.ShutdownHookManager$.addShutdownHook(ShutdownHookManager.scala:180) at org.apache.spark.deploy.worker.ExecutorRunner.start(ExecutorRunner.scala:73) at org.apache.spark.deploy.worker.Worker$$anonfun$receive$1.applyOrElse(Worker.scala:474) at org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:116) at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:204) at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100) at org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:215) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) ``` Author: Shixiong Zhu <shixiong@databricks.com> Closes #10269 from zsxwing/executor-state. (cherry picked from commit 2aecda284e22ec608992b6221e2f5ffbd51fcd24) Signed-off-by: Shixiong Zhu <shixiong@databricks.com> 14 December 2015, 06:06:56 UTC
d7e3bfd [SPARK-12267][CORE] Store the remote RpcEnv address to send the correct disconnetion message Author: Shixiong Zhu <shixiong@databricks.com> Closes #10261 from zsxwing/SPARK-12267. (cherry picked from commit 8af2f8c61ae4a59d129fb3530d0f6e9317f4bff8) Signed-off-by: Shixiong Zhu <shixiong@databricks.com> 13 December 2015, 05:59:03 UTC
e05364b [SPARK-12199][DOC] Follow-up: Refine example code in ml-features.md https://issues.apache.org/jira/browse/SPARK-12199 Follow-up PR of SPARK-11551. Fix some errors in ml-features.md mengxr Author: Xusen Yin <yinxusen@gmail.com> Closes #10193 from yinxusen/SPARK-12199. (cherry picked from commit 98b212d36b34ab490c391ea2adf5b141e4fb9289) Signed-off-by: Joseph K. Bradley <joseph@databricks.com> 13 December 2015, 01:47:10 UTC
2679fce [SPARK-11193] Use Java ConcurrentHashMap instead of SynchronizedMap trait in order to avoid ClassCastException due to KryoSerializer in KinesisReceiver Author: Jean-Baptiste Onofré <jbonofre@apache.org> Closes #10203 from jbonofre/SPARK-11193. (cherry picked from commit 03138b67d3ef7f5278ea9f8b9c75f0e357ef79d8) Signed-off-by: Sean Owen <sowen@cloudera.com> 12 December 2015, 08:52:02 UTC
47461fe [SPARK-12158][SPARKR][SQL] Fix 'sample' functions that break R unit test cases The existing sample functions miss the parameter `seed`, however, the corresponding function interface in `generics` has such a parameter. Thus, although the function caller can call the function with the 'seed', we are not using the value. This could cause SparkR unit tests failed. For example, I hit it in another PR: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/47213/consoleFull Author: gatorsmile <gatorsmile@gmail.com> Closes #10160 from gatorsmile/sampleR. (cherry picked from commit 1e3526c2d3de723225024fedd45753b556e18fc6) Signed-off-by: Shivaram Venkataraman <shivaram@cs.berkeley.edu> 12 December 2015, 04:55:24 UTC
03d8015 [SPARK-12298][SQL] Fix infinite loop in DataFrame.sortWithinPartitions Modifies the String overload to call the Column overload and ensures this is called in a test. Author: Ankur Dave <ankurdave@gmail.com> Closes #10271 from ankurdave/SPARK-12298. (cherry picked from commit 1e799d617a28cd0eaa8f22d103ea8248c4655ae5) Signed-off-by: Yin Huai <yhuai@databricks.com> 12 December 2015, 03:08:03 UTC
c2f2046 [SPARK-11978][ML] Move dataset_example.py to examples/ml and rename to dataframe_example.py Since ```Dataset``` has a new meaning in Spark 1.6, we should rename it to avoid confusion. #9873 finished the work of Scala example, here we focus on the Python one. Move dataset_example.py to ```examples/ml``` and rename to ```dataframe_example.py```. BTW, fix minor missing issues of #9873. cc mengxr Author: Yanbo Liang <ybliang8@gmail.com> Closes #9957 from yanboliang/SPARK-11978. (cherry picked from commit a0ff6d16ef4bcc1b6ff7282e82a9b345d8449454) Signed-off-by: Joseph K. Bradley <joseph@databricks.com> 12 December 2015, 02:02:37 UTC
75531c7 [SPARK-12217][ML] Document invalid handling for StringIndexer Added a paragraph regarding StringIndexer#setHandleInvalid to the ml-features documentation. I wonder if I should also add a snippet to the code example, input welcome. Author: BenFradet <benjamin.fradet@gmail.com> Closes #10257 from BenFradet/SPARK-12217. (cherry picked from commit aea676ca2d07c72b1a752e9308c961118e5bfc3c) Signed-off-by: Joseph K. Bradley <joseph@databricks.com> 11 December 2015, 23:43:09 UTC
bfcc8cf [SPARK-11497][MLLIB][PYTHON] PySpark RowMatrix Constructor Has Type Erasure Issue As noted in PR #9441, implementing `tallSkinnyQR` uncovered a bug with our PySpark `RowMatrix` constructor. As discussed on the dev list [here](http://apache-spark-developers-list.1001551.n3.nabble.com/K-Means-And-Class-Tags-td10038.html), there appears to be an issue with type erasure with RDDs coming from Java, and by extension from PySpark. Although we are attempting to construct a `RowMatrix` from an `RDD[Vector]` in [PythonMLlibAPI](https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala#L1115), the `Vector` type is erased, resulting in an `RDD[Object]`. Thus, when calling Scala's `tallSkinnyQR` from PySpark, we get a Java `ClassCastException` in which an `Object` cannot be cast to a Spark `Vector`. As noted in the aforementioned dev list thread, this issue was also encountered with `DecisionTrees`, and the fix involved an explicit `retag` of the RDD with a `Vector` type. `IndexedRowMatrix` and `CoordinateMatrix` do not appear to have this issue likely due to their related helper functions in `PythonMLlibAPI` creating the RDDs explicitly from DataFrames with pattern matching, thus preserving the types. This PR currently contains that retagging fix applied to the `createRowMatrix` helper function in `PythonMLlibAPI`. This PR blocks #9441, so once this is merged, the other can be rebased. cc holdenk Author: Mike Dusenberry <mwdusenb@us.ibm.com> Closes #9458 from dusenberrymw/SPARK-11497_PySpark_RowMatrix_Constructor_Has_Type_Erasure_Issue. (cherry picked from commit 1b8220387e6903564f765fabb54be0420c3e99d7) Signed-off-by: Joseph K. Bradley <joseph@databricks.com> 11 December 2015, 22:21:48 UTC
2ddd104 [SPARK-11964][DOCS][ML] Add in Pipeline Import/Export Documentation Adding in Pipeline Import and Export Documentation. Author: anabranch <wac.chambers@gmail.com> Author: Bill Chambers <wchambers@ischool.berkeley.edu> Closes #10179 from anabranch/master. (cherry picked from commit aa305dcaf5b4148aba9e669e081d0b9235f50857) Signed-off-by: Joseph K. Bradley <joseph@databricks.com> 11 December 2015, 20:56:20 UTC
f05bae4 [SPARK-12146][SPARKR] SparkR jsonFile should support multiple input files * ```jsonFile``` should support multiple input files, such as: ```R jsonFile(sqlContext, c(“path1”, “path2”)) # character vector as arguments jsonFile(sqlContext, “path1,path2”) ``` * Meanwhile, ```jsonFile``` has been deprecated by Spark SQL and will be removed at Spark 2.0. So we mark ```jsonFile``` deprecated and use ```read.json``` at SparkR side. * Replace all ```jsonFile``` with ```read.json``` at test_sparkSQL.R, but still keep jsonFile test case. * If this PR is accepted, we should also make almost the same change for ```parquetFile```. cc felixcheung sun-rui shivaram Author: Yanbo Liang <ybliang8@gmail.com> Closes #10145 from yanboliang/spark-12146. (cherry picked from commit 0fb9825556dbbcc98d7eafe9ddea8676301e09bb) Signed-off-by: Shivaram Venkataraman <shivaram@cs.berkeley.edu> 11 December 2015, 19:47:43 UTC
2e45231 Preparing development version 1.6.0-SNAPSHOT 11 December 2015, 19:25:09 UTC
23f8dfd Preparing Spark release v1.6.0-rc2 11 December 2015, 19:25:03 UTC
eec3660 [SPARK-12258] [SQL] passing null into ScalaUDF (follow-up) This is a follow-up PR for #10259 Author: Davies Liu <davies@databricks.com> Closes #10266 from davies/null_udf2. (cherry picked from commit c119a34d1e9e599e302acfda92e5de681086a19f) Signed-off-by: Davies Liu <davies.liu@gmail.com> 11 December 2015, 19:16:04 UTC
250249e Preparing development version 1.6.0-SNAPSHOT 11 December 2015, 02:45:42 UTC
3e39925 Preparing Spark release v1.6.0-rc2 11 December 2015, 02:45:36 UTC
d09af2c [SPARK-12258][SQL] passing null into ScalaUDF Check nullability and passing them into ScalaUDF. Closes #10249 Author: Davies Liu <davies@databricks.com> Closes #10259 from davies/udf_null. (cherry picked from commit b1b4ee7f3541d92c8bc2b0b4fdadf46cfdb09504) Signed-off-by: Yin Huai <yhuai@databricks.com> 11 December 2015, 01:22:57 UTC
5d3722f [STREAMING][DOC][MINOR] Update the description of direct Kafka stream doc With the merge of [SPARK-8337](https://issues.apache.org/jira/browse/SPARK-8337), now the Python API has the same functionalities compared to Scala/Java, so here changing the description to make it more precise. zsxwing tdas , please review, thanks a lot. Author: jerryshao <sshao@hortonworks.com> Closes #10246 from jerryshao/direct-kafka-doc-update. (cherry picked from commit 24d3357d66e14388faf8709b368edca70ea96432) Signed-off-by: Shixiong Zhu <shixiong@databricks.com> 10 December 2015, 23:31:55 UTC
c247b6a [SPARK-12155][SPARK-12253] Fix executor OOM in unified memory management **Problem.** In unified memory management, acquiring execution memory may lead to eviction of storage memory. However, the space freed from evicting cached blocks is distributed among all active tasks. Thus, an incorrect upper bound on the execution memory per task can cause the acquisition to fail, leading to OOM's and premature spills. **Example.** Suppose total memory is 1000B, cached blocks occupy 900B, `spark.memory.storageFraction` is 0.4, and there are two active tasks. In this case, the cap on task execution memory is 100B / 2 = 50B. If task A tries to acquire 200B, it will evict 100B of storage but can only acquire 50B because of the incorrect cap. For another example, see this [regression test](https://github.com/andrewor14/spark/blob/fix-oom/core/src/test/scala/org/apache/spark/memory/UnifiedMemoryManagerSuite.scala#L233) that I stole from JoshRosen. **Solution.** Fix the cap on task execution memory. It should take into account the space that could have been freed by storage in addition to the current amount of memory available to execution. In the example above, the correct cap should have been 600B / 2 = 300B. This patch also guards against the race condition (SPARK-12253): (1) Existing tasks collectively occupy all execution memory (2) New task comes in and blocks while existing tasks spill (3) After tasks finish spilling, another task jumps in and puts in a large block, stealing the freed memory (4) New task still cannot acquire memory and goes back to sleep Author: Andrew Or <andrew@databricks.com> Closes #10240 from andrewor14/fix-oom. (cherry picked from commit 5030923ea8bb94ac8fa8e432de9fc7089aa93986) Signed-off-by: Andrew Or <andrew@databricks.com> 10 December 2015, 23:30:14 UTC
9870e5c [SPARK-12251] Document and improve off-heap memory configurations This patch adds documentation for Spark configurations that affect off-heap memory and makes some naming and validation improvements for those configs. - Change `spark.memory.offHeapSize` to `spark.memory.offHeap.size`. This is fine because this configuration has not shipped in any Spark release yet (it's new in Spark 1.6). - Deprecated `spark.unsafe.offHeap` in favor of a new `spark.memory.offHeap.enabled` configuration. The motivation behind this change is to gather all memory-related configurations under the same prefix. - Add a check which prevents users from setting `spark.memory.offHeap.enabled=true` when `spark.memory.offHeap.size == 0`. After SPARK-11389 (#9344), which was committed in Spark 1.6, Spark enforces a hard limit on the amount of off-heap memory that it will allocate to tasks. As a result, enabling off-heap execution memory without setting `spark.memory.offHeap.size` will lead to immediate OOMs. The new configuration validation makes this scenario easier to diagnose, helping to avoid user confusion. - Document these configurations on the configuration page. Author: Josh Rosen <joshrosen@databricks.com> Closes #10237 from JoshRosen/SPARK-12251. (cherry picked from commit 23a9e62bad9669e9ff5dc4bd714f58d12f9be0b5) Signed-off-by: Andrew Or <andrew@databricks.com> 10 December 2015, 23:29:13 UTC
d0307de [SPARK-12212][ML][DOC] Clarifies the difference between spark.ml, spark.mllib and mllib in the documentation. Replaces a number of occurences of `MLlib` in the documentation that were meant to refer to the `spark.mllib` package instead. It should clarify for new users the difference between `spark.mllib` (the package) and MLlib (the umbrella project for ML in spark). It also removes some files that I forgot to delete with #10207 Author: Timothy Hunter <timhunter@databricks.com> Closes #10234 from thunterdb/12212. (cherry picked from commit 2ecbe02d5b28ee562d10c1735244b90a08532c9e) Signed-off-by: Joseph K. Bradley <joseph@databricks.com> 10 December 2015, 20:51:00 UTC
594fafc [SPARK-12250][SQL] Allow users to define a UDAF without providing details of its inputSchema https://issues.apache.org/jira/browse/SPARK-12250 Author: Yin Huai <yhuai@databricks.com> Closes #10236 from yhuai/SPARK-12250. (cherry picked from commit bc5f56aa60a430244ffa0cacd81c0b1ecbf8d68f) Signed-off-by: Yin Huai <yhuai@databricks.com> 10 December 2015, 20:03:40 UTC
e541f70 [SPARK-12012][SQL][BRANCH-1.6] Show more comprehensive PhysicalRDD metadata when visualizing SQL query plan This PR backports PR #10004 to branch-1.6 It adds a private[sql] method metadata to SparkPlan, which can be used to describe detail information about a physical plan during visualization. Specifically, this PR uses this method to provide details of PhysicalRDDs translated from a data source relation. Author: Cheng Lian <lian@databricks.com> Closes #10250 from liancheng/spark-12012.for-1.6. 10 December 2015, 18:19:49 UTC
93ef246 [SPARK-12234][SPARKR] Fix ```subset``` function error when only set ```select``` argument Fix ```subset``` function error when only set ```select``` argument. Please refer to the [JIRA](https://issues.apache.org/jira/browse/SPARK-12234) about the error and how to reproduce it. cc sun-rui felixcheung shivaram Author: Yanbo Liang <ybliang8@gmail.com> Closes #10217 from yanboliang/spark-12234. (cherry picked from commit d9d354ed40eec56b3f03d32f4e2629d367b1bf02) Signed-off-by: Shivaram Venkataraman <shivaram@cs.berkeley.edu> 10 December 2015, 18:19:06 UTC
e65c885 [SPARK-11602][MLLIB] Refine visibility for 1.6 scala API audit jira: https://issues.apache.org/jira/browse/SPARK-11602 Made a pass on the API change of 1.6. Open the PR for efficient discussion. Author: Yuhao Yang <hhbyyh@gmail.com> Closes #9939 from hhbyyh/auditScala. (cherry picked from commit 9fba9c8004d2b97549e5456fa7918965bec27336) Signed-off-by: Joseph K. Bradley <joseph@databricks.com> 10 December 2015, 18:16:04 UTC
b7b9f77 [SPARK-12198][SPARKR] SparkR support read.parquet and deprecate parquetFile SparkR support ```read.parquet``` and deprecate ```parquetFile```. This change is similar with #10145 for ```jsonFile```. Author: Yanbo Liang <ybliang8@gmail.com> Closes #10191 from yanboliang/spark-12198. (cherry picked from commit eeb58722ad73441eeb5f35f864be3c5392cfd426) Signed-off-by: Shivaram Venkataraman <shivaram@cs.berkeley.edu> 10 December 2015, 17:45:01 UTC
f939c71 [SPARK-12242][SQL] Add DataFrame.transform method Author: Reynold Xin <rxin@databricks.com> Closes #10226 from rxin/df-transform. (cherry picked from commit 76540b6df5370b463277d3498097b2cc2d2e97a8) Signed-off-by: Reynold Xin <rxin@databricks.com> 10 December 2015, 14:23:26 UTC
b5e5812 [SPARK-12136][STREAMING] rddToFileName does not properly handle prefix and suffix parameters The original code does not properly handle the cases where the prefix is null, but suffix is not null - the suffix should be used but is not. The fix is using StringBuilder to construct the proper file name. Author: bomeng <bmeng@us.ibm.com> Author: Bo Meng <mengbo@bos-macbook-pro.usca.ibm.com> Closes #10185 from bomeng/SPARK-12136. (cherry picked from commit e29704f90dfe67d9e276d242699ac0a00f64fb91) Signed-off-by: Sean Owen <sowen@cloudera.com> 10 December 2015, 12:54:08 UTC
f6d8661 [SPARK-12244][SPARK-12245][STREAMING] Rename trackStateByKey to mapWithState and change tracking function signature SPARK-12244: Based on feedback from early users and personal experience attempting to explain it, the name trackStateByKey had two problem. "trackState" is a completely new term which really does not give any intuition on what the operation is the resultant data stream of objects returned by the function is called in docs as the "emitted" data for the lack of a better. "mapWithState" makes sense because the API is like a mapping function like (Key, Value) => T with State as an additional parameter. The resultant data stream is "mapped data". So both problems are solved. SPARK-12245: From initial experiences, not having the key in the function makes it hard to return mapped stuff, as the whole information of the records is not there. Basically the user is restricted to doing something like mapValue() instead of map(). So adding the key as a parameter. Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #10224 from tdas/rename. 10 December 2015, 04:59:21 UTC
699f497 [SPARK-11796] Fix httpclient and httpcore depedency issues related to docker-client This commit fixes dependency issues which prevented the Docker-based JDBC integration tests from running in the Maven build. Author: Mark Grover <mgrover@cloudera.com> Closes #9876 from markgrover/master_docker. (cherry picked from commit 2166c2a75083c2262e071a652dd52b1a33348b6e) Signed-off-by: Josh Rosen <joshrosen@databricks.com> 10 December 2015, 02:39:59 UTC
9fe8dc9 [SPARK-11678][SQL][DOCS] Document basePath in the programming guide. This PR adds document for `basePath`, which is a new parameter used by `HadoopFsRelation`. The compiled doc is shown below. ![image](https://cloud.githubusercontent.com/assets/2072857/11673132/1ba01192-9dcb-11e5-98d9-ac0b4e92e98c.png) JIRA: https://issues.apache.org/jira/browse/SPARK-11678 Author: Yin Huai <yhuai@databricks.com> Closes #10211 from yhuai/basePathDoc. (cherry picked from commit ac8cdf1cdc148bd21290ecf4d4f9874f8c87cc14) Signed-off-by: Yin Huai <yhuai@databricks.com> 10 December 2015, 02:09:48 UTC
d86a88d [SPARK-12165][ADDENDUM] Fix outdated comments on unroll test JoshRosen Author: Andrew Or <andrew@databricks.com> Closes #10229 from andrewor14/unroll-test-comments. (cherry picked from commit 8770bd1213f9b1051dabde9c5424ae7b32143a44) Signed-off-by: Josh Rosen <joshrosen@databricks.com> 10 December 2015, 01:24:18 UTC
9bc6a27 [SPARK-12211][DOC][GRAPHX] Fix version number in graphx doc for migration from 1.1 Migration from 1.1 section added to the GraphX doc in 1.2.0 (see https://spark.apache.org/docs/1.2.0/graphx-programming-guide.html#migrating-from-spark-11) uses \{{site.SPARK_VERSION}} as the version where changes were introduced, it should be just 1.2. Author: Andrew Ray <ray.andrew@gmail.com> Closes #10206 from aray/graphx-doc-1.1-migration. (cherry picked from commit 7a8e587dc04c2fabc875d1754eae7f85b4fba6ba) Signed-off-by: Joseph K. Bradley <joseph@databricks.com> 10 December 2015, 01:16:13 UTC
bfb4201 [SPARK-11551][DOC] Replace example code in ml-features.md using include_example PR on behalf of somideshmukh, thanks! Author: Xusen Yin <yinxusen@gmail.com> Author: somideshmukh <somilde@us.ibm.com> Closes #10219 from yinxusen/SPARK-11551. (cherry picked from commit 051c6a066f7b5fcc7472412144c15b50a5319bd5) Signed-off-by: Xiangrui Meng <meng@databricks.com> 09 December 2015, 20:01:00 UTC
ee0a6e7 [SPARK-11824][WEBUI] WebUI does not render descriptions with 'bad' HTML, throws console error Don't warn when description isn't valid HTML since it may properly be like "SELECT ... where foo <= 1" The tests for this code indicate that it's normal to handle strings like this that don't contain HTML as a string rather than markup. Hence logging every such instance as a warning is too noisy since it's not a problem. this is an issue for stages whose name contain SQL like the above CC tdas as author of this bit of code Author: Sean Owen <sowen@cloudera.com> Closes #10159 from srowen/SPARK-11824. (cherry picked from commit 1eb7c22ce72a1b82ed194a51bbcf0da9c771605a) Signed-off-by: Sean Owen <sowen@cloudera.com> 09 December 2015, 19:47:51 UTC
05e441e [SPARK-12165][SPARK-12189] Fix bugs in eviction of storage memory by execution This patch fixes a bug in the eviction of storage memory by execution. ## The bug: In general, execution should be able to evict storage memory when the total storage memory usage is greater than `maxMemory * spark.memory.storageFraction`. Due to a bug, however, Spark might wind up evicting no storage memory in certain cases where the storage memory usage was between `maxMemory * spark.memory.storageFraction` and `maxMemory`. For example, here is a regression test which illustrates the bug: ```scala val maxMemory = 1000L val taskAttemptId = 0L val (mm, ms) = makeThings(maxMemory) // Since we used the default storage fraction (0.5), we should be able to allocate 500 bytes // of storage memory which are immune to eviction by execution memory pressure. // Acquire enough storage memory to exceed the storage region size assert(mm.acquireStorageMemory(dummyBlock, 750L, evictedBlocks)) assertEvictBlocksToFreeSpaceNotCalled(ms) assert(mm.executionMemoryUsed === 0L) assert(mm.storageMemoryUsed === 750L) // At this point, storage is using 250 more bytes of memory than it is guaranteed, so execution // should be able to reclaim up to 250 bytes of storage memory. // Therefore, execution should now be able to require up to 500 bytes of memory: assert(mm.acquireExecutionMemory(500L, taskAttemptId, MemoryMode.ON_HEAP) === 500L) // <--- fails by only returning 250L assert(mm.storageMemoryUsed === 500L) assert(mm.executionMemoryUsed === 500L) assertEvictBlocksToFreeSpaceCalled(ms, 250L) ``` The problem relates to the control flow / interaction between `StorageMemoryPool.shrinkPoolToReclaimSpace()` and `MemoryStore.ensureFreeSpace()`. While trying to allocate the 500 bytes of execution memory, the `UnifiedMemoryManager` discovers that it will need to reclaim 250 bytes of memory from storage, so it calls `StorageMemoryPool.shrinkPoolToReclaimSpace(250L)`. This method, in turn, calls `MemoryStore.ensureFreeSpace(250L)`. However, `ensureFreeSpace()` first checks whether the requested space is less than `maxStorageMemory - storageMemoryUsed`, which will be true if there is any free execution memory because it turns out that `MemoryStore.maxStorageMemory = (maxMemory - onHeapExecutionMemoryPool.memoryUsed)` when the `UnifiedMemoryManager` is used. The control flow here is somewhat confusing (it grew to be messy / confusing over time / as a result of the merging / refactoring of several components). In the pre-Spark 1.6 code, `ensureFreeSpace` was called directly by the `MemoryStore` itself, whereas in 1.6 it's involved in a confusing control flow where `MemoryStore` calls `MemoryManager.acquireStorageMemory`, which then calls back into `MemoryStore.ensureFreeSpace`, which, in turn, calls `MemoryManager.freeStorageMemory`. ## The solution: The solution implemented in this patch is to remove the confusing circular control flow between `MemoryManager` and `MemoryStore`, making the storage memory acquisition process much more linear / straightforward. The key changes: - Remove a layer of inheritance which made the memory manager code harder to understand (53841174760a24a0df3eb1562af1f33dbe340eb9). - Move some bounds checks earlier in the call chain (13ba7ada77f87ef1ec362aec35c89a924e6987cb). - Refactor `ensureFreeSpace()` so that the part which evicts blocks can be called independently from the part which checks whether there is enough free space to avoid eviction (7c68ca09cb1b12f157400866983f753ac863380e). - Realize that this lets us remove a layer of overloads from `ensureFreeSpace` (eec4f6c87423d5e482b710e098486b3bbc4daf06). - Realize that `ensureFreeSpace()` can simply be replaced with an `evictBlocksToFreeSpace()` method which is called [after we've already figured out](https://github.com/apache/spark/blob/2dc842aea82c8895125d46a00aa43dfb0d121de9/core/src/main/scala/org/apache/spark/memory/StorageMemoryPool.scala#L88) how much memory needs to be reclaimed via eviction; (2dc842aea82c8895125d46a00aa43dfb0d121de9). Along the way, I fixed some problems with the mocks in `MemoryManagerSuite`: the old mocks would [unconditionally](https://github.com/apache/spark/blob/80a824d36eec9d9a9f092ee1741453851218ec73/core/src/test/scala/org/apache/spark/memory/MemoryManagerSuite.scala#L84) report that a block had been evicted even if there was enough space in the storage pool such that eviction would be avoided. I also fixed a problem where `StorageMemoryPool._memoryUsed` might become negative due to freed memory being double-counted when excution evicts storage. The problem was that `StorageMemoryPoolshrinkPoolToFreeSpace` would [decrement `_memoryUsed`](https://github.com/apache/spark/commit/7c68ca09cb1b12f157400866983f753ac863380e#diff-935c68a9803be144ed7bafdd2f756a0fL133) even though `StorageMemoryPool.freeMemory` had already decremented it as each evicted block was freed. See SPARK-12189 for details. Author: Josh Rosen <joshrosen@databricks.com> Author: Andrew Or <andrew@databricks.com> Closes #10170 from JoshRosen/SPARK-12165. (cherry picked from commit aec5ea000ebb8921f42f006b694ef26f5df67d83) Signed-off-by: Andrew Or <andrew@databricks.com> 09 December 2015, 19:40:09 UTC
acd4624 [SPARK-10299][ML] word2vec should allow users to specify the window size Currently word2vec has the window hard coded at 5, some users may want different sizes (for example if using on n-gram input or similar). User request comes from http://stackoverflow.com/questions/32231975/spark-word2vec-window-size . Author: Holden Karau <holden@us.ibm.com> Author: Holden Karau <holden@pigscanfly.ca> Closes #8513 from holdenk/SPARK-10299-word2vec-should-allow-users-to-specify-the-window-size. (cherry picked from commit 22b9a8740d51289434553d19b6b1ac34aecdc09a) Signed-off-by: Sean Owen <sowen@cloudera.com> 09 December 2015, 16:45:23 UTC
b5a76b4 [SPARK-12031][CORE][BUG] Integer overflow when do sampling Author: uncleGen <hustyugm@gmail.com> Closes #10023 from uncleGen/1.6-bugfix. (cherry picked from commit a113216865fd45ea39ae8f104e784af2cf667dcf) Signed-off-by: Sean Owen <sowen@cloudera.com> 09 December 2015, 15:09:52 UTC
0be792a [SPARK-12222] [CORE] Deserialize RoaringBitmap using Kryo serializer throw Buffer underflow exception Jira: https://issues.apache.org/jira/browse/SPARK-12222 Deserialize RoaringBitmap using Kryo serializer throw Buffer underflow exception: ``` com.esotericsoftware.kryo.KryoException: Buffer underflow. at com.esotericsoftware.kryo.io.Input.require(Input.java:156) at com.esotericsoftware.kryo.io.Input.skip(Input.java:131) at com.esotericsoftware.kryo.io.Input.skip(Input.java:264) ``` This is caused by a bug of kryo's `Input.skip(long count)`(https://github.com/EsotericSoftware/kryo/issues/119) and we call this method in `KryoInputDataInputBridge`. Instead of upgrade kryo's version, this pr bypass the kryo's `Input.skip(long count)` by directly call another `skip` method in kryo's Input.java(https://github.com/EsotericSoftware/kryo/blob/kryo-2.21/src/com/esotericsoftware/kryo/io/Input.java#L124), i.e. write the bug-fixed version of `Input.skip(long count)` in KryoInputDataInputBridge's `skipBytes` method. more detail link to https://github.com/apache/spark/pull/9748#issuecomment-162860246 Author: Fei Wang <wangfei1@huawei.com> Closes #10213 from scwf/patch-1. (cherry picked from commit 3934562d34bbe08d91c54b4bbee27870e93d7571) Signed-off-by: Davies Liu <davies.liu@gmail.com> 09 December 2015, 05:32:58 UTC
9e82273 [SPARK-11343][ML] Documentation of float and double prediction/label columns in RegressionEvaluator felixcheung , mengxr Just added a message to require() Author: Dominik Dahlem <dominik.dahlem@gmail.combination> Closes #9598 from dahlem/ddahlem_regression_evaluator_double_predictions_message_04112015. (cherry picked from commit a0046e379bee0852c39ece4ea719cde70d350b0e) Signed-off-by: Joseph K. Bradley <joseph@databricks.com> 09 December 2015, 02:54:23 UTC
b1d5a78 [SPARK-8517][ML][DOC] Reorganizes the spark.ml user guide This PR moves pieces of the spark.ml user guide to reflect suggestions in SPARK-8517. It does not introduce new content, as requested. <img width="192" alt="screen shot 2015-12-08 at 11 36 00 am" src="https://cloud.githubusercontent.com/assets/7594753/11666166/e82b84f2-9d9f-11e5-8904-e215424d8444.png"> Author: Timothy Hunter <timhunter@databricks.com> Closes #10207 from thunterdb/spark-8517. (cherry picked from commit 765c67f5f2e0b1367e37883f662d313661e3a0d9) Signed-off-by: Joseph K. Bradley <joseph@databricks.com> 09 December 2015, 02:40:32 UTC
2a5e4d1 [SPARK-12069][SQL] Update documentation with Datasets Author: Michael Armbrust <michael@databricks.com> Closes #10060 from marmbrus/docs. (cherry picked from commit 39594894232e0b70c5ca8b0df137da0d61223fd5) Signed-off-by: Michael Armbrust <michael@databricks.com> 08 December 2015, 23:58:45 UTC
25249d1 [SPARK-12187] *MemoryPool classes should not be fully public This patch tightens them to `private[memory]`. Author: Andrew Or <andrew@databricks.com> Closes #10182 from andrewor14/memory-visibility. (cherry picked from commit 9494521695a1f1526aae76c0aea34a3bead96251) Signed-off-by: Josh Rosen <joshrosen@databricks.com> 08 December 2015, 22:34:26 UTC
3e31e7e [SPARK-12159][ML] Add user guide section for IndexToString transformer Documentation regarding the `IndexToString` label transformer with code snippets in Scala/Java/Python. Author: BenFradet <benjamin.fradet@gmail.com> Closes #10166 from BenFradet/SPARK-12159. (cherry picked from commit 06746b3005e5e9892d0314bee3bfdfaebc36d3d4) Signed-off-by: Joseph K. Bradley <joseph@databricks.com> 08 December 2015, 20:45:51 UTC
7e45feb [SPARK-11605][MLLIB] ML 1.6 QA: API: Java compatibility, docs jira: https://issues.apache.org/jira/browse/SPARK-11605 Check Java compatibility for MLlib for this release. fix: 1. `StreamingTest.registerStream` needs java friendly interface. 2. `GradientBoostedTreesModel.computeInitialPredictionAndError` and `GradientBoostedTreesModel.updatePredictionError` has java compatibility issue. Mark them as `developerAPI`. TBD: [updated] no fix for now per discussion. `org.apache.spark.mllib.classification.LogisticRegressionModel` `public scala.Option<java.lang.Object> getThreshold();` has wrong return type for Java invocation. `SVMModel` has the similar issue. Yet adding a `scala.Option<java.util.Double> getThreshold()` would result in an overloading error due to the same function signature. And adding a new function with different name seems to be not necessary. cc jkbradley feynmanliang Author: Yuhao Yang <hhbyyh@gmail.com> Closes #10102 from hhbyyh/javaAPI. (cherry picked from commit 5cb4695051e3dac847b1ea14d62e54dcf672c31c) Signed-off-by: Joseph K. Bradley <joseph@databricks.com> 08 December 2015, 19:47:04 UTC
9145bfb [SPARK-12205][SQL] Pivot fails Analysis when aggregate is UnresolvedFunction Delays application of ResolvePivot until all aggregates are resolved to prevent problems with UnresolvedFunction and adds unit test Author: Andrew Ray <ray.andrew@gmail.com> Closes #10202 from aray/sql-pivot-unresolved-function. (cherry picked from commit 4bcb894948c1b7294d84e2bf58abb1d79e6759c6) Signed-off-by: Yin Huai <yhuai@databricks.com> 08 December 2015, 18:53:24 UTC
1c8451b [SPARK-10393] use ML pipeline in LDA example jira: https://issues.apache.org/jira/browse/SPARK-10393 Since the logic of the text processing part has been moved to ML estimators/transformers, replace the related code in LDA Example with the ML pipeline. Author: Yuhao Yang <hhbyyh@gmail.com> Author: yuhaoyang <yuhao@zhanglipings-iMac.local> Closes #8551 from hhbyyh/ldaExUpdate. (cherry picked from commit 872a2ee281d84f40a786f765bf772cdb06e8c956) Signed-off-by: Joseph K. Bradley <joseph@databricks.com> 08 December 2015, 18:30:00 UTC
be0fe9b [SPARK-12188][SQL] Code refactoring and comment correction in Dataset APIs This PR contains the following updates: - Created a new private variable `boundTEncoder` that can be shared by multiple functions, `RDD`, `select` and `collect`. - Replaced all the `queryExecution.analyzed` by the function call `logicalPlan` - A few API comments are using wrong class names (e.g., `DataFrame`) or parameter names (e.g., `n`) - A few API descriptions are wrong. (e.g., `mapPartitions`) marmbrus rxin cloud-fan Could you take a look and check if they are appropriate? Thank you! Author: gatorsmile <gatorsmile@gmail.com> Closes #10184 from gatorsmile/datasetClean. (cherry picked from commit 5d96a710a5ed543ec81e383620fc3b2a808b26a1) Signed-off-by: Michael Armbrust <michael@databricks.com> 08 December 2015, 18:26:17 UTC
9eeb0f2 [SPARK-12195][SQL] Adding BigDecimal, Date and Timestamp into Encoder This PR is to add three more data types into Encoder, including `BigDecimal`, `Date` and `Timestamp`. marmbrus cloud-fan rxin Could you take a quick look at these three types? Not sure if it can be merged to 1.6. Thank you very much! Author: gatorsmile <gatorsmile@gmail.com> Closes #10188 from gatorsmile/dataTypesinEncoder. (cherry picked from commit c0b13d5565c45ae2acbe8cfb17319c92b6a634e4) Signed-off-by: Michael Armbrust <michael@databricks.com> 08 December 2015, 18:16:06 UTC
8ef33aa [SPARK-12201][SQL] add type coercion rule for greatest/least checked with hive, greatest/least should cast their children to a tightest common type, i.e. `(int, long) => long`, `(int, string) => error`, `(decimal(10,5), decimal(5, 10)) => error` Author: Wenchen Fan <wenchen@databricks.com> Closes #10196 from cloud-fan/type-coercion. (cherry picked from commit 381f17b540d92507cc07adf18bce8bc7e5ca5407) Signed-off-by: Michael Armbrust <michael@databricks.com> 08 December 2015, 18:13:54 UTC
c8f9eb7 [SPARK-11652][CORE] Remote code execution with InvokerTransformer Fix commons-collection group ID to commons-collections for version 3.x Patches earlier PR at https://github.com/apache/spark/pull/9731 Author: Sean Owen <sowen@cloudera.com> Closes #10198 from srowen/SPARK-11652.2. (cherry picked from commit e3735ce1602826f0a8e0ca9e08730923843449ee) Signed-off-by: Sean Owen <sowen@cloudera.com> 08 December 2015, 14:35:01 UTC
back to top