https://github.com/apache/spark

sort by:
Revision Author Date Message Commit Date
4c56ad7 Preparing Spark release v1.5.0-rc1 20 August 2015, 23:24:07 UTC
175c1d9 Preparing development version 1.5.0-SNAPSHOT 20 August 2015, 22:33:10 UTC
d837d51 Preparing Spark release v1.5.0-rc1 20 August 2015, 22:33:04 UTC
2beea65 [SPARK-9245] [MLLIB] LDA topic assignments For each (document, term) pair, return top topic. Note that instances of (doc, term) pairs within a document (a.k.a. "tokens") are exchangeable, so we should provide an estimate per document-term, rather than per token. CC: rotationsymmetry mengxr Author: Joseph K. Bradley <joseph@databricks.com> Closes #8329 from jkbradley/lda-topic-assignments. (cherry picked from commit eaafe139f881d6105996373c9b11f2ccd91b5b3e) Signed-off-by: Xiangrui Meng <meng@databricks.com> 20 August 2015, 22:01:37 UTC
560ec12 [SPARK-10108] Add since tags to mllib.feature Author: MechCoder <manojkumarsivaraj334@gmail.com> Closes #8309 from MechCoder/tags_feature. (cherry picked from commit 7cfc0750e14f2c1b3847e4720cc02150253525a9) Signed-off-by: Xiangrui Meng <meng@databricks.com> 20 August 2015, 21:59:55 UTC
2e0d2a9 [SPARK-10138] [ML] move setters to MultilayerPerceptronClassifier and add Java test suite Otherwise, setters do not return self type. jkbradley avulanov Author: Xiangrui Meng <meng@databricks.com> Closes #8342 from mengxr/SPARK-10138. (cherry picked from commit 2a3d98aae285aba39786e9809f96de412a130f39) Signed-off-by: Xiangrui Meng <meng@databricks.com> 20 August 2015, 21:47:11 UTC
eac31ab Preparing development version 1.5.0-SNAPSHOT 20 August 2015, 19:43:13 UTC
99eeac8 Preparing Spark release v1.5.0-rc1 20 August 2015, 19:43:08 UTC
6026f4f [SPARK-10126] [PROJECT INFRA] Fix typo in release-build.sh which broke snapshot publishing for Scala 2.11 The current `release-build.sh` has a typo which breaks snapshot publication for Scala 2.11. We should change the Scala version to 2.11 and clean before building a 2.11 snapshot. Author: Josh Rosen <joshrosen@databricks.com> Closes #8325 from JoshRosen/fix-2.11-snapshots. (cherry picked from commit 12de348332108f8c0c5bdad1d4cfac89b952b0f8) Signed-off-by: Josh Rosen <joshrosen@databricks.com> 20 August 2015, 18:31:21 UTC
a1785e3 Preparing development version 1.5.0-SNAPSHOT 20 August 2015, 18:06:41 UTC
19b92c8 Preparing Spark release v1.5.0-rc1 20 August 2015, 18:06:31 UTC
2f47e09 [SPARK-10136] [SQL] Fixes Parquet support for Avro array of primitive array I caught SPARK-10136 while adding more test cases to `ParquetAvroCompatibilitySuite`. Actual bug fix code lies in `CatalystRowConverter.scala`. Author: Cheng Lian <lian@databricks.com> Closes #8341 from liancheng/spark-10136/parquet-avro-nested-primitive-array. (cherry picked from commit 85f9a61357994da5023b08b0a8a2eb09388ce7f8) Signed-off-by: Michael Armbrust <michael@databricks.com> 20 August 2015, 18:02:02 UTC
a7027e6 [SPARK-9982] [SPARKR] SparkR DataFrame fail to return data of Decimal type Author: Alex Shkurenko <ashkurenko@enova.com> Closes #8239 from ashkurenko/master. (cherry picked from commit 39e91fe2fd43044cc734d55625a3c03284b69f09) Signed-off-by: Shivaram Venkataraman <shivaram@cs.berkeley.edu> 20 August 2015, 17:16:57 UTC
257e9d7 [MINOR] [SQL] Fix sphinx warnings in PySpark SQL Author: MechCoder <manojkumarsivaraj334@gmail.com> Closes #8171 from MechCoder/sql_sphinx. (cherry picked from commit 52c60537a274af5414f6b0340a4bd7488ef35280) Signed-off-by: Xiangrui Meng <meng@databricks.com> 20 August 2015, 17:05:39 UTC
5be5175 [SPARK-10100] [SQL] Eliminate hash table lookup if there is no grouping key in aggregation. This improves performance by ~ 20 - 30% in one of my local test and should fix the performance regression from 1.4 to 1.5 on ss_max. Author: Reynold Xin <rxin@databricks.com> Closes #8332 from rxin/SPARK-10100. (cherry picked from commit b4f4e91c395cb69ced61d9ff1492d1b814f96828) Signed-off-by: Yin Huai <yhuai@databricks.com> 20 August 2015, 14:53:40 UTC
675e224 [SPARK-10092] [SQL] Backports #8324 to branch-1.5 Author: Yin Huai <yhuai@databricks.com> Closes #8336 from liancheng/spark-10092/for-branch-1.5. 20 August 2015, 10:43:24 UTC
71aa547 [SPARK-10128] [STREAMING] Used correct classloader to deserialize WAL data Recovering Kinesis sequence numbers from WAL leads to classnotfoundexception because the ObjectInputStream does not use the correct classloader and the SequenceNumberRanges class (in streaming-kinesis-asl package) cannot be found (added through spark-submit) while deserializing. The solution is to use `Thread.currentThread().getContextClassLoader` while deserializing. Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #8328 from tdas/SPARK-10128 and squashes the following commits: f19b1c2 [Tathagata Das] Used correct classloader to deserialize WAL data (cherry picked from commit b762f9920f7587d3c08493c49dd2fede62110b88) Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com> 20 August 2015, 04:16:17 UTC
63922fa [SPARK-10125] [STREAMING] Fix a potential deadlock in JobGenerator.stop Because `lazy val` uses `this` lock, if JobGenerator.stop and JobGenerator.doCheckpoint (JobGenerator.shouldCheckpoint has not yet been initialized) run at the same time, it may hang. Here are the stack traces for the deadlock: ```Java "pool-1-thread-1-ScalaTest-running-StreamingListenerSuite" #11 prio=5 os_prio=31 tid=0x00007fd35d094800 nid=0x5703 in Object.wait() [0x000000012ecaf000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) at java.lang.Thread.join(Thread.java:1245) - locked <0x00000007b5d8d7f8> (a org.apache.spark.util.EventLoop$$anon$1) at java.lang.Thread.join(Thread.java:1319) at org.apache.spark.util.EventLoop.stop(EventLoop.scala:81) at org.apache.spark.streaming.scheduler.JobGenerator.stop(JobGenerator.scala:155) - locked <0x00000007b5d8cea0> (a org.apache.spark.streaming.scheduler.JobGenerator) at org.apache.spark.streaming.scheduler.JobScheduler.stop(JobScheduler.scala:95) - locked <0x00000007b5d8ced8> (a org.apache.spark.streaming.scheduler.JobScheduler) at org.apache.spark.streaming.StreamingContext.stop(StreamingContext.scala:687) "JobGenerator" #67 daemon prio=5 os_prio=31 tid=0x00007fd35c3b9800 nid=0x9f03 waiting for monitor entry [0x0000000139e4a000] java.lang.Thread.State: BLOCKED (on object monitor) at org.apache.spark.streaming.scheduler.JobGenerator.shouldCheckpoint$lzycompute(JobGenerator.scala:63) - waiting to lock <0x00000007b5d8cea0> (a org.apache.spark.streaming.scheduler.JobGenerator) at org.apache.spark.streaming.scheduler.JobGenerator.shouldCheckpoint(JobGenerator.scala:63) at org.apache.spark.streaming.scheduler.JobGenerator.doCheckpoint(JobGenerator.scala:290) at org.apache.spark.streaming.scheduler.JobGenerator.org$apache$spark$streaming$scheduler$JobGenerator$$processEvent(JobGenerator.scala:182) at org.apache.spark.streaming.scheduler.JobGenerator$$anon$1.onReceive(JobGenerator.scala:83) at org.apache.spark.streaming.scheduler.JobGenerator$$anon$1.onReceive(JobGenerator.scala:82) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) ``` I can use this patch to produce this deadlock: https://github.com/zsxwing/spark/commit/8a88f28d1331003a65fabef48ae3d22a7c21f05f And a timeout build in Jenkins due to this deadlock: https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/1654/ This PR initializes `checkpointWriter` before `eventLoop` uses it to avoid this deadlock. Author: zsxwing <zsxwing@gmail.com> Closes #8326 from zsxwing/SPARK-10125. 20 August 2015, 02:44:33 UTC
a3ed2c3 [SPARK-10124] [MESOS] Fix removing queued driver in mesos cluster mode. Currently the spark applications can be queued to the Mesos cluster dispatcher, but when multiple jobs are in queue we don't handle removing jobs from the buffer correctly while iterating and causes null pointer exception. This patch copies the buffer before iterating them, so exceptions aren't thrown when the jobs are removed. Author: Timothy Chen <tnachen@gmail.com> Closes #8322 from tnachen/fix_cluster_mode. (cherry picked from commit 73431d8afb41b93888d2642a1ce2d011f03fb740) Signed-off-by: Andrew Or <andrew@databricks.com> 20 August 2015, 02:43:34 UTC
16414da [SPARK-9812] [STREAMING] Fix Python 3 compatibility issue in PySpark Streaming and some docs This PR includes the following fixes: 1. Use `range` instead of `xrange` in `queue_stream.py` to support Python 3. 2. Fix the issue that `utf8_decoder` will return `bytes` rather than `str` when receiving an empty `bytes` in Python 3. 3. Fix the commands in docs so that the user can copy them directly to the command line. The previous commands was broken in the middle of a path, so when copying to the command line, the path would be split to two parts by the extra spaces, which forces the user to fix it manually. Author: zsxwing <zsxwing@gmail.com> Closes #8315 from zsxwing/SPARK-9812. (cherry picked from commit 1f29d502e7ecd6faa185d70dc714f9ea3922fb6d) Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com> 20 August 2015, 01:36:10 UTC
321cb99 [SPARK-9242] [SQL] Audit UDAF interface. A few minor changes: 1. Improved documentation 2. Rename apply(distinct....) to distinct. 3. Changed MutableAggregationBuffer from a trait to an abstract class. 4. Renamed returnDataType to dataType to be more consistent with other expressions. And unrelated to UDAFs: 1. Renamed file names in expressions to use suffix "Expressions" to be more consistent. 2. Moved regexp related expressions out to its own file. 3. Renamed StringComparison => StringPredicate. Author: Reynold Xin <rxin@databricks.com> Closes #8321 from rxin/SPARK-9242. (cherry picked from commit 2f2686a73f5a2a53ca5b1023e0d7e0e6c9be5896) Signed-off-by: Reynold Xin <rxin@databricks.com> 20 August 2015, 00:35:48 UTC
56a37b0 [SPARK-9895] User Guide for RFormula Feature Transformer mengxr Author: Eric Liang <ekl@databricks.com> Closes #8293 from ericl/docs-2. (cherry picked from commit 8e0a072f78b4902d5f7ccc6b15232ed202a117f9) Signed-off-by: Xiangrui Meng <meng@databricks.com> 19 August 2015, 22:43:15 UTC
5c749c8 [SPARK-6489] [SQL] add column pruning for Generate This PR takes over https://github.com/apache/spark/pull/5358 Author: Wenchen Fan <cloud0fan@outlook.com> Closes #8268 from cloud-fan/6489. (cherry picked from commit b0dbaec4f942a47afde3490b9339ad3bd187024d) Signed-off-by: Michael Armbrust <michael@databricks.com> 19 August 2015, 22:05:25 UTC
a59475f [SPARK-10119] [CORE] Fix isDynamicAllocationEnabled when config is expliticly disabled. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #8316 from vanzin/SPARK-10119. (cherry picked from commit e0dd1309ac248375f429639801923570f14de18d) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> 19 August 2015, 21:33:45 UTC
1494d58 [SPARK-10083] [SQL] CaseWhen should support type coercion of DecimalType and FractionalType create t1 (a decimal(7, 2), b long); select case when 1=1 then a else 1.0 end from t1; select case when 1=1 then a else b end from t1; Author: Daoyuan Wang <daoyuan.wang@intel.com> Closes #8270 from adrian-wang/casewhenfractional. (cherry picked from commit 373a376c04320aab228b5c385e2b788809877d3e) Signed-off-by: Michael Armbrust <michael@databricks.com> 19 August 2015, 21:32:43 UTC
b32a31d [SPARK-9899] [SQL] Disables customized output committer when speculation is on Speculation hates direct output committer, as there are multiple corner cases that may cause data corruption and/or data loss. Please see this [PR comment] [1] for more details. [1]: https://github.com/apache/spark/pull/8191#issuecomment-131598385 Author: Cheng Lian <lian@databricks.com> Closes #8317 from liancheng/spark-9899/speculation-hates-direct-output-committer. (cherry picked from commit f3ff4c41d2e32bd0f2419d1c9c68fcd0c2593e41) Signed-off-by: Michael Armbrust <michael@databricks.com> Conflicts: sql/hive/src/test/scala/org/apache/spark/sql/sources/hadoopFsRelationSuites.scala 19 August 2015, 21:26:11 UTC
d9dfd43 [SPARK-10090] [SQL] fix decimal scale of division We should rounding the result of multiply/division of decimal to expected precision/scale, also check overflow. Author: Davies Liu <davies@databricks.com> Closes #8287 from davies/decimal_division. (cherry picked from commit 1f4c4fe6dfd8cc52b5fddfd67a31a77edbb1a036) Signed-off-by: Michael Armbrust <michael@databricks.com> 19 August 2015, 21:04:09 UTC
77269fc [SPARK-9627] [SQL] Stops using Scala runtime reflection in DictionaryEncoding `DictionaryEncoding` uses Scala runtime reflection to avoid boxing costs while building the directory array. However, this code path may hit [SI-6240] [1] and throw exception. [1]: https://issues.scala-lang.org/browse/SI-6240 Author: Cheng Lian <lian@databricks.com> Closes #8306 from liancheng/spark-9627/in-memory-cache-scala-reflection. (cherry picked from commit 21bdbe9fe69be47be562de24216a469e5ee64c7b) Signed-off-by: Michael Armbrust <michael@databricks.com> 19 August 2015, 20:58:03 UTC
afaed7e [SPARK-10073] [SQL] Python withColumn should replace the old column DataFrame.withColumn in Python should be consistent with the Scala one (replacing the existing column that has the same name). cc marmbrus Author: Davies Liu <davies@databricks.com> Closes #8300 from davies/with_column. (cherry picked from commit 08887369c890e0dd87eb8b34e8c32bb03307bf24) Signed-off-by: Michael Armbrust <michael@databricks.com> 19 August 2015, 20:56:54 UTC
829c33a [SPARK-10087] [CORE] [BRANCH-1.5] Disable spark.shuffle.reduceLocality.enabled by default. https://issues.apache.org/jira/browse/SPARK-10087 In some cases, when spark.shuffle.reduceLocality.enabled is enabled, we are scheduling all reducers to the same executor (the cluster has plenty of resources). Changing spark.shuffle.reduceLocality.enabled to false resolve the problem. Comments of https://github.com/apache/spark/pull/8280 provide more details of the symptom of this issue. This PR changes the default setting of `spark.shuffle.reduceLocality.enabled` to `false` for branch 1.5. Author: Yin Huai <yhuai@databricks.com> Closes #8296 from yhuai/setNumPartitionsCorrectly-branch1.5. 19 August 2015, 20:43:46 UTC
1038f67 [SPARK-10107] [SQL] fix NPE in format_number Author: Davies Liu <davies@databricks.com> Closes #8305 from davies/format_number. (cherry picked from commit e05da5cb5ea253e6372f648fc8203204f2a8df8d) Signed-off-by: Reynold Xin <rxin@databricks.com> 19 August 2015, 20:43:20 UTC
8c0a5a2 [SPARK-8918] [MLLIB] [DOC] Add @since tags to mllib.clustering This continues the work from #8256. I removed `since` tags from private/protected/local methods/variables (see https://github.com/apache/spark/commit/72fdeb64630470f6f46cf3eed8ffbfe83a7c4659). MechCoder Closes #8256 Author: Xiangrui Meng <meng@databricks.com> Author: Xiaoqing Wang <spark445@126.com> Author: MechCoder <manojkumarsivaraj334@gmail.com> Closes #8288 from mengxr/SPARK-8918. (cherry picked from commit 5b62bef8cbf73f910513ef3b1f557aa94b384854) Signed-off-by: Xiangrui Meng <meng@databricks.com> 19 August 2015, 20:17:34 UTC
ba36925 [SPARK-10106] [SPARKR] Add `ifelse` Column function to SparkR ### JIRA [[SPARK-10106] Add `ifelse` Column function to SparkR - ASF JIRA](https://issues.apache.org/jira/browse/SPARK-10106) Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com> Closes #8303 from yu-iskw/SPARK-10106. (cherry picked from commit d898c33f774b9a3db2fb6aa8f0cb2c2ac6004b58) Signed-off-by: Shivaram Venkataraman <shivaram@cs.berkeley.edu> 19 August 2015, 19:39:44 UTC
f25c324 [SPARK-10097] Adds `shouldMaximize` flag to `ml.evaluation.Evaluator` Previously, users of evaluator (`CrossValidator` and `TrainValidationSplit`) would only maximize the metric in evaluator, leading to a hacky solution which negated metrics to be minimized and caused erroneous negative values to be reported to the user. This PR adds a `isLargerBetter` attribute to the `Evaluator` base class, instructing users of `Evaluator` on whether the chosen metric should be maximized or minimized. CC jkbradley Author: Feynman Liang <fliang@databricks.com> Author: Joseph K. Bradley <joseph@databricks.com> Closes #8290 from feynmanliang/SPARK-10097. (cherry picked from commit 28a98464ea65aa7b35e24fca5ddaa60c2e5d53ee) Signed-off-by: Joseph K. Bradley <joseph@databricks.com> 19 August 2015, 18:35:17 UTC
a8e8808 [SPARK-9856] [SPARKR] Add expression functions into SparkR whose params are complicated I added lots of Column functinos into SparkR. And I also added `rand(seed: Int)` and `randn(seed: Int)` in Scala. Since we need such APIs for R integer type. ### JIRA [[SPARK-9856] Add expression functions into SparkR whose params are complicated - ASF JIRA](https://issues.apache.org/jira/browse/SPARK-9856) Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com> Closes #8264 from yu-iskw/SPARK-9856-3. (cherry picked from commit 2fcb9cb9552dac1d78dcca5d4d5032b4fa6c985c) Signed-off-by: Shivaram Venkataraman <shivaram@cs.berkeley.edu> 19 August 2015, 17:41:22 UTC
bebe63d [SPARK-10084] [MLLIB] [DOC] Add Python example for mllib FP-growth user guide 1, Add Python example for mllib FP-growth user guide. 2, Correct mistakes of Scala and Java examples. Author: Yanbo Liang <ybliang8@gmail.com> Closes #8279 from yanboliang/spark-10084. (cherry picked from commit 802b5b8791fc2c892810981b2479a04175aa3dcd) Signed-off-by: Xiangrui Meng <meng@databricks.com> 19 August 2015, 15:53:42 UTC
f8dc427 [SPARK-10060] [ML] [DOC] spark.ml DecisionTree user guide New user guide section ml-decision-tree.md, including code examples. I have run all examples, including the Java ones. CC: manishamde yanboliang mengxr Author: Joseph K. Bradley <joseph@databricks.com> Closes #8244 from jkbradley/ml-dt-docs. (cherry picked from commit 39e4ebd521defdb68a0787bcd3bde6bc855f5198) Signed-off-by: Xiangrui Meng <meng@databricks.com> 19 August 2015, 14:38:39 UTC
522b0b6 [SPARK-8949] Print warnings when using preferred locations feature Add warnings according to SPARK-8949 in `SparkContext` - warnings in scaladoc - log warnings when preferred locations feature is used through `SparkContext`'s constructor However I didn't found any documentation reference of this feature. Please direct me if you know any reference to this feature. Author: Han JU <ju.han.felix@gmail.com> Closes #7874 from darkjh/SPARK-8949. (cherry picked from commit 3d16a545007922ee6fa36e5f5c3959406cb46484) Signed-off-by: Sean Owen <sowen@cloudera.com> 19 August 2015, 12:04:24 UTC
5553f02 [SPARK-9977] [DOCS] Update documentation for StringIndexer By using `StringIndexer`, we can obtain indexed label on new column. So a following estimator should use this new column through pipeline if it wants to use string indexed label. I think it is better to make it explicit on documentation. Author: lewuathe <lewuathe@me.com> Closes #8205 from Lewuathe/SPARK-9977. (cherry picked from commit ba2a07e2b6c5a39597b64041cd5bf342ef9631f5) Signed-off-by: Sean Owen <sowen@cloudera.com> 19 August 2015, 08:54:11 UTC
e56bcc6 [DOCS] [SQL] [PYSPARK] Fix typo in ntile function Fix typo in ntile function. Author: Moussa Taifi <moutai10@gmail.com> Closes #8261 from moutai/patch-2. (cherry picked from commit 865a3df3d578c0442c97d749c81f554b560da406) Signed-off-by: Sean Owen <sowen@cloudera.com> 19 August 2015, 08:42:50 UTC
561390d [SPARK-10070] [DOCS] Remove Guava dependencies in user guides `Lists.newArrayList` -> `Arrays.asList` CC jkbradley feynmanliang Anybody into replacing usages of `Lists.newArrayList` in the examples / source code too? this method isn't useful in Java 7 and beyond. Author: Sean Owen <sowen@cloudera.com> Closes #8272 from srowen/SPARK-10070. (cherry picked from commit f141efeafb42b14b5fcfd9aa8c5275162042349f) Signed-off-by: Sean Owen <sowen@cloudera.com> 19 August 2015, 08:41:19 UTC
417852f Fix Broken Link Link was broken because it included tick marks. Author: Bill Chambers <wchambers@ischool.berkeley.edu> Closes #8302 from anabranch/patch-1. (cherry picked from commit b23c4d3ffc36e47c057360c611d8ab1a13877699) Signed-off-by: Reynold Xin <rxin@databricks.com> 19 August 2015, 07:05:12 UTC
392bd19 [SPARK-9967] [SPARK-10099] [STREAMING] Renamed conf spark.streaming.backpressure.{enable-->enabled} and fixed deprecated annotations Small changes - Renamed conf spark.streaming.backpressure.{enable --> enabled} - Change Java Deprecated annotations to Scala deprecated annotation with more information. Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #8299 from tdas/SPARK-9967. (cherry picked from commit bc9a0e03235865d2ec33372f6400dec8c770778a) Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com> 19 August 2015, 06:38:13 UTC
3ceee55 [SPARK-9952] Fix N^2 loop when DAGScheduler.getPreferredLocsInternal accesses cacheLocs In Scala, `Seq.fill` always seems to return a List. Accessing a list by index is an O(N) operation. Thus, the following code will be really slow (~10 seconds on my machine): ```scala val numItems = 100000 val s = Seq.fill(numItems)(1) for (i <- 0 until numItems) s(i) ``` It turns out that we had a loop like this in DAGScheduler code, although it's a little tricky to spot. In `getPreferredLocsInternal`, there's a call to `getCacheLocs(rdd)(partition)`. The `getCacheLocs` call returns a Seq. If this Seq is a List and the RDD contains many partitions, then indexing into this list will cost O(partitions). Thus, when we loop over our tasks to compute their individual preferred locations we implicitly perform an N^2 loop, reducing scheduling throughput. This patch fixes this by replacing `Seq` with `Array`. Author: Josh Rosen <joshrosen@databricks.com> Closes #8178 from JoshRosen/dagscheduler-perf. (cherry picked from commit 010b03ed52f35fd4d426d522f8a9927ddc579209) Signed-off-by: Reynold Xin <rxin@databricks.com> 19 August 2015, 05:30:20 UTC
4163926 [SPARK-9508] GraphX Pregel docs update with new Pregel code SPARK-9436 simplifies the Pregel code. graphx-programming-guide needs to be modified accordingly since it lists the old Pregel code Author: Alexander Ulanov <nashb@yandex.ru> Closes #7831 from avulanov/SPARK-9508-pregel-doc2. (cherry picked from commit 1c843e284818004f16c3f1101c33b510f80722e3) Signed-off-by: Reynold Xin <rxin@databricks.com> 19 August 2015, 05:13:57 UTC
03a8a88 [SPARK-9705] [DOC] fix docs about Python version cc JoshRosen Author: Davies Liu <davies@databricks.com> Closes #8245 from davies/python_doc. (cherry picked from commit de3223872a217c5224ba7136604f6b7753b29108) Signed-off-by: Reynold Xin <rxin@databricks.com> 19 August 2015, 05:11:32 UTC
3c33931 [SPARK-10093] [SPARK-10096] [SQL] Avoid transformation on executors & fix UDFs on complex types This is kind of a weird case, but given a sufficiently complex query plan (in this case a TungstenProject with an Exchange underneath), we could have NPEs on the executors due to the time when we were calling transformAllExpressions In general we should ensure that all transformations occur on the driver and not on the executors. Some reasons for avoid executor side transformations include: * (this case) Some operator constructors require state such as access to the Spark/SQL conf so doing a makeCopy on the executor can fail. * (unrelated reason for avoid executor transformations) ExprIds are calculated using an atomic integer, so you can violate their uniqueness constraint by constructing them anywhere other than the driver. This subsumes #8285. Author: Reynold Xin <rxin@databricks.com> Author: Michael Armbrust <michael@databricks.com> Closes #8295 from rxin/SPARK-10096. (cherry picked from commit 1ff0580eda90f9247a5233809667f5cebaea290e) Signed-off-by: Reynold Xin <rxin@databricks.com> 19 August 2015, 05:08:22 UTC
11c9335 [SPARK-10095] [SQL] use public API of BigInteger In UnsafeRow, we use the private field of BigInteger for better performance, but it actually didn't contribute much (3% in one benchmark) to end-to-end runtime, and make it not portable (may fail on other JVM implementations). So we should use the public API instead. cc rxin Author: Davies Liu <davies@databricks.com> Closes #8286 from davies/portable_decimal. (cherry picked from commit 270ee677750a1f2adaf24b5816857194e61782ff) Signed-off-by: Davies Liu <davies.liu@gmail.com> 19 August 2015, 03:40:12 UTC
ebaeb18 [SPARK-10075] [SPARKR] Add `when` expressino function in SparkR - Add `when` and `otherwise` as `Column` methods - Add `When` as an expression function - Add `%otherwise%` infix as an alias of `otherwise` Since R doesn't support a feature like method chaining, `otherwise(when(condition, value), value)` style is a little annoying for me. If `%otherwise%` looks strange for shivaram, I can remove it. What do you think? ### JIRA [[SPARK-10075] Add `when` expressino function in SparkR - ASF JIRA](https://issues.apache.org/jira/browse/SPARK-10075) Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com> Closes #8266 from yu-iskw/SPARK-10075. (cherry picked from commit bf32c1f7f47dd907d787469f979c5859e02ce5e6) Signed-off-by: Shivaram Venkataraman <shivaram@cs.berkeley.edu> 19 August 2015, 03:29:34 UTC
bb2fb59 [SPARK-9939] [SQL] Resorts to Java process API in CliSuite, HiveSparkSubmitSuite and HiveThriftServer2 test suites Scala process API has a known bug ([SI-8768] [1]), which may be the reason why several test suites which fork sub-processes are flaky. This PR replaces Scala process API with Java process API in `CliSuite`, `HiveSparkSubmitSuite`, and `HiveThriftServer2` related test suites to see whether it fix these flaky tests. [1]: https://issues.scala-lang.org/browse/SI-8768 Author: Cheng Lian <lian@databricks.com> Closes #8168 from liancheng/spark-9939/use-java-process-api. (cherry picked from commit a5b5b936596ceb45f5f5b68bf1d6368534fb9470) Signed-off-by: Cheng Lian <lian@databricks.com> 19 August 2015, 03:22:31 UTC
a6f8979 [SPARK-10102] [STREAMING] Fix a race condition that startReceiver may happen before setting trackerState to Started Test failure: https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-with-YARN/HADOOP_PROFILE=hadoop-2.4,label=spark-test/3305/testReport/junit/org.apache.spark.streaming/StreamingContextSuite/stop_gracefully/ There is a race condition that setting `trackerState` to `Started` could happen after calling `startReceiver`. Then `startReceiver` won't start the receivers because it uses `! isTrackerStarted` to check if ReceiverTracker is stopping or stopped. But actually, `trackerState` is `Initialized` and will be changed to `Started` soon. Therefore, we should use `isTrackerStopping || isTrackerStopped`. Author: zsxwing <zsxwing@gmail.com> Closes #8294 from zsxwing/SPARK-9504. (cherry picked from commit 90273eff9604439a5a5853077e232d34555c67d7) Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com> 19 August 2015, 03:16:18 UTC
08c5962 [SPARK-10072] [STREAMING] BlockGenerator can deadlock when the queue of generate blocks fills up to capacity Generated blocks are inserted into an ArrayBlockingQueue, and another thread pulls stuff from the ArrayBlockingQueue and pushes it into BlockManager. Now if that queue fills up to capacity (default is 10 blocks), then the inserting into queue (done in the function updateCurrentBuffer) get blocked inside a synchronized block. However, the thread that is pulling blocks from the queue uses the same lock to check the current (active or stopped) while pulling from the queue. Since the block generating threads is blocked (as the queue is full) on the lock, this thread that is supposed to drain the queue gets blocked. Ergo, deadlock. Solution: Moved blocking call to ArrayBlockingQueue outside the synchronized to prevent deadlock. Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #8257 from tdas/SPARK-10072. (cherry picked from commit 1aeae05bb20f01ab7ccaa62fe905a63e020074b5) Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com> 19 August 2015, 02:26:51 UTC
0a1385e [SPARKR] [MINOR] Get rid of a long line warning ``` R/functions.R:74:1: style: lines should not be more than 100 characters. jc <- callJStatic("org.apache.spark.sql.functions", "lit", ifelse(class(x) == "Column", xjc, x)) ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ``` Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com> Closes #8297 from yu-iskw/minor-lint-r. (cherry picked from commit b4b35f133aecaf84f04e8e444b660a33c6b7894a) Signed-off-by: Shivaram Venkataraman <shivaram@cs.berkeley.edu> 19 August 2015, 02:18:13 UTC
9b42e24 Bump SparkR version string to 1.5.0 This patch is against master, but we need to apply it to 1.5 branch as well. cc shivaram and rxin Author: Hossein <hossein@databricks.com> Closes #8291 from falaki/SparkRVersion1.5. (cherry picked from commit 04e0fea79b9acfa3a3cb81dbacb08f9d287b42c3) Signed-off-by: Shivaram Venkataraman <shivaram@cs.berkeley.edu> 19 August 2015, 01:02:31 UTC
4ee225a [SPARK-8473] [SPARK-9889] [ML] User guide and example code for DCT mengxr jkbradley Author: Feynman Liang <fliang@databricks.com> Closes #8184 from feynmanliang/SPARK-9889-DCT-docs. (cherry picked from commit badf7fa650f9801c70515907fcc26b58d7ec3143) Signed-off-by: Joseph K. Bradley <joseph@databricks.com> 19 August 2015, 00:54:58 UTC
e1b50c7 [SPARK-10098] [STREAMING] [TEST] Cleanup active context after test in FailureSuite Failures in streaming.FailureSuite can leak StreamingContext and SparkContext which fails all subsequent tests Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #8289 from tdas/SPARK-10098. (cherry picked from commit 9108eff74a2815986fd067b273c2a344b6315405) Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com> 19 August 2015, 00:00:21 UTC
fb207b2 [SPARK-10012] [ML] Missing test case for Params#arrayLengthGt Currently there is no test case for `Params#arrayLengthGt`. Author: lewuathe <lewuathe@me.com> Closes #8223 from Lewuathe/SPARK-10012. (cherry picked from commit c635a16f64c939182196b46725ef2d00ed107cca) Signed-off-by: Joseph K. Bradley <joseph@databricks.com> 18 August 2015, 22:30:34 UTC
56f4da2 [SPARK-8924] [MLLIB, DOCUMENTATION] Added @since tags to mllib.tree Added since tags to mllib.tree Author: Bryan Cutler <bjcutler@us.ibm.com> Closes #7380 from BryanCutler/sinceTag-mllibTree-8924. (cherry picked from commit 1dbffba37a84c62202befd3911d25888f958191d) Signed-off-by: Xiangrui Meng <meng@databricks.com> 18 August 2015, 21:58:37 UTC
8b0df5a [SPARK-10088] [SQL] Add support for "stored as avro" in HiveQL parser. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #8282 from vanzin/SPARK-10088. (cherry picked from commit 492ac1facbc79ee251d45cff315598ec9935a0e2) Signed-off-by: Michael Armbrust <michael@databricks.com> 18 August 2015, 21:45:35 UTC
74a6b1a [SPARK-10089] [SQL] Add missing golden files. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #8283 from vanzin/SPARK-10089. (cherry picked from commit fa41e0242f075843beff7dc600d1a6bac004bdc7) Signed-off-by: Michael Armbrust <michael@databricks.com> 18 August 2015, 21:43:19 UTC
80a6fb5 [SPARK-10080] [SQL] Fix binary incompatibility for $ column interpolation Turns out that inner classes of inner objects are referenced directly, and thus moving it will break binary compatibility. Author: Michael Armbrust <michael@databricks.com> Closes #8281 from marmbrus/binaryCompat. (cherry picked from commit 80cb25b228e821a80256546a2f03f73a45cf7645) Signed-off-by: Michael Armbrust <michael@databricks.com> 18 August 2015, 20:51:03 UTC
2bccd91 [SPARK-9574] [STREAMING] Remove unnecessary contents of spark-streaming-XXX-assembly jars Removed contents already included in Spark assembly jar from spark-streaming-XXX-assembly jars. Author: zsxwing <zsxwing@gmail.com> Closes #8069 from zsxwing/SPARK-9574. (cherry picked from commit bf1d6614dcb8f5974e62e406d9c0f8aac52556d3) Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com> 18 August 2015, 20:36:25 UTC
9bd2e6f [SPARK-10085] [MLLIB] [DOCS] removed unnecessary numpy array import See https://issues.apache.org/jira/browse/SPARK-10085 Author: Piotr Migdal <pmigdal@gmail.com> Closes #8284 from stared/spark-10085. (cherry picked from commit 8bae9015b7e7b4528ca2bc5180771cb95d2aac13) Signed-off-by: Xiangrui Meng <meng@databricks.com> 18 August 2015, 19:59:36 UTC
ec7079f [SPARK-10032] [PYSPARK] [DOC] Add Python example for mllib LDAModel user guide Add Python example for mllib LDAModel user guide Author: Yanbo Liang <ybliang8@gmail.com> Closes #8227 from yanboliang/spark-10032. (cherry picked from commit 747c2ba8006d5b86f3be8dfa9ace639042a35628) Signed-off-by: Xiangrui Meng <meng@databricks.com> 18 August 2015, 19:56:43 UTC
80debff [SPARK-10029] [MLLIB] [DOC] Add Python examples for mllib IsotonicRegression user guide Add Python examples for mllib IsotonicRegression user guide Author: Yanbo Liang <ybliang8@gmail.com> Closes #8225 from yanboliang/spark-10029. (cherry picked from commit f4fa61effe34dae2f0eab0bef57b2dee220cf92f) Signed-off-by: Xiangrui Meng <meng@databricks.com> 18 August 2015, 19:55:42 UTC
7ff0e5d [SPARK-9900] [MLLIB] User guide for Association Rules Updates FPM user guide to include Association Rules. Author: Feynman Liang <fliang@databricks.com> Closes #8207 from feynmanliang/SPARK-9900-arules. (cherry picked from commit f5ea3912900ccdf23e2eb419a342bfe3c0c0b61b) Signed-off-by: Xiangrui Meng <meng@databricks.com> 18 August 2015, 19:54:05 UTC
b86378c [SPARK-9028] [ML] Add CountVectorizer as an estimator to generate CountVectorizerModel jira: https://issues.apache.org/jira/browse/SPARK-9028 Add an estimator for CountVectorizerModel. The estimator will extract a vocabulary from document collections according to the term frequency. I changed the meaning of minCount as a filter across the corpus. This aligns with Word2Vec and the similar parameter in SKlearn. Author: Yuhao Yang <hhbyyh@gmail.com> Author: Joseph K. Bradley <joseph@databricks.com> Closes #7388 from hhbyyh/cvEstimator. (cherry picked from commit 354f4582b637fa25d3892ec2b12869db50ed83c9) Signed-off-by: Joseph K. Bradley <joseph@databricks.com> 18 August 2015, 18:00:22 UTC
20a760a [SPARK-10007] [SPARKR] Update `NAMESPACE` file in SparkR for simple parameters functions ### JIRA [[SPARK-10007] Update `NAMESPACE` file in SparkR for simple parameters functions - ASF JIRA](https://issues.apache.org/jira/browse/SPARK-10007) Author: Yuu ISHIKAWA <yuu.ishikawa@gmail.com> Closes #8277 from yu-iskw/SPARK-10007. (cherry picked from commit 1968276af0f681fe51328b7dd795bd21724a5441) Signed-off-by: Shivaram Venkataraman <shivaram@cs.berkeley.edu> 18 August 2015, 16:11:22 UTC
a512250 [SPARK-8118] [SQL] Redirects Parquet JUL logger via SLF4J Parquet hard coded a JUL logger which always writes to stdout. This PR redirects it via SLF4j JUL bridge handler, so that we can control Parquet logs via `log4j.properties`. This solution is inspired by https://github.com/Parquet/parquet-mr/issues/390#issuecomment-46064909. Author: Cheng Lian <lian@databricks.com> Closes #8196 from liancheng/spark-8118/redirect-parquet-jul. (cherry picked from commit 5723d26d7e677b89383de3fcf2c9a821b68a65b7) Signed-off-by: Cheng Lian <lian@databricks.com> 18 August 2015, 12:16:13 UTC
42a0b48 [MINOR] fix the comments in IndexShuffleBlockResolver it might be a typo introduced at the first moment or some leftover after some renaming...... the name of the method accessing the index file is called `getBlockData` now (not `getBlockLocation` as indicated in the comments) Author: CodingCat <zhunansjtu@gmail.com> Closes #8238 from CodingCat/minor_1. (cherry picked from commit c34e9ff0eac2032283b959fe63b47cc30f28d21c) Signed-off-by: Sean Owen <sowen@cloudera.com> 18 August 2015, 09:31:25 UTC
40b89c3 [SPARK-10076] [ML] make MultilayerPerceptronClassifier layers and weights public Fix the issue that ```layers``` and ```weights``` should be public variables of ```MultilayerPerceptronClassificationModel```. Users can not get ```layers``` and ```weights``` from a ```MultilayerPerceptronClassificationModel``` currently. Author: Yanbo Liang <ybliang8@gmail.com> Closes #8263 from yanboliang/mlp-public. (cherry picked from commit dd0614fd618ad28cb77aecfbd49bb319b98fdba0) Signed-off-by: Xiangrui Meng <meng@databricks.com> 18 August 2015, 06:57:14 UTC
e5fbe4f [SPARK-10038] [SQL] fix bug in generated unsafe projection when there is binary in ArrayData The type for array of array in Java is slightly different than array of others. cc cloud-fan Author: Davies Liu <davies@databricks.com> Closes #8250 from davies/array_binary. (cherry picked from commit 5af3838d2e59ed83766f85634e26918baa53819f) Signed-off-by: Reynold Xin <rxin@databricks.com> 18 August 2015, 06:28:02 UTC
2803e8b [MINOR] Format the comment of `translate` at `functions.scala` Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com> Closes #8265 from yu-iskw/minor-translate-comment. (cherry picked from commit a0910315dae88b033e38a1de07f39ca21f6552ad) Signed-off-by: Reynold Xin <rxin@databricks.com> 18 August 2015, 06:27:19 UTC
3554250 [SPARK-7808] [ML] add package doc for ml.feature This PR adds a short description of `ml.feature` package with code example. The Java package doc will come in a separate PR. jkbradley Author: Xiangrui Meng <meng@databricks.com> Closes #8260 from mengxr/SPARK-7808. (cherry picked from commit e290029a356222bddf4da1be0525a221a5a1630b) Signed-off-by: Xiangrui Meng <meng@databricks.com> 18 August 2015, 02:40:58 UTC
bfb4c84 [SPARK-10059] [YARN] Explicitly add JSP dependencies for tests. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #8251 from vanzin/SPARK-10059. (cherry picked from commit ee093c8b927e8d488aeb76115c7fb0de96af7720) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> 18 August 2015, 02:35:50 UTC
9740d43 [SPARK-9902] [MLLIB] Add Java and Python examples to user guide for 1-sample KS test added doc examples for python. Author: jose.cambronero <jose.cambronero@cloudera.com> Closes #8154 from josepablocam/spark_9902. (cherry picked from commit c90c605dc6a876aef3cc204ac15cd65bab9743ad) Signed-off-by: Xiangrui Meng <meng@databricks.com> 18 August 2015, 02:09:51 UTC
5de0ffb [SPARK-7707] User guide and example code for KernelDensity Author: Sandy Ryza <sandy@cloudera.com> Closes #8230 from sryza/sandy-spark-7707. (cherry picked from commit f9d1a92aa1bac4494022d78559b871149579e6e8) Signed-off-by: Xiangrui Meng <meng@databricks.com> 18 August 2015, 00:58:06 UTC
18b3d11 [SPARK-9898] [MLLIB] Prefix Span user guide Adds user guide for `PrefixSpan`, including Scala and Java example code. mengxr zhangjiajin Author: Feynman Liang <fliang@databricks.com> Closes #8253 from feynmanliang/SPARK-9898. (cherry picked from commit 0b6b01761370629ce387c143a25d41f3a334ff28) Signed-off-by: Xiangrui Meng <meng@databricks.com> 18 August 2015, 00:53:31 UTC
f5ed9ed SPARK-8916 [Documentation, MLlib] Add @since tags to mllib.regression Added since tags to mllib.regression Author: Prayag Chandran <prayagchandran@gmail.com> Closes #7518 from prayagchandran/sinceTags and squashes the following commits: fa4dda2 [Prayag Chandran] Re-formatting 6c6d584 [Prayag Chandran] Corrected a few tags. Removed few unnecessary tags 1a0365f [Prayag Chandran] Reformating and adding a few more tags 89fdb66 [Prayag Chandran] SPARK-8916 [Documentation, MLlib] Add @since tags to mllib.regression (cherry picked from commit 18523c130548f0438dff8d1f25531fd2ed36e517) Signed-off-by: DB Tsai <dbt@netflix.com> 18 August 2015, 00:26:27 UTC
eaeebb9 [SPARK-9768] [PYSPARK] [ML] Add Python API and user guide for ml.feature.ElementwiseProduct Add Python API, user guide and example for ml.feature.ElementwiseProduct. Author: Yanbo Liang <ybliang8@gmail.com> Closes #8061 from yanboliang/SPARK-9768. (cherry picked from commit 0076e8212334c613599dcbc2ac23f49e9e50cc44) Signed-off-by: Joseph K. Bradley <joseph@databricks.com> 18 August 2015, 00:25:50 UTC
407175e [SPARK-9974] [BUILD] [SQL] Makes sure com.twitter:parquet-hadoop-bundle:1.6.0 is in SBT assembly jar PR #7967 enables Spark SQL to persist Parquet tables in Hive compatible format when possible. One of the consequence is that, we have to set input/output classes to `MapredParquetInputFormat`/`MapredParquetOutputFormat`, which rely on com.twitter:parquet-hadoop:1.6.0 bundled with Hive 1.2.1. When loading such a table in Spark SQL, `o.a.h.h.ql.metadata.Table` first loads these input/output format classes, and thus classes in com.twitter:parquet-hadoop:1.6.0. However, the scope of this dependency is defined as "runtime", and is not packaged into Spark assembly jar. This results in a `ClassNotFoundException`. This issue can be worked around by asking users to add parquet-hadoop 1.6.0 via the `--driver-class-path` option. However, considering Maven build is immune to this problem, I feel it can be confusing and inconvenient for users. So this PR fixes this issue by changing scope of parquet-hadoop 1.6.0 to "compile". Author: Cheng Lian <lian@databricks.com> Closes #8198 from liancheng/spark-9974/bundle-parquet-1.6.0. (cherry picked from commit 52ae952574f5d641a398dd185e09e5a79318c8a9) Signed-off-by: Reynold Xin <rxin@databricks.com> 18 August 2015, 00:25:21 UTC
0f1417b [SPARK-8920] [MLLIB] Add @since tags to mllib.linalg Author: Sameer Abhyankar <sabhyankar@sabhyankar-MBP.Samavihome> Author: Sameer Abhyankar <sabhyankar@sabhyankar-MBP.local> Closes #7729 from sabhyankar/branch_8920. (cherry picked from commit 088b11ec5949e135cb3db2a1ce136837e046c288) Signed-off-by: Xiangrui Meng <meng@databricks.com> 17 August 2015, 23:00:31 UTC
bb3bb2a [SPARK-10068] [MLLIB] Adds links to MLlib types, algos, utilities listing mengxr jkbradley Author: Feynman Liang <fliang@databricks.com> Closes #8255 from feynmanliang/SPARK-10068. (cherry picked from commit fdaf17f63f751f02623414fbc7d0a2f545364050) Signed-off-by: Xiangrui Meng <meng@databricks.com> 17 August 2015, 22:42:21 UTC
f77eaaf [SPARK-9592] [SQL] Fix Last function implemented based on AggregateExpression1. https://issues.apache.org/jira/browse/SPARK-9592 #8113 has the fundamental fix. But, if we want to minimize the number of changed lines, we can go with this one. Then, in 1.6, we merge #8113. Author: Yin Huai <yhuai@databricks.com> Closes #8172 from yhuai/lastFix and squashes the following commits: b28c42a [Yin Huai] Regression test. af87086 [Yin Huai] Fix last. (cherry picked from commit 772e7c18fb1a79c0f080408cb43307fe89a4fa04) Signed-off-by: Michael Armbrust <michael@databricks.com> 17 August 2015, 22:31:47 UTC
24765cc [SPARK-9526] [SQL] Utilize randomized tests to reveal potential bugs in sql expressions JIRA: https://issues.apache.org/jira/browse/SPARK-9526 This PR is a follow up of #7830, aiming at utilizing randomized tests to reveal more potential bugs in sql expression. Author: Yijie Shen <henry.yijieshen@gmail.com> Closes #7855 from yjshen/property_check. (cherry picked from commit b265e282b62954548740a5767e97ab1678c65194) Signed-off-by: Josh Rosen <joshrosen@databricks.com> 17 August 2015, 21:10:40 UTC
4daf79f [SPARK-10036] [SQL] Load JDBC driver in DataFrameReader.jdbc and DataFrameWriter.jdbc This PR uses `JDBCRDD.getConnector` to load JDBC driver before creating connection in `DataFrameReader.jdbc` and `DataFrameWriter.jdbc`. Author: zsxwing <zsxwing@gmail.com> Closes #8232 from zsxwing/SPARK-10036 and squashes the following commits: adf75de [zsxwing] Add extraOptions to the connection properties 57f59d4 [zsxwing] Load JDBC driver in DataFrameReader.jdbc and DataFrameWriter.jdbc (cherry picked from commit f10660fe7b809be2059da4f9781a5743f117a35a) Signed-off-by: Michael Armbrust <michael@databricks.com> 17 August 2015, 18:53:45 UTC
76390ec [SPARK-9950] [SQL] Wrong Analysis Error for grouping/aggregating on struct fields This issue has been fixed by https://github.com/apache/spark/pull/8215, this PR added regression test for it. Author: Wenchen Fan <cloud0fan@outlook.com> Closes #8222 from cloud-fan/minor and squashes the following commits: 0bbfb1c [Wenchen Fan] fix style... 7e2d8d9 [Wenchen Fan] add test (cherry picked from commit a4acdabb103f6d04603163c9555c1ddc413c3b80) Signed-off-by: Michael Armbrust <michael@databricks.com> 17 August 2015, 18:36:34 UTC
7279445 [SPARK-7837] [SQL] Avoids double closing output writers when commitTask() fails When inserting data into a `HadoopFsRelation`, if `commitTask()` of the writer container fails, `abortTask()` will be invoked. However, both `commitTask()` and `abortTask()` try to close the output writer(s). The problem is that, closing underlying writers may not be an idempotent operation. E.g., `ParquetRecordWriter.close()` throws NPE when called twice. Author: Cheng Lian <lian@databricks.com> Closes #8236 from liancheng/spark-7837/double-closing. (cherry picked from commit 76c155dd4483d58499e5cb66e5e9373bb771dbeb) Signed-off-by: Cheng Lian <lian@databricks.com> 17 August 2015, 16:59:19 UTC
d554bf4 [SPARK-9959] [MLLIB] Association Rules Java Compatibility mengxr Author: Feynman Liang <fliang@databricks.com> Closes #8206 from feynmanliang/SPARK-9959-arules-java. (cherry picked from commit f7efda3975d46a8ce4fd720b3730127ea482560b) Signed-off-by: Xiangrui Meng <meng@databricks.com> 17 August 2015, 16:58:44 UTC
78275c4 [SPARK-9871] [SPARKR] Add expression functions into SparkR which have a variable parameter ### Summary - Add `lit` function - Add `concat`, `greatest`, `least` functions I think we need to improve `collect` function in order to implement `struct` function. Since `collect` doesn't work with arguments which includes a nested `list` variable. It seems that a list against `struct` still has `jobj` classes. So it would be better to solve this problem on another issue. ### JIRA [[SPARK-9871] Add expression functions into SparkR which have a variable parameter - ASF JIRA](https://issues.apache.org/jira/browse/SPARK-9871) Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com> Closes #8194 from yu-iskw/SPARK-9856. (cherry picked from commit 26e760581fdf7ca913da93fa80e73b7ddabcedf6) Signed-off-by: Shivaram Venkataraman <shivaram@cs.berkeley.edu> 17 August 2015, 06:33:28 UTC
90245f6 [SPARK-10005] [SQL] Fixes schema merging for nested structs In case of schema merging, we only handled first level fields when converting Parquet groups to `InternalRow`s. Nested struct fields are not properly handled. For example, the schema of a Parquet file to be read can be: ``` message individual { required group f1 { optional binary f11 (utf8); } } ``` while the global schema is: ``` message global { required group f1 { optional binary f11 (utf8); optional int32 f12; } } ``` This PR fixes this issue by padding missing fields when creating actual converters. Author: Cheng Lian <lian@databricks.com> Closes #8228 from liancheng/spark-10005/nested-schema-merging. (cherry picked from commit ae2370e72f93db8a28b262e8252c55fe1fc9873c) Signed-off-by: Yin Huai <yhuai@databricks.com> 16 August 2015, 17:18:08 UTC
e2c6ef8 [SPARK-9973] [SQL] Correct in-memory columnar buffer size The `initialSize` argument of `ColumnBuilder.initialize()` should be the number of rows rather than bytes. However `InMemoryColumnarTableScan` passes in a byte size, which makes Spark SQL allocate more memory than necessary when building in-memory columnar buffers. Author: Kun Xu <viper_kun@163.com> Closes #8189 from viper-kun/errorSize. (cherry picked from commit 182f9b7a6d3a3ee7ec7de6abc24e296aa794e4e8) Signed-off-by: Cheng Lian <lian@databricks.com> 16 August 2015, 11:35:04 UTC
fa55c27 [SPARK-10008] Ensure shuffle locality doesn't take precedence over narrow deps The shuffle locality patch made the DAGScheduler aware of shuffle data, but for RDDs that have both narrow and shuffle dependencies, it can cause them to place tasks based on the shuffle dependency instead of the narrow one. This case is common in iterative join-based algorithms like PageRank and ALS, where one RDD is hash-partitioned and one isn't. Author: Matei Zaharia <matei@databricks.com> Closes #8220 from mateiz/shuffle-loc-fix. (cherry picked from commit cf016075a006034c24c5b758edb279f3e151d25d) Signed-off-by: Matei Zaharia <matei@databricks.com> 16 August 2015, 07:35:09 UTC
4f75ce2 [SPARK-8844] [SPARKR] head/collect is broken in SparkR. This is a WIP patch for SPARK-8844 for collecting reviews. This bug is about reading an empty DataFrame. in readCol(), lapply(1:numRows, function(x) { does not take into consideration the case where numRows = 0. Will add unit test case. Author: Sun Rui <rui.sun@intel.com> Closes #7419 from sun-rui/SPARK-8844. (cherry picked from commit 5f9ce738fe6bab3f0caffad0df1d3876178cf469) Signed-off-by: Shivaram Venkataraman <shivaram@cs.berkeley.edu> 16 August 2015, 07:30:10 UTC
881baf1 [SPARK-9805] [MLLIB] [PYTHON] [STREAMING] Added _eventually for ml streaming pyspark tests Recently, PySpark ML streaming tests have been flaky, most likely because of the batches not being processed in time. Proposal: Replace the use of _ssc_wait (which waits for a fixed amount of time) with a method which waits for a fixed amount of time but can terminate early based on a termination condition method. With this, we can extend the waiting period (to make tests less flaky) but also stop early when possible (making tests faster on average, which I verified locally). CC: mengxr tdas freeman-lab Author: Joseph K. Bradley <joseph@databricks.com> Closes #8087 from jkbradley/streaming-ml-tests. (cherry picked from commit 1db7179fae672fcec7b8de12c374dd384ce51c67) Signed-off-by: Joseph K. Bradley <joseph@databricks.com> 16 August 2015, 01:48:29 UTC
2fda1d8 [SPARK-9955] [SQL] correct error message for aggregate We should skip unresolved `LogicalPlan`s for `PullOutNondeterministic`, as calling `output` on unresolved `LogicalPlan` will produce confusing error message. Author: Wenchen Fan <cloud0fan@outlook.com> Closes #8203 from cloud-fan/error-msg and squashes the following commits: 1c67ca7 [Wenchen Fan] move test 7593080 [Wenchen Fan] correct error message for aggregate (cherry picked from commit 570567258b5839c1e0e28b5182f4c29b119ed4c4) Signed-off-by: Michael Armbrust <michael@databricks.com> 15 August 2015, 21:13:28 UTC
1a6f0af [SPARK-9980] [BUILD] Fix SBT publishLocal error due to invalid characters in doc Tiny modification to a few comments ```sbt publishLocal``` work again. Author: Herman van Hovell <hvanhovell@questtec.nl> Closes #8209 from hvanhovell/SPARK-9980. (cherry picked from commit a85fb6c07fdda5c74d53d6373910dcf5db3ff111) Signed-off-by: Sean Owen <sowen@cloudera.com> 15 August 2015, 09:46:16 UTC
d97af68 [SPARK-9725] [SQL] fix serialization of UTF8String across different JVM The BYTE_ARRAY_OFFSET could be different in JVM with different configurations (for example, different heap size, 24 if heap > 32G, otherwise 16), so offset of UTF8String is not portable, we should handler that during serialization. Author: Davies Liu <davies@databricks.com> Closes #8210 from davies/serialize_utf8string. (cherry picked from commit 7c1e56825b716a7d703dff38254b4739755ac0c4) Signed-off-by: Davies Liu <davies.liu@gmail.com> 15 August 2015, 05:31:34 UTC
3301500 [SPARK-9960] [GRAPHX] sendMessage type fix in LabelPropagation.scala Author: zc he <farseer90718@gmail.com> Closes #8188 from farseer90718/farseer-patch-1. (cherry picked from commit 71a3af8a94f900a26ac7094f22ec1216cab62e15) Signed-off-by: Reynold Xin <rxin@databricks.com> 15 August 2015, 04:28:57 UTC
83cbf60 [SPARK-9634] [SPARK-9323] [SQL] cleanup unnecessary Aliases in LogicalPlan at the end of analysis Also alias the ExtractValue instead of wrapping it with UnresolvedAlias when resolve attribute in LogicalPlan, as this alias will be trimmed if it's unnecessary. Based on #7957 without the changes to mllib, but instead maintaining earlier behavior when using `withColumn` on expressions that already have metadata. Author: Wenchen Fan <cloud0fan@outlook.com> Author: Michael Armbrust <michael@databricks.com> Closes #8215 from marmbrus/pr/7957. (cherry picked from commit ec29f2034a3306cc0afdc4c160b42c2eefa0897c) Signed-off-by: Reynold Xin <rxin@databricks.com> 15 August 2015, 04:00:05 UTC
back to top