https://github.com/apache/spark

sort by:
Revision Author Date Message Commit Date
2f9b2bd [maven-release-plugin] prepare release v1.1.0-rc4 03 September 2014, 05:27:53 UTC
a52aabd Revert "[maven-release-plugin] prepare release v1.1.0-rc3" This reverts commit b2d0493b223c5f98a593bb6d7372706cc02bebad. 03 September 2014, 04:40:07 UTC
e9bff45 Revert "[maven-release-plugin] prepare for next development iteration" This reverts commit 865e6f63f63f5e881a02d1a4e3b4c5d0e86fcd8e. 03 September 2014, 04:40:01 UTC
bc4a205 SPARK-3358: [EC2] Switch back to HVM instances for m3.X. During regression tests of Spark 1.1 we discovered perf issues with PVM instances when running PySpark. This reverts a change added in #1156 which changed the default type for m3 instances to PVM. Author: Patrick Wendell <pwendell@gmail.com> Closes #2244 from pwendell/ec2-hvm and squashes the following commits: 1342d7e [Patrick Wendell] SPARK-3358: [EC2] Switch back to HVM instances for m3.X. 03 September 2014, 04:31:09 UTC
ffdb2fc [SPARK-2823][GraphX]fix GraphX EdgeRDD zipPartitions If the users set “spark.default.parallelism” and the value is different with the EdgeRDD partition number, GraphX jobs will throw: java.lang.IllegalArgumentException: Can't zip RDDs with unequal numbers of partitions Author: luluorta <luluorta@gmail.com> Closes #1763 from luluorta/fix-graph-zip and squashes the following commits: 8338961 [luluorta] fix GraphX EdgeRDD zipPartitions (cherry picked from commit 9b225ac3072de522b40b46aba6df1f1c231f13ef) Signed-off-by: Ankur Dave <ankurdave@gmail.com> 03 September 2014, 02:28:57 UTC
0c8183c [SPARK-1981][Streaming][Hotfix] Fixed docs related to kinesis - Include kinesis in the unidocs - Hide non-public classes from docs Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #2239 from tdas/kinesis-doc-fix and squashes the following commits: 156e20c [Tathagata Das] More fixes, based on PR comments. e9a6c01 [Tathagata Das] Fixed docs related to kinesis (cherry picked from commit e9bb12bea9fbef94332fbec88e3cd9197a27b7ad) Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com> 03 September 2014, 02:03:04 UTC
9b0cff2 [SPARK-2981][GraphX] EdgePartition1D Int overflow minor fix detail is here: https://issues.apache.org/jira/browse/SPARK-2981 Author: Larry Xiao <xiaodi@sjtu.edu.cn> Closes #1902 from larryxiao/2981 and squashes the following commits: 88059a2 [Larry Xiao] [SPARK-2981][GraphX] EdgePartition1D Int overflow (cherry picked from commit aa7de128c5987fd2e134736f07ae913ad1f5eb26) Signed-off-by: Ankur Dave <ankurdave@gmail.com> 03 September 2014, 01:51:03 UTC
7267e40 SPARK-3328 fixed make-distribution script --with-tachyon option. Directory path for dependencies jar and resources in Tachyon 0.5.0 has been changed. Author: Prudhvi Krishna <prudhvi953@gmail.com> Closes #2228 from prudhvije/SPARK-3328/make-dist-fix and squashes the following commits: d1d2c22 [Prudhvi Krishna] SPARK-3328 fixed make-distribution script --with-tachyon option. (cherry picked from commit 644e31524a6a9a22c671a368aeb3b4eaeb61cf29) Signed-off-by: Patrick Wendell <pwendell@gmail.com> 03 September 2014, 00:37:02 UTC
dff42a7 [Build] merge changes to run-tests-jenkins from master branch Author: Nicholas Chammas <nicholas.chammas@gmail.com> Author: nchammas <nicholas.chammas@gmail.com> Closes #2237 from nchammas/branch-1.1 and squashes the following commits: 39bdd5e [Nicholas Chammas] merge updates from master f5aa841 [nchammas] Merge pull request #3 from apache/branch-1.1 02 September 2014, 20:08:52 UTC
ccf3520 [SPARK-3332] Revert spark-ec2 patch that identifies clusters using tags This reverts #1899 and #2163, two patches that modified `spark-ec2` so that clusters are identified using tags instead of security groups. The original motivation for this patch was to allow multiple clusters to run in the same security group. Unfortunately, tagging is not atomic with launching instances on EC2, so with this approach we have the possibility of `spark-ec2` launching instances and crashing before they can be tagged, effectively orphaning those instances. The orphaned instances won't belong to any cluster, so the `spark-ec2` script will be unable to clean them up. Since this feature may still be worth supporting, there are several alternative approaches that we might consider, including detecting orphaned instances and logging warnings, or maybe using another mechanism to group instances into clusters. For the 1.1.0 release, though, I propose that we just revert this patch. Author: Josh Rosen <joshrosen@apache.org> Closes #2225 from JoshRosen/revert-ec2-cluster-naming and squashes the following commits: 0c18e86 [Josh Rosen] Revert "SPARK-2333 - spark_ec2 script should allow option for existing security group" c2ca2d4 [Josh Rosen] Revert "Spark-3213 Fixes issue with spark-ec2 not detecting slaves created with "Launch More like this"" 02 September 2014, 17:47:56 UTC
e6972ea [MLlib] Squash bug in IndexedRowMatrix Kill this bug fast before it does damage. Author: Reza Zadeh <rizlar@gmail.com> Closes #2224 from rezazadeh/indexrmbug and squashes the following commits: 53386d6 [Reza Zadeh] Squash bug in IndexedRowMatrix (cherry picked from commit 0f16b23cd17002fac05f3ecc58899be1b1121b82) Signed-off-by: Xiangrui Meng <meng@databricks.com> 02 September 2014, 16:48:17 UTC
e136312 [SPARK-3342] Add SSDs to block device mapping On `m3.2xlarge` instances the 2x80GB SSDs are inaccessible if not added to the block device mapping when the instance is created. They work when added with this patch. I have not tested this with other instance types, and I do not know much about this script and EC2 deployment in general. Maybe this code needs to depend on the instance type. The requirement for this mapping is described in the AWS docs at: http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/InstanceStorage.html#InstanceStore_UsageScenarios "For M3 instances, you must specify instance store volumes in the block device mapping for the instance. When you launch an M3 instance, we ignore any instance store volumes specified in the block device mapping for the AMI." Author: Daniel Darabos <darabos.daniel@gmail.com> Closes #2081 from darabos/patch-1 and squashes the following commits: 1ceb2c8 [Daniel Darabos] Use %d string interpolation instead of {}. a1854d7 [Daniel Darabos] Only specify ephemeral device mapping for M3. e0d9e37 [Daniel Darabos] Create ephemeral device mapping based on get_num_disks(). 6b116a6 [Daniel Darabos] Add SSDs to block device mapping 02 September 2014, 05:18:21 UTC
865e6f6 [maven-release-plugin] prepare for next development iteration 30 August 2014, 17:48:10 UTC
b2d0493 [maven-release-plugin] prepare release v1.1.0-rc3 30 August 2014, 17:48:02 UTC
d9a1c96 Revert "[maven-release-plugin] prepare release v1.1.0-rc3" This reverts commit 2b2e02265f80e4c5172c1e498aa9ba2c6b91c6c9. 30 August 2014, 17:14:33 UTC
829025e Revert "[maven-release-plugin] prepare for next development iteration" This reverts commit 8b5f0dbd8d32a25a4e7ba3ebe1a4c3c6310aeb85. 30 August 2014, 17:14:28 UTC
d4ce264 BUILD: Adding back CDH4 as per user requests 30 August 2014, 05:26:04 UTC
8b5f0db [maven-release-plugin] prepare for next development iteration 30 August 2014, 02:26:11 UTC
2b2e022 [maven-release-plugin] prepare release v1.1.0-rc3 30 August 2014, 02:26:03 UTC
272b4a6 Adding new CHANGES.txt 30 August 2014, 01:49:51 UTC
aa9364a [SPARK-3320][SQL] Made batched in-memory column buffer building work for SchemaRDDs with empty partitions Author: Cheng Lian <lian.cs.zju@gmail.com> Closes #2213 from liancheng/spark-3320 and squashes the following commits: 45a0139 [Cheng Lian] Fixed typo in InMemoryColumnarQuerySuite f67067d [Cheng Lian] Fixed SPARK-3320 (cherry picked from commit 32b18dd52cf8920903819f23e406271ecd8ac6bb) Signed-off-by: Michael Armbrust <michael@databricks.com> 30 August 2014, 01:16:58 UTC
b0facb5 [SPARK-3296][mllib] spark-example should be run-example in head notation of DenseKMeans and SparseNaiveBayes `./bin/spark-example` should be `./bin/run-example` in DenseKMeans and SparseNaiveBayes Author: wangfei <wangfei_hello@126.com> Closes #2193 from scwf/run-example and squashes the following commits: 207eb3a [wangfei] spark-example should be run-example 27a8999 [wangfei] ./bin/spark-example should be ./bin/run-example (cherry picked from commit 13901764f4e9ed3de03e420d88ab42bdce5d5140) Signed-off-by: Xiangrui Meng <meng@databricks.com> 30 August 2014, 00:37:36 UTC
c4b7ec8 Revert "[maven-release-plugin] prepare release v1.1.0-rc2" This reverts commit 711aebb329ca28046396af1e34395a0df92b5327. 29 August 2014, 22:55:30 UTC
926f171 Revert "[maven-release-plugin] prepare for next development iteration" This reverts commit a4a7a241441489a0d31365e18476ae2e1c34464d. 29 August 2014, 22:55:26 UTC
c1333b8 [SPARK-3291][SQL]TestcaseName in createQueryTest should not contain ":" ":" is not allowed to appear in a file name of Windows system. If file name contains ":", this file can't be checked out in a Windows system and developers using Windows must be careful to not commit the deletion of such files, Which is very inconvenient. Author: qiping.lqp <qiping.lqp@alibaba-inc.com> Closes #2191 from chouqin/querytest and squashes the following commits: 0e943a1 [qiping.lqp] rename golden file 60a863f [qiping.lqp] TestcaseName in createQueryTest should not contain ":" (cherry picked from commit 634d04b87c2744d645e9c26e746ba2006371d9b5) Signed-off-by: Michael Armbrust <michael@databricks.com> 29 August 2014, 22:38:00 UTC
9bae345 [SPARK-3269][SQL] Decreases initial buffer size for row set to prevent OOM When a large batch size is specified, `SparkSQLOperationManager` OOMs even if the whole result set is much smaller than the batch size. Author: Cheng Lian <lian.cs.zju@gmail.com> Closes #2171 from liancheng/jdbc-fetch-size and squashes the following commits: 5e1623b [Cheng Lian] Decreases initial buffer size for row set to prevent OOM (cherry picked from commit d94a44d7caaf3fe7559d9ad7b10872fa16cf81ca) Signed-off-by: Michael Armbrust <michael@databricks.com> 29 August 2014, 22:36:19 UTC
cf049ef [SPARK-3234][Build] Fixed environment variables that rely on deprecated command line options in make-distribution.sh Please refer to [SPARK-3234](https://issues.apache.org/jira/browse/SPARK-3234) for details. Author: Cheng Lian <lian.cs.zju@gmail.com> Closes #2208 from liancheng/spark-3234 and squashes the following commits: fb26de8 [Cheng Lian] Fixed SPARK-3234 (cherry picked from commit 287c0ac7722dd4bc51b921ccc6f0e3c1625b5ff4) Signed-off-by: Patrick Wendell <pwendell@gmail.com> 29 August 2014, 22:31:00 UTC
bfa2dc9 [Docs] SQL doc formatting and typo fixes As [reported on the dev list](http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Release-Apache-Spark-1-1-0-RC2-tp8107p8131.html): * Code fencing with triple-backticks doesn’t seem to work like it does on GitHub. Newlines are lost. Instead, use 4-space indent to format small code blocks. * Nested bullets need 2 leading spaces, not 1. * Spellcheck! Author: Nicholas Chammas <nicholas.chammas@gmail.com> Author: nchammas <nicholas.chammas@gmail.com> Closes #2201 from nchammas/sql-doc-fixes and squashes the following commits: 873f889 [Nicholas Chammas] [Docs] fix skip-api flag 5195e0c [Nicholas Chammas] [Docs] SQL doc formatting and typo fixes 3b26c8d [nchammas] [Spark QA] Link to console output on test time out (cherry picked from commit 53aa8316e88980c6f46d3b9fc90d935a4738a370) Signed-off-by: Michael Armbrust <michael@databricks.com> 29 August 2014, 22:23:41 UTC
98d0716 [SPARK-3307] [PySpark] Fix doc string of SparkContext.broadcast() remove invalid docs Author: Davies Liu <davies.liu@gmail.com> Closes #2202 from davies/keep and squashes the following commits: aa3b44f [Davies Liu] remove invalid docs (cherry picked from commit e248328b39f52073422a12fd0388208de41be1c7) Signed-off-by: Josh Rosen <joshrosen@apache.org> 29 August 2014, 18:48:00 UTC
c71b5c6 HOTFIX: Bump spark-ec2 version to 1.1.0 29 August 2014, 18:20:45 UTC
a4a7a24 [maven-release-plugin] prepare for next development iteration 29 August 2014, 00:54:09 UTC
711aebb [maven-release-plugin] prepare release v1.1.0-rc2 29 August 2014, 00:54:02 UTC
fb2b40a Revert "[maven-release-plugin] prepare release v1.1.0-rc1" This reverts commit f07183249b74dd857069028bf7d570b35f265585. 29 August 2014, 00:18:28 UTC
587dff2 Revert "[maven-release-plugin] prepare for next development iteration" This reverts commit f8f7a0c9dce764ece8acdc41d35bbf448dba7e92. 29 August 2014, 00:18:20 UTC
7db87b3 Adding new CHANGES.txt 29 August 2014, 00:17:30 UTC
fe4df34 [SPARK-3277] Fix external spilling with LZ4 assertion error **Summary of the changes** The bulk of this PR is comprised of tests and documentation; the actual fix is really just adding 1 line of code (see `BlockObjectWriter.scala`). We currently do not run the `External*` test suites with different compression codecs, and this would have caught the bug reported in [SPARK-3277](https://issues.apache.org/jira/browse/SPARK-3277). This PR extends the existing code to test spilling using all compression codecs known to Spark, including `LZ4`. **The bug itself** In `DiskBlockObjectWriter`, we only report the shuffle bytes written before we close the streams. With `LZ4`, all the bytes written reported by our metrics were 0 because `flush()` was not taking effect for some reason. In general, compression codecs may write additional bytes to the file after we call `close()`, and so we must also capture those bytes in our shuffle write metrics. Thanks mridulm and pwendell for help with debugging. Author: Andrew Or <andrewor14@gmail.com> Author: Patrick Wendell <pwendell@gmail.com> Closes #2187 from andrewor14/fix-lz4-spilling and squashes the following commits: 1b54bdc [Andrew Or] Speed up tests by not compressing everything 1c4624e [Andrew Or] Merge branch 'master' of github.com:apache/spark into fix-lz4-spilling 6b2e7d1 [Andrew Or] Fix compilation error 92e251b [Patrick Wendell] Better documentation for BlockObjectWriter. a1ad536 [Andrew Or] Fix tests 089593f [Andrew Or] Actually fix SPARK-3277 (tests still fail) 4bbcf68 [Andrew Or] Update tests to actually test all compression codecs b264a84 [Andrew Or] ExternalAppendOnlyMapSuite code style fixes (minor) 1bfa743 [Andrew Or] Add more information to assert for better debugging 29 August 2014, 00:05:53 UTC
f4cbf5e SPARK-3082. yarn.Client.logClusterResourceDetails throws NPE if requeste... ...d queue doesn't exist Author: Sandy Ryza <sandy@cloudera.com> Closes #1984 from sryza/sandy-spark-3082 and squashes the following commits: fe08c37 [Sandy Ryza] Remove log message entirely 85253ad [Sandy Ryza] SPARK-3082. yarn.Client.logClusterResourceDetails throws NPE if requested queue doesn't exist (cherry picked from commit 92af2314f27e80227174499f2fca505bd551cda7) Signed-off-by: Andrew Or <andrewor14@gmail.com> 28 August 2014, 23:19:01 UTC
0b9718a [SPARK-3190] Avoid overflow in VertexRDD.count() VertexRDDs with more than 4 billion elements are counted incorrectly due to integer overflow when summing partition sizes. This PR fixes the issue by converting partition sizes to Longs before summing them. The following code previously returned -10000000. After applying this PR, it returns the correct answer of 5000000000 (5 billion). ```scala val pairs = sc.parallelize(0L until 500L).map(_ * 10000000) .flatMap(start => start until (start + 10000000)).map(x => (x, x)) VertexRDD(pairs).count() ``` Author: Ankur Dave <ankurdave@gmail.com> Closes #2106 from ankurdave/SPARK-3190 and squashes the following commits: 641f468 [Ankur Dave] Avoid overflow in VertexRDD.count() (cherry picked from commit 96df92906978c5f58e0cc8ff5eebe5b35a08be3b) Signed-off-by: Josh Rosen <joshrosen@apache.org> 28 August 2014, 22:17:32 UTC
069ecfe [SPARK-3264] Allow users to set executor Spark home in Mesos The executors and the driver may not share the same Spark home. There is currently one way to set the executor side Spark home in Mesos, through setting `spark.home`. However, this is neither documented nor intuitive. This PR adds a more specific config `spark.mesos.executor.home` and exposes this to the user. liancheng tnachen Author: Andrew Or <andrewor14@gmail.com> Closes #2166 from andrewor14/mesos-spark-home and squashes the following commits: b87965e [Andrew Or] Merge branch 'master' of github.com:apache/spark into mesos-spark-home f6abb2e [Andrew Or] Document spark.mesos.executor.home ca7846d [Andrew Or] Add more specific configuration for executor Spark home in Mesos (cherry picked from commit 41dc5987d9abeca6fc0f5935c780d48f517cdf95) Signed-off-by: Andrew Or <andrewor14@gmail.com> 28 August 2014, 18:06:00 UTC
fd98020 [SPARK-3150] Fix NullPointerException in in Spark recovery: Add initializing default values in DriverInfo.init() The issue happens when Spark is run standalone on a cluster. When master and driver fall simultaneously on one node in a cluster, master tries to recover its state and restart spark driver. While restarting driver, it falls with NPE exception (stacktrace is below). After falling, it restarts and tries to recover its state and restart Spark driver again. It happens over and over in an infinite cycle. Namely, Spark tries to read DriverInfo state from zookeeper, but after reading it happens to be null in DriverInfo.worker. https://issues.apache.org/jira/browse/SPARK-3150 Author: Tatiana Borisova <tanyatik@yandex.ru> Closes #2062 from tanyatik/spark-3150 and squashes the following commits: 9936043 [Tatiana Borisova] Add initializing default values in DriverInfo.init() (cherry picked from commit 70d814665baa8b8ca868d3126452105ecfa5cbff) Signed-off-by: Josh Rosen <joshrosen@apache.org> 28 August 2014, 17:37:20 UTC
f8f7a0c [maven-release-plugin] prepare for next development iteration 28 August 2014, 09:29:30 UTC
f071832 [maven-release-plugin] prepare release v1.1.0-rc1 28 August 2014, 09:29:24 UTC
1d03330 Revert "[maven-release-plugin] prepare release v1.1.0-rc1" This reverts commit 58b0be6a29eab817d350729710345e9f39e4c506. 28 August 2014, 08:55:48 UTC
c0bacc1 Revert "[maven-release-plugin] prepare for next development iteration" This reverts commit 78e3c036eee7113b2ed144eec5061e070b479e56. 28 August 2014, 08:55:46 UTC
c818b2b Revert "[maven-release-plugin] prepare release v1.1.0-rc1" This reverts commit 79e86ef3e1a3ee03a7e3b166a5c7dee11c6d60d7. 28 August 2014, 08:55:44 UTC
d01b3fa Revert "[maven-release-plugin] prepare for next development iteration" This reverts commit a118ea5c59d653f5a3feda21455ba60bc722b3b1. 28 August 2014, 08:55:41 UTC
df61944 Revert "Revert "[maven-release-plugin] prepare for next development iteration"" This reverts commit 71ec0140f7e121bdba3d19e8219e91a5e9d1e320. 28 August 2014, 08:55:36 UTC
4186c45 Revert "Revert "[maven-release-plugin] prepare release v1.1.0-rc1"" This reverts commit 56070f12f455bae645cba887a74c72b12f1085f8. 28 August 2014, 08:55:33 UTC
ecdbeef Revert "[maven-release-plugin] prepare release v1.1.0-rc1" This reverts commit da4b94c86c9dd0d624b3040aa4b9449be9f60fc3. 28 August 2014, 08:55:31 UTC
473b02d Revert "[maven-release-plugin] prepare for next development iteration" This reverts commit 96926c5a42c5970ed74c50db5bd9c68cacf92207. 28 August 2014, 08:55:24 UTC
96926c5 [maven-release-plugin] prepare for next development iteration 28 August 2014, 07:50:43 UTC
da4b94c [maven-release-plugin] prepare release v1.1.0-rc1 28 August 2014, 07:50:32 UTC
a9df703 Additional CHANGES.txt 28 August 2014, 07:19:03 UTC
56070f1 Revert "[maven-release-plugin] prepare release v1.1.0-rc1" This reverts commit 79e86ef3e1a3ee03a7e3b166a5c7dee11c6d60d7. 28 August 2014, 07:16:09 UTC
71ec014 Revert "[maven-release-plugin] prepare for next development iteration" This reverts commit a118ea5c59d653f5a3feda21455ba60bc722b3b1. 28 August 2014, 07:16:09 UTC
2e8ad99 [SPARK-3230][SQL] Fix udfs that return structs We need to convert the case classes into Rows. Author: Michael Armbrust <michael@databricks.com> Closes #2133 from marmbrus/structUdfs and squashes the following commits: 189722f [Michael Armbrust] Merge remote-tracking branch 'origin/master' into structUdfs 8e29b1c [Michael Armbrust] Use existing function d8d0b76 [Michael Armbrust] Fix udfs that return structs (cherry picked from commit 76e3ba4264c4a0bc2c33ae6ac862fc40bc302d83) Signed-off-by: Michael Armbrust <michael@databricks.com> 28 August 2014, 07:15:40 UTC
c0e3bc1 [SQL] Fixed 2 comment typos in SQLConf Author: Cheng Lian <lian.cs.zju@gmail.com> Closes #2172 from liancheng/sqlconf-typo and squashes the following commits: 115cc71 [Cheng Lian] Fixed 2 comment typos in SQLConf (cherry picked from commit 68f75dcdfe7e8ab229b73824692c4b3d4c39946c) Signed-off-by: Michael Armbrust <michael@databricks.com> 28 August 2014, 07:08:39 UTC
a118ea5 [maven-release-plugin] prepare for next development iteration 28 August 2014, 06:46:02 UTC
79e86ef [maven-release-plugin] prepare release v1.1.0-rc1 28 August 2014, 06:45:54 UTC
ad0fab2 HOTFIX: Don't build with YARN support for Mapr3 28 August 2014, 06:08:44 UTC
233c283 [HOTFIX][SQL] Remove cleaning of UDFs It is not safe to run the closure cleaner on slaves. #2153 introduced this which broke all UDF execution on slaves. Will re-add cleaning of UDF closures in a follow-up PR. Author: Michael Armbrust <michael@databricks.com> Closes #2174 from marmbrus/fixUdfs and squashes the following commits: 55406de [Michael Armbrust] [HOTFIX] Remove cleaning of UDFs (cherry picked from commit 024178c57419f915d26414e1b91ea0019c3650db) Signed-off-by: Patrick Wendell <pwendell@gmail.com> 28 August 2014, 06:06:14 UTC
54ccd93 [HOTFIX] Wait for EOF only for the PySpark shell In `SparkSubmitDriverBootstrapper`, we wait for the parent process to send us an `EOF` before finishing the application. This is applicable for the PySpark shell because we terminate the application the same way. However if we run a python application, for instance, the JVM actually never exits unless it receives a manual EOF from the user. This is causing a few tests to timeout. We only need to do this for the PySpark shell because Spark submit runs as a python subprocess only in this case. Thus, the normal Spark shell doesn't need to go through this case even though it is also a REPL. Thanks davies for reporting this. Author: Andrew Or <andrewor14@gmail.com> Closes #2170 from andrewor14/bootstrap-hotfix and squashes the following commits: 42963f5 [Andrew Or] Do not wait for EOF unless this is the pyspark shell (cherry picked from commit dafe343499bbc688e266106e4bb897f9e619834e) Signed-off-by: Patrick Wendell <pwendell@gmail.com> 28 August 2014, 06:04:28 UTC
78e3c03 [maven-release-plugin] prepare for next development iteration 27 August 2014, 23:28:27 UTC
58b0be6 [maven-release-plugin] prepare release v1.1.0-rc1 27 August 2014, 23:28:08 UTC
8597e9c BUILD: Updating CHANGES.txt for Spark 1.1 27 August 2014, 22:56:08 UTC
d4cf7a0 Add line continuation for script to work w/ py2.7.5 Error was - $ SPARK_HOME=$PWD/dist ./dev/create-release/generate-changelist.py File "./dev/create-release/generate-changelist.py", line 128 if day < SPARK_REPO_CHANGE_DATE1 or ^ SyntaxError: invalid syntax Author: Matthew Farrellee <matt@redhat.com> Closes #2139 from mattf/master-fix-generate-changelist.py-0 and squashes the following commits: 6b3a900 [Matthew Farrellee] Add line continuation for script to work w/ py2.7.5 (cherry picked from commit 64d8ecbbe94c47236ff2d8c94d7401636ba6fca4) Signed-off-by: Patrick Wendell <pwendell@gmail.com> 27 August 2014, 22:50:37 UTC
0b17c7d Revert "[maven-release-plugin] prepare release v1.1.0-snapshot2" This reverts commit e1535ad3c6f7400f2b7915ea91da9c60510557ba. 27 August 2014, 22:48:13 UTC
0c03fb6 Revert "[maven-release-plugin] prepare for next development iteration" This reverts commit 9af3fb7385d1f9f221962f1d2d725ff79bd82033. 27 August 2014, 22:48:00 UTC
9a62cf3 [SPARK-3235][SQL] Ensure in-memory tables don't always broadcast. Author: Michael Armbrust <michael@databricks.com> Closes #2147 from marmbrus/inMemDefaultSize and squashes the following commits: 5390360 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into inMemDefaultSize 14204d3 [Michael Armbrust] Set the context before creating SparkLogicalPlans. 8da4414 [Michael Armbrust] Make sure we throw errors when leaf nodes fail to provide statistcs 18ce029 [Michael Armbrust] Ensure in-memory tables don't always broadcast. (cherry picked from commit 7d2a7a91f263bb9fbf24dc4dbffde8fe5e2c7442) Signed-off-by: Michael Armbrust <michael@databricks.com> 27 August 2014, 22:14:29 UTC
5ea260e [SPARK-3065][SQL] Add locale setting to fix results do not match for udf_unix_timestamp format "yyyy MMM dd h:mm:ss a" run with not "America/Los_Angeles" TimeZone in HiveCompatibilitySuite When run the udf_unix_timestamp of org.apache.spark.sql.hive.execution.HiveCompatibilitySuite testcase with not "America/Los_Angeles" TimeZone throws error. [https://issues.apache.org/jira/browse/SPARK-3065] add locale setting on beforeAll and afterAll method to fix the bug of HiveCompatibilitySuite testcase Author: luogankun <luogankun@gmail.com> Closes #1968 from luogankun/SPARK-3065 and squashes the following commits: c167832 [luogankun] [SPARK-3065][SQL] Add Locale setting to HiveCompatibilitySuite 0a25e3a [luogankun] [SPARK-3065][SQL] Add Locale setting to HiveCompatibilitySuite (cherry picked from commit 65253502b913f390b26b9b631380b2c6cf1ccdf7) Signed-off-by: Michael Armbrust <michael@databricks.com> 27 August 2014, 22:08:34 UTC
7711687 [SQL] [SPARK-3236] Reading Parquet tables from Metastore mangles location Currently we do `relation.hiveQlTable.getDataLocation.getPath`, which returns the path-part of the URI (e.g., "s3n://my-bucket/my-path" => "/my-path"). We should do `relation.hiveQlTable.getDataLocation.toString` instead, as a URI's toString returns a faithful representation of the full URI, which can later be passed into a Hadoop Path. Author: Aaron Davidson <aaron@databricks.com> Closes #2150 from aarondav/parquet-location and squashes the following commits: 459f72c [Aaron Davidson] [SQL] [SPARK-3236] Reading Parquet tables from Metastore mangles location (cherry picked from commit cc275f4b7910f6d0ad266a43bac2fdae58e9739e) Signed-off-by: Michael Armbrust <michael@databricks.com> 27 August 2014, 22:06:04 UTC
b3d763b [SPARK-3252][SQL] Add missing condition for test According to the text message, both relations should be tested. So add the missing condition. Author: viirya <viirya@gmail.com> Closes #2159 from viirya/fix_test and squashes the following commits: b1c0f52 [viirya] add missing condition. (cherry picked from commit 28d41d627919fcb196d9d31bad65d664770bee67) Signed-off-by: Michael Armbrust <michael@databricks.com> 27 August 2014, 22:04:35 UTC
c1ffa3e [SPARK-3243] Don't use stale spark-driver.* system properties If we set both `spark.driver.extraClassPath` and `--driver-class-path`, then the latter correctly overrides the former. However, the value of the system property `spark.driver.extraClassPath` still uses the former, which is actually not added to the class path. This may cause some confusion... Of course, this also affects other options (i.e. java options, library path, memory...). Author: Andrew Or <andrewor14@gmail.com> Closes #2154 from andrewor14/driver-submit-configs-fix and squashes the following commits: 17ec6fc [Andrew Or] Fix tests 0140836 [Andrew Or] Don't forget spark.driver.memory e39d20f [Andrew Or] Also set spark.driver.extra* configs in client mode (cherry picked from commit 63a053ab140d7bf605e8c5b7fb5a7bd52aca29b2) Signed-off-by: Patrick Wendell <pwendell@gmail.com> 27 August 2014, 21:47:05 UTC
3cb4e17 Spark-3213 Fixes issue with spark-ec2 not detecting slaves created with "Launch More like this" ... copy the spark_cluster_tag from a spot instance requests over to the instances. Author: Vida Ha <vida@databricks.com> Closes #2163 from vidaha/vida/spark-3213 and squashes the following commits: 5070a70 [Vida Ha] Spark-3214 Fix issue with spark-ec2 not detecting slaves created with 'Launch More Like This' and using Spot Requests (cherry picked from commit 7faf755ae4f0cf510048e432340260a6e609066d) Signed-off-by: Josh Rosen <joshrosen@apache.org> 27 August 2014, 21:26:16 UTC
90f8f3e [SPARK-3138][SQL] sqlContext.parquetFile should be able to take a single file as parameter ```if (!fs.getFileStatus(path).isDir) throw Exception``` make no sense after this commit #1370 be careful if someone is working on SPARK-2551, make sure the new change passes test case ```test("Read a parquet file instead of a directory")``` Author: chutium <teng.qiu@gmail.com> Closes #2044 from chutium/parquet-singlefile and squashes the following commits: 4ae477f [chutium] [SPARK-3138][SQL] sqlContext.parquetFile should be able to take a single file as parameter (cherry picked from commit 48f42781dedecd38ddcb2dcf67dead92bb4318f5) Signed-off-by: Michael Armbrust <michael@databricks.com> 27 August 2014, 20:13:12 UTC
4c7f082 [SPARK-3197] [SQL] Reduce the Expression tree object creations for aggregation function (min/max) Aggregation function min/max in catalyst will create expression tree for each single row, however, the expression tree creation is quite expensive in a multithreading env currently. Hence we got a very bad performance for the min/max. Here is the benchmark that I've done in my local. Master | Previous Result (ms) | Current Result (ms) ------------ | ------------- | ------------- local | 3645 | 3416 local[6] | 3602 | 1002 The Benchmark source code. ``` case class Record(key: Int, value: Int) object TestHive2 extends HiveContext(new SparkContext("local[6]", "TestSQLContext", new SparkConf())) object DataPrepare extends App { import TestHive2._ val rdd = sparkContext.parallelize((1 to 10000000).map(i => Record(i % 3000, i)), 12) runSqlHive("SHOW TABLES") runSqlHive("DROP TABLE if exists a") runSqlHive("DROP TABLE if exists result") rdd.registerAsTable("records") runSqlHive("""CREATE TABLE a (key INT, value INT) | ROW FORMAT SERDE | 'org.apache.hadoop.hive.serde2.columnar.LazyBinaryColumnarSerDe' | STORED AS RCFILE """.stripMargin) runSqlHive("""CREATE TABLE result (key INT, value INT) | ROW FORMAT SERDE | 'org.apache.hadoop.hive.serde2.columnar.LazyBinaryColumnarSerDe' | STORED AS RCFILE """.stripMargin) hql(s"""from records | insert into table a | select key, value """.stripMargin) } object PerformanceTest extends App { import TestHive2._ hql("SHOW TABLES") hql("set spark.sql.shuffle.partitions=12") val cmd = "select min(value), max(value) from a group by key" val results = ("Result1", benchmark(cmd)) :: ("Result2", benchmark(cmd)) :: ("Result3", benchmark(cmd)) :: Nil results.foreach { case (prompt, result) => { println(s"$prompt: took ${result._1} ms (${result._2} records)") } } def benchmark(cmd: String) = { val begin = System.currentTimeMillis() val count = hql(cmd).count val end = System.currentTimeMillis() ((end - begin), count) } } ``` Author: Cheng Hao <hao.cheng@intel.com> Closes #2113 from chenghao-intel/aggregation_expression_optimization and squashes the following commits: db40395 [Cheng Hao] remove the transient and add val for the expression property d56167d [Cheng Hao] Reduce the Expressions creation (cherry picked from commit 4238c17dc9e1f2f93cc9e6c768f92bd27bf1df66) Signed-off-by: Michael Armbrust <michael@databricks.com> 27 August 2014, 19:50:59 UTC
19cda07 [SPARK-3118][SQL]add "SHOW TBLPROPERTIES tblname;" and "SHOW COLUMNS (FROM|IN) table_name [(FROM|IN) db_name]" support JIRA issue: [SPARK-3118] https://issues.apache.org/jira/browse/SPARK-3118 eg: > SHOW TBLPROPERTIES test; SHOW TBLPROPERTIES test; numPartitions 0 numFiles 1 transient_lastDdlTime 1407923642 numRows 0 totalSize 82 rawDataSize 0 eg: > SHOW COLUMNS in test; SHOW COLUMNS in test; OK Time taken: 0.304 seconds id stid bo Author: u0jing <u9jing@gmail.com> Closes #2034 from u0jing/spark-3118 and squashes the following commits: b231d87 [u0jing] add golden answer files 35f4885 [u0jing] add 'show columns' and 'show tblproperties' support (cherry picked from commit 3b5eb7083d3e1955de288e4fd365dca6221f32fb) Signed-off-by: Michael Armbrust <michael@databricks.com> 27 August 2014, 19:47:30 UTC
0c94a5b SPARK-3259 - User data should be given to the master Author: Allan Douglas R. de Oliveira <allan@chaordicsystems.com> Closes #2162 from douglaz/user_data_master and squashes the following commits: 10d15f6 [Allan Douglas R. de Oliveira] Give user data also to the master (cherry picked from commit 5ac4093c9fa29a11e38f884eebb3f5db087de76f) Signed-off-by: Patrick Wendell <pwendell@gmail.com> 27 August 2014, 19:43:29 UTC
935bffe [SPARK-2608][Core] Fixed command line option passing issue over Mesos via SPARK_EXECUTOR_OPTS This is another try after #2145 to fix [SPARK-2608](https://issues.apache.org/jira/browse/SPARK-2608). ### Basic Idea The basic idea is to pass `extraJavaOpts` and `extraLibraryPath` together via environment variable `SPARK_EXECUTOR_OPTS`. This variable is recognized by `spark-class` and not used anywhere else. In this way, we still launch Mesos executors with `spark-class`/`spark-executor`, but avoids the executor side Spark home issue. ### Known Issue Quoted string with spaces is not allowed in either `extraJavaOpts` or `extraLibraryPath` when using Spark over Mesos. The reason is that Mesos passes the whole command line as a single string argument to `sh -c` to start the executor, and this makes shell string escaping non-trivial to handle. This should be fixed in a later release. ### Background Classes in package `org.apache.spark.deploy` shouldn't be used as they assume Spark is deployed in standalone mode, and give wrong executor side Spark home directory. Please refer to comments in #2145 for more details. Author: Cheng Lian <lian.cs.zju@gmail.com> Closes #2161 from liancheng/mesos-fix-with-env-var and squashes the following commits: ba59190 [Cheng Lian] Added fine grained Mesos executor support 1174076 [Cheng Lian] Draft fix for CoarseMesosSchedulerBackend 27 August 2014, 19:39:21 UTC
092121e [SPARK-3239] [PySpark] randomize the dirs for each process This can avoid the IO contention during spilling, when you have multiple disks. Author: Davies Liu <davies.liu@gmail.com> Closes #2152 from davies/randomize and squashes the following commits: a4863c4 [Davies Liu] randomize the dirs for each process 27 August 2014, 17:40:35 UTC
8f8e2a4 [SPARK-3170][CORE][BUG]:RDD info loss in "StorageTab" and "ExecutorTab" compeleted stage only need to remove its own partitions that are no longer cached. However, "StorageTab" may lost some rdds which are cached actually. Not only in "StorageTab", "ExectutorTab" may also lose some rdd info which have been overwritten by last rdd in a same task. 1. "StorageTab": when multiple stages run simultaneously, completed stage will remove rdd info which belong to other stages that are still running. 2. "ExectutorTab": taskcontext may lose some "updatedBlocks" info of rdds in a dependency chain. Like the following example: val r1 = sc.paralize(..).cache() val r2 = r1.map(...).cache() val n = r2.count() When count the r2, r1 and r2 will be cached finally. So in CacheManager.getOrCompute, the taskcontext should contain "updatedBlocks" of r1 and r2. Currently, the "updatedBlocks" only contain the info of r2. Author: uncleGen <hustyugm@gmail.com> Closes #2131 from uncleGen/master_ui_fix and squashes the following commits: a6a8a0b [uncleGen] fix some coding style 3a1bc15 [uncleGen] fix some error in unit test 56ea488 [uncleGen] there's some line too long c82ba82 [uncleGen] Bug Fix: RDD info loss in "StorageTab" and "ExecutorTab" (cherry picked from commit d8298c46b7bf566d1cd2f7ea9b1b2b2722dcfb17) Signed-off-by: Andrew Or <andrewor14@gmail.com> 27 August 2014, 17:33:13 UTC
1d468df [SPARK-3154][STREAMING] Make FlumePollingInputDStream shutdown cleaner. Currently lot of errors get thrown from Avro IPC layer when the dstream or sink is shutdown. This PR cleans it up. Some refactoring is done in the receiver code to put all of the RPC code into a single Try and just recover from that. The sink code has also been cleaned up. Author: Hari Shreedharan <hshreedharan@apache.org> Closes #2065 from harishreedharan/clean-flume-shutdown and squashes the following commits: f93a07c [Hari Shreedharan] Formatting fixes. d7427cc [Hari Shreedharan] More fixes! a0a8852 [Hari Shreedharan] Fix race condition, hopefully! Minor other changes. 4c9ed02 [Hari Shreedharan] Remove unneeded list in Callback handler. Other misc changes. 8fee36f [Hari Shreedharan] Scala-library is required, else maven build fails. Also catch InterruptedException in TxnProcessor. 445e700 [Hari Shreedharan] Merge remote-tracking branch 'asf/master' into clean-flume-shutdown 87232e0 [Hari Shreedharan] Refactor Flume Input Stream. Clean up code, better error handling. 9001d26 [Hari Shreedharan] Change log level to debug in TransactionProcessor#shutdown method e7b8d82 [Hari Shreedharan] Incorporate review feedback 598efa7 [Hari Shreedharan] Clean up some exception handling code e1027c6 [Hari Shreedharan] Merge remote-tracking branch 'asf/master' into clean-flume-shutdown ed608c8 [Hari Shreedharan] [SPARK-3154][STREAMING] Make FlumePollingInputDStream shutdown cleaner. (cherry picked from commit 6f671d04fa98f97fd48c5e749b9f47dd4a8b4f44) Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com> 27 August 2014, 09:39:21 UTC
7286d57 [SPARK-3227] [mllib] Added migration guide for v1.0 to v1.1 The only updates are in DecisionTree. CC: mengxr Author: Joseph K. Bradley <joseph.kurata.bradley@gmail.com> Closes #2146 from jkbradley/mllib-migration and squashes the following commits: 5a1f487 [Joseph K. Bradley] small edit to doc 411d6d9 [Joseph K. Bradley] Added migration guide for v1.0 to v1.1. The only updates are in DecisionTree. (cherry picked from commit 171a41cb034f4ea80f6a3c91a6872970de16a14a) Signed-off-by: Xiangrui Meng <meng@databricks.com> 27 August 2014, 08:46:24 UTC
7401247 [SPARK-2830][MLLIB] doc update for 1.1 1. renamed mllib-basics to mllib-data-types 1. renamed mllib-stats to mllib-statistics 1. moved random data generation to the bottom of mllib-stats 1. updated toc accordingly atalwalkar Author: Xiangrui Meng <meng@databricks.com> Closes #2151 from mengxr/mllib-doc-1.1 and squashes the following commits: 0bd79f3 [Xiangrui Meng] add mllib-data-types b64a5d7 [Xiangrui Meng] update the content list of basis statistics in mllib-guide f625cc2 [Xiangrui Meng] move mllib-basics to mllib-data-types 4d69250 [Xiangrui Meng] move random data generation to the bottom of statistics e64f3ce [Xiangrui Meng] move mllib-stats.md to mllib-statistics.md (cherry picked from commit 43dfc84f883822ea27b6e312d4353bf301c2e7ef) Signed-off-by: Xiangrui Meng <meng@databricks.com> 27 August 2014, 08:20:07 UTC
ca01de1 [SPARK-3237][SQL] Fix parquet filters with UDFs Author: Michael Armbrust <michael@databricks.com> Closes #2153 from marmbrus/parquetFilters and squashes the following commits: 712731a [Michael Armbrust] Use closure serializer for sending filters. 1e83f80 [Michael Armbrust] Clean udf functions. (cherry picked from commit e1139dd60e0692e8adb1337c1f605165ce4b8895) Signed-off-by: Michael Armbrust <michael@databricks.com> 27 August 2014, 07:59:54 UTC
5cf1e44 [SPARK-3139] Made ContextCleaner to not block on shuffles As a workaround for SPARK-3015, the ContextCleaner was made "blocking", that is, it cleaned items one-by-one. But shuffles can take a long time to be deleted. Given that the RC for 1.1 is imminent, this PR makes a narrow change in the context cleaner - not wait for shuffle cleanups to complete. Also it changes the error messages on failure to delete to be milder warnings, as exceptions in the delete code path for one item does not really stop the actual functioning of the system. Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #2143 from tdas/cleaner-shuffle-fix and squashes the following commits: 9c84202 [Tathagata Das] Restoring default blocking behavior in ContextCleanerSuite, and added docs to identify that spark.cleaner.referenceTracking.blocking does not control shuffle. 2181329 [Tathagata Das] Mark shuffle cleanup as non-blocking. e337cc2 [Tathagata Das] Changed semantics based on PR comments. 387b578 [Tathagata Das] Made ContextCleaner to not block on shuffles (cherry picked from commit 3e2864e40472b32e6a7eec5ba3bc83562d2a1a62) Signed-off-by: Patrick Wendell <pwendell@gmail.com> 27 August 2014, 07:17:37 UTC
6f82a4b HOTFIX: Minor typo in conf template 27 August 2014, 06:41:12 UTC
e7672f1 [SPARK-3167] Handle special driver configs in Windows (Branch 1.1) This is an effort to bring the Windows scripts up to speed after recent splashing changes in #1845. Author: Andrew Or <andrewor14@gmail.com> Closes #2156 from andrewor14/windows-config-branch-1.1 and squashes the following commits: 00b9dfe [Andrew Or] [SPARK-3167] Handle special driver configs in Windows 27 August 2014, 06:06:21 UTC
2381e90 [SPARK-3224] FetchFailed reduce stages should only show up once in failed stages (in UI) This is a HOTFIX for 1.1. Author: Reynold Xin <rxin@apache.org> Author: Kay Ousterhout <kayousterhout@gmail.com> Closes #2127 from rxin/SPARK-3224 and squashes the following commits: effb1ce [Reynold Xin] Move log message. 49282b3 [Reynold Xin] Kay's feedback. 3f01847 [Reynold Xin] Merge pull request #2 from kayousterhout/SPARK-3224 796d282 [Kay Ousterhout] Added unit test for SPARK-3224 3d3d356 [Reynold Xin] Remove map output loc even for repeated FetchFaileds. 1dd3eb5 [Reynold Xin] [SPARK-3224] FetchFailed reduce stages should only show up once in the failed stages UI. (cherry picked from commit bf719056b71d55e1194554661dfa194ed03d364d) Signed-off-by: Patrick Wendell <pwendell@gmail.com> 27 August 2014, 05:13:06 UTC
7726e56 Fix unclosed HTML tag in Yarn docs. 27 August 2014, 02:01:10 UTC
8b5af6f [SPARK-3036][SPARK-3037][SQL] Add MapType/ArrayType containing null value support to Parquet. JIRA: - https://issues.apache.org/jira/browse/SPARK-3036 - https://issues.apache.org/jira/browse/SPARK-3037 Currently this uses the following Parquet schema for `MapType` when `valueContainsNull` is `true`: ``` message root { optional group a (MAP) { repeated group map (MAP_KEY_VALUE) { required int32 key; optional int32 value; } } } ``` for `ArrayType` when `containsNull` is `true`: ``` message root { optional group a (LIST) { repeated group bag { optional int32 array; } } } ``` We have to think about compatibilities with older version of Spark or Hive or others I mentioned in the JIRA issues. Notice: This PR is based on #1963 and #1889. Please check them first. /cc marmbrus, yhuai Author: Takuya UESHIN <ueshin@happy-camper.st> Closes #2032 from ueshin/issues/SPARK-3036_3037 and squashes the following commits: 4e8e9e7 [Takuya UESHIN] Add ArrayType containing null value support to Parquet. 013c2ca [Takuya UESHIN] Add MapType containing null value support to Parquet. 62989de [Takuya UESHIN] Merge branch 'issues/SPARK-2969' into issues/SPARK-3036_3037 8e38b53 [Takuya UESHIN] Merge branch 'issues/SPARK-3063' into issues/SPARK-3036_3037 (cherry picked from commit 727cb25bcc29481d6b744abef1ca091e64b5f91f) Signed-off-by: Michael Armbrust <michael@databricks.com> 27 August 2014, 01:28:52 UTC
0d97233 [Docs] Run tests like in contributing guide The Contributing to Spark guide [recommends](https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark#ContributingtoSpark-AutomatedTesting) running tests by calling `./dev/run-tests`. The README should, too. `./sbt/sbt test` does not cover Python tests or style tests. Author: nchammas <nicholas.chammas@gmail.com> Closes #2149 from nchammas/patch-2 and squashes the following commits: 2b3b132 [nchammas] [Docs] Run tests like in contributing guide (cherry picked from commit 73b3089b8d2901dab11bb1ef6f46c29625b677fe) Signed-off-by: Reynold Xin <rxin@apache.org> 27 August 2014, 00:50:16 UTC
c0e1f99 [SPARK-2964] [SQL] Remove duplicated code from spark-sql and start-thriftserver.sh Author: Cheng Lian <lian.cs.zju@gmail.com> Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp> Closes #1886 from sarutak/SPARK-2964 and squashes the following commits: 8ef8751 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-2964 26e7c95 [Kousuke Saruta] Revert "Shorten timeout to more reasonable value" ffb68fa [Kousuke Saruta] Modified spark-sql and start-thriftserver.sh to use bin/utils.sh 8c6f658 [Kousuke Saruta] Merge branch 'spark-3026' of https://github.com/liancheng/spark into SPARK-2964 81b43a8 [Cheng Lian] Shorten timeout to more reasonable value a89e66d [Cheng Lian] Fixed command line options quotation in scripts 9c894d3 [Cheng Lian] Fixed bin/spark-sql -S option typo be4736b [Cheng Lian] Report better error message when running JDBC/CLI without hive-thriftserver profile enabled (cherry picked from commit faeb9c0e1440f4af888be0dfc5de7b57efc92b00) Signed-off-by: Michael Armbrust <michael@databricks.com> 27 August 2014, 00:33:57 UTC
a308a16 [SPARK-3194][SQL] Add AttributeSet to fix bugs with invalid comparisons of AttributeReferences It is common to want to describe sets of attributes that are in various parts of a query plan. However, the semantics of putting `AttributeReference` objects into a standard Scala `Set` result in subtle bugs when references differ cosmetically. For example, with case insensitive resolution it is possible to have two references to the same attribute whose names are not equal. In this PR I introduce a new abstraction, an `AttributeSet`, which performs all comparisons using the globally unique `ExpressionId` instead of case class equality. (There is already a related class, [`AttributeMap`](https://github.com/marmbrus/spark/blob/inMemStats/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/AttributeMap.scala#L32)) This new type of set is used to fix a bug in the optimizer where needed attributes were getting projected away underneath join operators. I also took this opportunity to refactor the expression and query plan base classes. In all but one instance the logic for computing the `references` of an `Expression` were the same. Thus, I moved this logic into the base class. For query plans the semantics of the `references` method were ill defined (is it the references output? or is it those used by expression evaluation? or what?). As a result, this method wasn't really used very much. So, I removed it. TODO: - [x] Finish scala doc for `AttributeSet` - [x] Scan the code for other instances of `Set[Attribute]` and refactor them. - [x] Finish removing `references` from `QueryPlan` Author: Michael Armbrust <michael@databricks.com> Closes #2109 from marmbrus/attributeSets and squashes the following commits: 1c0dae5 [Michael Armbrust] work on serialization bug. 9ba868d [Michael Armbrust] Merge remote-tracking branch 'origin/master' into attributeSets 3ae5288 [Michael Armbrust] review comments 40ce7f6 [Michael Armbrust] style d577cc7 [Michael Armbrust] Scaladoc cae5d22 [Michael Armbrust] remove more references implementations d6e16be [Michael Armbrust] Remove more instances of "def references" and normal sets of attributes. fc26b49 [Michael Armbrust] Add AttributeSet class, remove references from Expression. (cherry picked from commit c4787a3690a9ed3b8b2c6c294fc4a6915436b6f7) Signed-off-by: Reynold Xin <rxin@apache.org> 26 August 2014, 23:29:29 UTC
2715eb7 [SPARK-2839][MLlib] Stats Toolkit documentation updated Documentation updated for the Statistics Toolkit of MLlib. mengxr atalwalkar https://issues.apache.org/jira/browse/SPARK-2839 P.S. Accidentally closed #2123. New commits didn't show up after I reopened the PR. I've opened this instead and closed the old one. Author: Burak <brkyvz@gmail.com> Closes #2130 from brkyvz/StatsLib-Docs and squashes the following commits: a54a855 [Burak] [SPARK-2839][MLlib] Addressed comments bfc6896 [Burak] [SPARK-2839][MLlib] Added a more specific link to colStats() for pyspark 213fe3f [Burak] [SPARK-2839][MLlib] Modifications made according to review fec4d9d [Burak] [SPARK-2830][MLlib] Stats Toolkit documentation updated (cherry picked from commit 1208f72ac78960fe5060187761479b2a9a417c1b) Signed-off-by: Xiangrui Meng <meng@databricks.com> 26 August 2014, 22:18:51 UTC
5ff9000 [SPARK-3226][MLLIB] doc update for native libraries to mention `-Pnetlib-lgpl` option. atalwalkar Author: Xiangrui Meng <meng@databricks.com> Closes #2128 from mengxr/mllib-native and squashes the following commits: 4cbba57 [Xiangrui Meng] update mllib dependencies (cherry picked from commit adbd5c1636669fc474ab02b54cd1ced353f68712) Signed-off-by: Xiangrui Meng <meng@databricks.com> 26 August 2014, 22:12:40 UTC
5d981a4 [SPARK-3063][SQL] ExistingRdd should convert Map to catalyst Map. Currently `ExistingRdd.convertToCatalyst` doesn't convert `Map` value. Author: Takuya UESHIN <ueshin@happy-camper.st> Closes #1963 from ueshin/issues/SPARK-3063 and squashes the following commits: 3ba41f2 [Takuya UESHIN] Merge branch 'master' into issues/SPARK-3063 4d7bae2 [Takuya UESHIN] Merge branch 'master' into issues/SPARK-3063 9321379 [Takuya UESHIN] Merge branch 'master' into issues/SPARK-3063 d8a900a [Takuya UESHIN] Make ExistingRdd.convertToCatalyst be able to convert Map value. (cherry picked from commit 6b5584ef1c605cd30f25dbe7099ab32aea1746fb) Signed-off-by: Michael Armbrust <michael@databricks.com> 26 August 2014, 22:04:23 UTC
35a5853 [SPARK-2969][SQL] Make ScalaReflection be able to handle ArrayType.containsNull and MapType.valueContainsNull. Make `ScalaReflection` be able to handle like: - `Seq[Int]` as `ArrayType(IntegerType, containsNull = false)` - `Seq[java.lang.Integer]` as `ArrayType(IntegerType, containsNull = true)` - `Map[Int, Long]` as `MapType(IntegerType, LongType, valueContainsNull = false)` - `Map[Int, java.lang.Long]` as `MapType(IntegerType, LongType, valueContainsNull = true)` Author: Takuya UESHIN <ueshin@happy-camper.st> Closes #1889 from ueshin/issues/SPARK-2969 and squashes the following commits: 24f1c5c [Takuya UESHIN] Change the default value of ArrayType.containsNull to true in Python API. 79f5b65 [Takuya UESHIN] Change the default value of ArrayType.containsNull to true in Java API. 7cd1a7a [Takuya UESHIN] Fix json test failures. 2cfb862 [Takuya UESHIN] Change the default value of ArrayType.containsNull to true. 2f38e61 [Takuya UESHIN] Revert the default value of MapTypes.valueContainsNull. 9fa02f5 [Takuya UESHIN] Fix a test failure. 1a9a96b [Takuya UESHIN] Modify ScalaReflection to handle ArrayType.containsNull and MapType.valueContainsNull. (cherry picked from commit 98c2bb0bbde6fb2b6f64af3efffefcb0dae94c12) Signed-off-by: Michael Armbrust <michael@databricks.com> 26 August 2014, 20:23:07 UTC
83d2730 [SPARK-2871] [PySpark] add histgram() API RDD.histogram(buckets) Compute a histogram using the provided buckets. The buckets are all open to the right except for the last which is closed. e.g. [1,10,20,50] means the buckets are [1,10) [10,20) [20,50], which means 1<=x<10, 10<=x<20, 20<=x<=50. And on the input of 1 and 50 we would have a histogram of 1,0,1. If your histogram is evenly spaced (e.g. [0, 10, 20, 30]), this can be switched from an O(log n) inseration to O(1) per element(where n = # buckets). Buckets must be sorted and not contain any duplicates, must be at least two elements. If `buckets` is a number, it will generates buckets which is evenly spaced between the minimum and maximum of the RDD. For example, if the min value is 0 and the max is 100, given buckets as 2, the resulting buckets will be [0,50) [50,100]. buckets must be at least 1 If the RDD contains infinity, NaN throws an exception If the elements in RDD do not vary (max == min) always returns a single bucket. It will return an tuple of buckets and histogram. >>> rdd = sc.parallelize(range(51)) >>> rdd.histogram(2) ([0, 25, 50], [25, 26]) >>> rdd.histogram([0, 5, 25, 50]) ([0, 5, 25, 50], [5, 20, 26]) >>> rdd.histogram([0, 15, 30, 45, 60], True) ([0, 15, 30, 45, 60], [15, 15, 15, 6]) >>> rdd = sc.parallelize(["ab", "ac", "b", "bd", "ef"]) >>> rdd.histogram(("a", "b", "c")) (('a', 'b', 'c'), [2, 2]) closes #122, it's duplicated. Author: Davies Liu <davies.liu@gmail.com> Closes #2091 from davies/histgram and squashes the following commits: a322f8a [Davies Liu] fix deprecation of e.message 84e85fa [Davies Liu] remove evenBuckets, add more tests (including str) d9a0722 [Davies Liu] address comments 0e18a2d [Davies Liu] add histgram() API (cherry picked from commit 3cedc4f4d78e093fd362085e0a077bb9e4f28ca5) Signed-off-by: Josh Rosen <joshrosen@apache.org> 26 August 2014, 20:05:35 UTC
3a9d874 [SPARK-3131][SQL] Allow user to set parquet compression codec for writing ParquetFile in SQLContext There are 4 different compression codec available for ```ParquetOutputFormat``` in Spark SQL, it was set as a hard-coded value in ```ParquetRelation.defaultCompression``` original discuss: https://github.com/apache/spark/pull/195#discussion-diff-11002083 i added a new config property in SQLConf to allow user to change this compression codec, and i used similar short names syntax as described in SPARK-2953 #1873 (https://github.com/apache/spark/pull/1873/files#diff-0) btw, which codec should we use as default? it was set to GZIP (https://github.com/apache/spark/pull/195/files#diff-4), but i think maybe we should change this to SNAPPY, since SNAPPY is already the default codec for shuffling in spark-core (SPARK-2469, #1415), and parquet-mr supports Snappy codec natively (https://github.com/Parquet/parquet-mr/commit/e440108de57199c12d66801ca93804086e7f7632). Author: chutium <teng.qiu@gmail.com> Closes #2039 from chutium/parquet-compression and squashes the following commits: 2f44964 [chutium] [SPARK-3131][SQL] parquet compression default codec set to snappy, also in test suite e578e21 [chutium] [SPARK-3131][SQL] compression codec config property name and default codec set to snappy 21235dc [chutium] [SPARK-3131][SQL] Allow user to set parquet compression codec for writing ParquetFile in SQLContext (cherry picked from commit 8856c3d86009295be871989a5dc7270f31b420cd) Signed-off-by: Michael Armbrust <michael@databricks.com> 26 August 2014, 18:51:42 UTC
back to top