https://github.com/apache/spark

sort by:
Revision Author Date Message Commit Date
4aaf48d Preparing Spark release v1.3.0-rc3 05 March 2015, 23:02:07 UTC
4ab990c Updating CHANGES.txt for Spark 1.3 05 March 2015, 23:00:48 UTC
a0cbfe4 Revert "Preparing Spark release v1.3.0-rc3" This reverts commit 6fb4af2fbeb3d1b888191a2fa1042c80e3ef2d60. 05 March 2015, 22:56:25 UTC
d6b9dce Revert "Preparing development version 1.3.1-SNAPSHOT" This reverts commit 5097f869efbdb75d3b87bcbd8e621e7c12356942. 05 March 2015, 22:56:23 UTC
556e0de [SQL] Make Strategies a public developer API Author: Michael Armbrust <michael@databricks.com> Closes #4920 from marmbrus/openStrategies and squashes the following commits: cbc35c0 [Michael Armbrust] [SQL] Make Strategies a public developer API (cherry picked from commit eb48fd6e9d55fb034c00e61374bb9c2a86a82fb8) Signed-off-by: Michael Armbrust <michael@databricks.com> 05 March 2015, 22:50:58 UTC
083fed5 [SPARK-6163][SQL] jsonFile should be backed by the data source API jira: https://issues.apache.org/jira/browse/SPARK-6163 Author: Yin Huai <yhuai@databricks.com> Closes #4896 from yhuai/SPARK-6163 and squashes the following commits: 45e023e [Yin Huai] Address @chenghao-intel's comment. 2e8734e [Yin Huai] Use JSON data source for jsonFile. 92a4a33 [Yin Huai] Test. (cherry picked from commit 1b4bb25c10d72132d7f4f3835ef9a3b94b2349e0) Signed-off-by: Michael Armbrust <michael@databricks.com> 05 March 2015, 22:49:56 UTC
e358f55 [SPARK-6145][SQL] fix ORDER BY on nested fields Based on #4904 with style errors fixed. `LogicalPlan#resolve` will not only produce `Attribute`, but also "`GetField` chain". So in `ResolveSortReferences`, after resolve the ordering expressions, we should not just collect the `Attribute` results, but also `Attribute` at the bottom of "`GetField` chain". Author: Wenchen Fan <cloud0fan@outlook.com> Author: Michael Armbrust <michael@databricks.com> Closes #4918 from marmbrus/pr/4904 and squashes the following commits: 997f84e [Michael Armbrust] fix style 3eedbfc [Wenchen Fan] fix 6145 (cherry picked from commit 5873c713cc47af311f517ea33a6110993a410377) Signed-off-by: Michael Armbrust <michael@databricks.com> 05 March 2015, 22:49:13 UTC
5097f86 Preparing development version 1.3.1-SNAPSHOT 05 March 2015, 22:40:05 UTC
6fb4af2 Preparing Spark release v1.3.0-rc3 05 March 2015, 22:40:04 UTC
988b498 [SPARK-6175] Fix standalone executor log links when ephemeral ports or SPARK_PUBLIC_DNS are used This patch fixes two issues with the executor log viewing links added in Spark 1.3. In standalone mode, the log URLs might include a port value of 0 rather than the actual bound port of the UI, which broke the ability to view logs from workers whose web UIs had been configured to bind to ephemeral ports. In addition, the URLs used workers' local hostnames instead of respecting SPARK_PUBLIC_DNS, which prevented this feature from working properly on Spark EC2 clusters because the links would point to internal DNS names instead of external ones. I included tests for both of these bugs: - We now browse to the URLs and verify that they point to the expected pages. - To test SPARK_PUBLIC_DNS, I changed the code that reads the environment variable to do so via `SparkConf.getenv`, then used a custom SparkConf subclass to mock the environment variable (this pattern is used elsewhere in Spark's tests). Author: Josh Rosen <joshrosen@databricks.com> Closes #4903 from JoshRosen/SPARK-6175 and squashes the following commits: 5577f41 [Josh Rosen] Remove println cfec135 [Josh Rosen] Use webUi.boundPort and publicAddress in log links 27918c7 [Josh Rosen] Add failing unit tests for standalone log URL viewing c250fbe [Josh Rosen] Respect SparkConf in local-cluster Workers. 422a2ef [Josh Rosen] Use conf.getenv to read SPARK_PUBLIC_DNS (cherry picked from commit 424a86a1ed2a3e6dd54cf8b09fe2f13a1311b7e6) Signed-off-by: Josh Rosen <joshrosen@databricks.com> 05 March 2015, 20:04:14 UTC
ae315d2 SPARK-6182 [BUILD] spark-parent pom needs to be published for both 2.10 and 2.11 Option 1 of 2: Convert spark-parent module name to spark-parent_2.10 / spark-parent_2.11 Author: Sean Owen <sowen@cloudera.com> Closes #4912 from srowen/SPARK-6182.1 and squashes the following commits: eff60de [Sean Owen] Convert spark-parent module name to spark-parent_2.10 / spark-parent_2.11 (cherry picked from commit c9cfba0cebe3eb546e3e96f3e5b9b89a74c5b7de) Signed-off-by: Patrick Wendell <patrick@databricks.com> 05 March 2015, 19:32:06 UTC
f8205d3 Revert "[SPARK-6153] [SQL] promote guava dep for hive-thriftserver" This reverts commit b92d925d9dcfa601d65d4e6590f62f2aaf59dad2. 05 March 2015, 09:58:18 UTC
b92d925 [SPARK-6153] [SQL] promote guava dep for hive-thriftserver For package thriftserver, guava is used at runtime. /cc pwendell Author: Daoyuan Wang <daoyuan.wang@intel.com> Closes #4884 from adrian-wang/test and squashes the following commits: 4600ae7 [Daoyuan Wang] only promote for thriftserver 44dda18 [Daoyuan Wang] promote guava dep for hive (cherry picked from commit e06c7dfbc2331db2d1c365959c12aaac640a610a) Signed-off-by: Cheng Lian <lian@databricks.com> 05 March 2015, 08:35:54 UTC
b1e9162 Revert "Preparing Spark release v1.3.0-rc3" This reverts commit 430a879699d2d41bc65f5c00b1f239d15fd5e549. 05 March 2015, 07:54:34 UTC
53cba58 Revert "Preparing development version 1.3.1-SNAPSHOT" This reverts commit 0ecab40e4391d0674ac86595ec09af3b9a4ac50d. 05 March 2015, 07:54:32 UTC
0ecab40 Preparing development version 1.3.1-SNAPSHOT 05 March 2015, 05:20:44 UTC
430a879 Preparing Spark release v1.3.0-rc3 05 March 2015, 05:20:43 UTC
87eac3c Updating CHANGES file 05 March 2015, 05:19:49 UTC
a1f4cc5 Revert "Preparing Spark release v1.3.0-rc2" This reverts commit 3af26870e5163438868c4eb2df88380a533bb232. 05 March 2015, 05:02:20 UTC
cbdc43f Revert "Preparing development version 1.3.1-SNAPSHOT" This reverts commit 05d5a29eb3193aeb57d177bafe39eb75edce72a1. 05 March 2015, 05:02:18 UTC
f509159 SPARK-5143 [BUILD] [WIP] spark-network-yarn 2.11 depends on spark-network-shuffle 2.10 Update `<scala.binary.version>` prop in POM when switching between Scala 2.10/2.11 ScrapCodes for review. This `sed` command is supposed to just replace the first occurrence, but it replaces them all. Are you more of a `sed` wizard than I? It may be a GNU/BSD thing that is throwing me off. Really, just the first instance should be replaced, hence the `[WIP]`. NB on OS X the original `sed` command here will create files like `pom.xml-e` through the source tree though it otherwise works. It's like `-e` is also the arg to `-i`. I couldn't get rid of that even with `-i""`. No biggie. Author: Sean Owen <sowen@cloudera.com> Closes #4876 from srowen/SPARK-5143 and squashes the following commits: b060c44 [Sean Owen] Oops, fixed reversed version numbers! e875d4a [Sean Owen] Add note about non-GNU sed; fix new pom.xml update to work as intended on GNU sed 703e1eb [Sean Owen] Update scala.binary.version prop in POM when switching between Scala 2.10/2.11 (cherry picked from commit 7ac072f74b5a9a02339cede82ad5ffec5beed715) Signed-off-by: Patrick Wendell <patrick@databricks.com> 05 March 2015, 05:00:58 UTC
a0aa24a [SPARK-6149] [SQL] [Build] Excludes Guava 15 referenced by jackson-module-scala_2.10 This PR excludes Guava 15.0 from the SBT build, to make Spark SQL CLI (`bin/spark-sql`) work when compiled against Hive 0.12.0. <!-- Reviewable:start --> [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/4890) <!-- Reviewable:end --> Author: Cheng Lian <lian@databricks.com> Closes #4890 from liancheng/exclude-guava-15 and squashes the following commits: 91ae9fa [Cheng Lian] Moves Guava 15 exclusion from SBT build to POM 282bd2a [Cheng Lian] Excludes Guava 15 referenced by jackson-module-scala_2.10 (cherry picked from commit 1aa90e39e33caa497971544ee7643fb3ff048c12) Signed-off-by: Patrick Wendell <patrick@databricks.com> 05 March 2015, 04:53:17 UTC
3fc74f4 [SPARK-6144] [core] Fix addFile when source files are on "hdfs:" The code failed in two modes: it complained when it tried to re-create a directory that already existed, and it was placing some files in the wrong parent directory. The patch fixes both issues. Author: Marcelo Vanzin <vanzin@cloudera.com> Author: trystanleftwich <trystan@atscale.com> Closes #4894 from vanzin/SPARK-6144 and squashes the following commits: 100b3a1 [Marcelo Vanzin] Style fix. 58266aa [Marcelo Vanzin] Fix fetchHcfs file for directories. 91733b7 [trystanleftwich] [SPARK-6144]When in cluster mode using ADD JAR with a hdfs:// sourced jar will fail (cherry picked from commit 3a35a0dfe940843c3f3a5f51acfe24def488faa9) Signed-off-by: Andrew Or <andrew@databricks.com> 04 March 2015, 20:58:47 UTC
bfa4e31 [SPARK-6134][SQL] Fix wrong datatype for casting FloatType and default LongType value in defaultPrimitive In `CodeGenerator`, the casting on `FloatType` should use `FloatType` instead of `IntegerType`. Besides, `defaultPrimitive` for `LongType` should be `-1L` instead of `1L`. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #4870 from viirya/codegen_type and squashes the following commits: 76311dd [Liang-Chi Hsieh] Fix wrong datatype for casting on FloatType. Fix the wrong value for LongType in defaultPrimitive. (cherry picked from commit aef8a84e42351419a67d56abaf1ee75a05eb11ea) Signed-off-by: Cheng Lian <lian@databricks.com> 04 March 2015, 12:25:52 UTC
035243d [SPARK-6136] [SQL] Removed JDBC integration tests which depends on docker-client Integration test suites in the JDBC data source (`MySQLIntegration` and `PostgresIntegration`) depend on docker-client 2.7.5, which transitively depends on Guava 17.0. Unfortunately, Guava 17.0 is causing test runtime binary compatibility issues when Spark is compiled against Hive 0.12.0, or Hadoop 2.4. Considering `MySQLIntegration` and `PostgresIntegration` are ignored right now, I'd suggest moving them from the Spark project to the [Spark integration tests] [1] project. This PR removes both the JDBC data source integration tests and the docker-client test dependency. [1]: |https://github.com/databricks/spark-integration-tests <!-- Reviewable:start --> [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/4872) <!-- Reviewable:end --> Author: Cheng Lian <lian@databricks.com> Closes #4872 from liancheng/remove-docker-client and squashes the following commits: 1f4169e [Cheng Lian] Removes DockerHacks 159b24a [Cheng Lian] Removed JDBC integration tests which depends on docker-client (cherry picked from commit 76b472f12a57bb5bec7b3791660eb47e9177da7f) Signed-off-by: Cheng Lian <lian@databricks.com> 04 March 2015, 11:40:01 UTC
9f24977 [SPARK-6141][MLlib] Upgrade Breeze from 0.10 to 0.11 to fix convergence bug LBFGS and OWLQN in Breeze 0.10 has convergence check bug. This is fixed in 0.11, see the description in Breeze project for detail: https://github.com/scalanlp/breeze/pull/373#issuecomment-76879760 Author: Xiangrui Meng <meng@databricks.com> Author: DB Tsai <dbtsai@alpinenow.com> Author: DB Tsai <dbtsai@dbtsai.com> Closes #4879 from dbtsai/breeze and squashes the following commits: d848f65 [DB Tsai] Merge pull request #1 from mengxr/AlpineNow-breeze c2ca6ac [Xiangrui Meng] upgrade to breeze-0.11.1 35c2f26 [Xiangrui Meng] fix LRSuite 397a208 [DB Tsai] upgrade breeze (cherry picked from commit 76e20a0a03cf2c02db35e00271924efb070eaaa5) Signed-off-by: Xiangrui Meng <meng@databricks.com> 04 March 2015, 07:52:10 UTC
9a0b75c [SPARK-5949] HighlyCompressedMapStatus needs more classes registered w/ kryo https://issues.apache.org/jira/browse/SPARK-5949 Author: Imran Rashid <irashid@cloudera.com> Closes #4877 from squito/SPARK-5949_register_roaring_bitmap and squashes the following commits: 7e13316 [Imran Rashid] style style style 5f6bb6d [Imran Rashid] more style 709bfe0 [Imran Rashid] style a5cb744 [Imran Rashid] update tests to cover both types of RoaringBitmapContainers 09610c6 [Imran Rashid] formatting f9a0b7c [Imran Rashid] put primitive array registrations together 97beaf8 [Imran Rashid] SPARK-5949 HighlyCompressedMapStatus needs more classes registered w/ kryo (cherry picked from commit 1f1fccc5ceb0c5b7656a0594be3a67bd3b432e85) Signed-off-by: Reynold Xin <rxin@databricks.com> 03 March 2015, 23:33:26 UTC
8446ad0 SPARK-1911 [DOCS] Warn users if their assembly jars are not built with Java 6 Add warning about building with Java 7+ and running the JAR on early Java 6. CC andrewor14 Author: Sean Owen <sowen@cloudera.com> Closes #4874 from srowen/SPARK-1911 and squashes the following commits: 79fa2f6 [Sean Owen] Add warning about building with Java 7+ and running the JAR on early Java 6. (cherry picked from commit e750a6bfddf1d7bf7d3e99a424ec2b83a18b40d9) Signed-off-by: Andrew Or <andrew@databricks.com> 03 March 2015, 21:40:16 UTC
ee4929d Revert "[SPARK-5423][Core] Cleanup resources in DiskMapIterator.finalize to ensure deleting the temp file" This reverts commit 25fae8e7e6c93b7817771342d370b73b40dcf92e. 03 March 2015, 21:04:15 UTC
05d5a29 Preparing development version 1.3.1-SNAPSHOT 03 March 2015, 10:23:07 UTC
3af2687 Preparing Spark release v1.3.0-rc2 03 March 2015, 10:23:07 UTC
b012ed1 Revert "Preparing Spark release v1.3.0-rc1" This reverts commit f97b0d4a6b26504916816d7aefcf3132cd1da6c2. 03 March 2015, 10:20:05 UTC
4fee08e Revert "Preparing development version 1.3.1-SNAPSHOT" This reverts commit 2ab0ba04f66683be25cbe0e83cecf2bdcb0f13ba. 03 March 2015, 10:20:03 UTC
ce7158c Adding CHANGES.txt for Spark 1.3 03 March 2015, 10:19:19 UTC
ae60eb9 BUILD: Minor tweaks to internal build scripts This adds two features: 1. The ability to publish with a different maven version than that specified in the release source. 2. Forking of different Zinc instances during the parallel dist creation (to help with some stability issues). 03 March 2015, 09:54:06 UTC
1aa8461 HOTFIX: Bump HBase version in MapR profiles. After #2982 (SPARK-4048) we rely on the newer HBase packaging format. 03 March 2015, 09:39:26 UTC
841d2a2 [SPARK-5537][MLlib][Docs] Add user guide for multinomial logistic regression Adding more description on top of #4861. Author: DB Tsai <dbtsai@alpinenow.com> Closes #4866 from dbtsai/doc and squashes the following commits: 37e9d07 [DB Tsai] doc (cherry picked from commit b196056190c569505cc32669d1aec30ed9d70665) Signed-off-by: Xiangrui Meng <meng@databricks.com> 03 March 2015, 06:37:21 UTC
81648a7 [SPARK-6120] [mllib] Warnings about memory in tree, ensemble model save Issue: When the Python DecisionTree example in the programming guide is run, it runs out of Java Heap Space when using the default memory settings for the spark shell. This prints a warning. CC: mengxr Author: Joseph K. Bradley <joseph@databricks.com> Closes #4864 from jkbradley/dt-save-heap and squashes the following commits: 02e8daf [Joseph K. Bradley] fixed based on code review 7ecb1ed [Joseph K. Bradley] Added warnings about memory when calling tree and ensemble model save with too small a Java heap size (cherry picked from commit c2fe3a6ff1a48a9da54d2c2c4d80ecd06cdeebca) Signed-off-by: Xiangrui Meng <meng@databricks.com> 03 March 2015, 06:35:22 UTC
62c53be [SPARK-6097][MLLIB] Support tree model save/load in PySpark/MLlib Similar to `MatrixFactorizaionModel`, we only need wrappers to support save/load for tree models in Python. jkbradley Author: Xiangrui Meng <meng@databricks.com> Closes #4854 from mengxr/SPARK-6097 and squashes the following commits: 4586a4d [Xiangrui Meng] fix more typos 8ebcac2 [Xiangrui Meng] fix python style 91172d8 [Xiangrui Meng] fix typos 201b3b9 [Xiangrui Meng] update user guide b5158e2 [Xiangrui Meng] support tree model save/load in PySpark/MLlib (cherry picked from commit 7e53a79c30511dbd0e5d9878a4b8b0f5bc94e68b) Signed-off-by: Xiangrui Meng <meng@databricks.com> 03 March 2015, 06:27:23 UTC
4e6e008 [SPARK-5310][SQL] Fixes to Docs and Datasources API - Various Fixes to docs - Make data source traits actually interfaces Based on #4862 but with fixed conflicts. Author: Reynold Xin <rxin@databricks.com> Author: Michael Armbrust <michael@databricks.com> Closes #4868 from marmbrus/pr/4862 and squashes the following commits: fe091ea [Michael Armbrust] Merge remote-tracking branch 'origin/master' into pr/4862 0208497 [Reynold Xin] Test fixes. 34e0a28 [Reynold Xin] [SPARK-5310][SQL] Various fixes to Spark SQL docs. (cherry picked from commit 54d19689ff8d786acde5b8ada6741854ffadadea) Signed-off-by: Michael Armbrust <michael@databricks.com> 03 March 2015, 06:14:20 UTC
1b490e9 [SPARK-5950][SQL]Insert array into a metastore table saved as parquet should work when using datasource api This PR contains the following changes: 1. Add a new method, `DataType.equalsIgnoreCompatibleNullability`, which is the middle ground between DataType's equality check and `DataType.equalsIgnoreNullability`. For two data types `from` and `to`, it does `equalsIgnoreNullability` as well as if the nullability of `from` is compatible with that of `to`. For example, the nullability of `ArrayType(IntegerType, containsNull = false)` is compatible with that of `ArrayType(IntegerType, containsNull = true)` (for an array without null values, we can always say it may contain null values). However, the nullability of `ArrayType(IntegerType, containsNull = true)` is incompatible with that of `ArrayType(IntegerType, containsNull = false)` (for an array that may have null values, we cannot say it does not have null values). 2. For the `resolved` field of `InsertIntoTable`, use `equalsIgnoreCompatibleNullability` to replace the equality check of the data types. 3. For our data source write path, when appending data, we always use the schema of existing table to write the data. This is important for parquet, since nullability direct impacts the way to encode/decode values. If we do not do this, we may see corrupted values when reading values from a set of parquet files generated with different nullability settings. 4. When generating a new parquet table, we always set nullable/containsNull/valueContainsNull to true. So, we will not face situations that we cannot append data because containsNull/valueContainsNull in an Array/Map column of the existing table has already been set to `false`. This change makes the whole data pipeline more robust. 5. Update the equality check of JSON relation. Since JSON does not really cares nullability, `equalsIgnoreNullability` seems a better choice to compare schemata from to JSON tables. JIRA: https://issues.apache.org/jira/browse/SPARK-5950 Thanks viirya for the initial work in #4729. cc marmbrus liancheng Author: Yin Huai <yhuai@databricks.com> Closes #4826 from yhuai/insertNullabilityCheck and squashes the following commits: 3b61a04 [Yin Huai] Revert change on equals. 80e487e [Yin Huai] asNullable in UDT. 587d88b [Yin Huai] Make methods private. 0cb7ea2 [Yin Huai] marmbrus's comments. 3cec464 [Yin Huai] Cheng's comments. 486ed08 [Yin Huai] Merge remote-tracking branch 'upstream/master' into insertNullabilityCheck d3747d1 [Yin Huai] Remove unnecessary change. 8360817 [Yin Huai] Merge remote-tracking branch 'upstream/master' into insertNullabilityCheck 8a3f237 [Yin Huai] Use equalsIgnoreNullability instead of equality check. 0eb5578 [Yin Huai] Fix tests. f6ed813 [Yin Huai] Update old parquet path. e4f397c [Yin Huai] Unit tests. b2c06f8 [Yin Huai] Ignore nullability in JSON relation's equality check. 8bd008b [Yin Huai] nullable, containsNull, and valueContainsNull will be always true for parquet data. bf50d73 [Yin Huai] When appending data, we use the schema of the existing table instead of the schema of the new data. 0a703e7 [Yin Huai] Test failed again since we cannot read correct content. 9a26611 [Yin Huai] Make InsertIntoTable happy. 8f19fe5 [Yin Huai] equalsIgnoreCompatibleNullability 4ec17fd [Yin Huai] Failed test. (cherry picked from commit 12599942e69e4d73040f3a8611661a0862514ffc) Signed-off-by: Michael Armbrust <michael@databricks.com> 03 March 2015, 03:32:08 UTC
ffd0591 [SPARK-6127][Streaming][Docs] Add Kafka to Python api docs davies Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #4860 from tdas/SPARK-6127 and squashes the following commits: 82de92a [Tathagata Das] Add Kafka to Python api docs (cherry picked from commit 9eb22ece115c69899d100cecb8a5e20b3a268649) Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com> 03 March 2015, 02:40:57 UTC
11389f0 [SPARK-5537] Add user guide for multinomial logistic regression This is based on #4801 from dbtsai. The linear method guide is re-organized a little bit for this change. Closes #4801 Author: Xiangrui Meng <meng@databricks.com> Author: DB Tsai <dbtsai@alpinenow.com> Closes #4861 from mengxr/SPARK-5537 and squashes the following commits: 47af0ac [Xiangrui Meng] update user guide for multinomial logistic regression cdc2e15 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into AlpineNow-mlor-doc 096d0ca [DB Tsai] first commit (cherry picked from commit 9d6c5aeebd3c7f8ff6defe3bccd8ff12ed918293) Signed-off-by: Xiangrui Meng <meng@databricks.com> 03 March 2015, 02:10:56 UTC
1b8ab57 [SPARK-6121][SQL][MLLIB] simpleString for UDT `df.dtypes` shows `null` for UDTs. This PR uses `udt` by default and `VectorUDT` overwrites it with `vector`. jkbradley davies Author: Xiangrui Meng <meng@databricks.com> Closes #4858 from mengxr/SPARK-6121 and squashes the following commits: 34f0a77 [Xiangrui Meng] simpleString for UDT (cherry picked from commit 2db6a853a53b4c25e35983bc489510abb8a73e1d) Signed-off-by: Xiangrui Meng <meng@databricks.com> 03 March 2015, 01:14:43 UTC
ea69cf2 [SPARK-6048] SparkConf should not translate deprecated configs on set There are multiple issues with translating on set outlined in the JIRA. This PR reverts the translation logic added to `SparkConf`. In the future, after the 1.3.0 release we will figure out a way to reorganize the internal structure more elegantly. For now, let's preserve the existing semantics of `SparkConf` since it's a public interface. Unfortunately this means duplicating some code for now, but this is all internal and we can always clean it up later. Author: Andrew Or <andrew@databricks.com> Closes #4799 from andrewor14/conf-set-translate and squashes the following commits: 11c525b [Andrew Or] Move warning to driver 10e77b5 [Andrew Or] Add documentation for deprecation precedence a369cb1 [Andrew Or] Merge branch 'master' of github.com:apache/spark into conf-set-translate c26a9e3 [Andrew Or] Revert all translate logic in SparkConf fef6c9c [Andrew Or] Restore deprecation logic for spark.executor.userClassPathFirst 94b4dfa [Andrew Or] Translate on get, not set (cherry picked from commit 258d154c9f1afdd52dce19f03d81683ee34effac) Signed-off-by: Patrick Wendell <patrick@databricks.com> 03 March 2015, 00:36:50 UTC
8100b79 [SPARK-6066] Make event log format easier to parse Some users have reported difficulty in parsing the new event log format. Since we embed the metadata in the beginning of the file, when we compress the event log we need to skip the metadata because we need that information to parse the log later. This means we'll end up with a partially compressed file if event logging compression is turned on. The old format looks like: ``` sparkVersion = 1.3.0 compressionCodec = org.apache.spark.io.LZFCompressionCodec === LOG_HEADER_END === // actual events, could be compressed bytes ``` The new format in this patch puts the compression codec in the log file name instead. It also removes the metadata header altogether along with the Spark version, which was not needed. The new file name looks something like: ``` app_without_compression app_123.lzf app_456.snappy ``` I tested this with and without compression, using different compression codecs and event logging directories. I verified that both the `Master` and the `HistoryServer` can render both compressed and uncompressed logs as before. Author: Andrew Or <andrew@databricks.com> Closes #4821 from andrewor14/event-log-format and squashes the following commits: 8511141 [Andrew Or] Fix test 654883d [Andrew Or] Add back metadata with Spark version 7f537cd [Andrew Or] Address review feedback 7d6aa61 [Andrew Or] Make codec an extension 59abee9 [Andrew Or] Merge branch 'master' of github.com:apache/spark into event-log-format 27c9a6c [Andrew Or] Address review feedback 519e51a [Andrew Or] Address review feedback ef69276 [Andrew Or] Merge branch 'master' of github.com:apache/spark into event-log-format 88a091d [Andrew Or] Add tests for new format and file name f32d8d2 [Andrew Or] Fix tests 8db5a06 [Andrew Or] Embed metadata in the event log file name instead (cherry picked from commit 6776cb33ea691f7843b956b3e80979282967e826) Signed-off-by: Patrick Wendell <patrick@databricks.com> 03 March 2015, 00:34:39 UTC
866f281 [SPARK-6082] [SQL] Provides better error message for malformed rows when caching tables Constructs like Hive `TRANSFORM` may generate malformed rows (via badly authored external scripts for example). I'm a bit hesitant to have this feature, since it introduces per-tuple cost when caching tables. However, considering caching tables is usually a one-time cost, this is probably worth having. <!-- Reviewable:start --> [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/4842) <!-- Reviewable:end --> Author: Cheng Lian <lian@databricks.com> Closes #4842 from liancheng/spark-6082 and squashes the following commits: b05dbff [Cheng Lian] Provides better error message for malformed rows when caching tables (cherry picked from commit 1a49496b4a9df40c74739fc0fb8a21c88a477075) Signed-off-by: Michael Armbrust <michael@databricks.com> 03 March 2015, 00:18:10 UTC
3899c7c [SPARK-6114][SQL] Avoid metastore conversions before plan is resolved Author: Michael Armbrust <michael@databricks.com> Closes #4855 from marmbrus/explodeBug and squashes the following commits: a712249 [Michael Armbrust] [SPARK-6114][SQL] Avoid metastore conversions before plan is resolved (cherry picked from commit 8223ce6a81e4cc9fdf816892365fcdff4006c35e) Signed-off-by: Michael Armbrust <michael@databricks.com> 03 March 2015, 00:12:03 UTC
650d1e7 [SPARK-6050] [yarn] Relax matching of vcore count in received containers. Some YARN configurations return a vcore count for allocated containers that does not match the requested resource. That means Spark would always ignore those containers. So relax the the matching of the vcore count to allow the Spark jobs to run. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #4818 from vanzin/SPARK-6050 and squashes the following commits: 991c803 [Marcelo Vanzin] Remove config option, standardize on legacy behavior (no vcore matching). 8c9c346 [Marcelo Vanzin] Restrict lax matching to vcores only. 3359692 [Marcelo Vanzin] [SPARK-6050] [yarn] Add config option to do lax resource matching. (cherry picked from commit 6b348d90f475440c285a4b636134ffa9351580b9) Signed-off-by: Thomas Graves <tgraves@apache.org> 02 March 2015, 22:42:02 UTC
a83b9bb [SPARK-6040][SQL] Fix the percent bug in tablesample HiveQL expression like `select count(1) from src tablesample(1 percent);` means take 1% sample to select. But it means 100% in the current version of the Spark. Author: q00251598 <qiyadong@huawei.com> Closes #4789 from watermen/SPARK-6040 and squashes the following commits: 2453ebe [q00251598] check and adjust the fraction. (cherry picked from commit 582e5a24c55e8c876733537c9910001affc8b29b) Signed-off-by: Michael Armbrust <michael@databricks.com> 02 March 2015, 21:16:40 UTC
f92876a [Minor] Fix doc typo for describing primitiveTerm effectiveness condition It should be `true` instead of `false`? Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #4762 from viirya/doc_fix and squashes the following commits: 2e37482 [Liang-Chi Hsieh] Fix doc. (cherry picked from commit 3f9def81170c24f24f4a6b7ca7905de4f75e11e0) Signed-off-by: Michael Armbrust <michael@databricks.com> 02 March 2015, 21:11:38 UTC
58e7198 SPARK-5390 [DOCS] Encourage users to post on Stack Overflow in Community Docs Point "Community" to main Spark Community page; mention SO tag apache-spark. Separately, the Apache site can be updated to mention, under Mailing Lists: "StackOverflow also has an apache-spark tag for Spark Q&A." or similar. Author: Sean Owen <sowen@cloudera.com> Closes #4843 from srowen/SPARK-5390 and squashes the following commits: 3508ac6 [Sean Owen] Point "Community" to main Spark Community page; mention SO tag apache-spark (cherry picked from commit 0b472f60cdf4984ab5e28e6dbf12615e8997a448) Signed-off-by: Sean Owen <sowen@cloudera.com> 02 March 2015, 21:10:19 UTC
54ac243 [DOCS] Refactored Dataframe join comment to use correct parameter ordering The API signatire for join requires the JoinType to be the third parameter. The code examples provided for join show JoinType being provided as the 2nd parater resuling in errors (i.e. "df1.join(df2, "outer", $"df1Key" === $"df2Key") ). The correct sample code is df1.join(df2, $"df1Key" === $"df2Key", "outer") Author: Paul Power <paul.power@peerside.com> Closes #4847 from peerside/master and squashes the following commits: ebc1efa [Paul Power] Merge pull request #1 from peerside/peerside-patch-1 e353340 [Paul Power] Updated comments use correct sample code for Dataframe joins (cherry picked from commit d9a8bae77826a0cc77df29d85883e914d0f0b4f3) Signed-off-by: Michael Armbrust <michael@databricks.com> 02 March 2015, 21:09:57 UTC
4ffaf85 [SPARK-6080] [PySpark] correct LogisticRegressionWithLBFGS regType parameter for pyspark Currently LogisticRegressionWithLBFGS in python/pyspark/mllib/classification.py will invoke callMLlibFunc with a wrong "regType" parameter. It was assigned to "str(regType)" which translate None(Python) to "None"(Java/Scala). The right way should be translate None(Python) to null(Java/Scala) just as what we did at LogisticRegressionWithSGD. Author: Yanbo Liang <ybliang8@gmail.com> Closes #4831 from yanboliang/pyspark_classification and squashes the following commits: 12db65a [Yanbo Liang] correct LogisticRegressionWithLBFGS regType parameter for pyspark (cherry picked from commit af2effdd7b54316af0c02e781911acfb148b962b) Signed-off-by: Xiangrui Meng <meng@databricks.com> 02 March 2015, 18:17:32 UTC
f476108 [SPARK-5741][SQL] Support the path contains comma in HiveContext When run ```select * from nzhang_part where hr = 'file,';```, it throws exception ```java.lang.IllegalArgumentException: Can not create a Path from an empty string``` . Because the path of hdfs contains comma, and FileInputFormat.setInputPaths will split path by comma. ### SQL ``` set hive.merge.mapfiles=true; set hive.merge.mapredfiles=true; set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat; set hive.exec.dynamic.partition=true; set hive.exec.dynamic.partition.mode=nonstrict; create table nzhang_part like srcpart; insert overwrite table nzhang_part partition (ds='2010-08-15', hr) select key, value, hr from srcpart where ds='2008-04-08'; insert overwrite table nzhang_part partition (ds='2010-08-15', hr=11) select key, value from srcpart where ds='2008-04-08'; insert overwrite table nzhang_part partition (ds='2010-08-15', hr) select * from ( select key, value, hr from srcpart where ds='2008-04-08' union all select '1' as key, '1' as value, 'file,' as hr from src limit 1) s; select * from nzhang_part where hr = 'file,'; ``` ### Error Log ``` 15/02/10 14:33:16 ERROR SparkSQLDriver: Failed in [select * from nzhang_part where hr = 'file,'] java.lang.IllegalArgumentException: Can not create a Path from an empty string at org.apache.hadoop.fs.Path.checkPathArg(Path.java:127) at org.apache.hadoop.fs.Path.<init>(Path.java:135) at org.apache.hadoop.util.StringUtils.stringToPath(StringUtils.java:241) at org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:400) at org.apache.spark.sql.hive.HadoopTableReader$.initializeLocalJobConfFunc(TableReader.scala:251) at org.apache.spark.sql.hive.HadoopTableReader$$anonfun$11.apply(TableReader.scala:229) at org.apache.spark.sql.hive.HadoopTableReader$$anonfun$11.apply(TableReader.scala:229) at org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:172) at org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:172) at scala.Option.map(Option.scala:145) at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:172) at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:196) Author: q00251598 <qiyadong@huawei.com> Closes #4532 from watermen/SPARK-5741 and squashes the following commits: 9758ab1 [q00251598] fix bug 1db1a1c [q00251598] use setInputPaths(Job job, Path... inputPaths) b788a72 [q00251598] change FileInputFormat.setInputPaths to jobConf.set and add test suite (cherry picked from commit 9ce12aaf283a2793e719bdc956dd858922636e8d) Signed-off-by: Michael Armbrust <michael@databricks.com> 02 March 2015, 18:13:30 UTC
b2b7f01 [SPARK-6111] Fixed usage string in documentation. Usage info in documentation does not match actual usage info. Doc string usage says ```Usage: network_wordcount.py <zk> <topic>``` whereas the actual usage is ```Usage: kafka_wordcount.py <zk> <topic>``` Author: Kenneth Myers <myerske@us.ibm.com> Closes #4852 from kennethmyers/kafka_wordcount_documentation_fix and squashes the following commits: 3855325 [Kenneth Myers] Fixed usage string in documentation. (cherry picked from commit 95ac68bf127b5370c13d6bc15adbda78228829cc) Signed-off-by: Sean Owen <sowen@cloudera.com> 02 March 2015, 17:25:37 UTC
a3fef2c [SPARK-6052][SQL]In JSON schema inference, we should always set containsNull of an ArrayType to true Always set `containsNull = true` when infer the schema of JSON datasets. If we set `containsNull` based on records we scanned, we may miss arrays with null values when we do sampling. Also, because future data can have arrays with null values, if we convert JSON data to parquet, always setting `containsNull = true` is a more robust way to go. JIRA: https://issues.apache.org/jira/browse/SPARK-6052 Author: Yin Huai <yhuai@databricks.com> Closes #4806 from yhuai/jsonArrayContainsNull and squashes the following commits: 05eab9d [Yin Huai] Change containsNull to true. (cherry picked from commit 3efd8bb6cf139ce094ff631c7a9c1eb93fdcd566) Signed-off-by: Cheng Lian <lian@databricks.com> 02 March 2015, 15:18:31 UTC
c59871c [SPARK-6073][SQL] Need to refresh metastore cache after append data in CreateMetastoreDataSourceAsSelect JIRA: https://issues.apache.org/jira/browse/SPARK-6073 liancheng Author: Yin Huai <yhuai@databricks.com> Closes #4824 from yhuai/refreshCache and squashes the following commits: b9542ef [Yin Huai] Refresh metadata cache in the Catalog in CreateMetastoreDataSourceAsSelect. (cherry picked from commit 39a54b40aff66816f8b8f5c6133eaaad6eaecae1) Signed-off-by: Cheng Lian <lian@databricks.com> 02 March 2015, 14:44:05 UTC
1fe677a [Streaming][Minor]Fix some error docs in streaming examples Small changes, please help to review, thanks a lot. Author: Saisai Shao <saisai.shao@intel.com> Closes #4837 from jerryshao/doc-fix and squashes the following commits: 545291a [Saisai Shao] Fix some error docs in streaming examples (cherry picked from commit d8fb40edea7c8c811814f1ff288d59178928964b) Signed-off-by: Sean Owen <sowen@cloudera.com> 02 March 2015, 08:49:28 UTC
6a2fc85 [SPARK-6083] [MLLib] [DOC] Make Python API example consistent in NaiveBayes Author: MechCoder <manojkumarsivaraj334@gmail.com> Closes #4834 from MechCoder/spark-6083 and squashes the following commits: 1cdd7b5 [MechCoder] Add parse function 65bbbe9 [MechCoder] [SPARK-6083] Make Python API example consistent in NaiveBayes (cherry picked from commit 3f00bb3ef1384fabf86a68180d40a1a515f6f5e3) Signed-off-by: Xiangrui Meng <meng@databricks.com> 02 March 2015, 00:28:25 UTC
b570d98 [SPARK-6053][MLLIB] support save/load in PySpark's ALS A simple wrapper to save/load `MatrixFactorizationModel` in Python. jkbradley Author: Xiangrui Meng <meng@databricks.com> Closes #4811 from mengxr/SPARK-5991 and squashes the following commits: f135dac [Xiangrui Meng] update save doc 57e5200 [Xiangrui Meng] address comments 06140a4 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-5991 282ec8d [Xiangrui Meng] support save/load in PySpark's ALS (cherry picked from commit aedbbaa3dda9cbc154cd52c07f6d296b972b0eb2) Signed-off-by: Xiangrui Meng <meng@databricks.com> 02 March 2015, 00:27:18 UTC
bb16618 [SPARK-6074] [sql] Package pyspark sql bindings. This is needed for the SQL bindings to work on Yarn. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #4822 from vanzin/SPARK-6074 and squashes the following commits: fb52001 [Marcelo Vanzin] [SPARK-6074] [sql] Package pyspark sql bindings. (cherry picked from commit fd8d283eeb98e310b1e85ef8c3a8af9e547ab5e0) Signed-off-by: Sean Owen <sowen@cloudera.com> 01 March 2015, 11:05:20 UTC
317694c SPARK-5984: Fix TimSort bug causes ArrayOutOfBoundsException Fix TimSort bug which causes a ArrayOutOfBoundsException. Using the proposed fix here http://envisage-project.eu/proving-android-java-and-python-sorting-algorithm-is-broken-and-how-to-fix-it/ Author: Evan Yu <ehotou@gmail.com> Closes #4804 from hotou/SPARK-5984 and squashes the following commits: 3421b6c [Evan Yu] SPARK-5984: Add info to LICENSE e61c6b8 [Evan Yu] SPARK-5984: Fix license and document 6ccc280 [Evan Yu] SPARK-5984: Add License header to file e06c0d2 [Evan Yu] SPARK-5984: Add License header to file 4d95f75 [Evan Yu] SPARK-5984: Fix TimSort bug causes ArrayOutOfBoundsException 479a106 [Evan Yu] SPARK-5984: Fix TimSort bug causes ArrayOutOfBoundsException (cherry picked from commit 643300a6e27dac3822f9a3ced0ad5fb3b4f2ad75) Signed-off-by: Reynold Xin <rxin@databricks.com> 01 March 2015, 02:56:42 UTC
aa39460 [SPARK-5775] [SQL] BugFix: GenericRow cannot be cast to SpecificMutableRow when nested data and partitioned table This PR adapts anselmevignon's #4697 to master and branch-1.3. Please refer to PR description of #4697 for details. <!-- Reviewable:start --> [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/4792) <!-- Reviewable:end --> Author: Cheng Lian <lian@databricks.com> Author: Cheng Lian <liancheng@users.noreply.github.com> Author: Yin Huai <yhuai@databricks.com> Closes #4792 from liancheng/spark-5775 and squashes the following commits: 538f506 [Cheng Lian] Addresses comments cee55cf [Cheng Lian] Merge pull request #4 from yhuai/spark-5775-yin b0b74fb [Yin Huai] Remove runtime pattern matching. ca6e038 [Cheng Lian] Fixes SPARK-5775 (cherry picked from commit e6003f0a571ba44fcd011e695c8622e11cfee7dd) Signed-off-by: Cheng Lian <lian@databricks.com> 28 February 2015, 13:17:31 UTC
5a55c96 [SPARK-5979][SPARK-6032] Smaller safer --packages fix pwendell tdas This is the safer parts of PR #4754: - SPARK-5979: All dependencies with the groupId `org.apache.spark` passed through `--packages`, were being excluded from the dependency tree on the assumption that they would be in the assembly jar. This is not the case, therefore the exclusion rules had to be defined more explicitly. - SPARK-6032: Ivy prints a whole lot of logs while retrieving dependencies. These were printed to `System.out`. Moved the logging to `System.err`. Author: Burak Yavuz <brkyvz@gmail.com> Closes #4802 from brkyvz/simple-streaming-fix and squashes the following commits: e0f38cb [Burak Yavuz] Merge branch 'master' of github.com:apache/spark into simple-streaming-fix bad921c [Burak Yavuz] [SPARK-5979][SPARK-6032] Smaller safer fix (cherry picked from commit 6d8e5fbc0d83411174ffa59ff6a761a862eca32c) Signed-off-by: Patrick Wendell <patrick@databricks.com> 28 February 2015, 06:59:41 UTC
1747e0a [SPARK-6070] [yarn] Remove unneeded classes from shuffle service jar. These may conflict with the classes already in the NM. We shouldn't be repackaging them. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #4820 from vanzin/SPARK-6070 and squashes the following commits: 871b566 [Marcelo Vanzin] The "d'oh how didn't I think of it before" solution. 3cba946 [Marcelo Vanzin] Use profile instead, so that dependencies don't need to be explicitly listed. 7a18a1b [Marcelo Vanzin] [SPARK-6070] [yarn] Remove unneeded classes from shuffle service jar. (cherry picked from commit dba08d1fc3bdb9245aefe695970354df088a93b6) Signed-off-by: Patrick Wendell <patrick@databricks.com> 28 February 2015, 06:44:18 UTC
49f2187 [SPARK-6055] [PySpark] fix incorrect __eq__ of DataType The _eq_ of DataType is not correct, class cache is not use correctly (created class can not be find by dataType), then it will create lots of classes (saved in _cached_cls), never released. Also, all same DataType have same hash code, there will be many object in a dict with the same hash code, end with hash attach, it's very slow to access this dict (depends on the implementation of CPython). This PR also improve the performance of inferSchema (avoid the unnecessary converter of object). cc pwendell JoshRosen Author: Davies Liu <davies@databricks.com> Closes #4808 from davies/leak and squashes the following commits: 6a322a4 [Davies Liu] tests refactor 3da44fc [Davies Liu] fix __eq__ of Singleton 534ac90 [Davies Liu] add more checks 46999dc [Davies Liu] fix tests d9ae973 [Davies Liu] fix memory leak in sql (cherry picked from commit e0e64ba4b1b8eb72e856286f756c65fa22ab0a36) Signed-off-by: Josh Rosen <joshrosen@databricks.com> 28 February 2015, 04:07:57 UTC
5d19cf0 [SPARK-5751] [SQL] Sets SPARK_HOME as SPARK_PID_DIR when running Thrift server test suites This is a follow-up of #4720. By default, `spark-daemon.sh` writes PID files under `/tmp`, which makes it impossible to start multiple server instances simultaneously. This PR sets `SPARK_PID_DIR` to Spark home directory to workaround this problem. Many thanks to chenghao-intel for pointing out this issue! <!-- Reviewable:start --> [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/4758) <!-- Reviewable:end --> Author: Cheng Lian <lian@databricks.com> Closes #4758 from liancheng/thriftserver-pid-dir and squashes the following commits: 252fa0f [Cheng Lian] Uses temporary directory as Thrift server PID directory 1b3d1e3 [Cheng Lian] Sets SPARK_HOME as SPARK_PID_DIR when running Thrift server test suites (cherry picked from commit 8c468a6600e0deb5464990df60148212e64fdecd) Signed-off-by: Cheng Lian <lian@databricks.com> 28 February 2015, 00:42:27 UTC
ceebe3c [Streaming][Minor] Remove useless type signature of Java Kafka direct stream API cc tdas . Author: Saisai Shao <saisai.shao@intel.com> Closes #4817 from jerryshao/signature-minor-fix and squashes the following commits: eebfaac [Saisai Shao] Remove useless type parameter (cherry picked from commit 5f7f3b938e1776168be866fc9ee87dc7494696cc) Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com> 27 February 2015, 21:01:55 UTC
117e10c [SPARK-4587] [mllib] [docs] Fixed save,load calls in ML guide examples Should pass spark context to save/load CC: mengxr Author: Joseph K. Bradley <joseph@databricks.com> Closes #4816 from jkbradley/ml-io-doc-fix and squashes the following commits: 83d369d [Joseph K. Bradley] added comment to save,load parts of ML guide examples 2841170 [Joseph K. Bradley] Fixed save,load calls in ML guide examples (cherry picked from commit d17cb2ba33b363dd346ac5a5681e1757decd0f4d) Signed-off-by: Xiangrui Meng <meng@databricks.com> 27 February 2015, 21:00:52 UTC
bff8088 [SPARK-6058][Yarn] Log the user class exception in ApplicationMaster Because ApplicationMaster doesn't set SparkUncaughtExceptionHandler, the exception in the user class won't be logged. This PR added a `logError` for it. Author: zsxwing <zsxwing@gmail.com> Closes #4813 from zsxwing/SPARK-6058 and squashes the following commits: 806c932 [zsxwing] Log the user class exception (cherry picked from commit e747e98490f8ede23b0a9e0795e7445d0b597624) Signed-off-by: Sean Owen <sowen@cloudera.com> 27 February 2015, 13:32:07 UTC
b8db84c fix spark-6033, clarify the spark.worker.cleanup behavior in standalone mode jira case spark-6033 https://issues.apache.org/jira/browse/SPARK-6033 In standalone deploy mode, the cleanup will only remove the stopped application's directories. The original description about the cleanup behavior is incorrect. Author: 许鹏 <peng.xu@fraudmetrix.cn> Closes #4803 from hseagle/spark-6033 and squashes the following commits: 927a6a0 [许鹏] fix the incorrect description about the spark.worker.cleanup in standalone mode (cherry picked from commit 0375a413b8a009f5820897691570a1273ee25b97) Signed-off-by: Andrew Or <andrew@databricks.com> 27 February 2015, 07:06:42 UTC
485b919 SPARK-2168 [Spark core] Use relative URIs for the app links in the History Server. As agreed in PR #1160 adding test to verify if history server generates relative links to applications. Author: Lukasz Jastrzebski <lukasz.jastrzebski@gmail.com> Closes #4778 from elyast/master and squashes the following commits: 0c07fab [Lukasz Jastrzebski] Incorporating comments for SPARK-2168 6d7866d [Lukasz Jastrzebski] Adjusting test for SPARK-2168 for master branch d6f4fbe [Lukasz Jastrzebski] Added test for SPARK-2168 (cherry picked from commit 4a8a0a8ecd836bf7fe0f2e692cf20a62dda313c0) Signed-off-by: Andrew Or <andrew@databricks.com> 27 February 2015, 06:38:13 UTC
6200f07 [SPARK-6024][SQL] When a data source table has too many columns, it's schema cannot be stored in metastore. JIRA: https://issues.apache.org/jira/browse/SPARK-6024 Author: Yin Huai <yhuai@databricks.com> Closes #4795 from yhuai/wideSchema and squashes the following commits: 4882e6f [Yin Huai] Address comments. 73e71b4 [Yin Huai] Address comments. 143927a [Yin Huai] Simplify code. cc1d472 [Yin Huai] Make the schema wider. 12bacae [Yin Huai] If the JSON string of a schema is too large, split it before storing it in metastore. e9b4f70 [Yin Huai] Failed test. (cherry picked from commit 5e5ad6558d60cfbf360708584e883e80d363e33e) Signed-off-by: Reynold Xin <rxin@databricks.com> 27 February 2015, 04:46:11 UTC
25a109e [SPARK-6037][SQL] Avoiding duplicate Parquet schema merging `FilteringParquetRowInputFormat` manually merges Parquet schemas before computing splits. However, it is duplicate because the schemas are already merged in `ParquetRelation2`. We don't need to re-merge them at `InputFormat`. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #4786 from viirya/dup_parquet_schemas_merge and squashes the following commits: ef78a5a [Liang-Chi Hsieh] Avoiding duplicate Parquet schema merging. (cherry picked from commit 4ad5153f5449319a7e82c9013ccff4494ab58ef1) Signed-off-by: Cheng Lian <lian@databricks.com> 27 February 2015, 03:07:08 UTC
b83a93e SPARK-4579 [WEBUI] Scheduling Delay appears negative Ensure scheduler delay handles unfinished task case, and ensure delay is never negative even due to rounding Author: Sean Owen <sowen@cloudera.com> Closes #4796 from srowen/SPARK-4579 and squashes the following commits: ad6713c [Sean Owen] Ensure scheduler delay handles unfinished task case, and ensure delay is never negative even due to rounding (cherry picked from commit fbc469473dd529eb72046186b85dd8fc2b7c5bb5) Signed-off-by: Andrew Or <andrew@databricks.com> 27 February 2015, 01:35:30 UTC
5b426cb [SPARK-5951][YARN] Remove unreachable driver memory properties in yarn client mode Remove unreachable driver memory properties in yarn client mode Author: mohit.goyal <mohit.goyal@guavus.com> Closes #4730 from zuxqoj/master and squashes the following commits: 977dc96 [mohit.goyal] remove not rechable deprecated variables in yarn client mode (cherry picked from commit b38dec2ffdf724ff4e181cc8c7427d074b442670) Signed-off-by: Andrew Or <andrew@databricks.com> 26 February 2015, 22:28:00 UTC
297c3ef Add a note for context termination for History server on Yarn The history server on Yarn only shows completed jobs. This adds a note concerning the needed explicit context termination at the end of a spark job which is a best practice anyway. Related to SPARK-2972 and SPARK-3458 Author: moussa taifi <moutai10@gmail.com> Closes #4721 from moutai/add-history-server-note-for-closing-the-spark-context and squashes the following commits: 9f5b6c3 [moussa taifi] Fix upper case typo for YARN 3ad3db4 [moussa taifi] Add context termination for History server on Yarn (cherry picked from commit c871e2dae0182e914135560d14304242e1f97f7e) Signed-off-by: Andrew Or <andrew@databricks.com> 26 February 2015, 22:20:36 UTC
fe79674 [SPARK-6018] [YARN] NoSuchMethodError in Spark app is swallowed by YARN AM Author: Cheolsoo Park <cheolsoop@netflix.com> Closes #4773 from piaozhexiu/SPARK-6018 and squashes the following commits: 2a919d5 [Cheolsoo Park] Rename e with cause to avoid duplicate names 1e71d2d [Cheolsoo Park] Replace placeholder with throwable eb5750d [Cheolsoo Park] NoSuchMethodError in Spark app is swallowed by YARN AM (cherry picked from commit 5f3238b3b0157091d28803aa3b1d248dfa6cdc59) Signed-off-by: Andrew Or <andrew@databricks.com> 26 February 2015, 21:53:57 UTC
731a997 [SPARK-6027][SPARK-5546] Fixed --jar and --packages not working for KafkaUtils and improved error message The problem with SPARK-6027 in short is that JARs like the kafka-assembly.jar does not work in python as the added JAR is not visible in the classloader used by Py4J. Py4J uses Class.forName(), which does not uses the systemclassloader, but the JARs are only visible in the Thread's contextclassloader. So this back uses the context class loader to create the KafkaUtils dstream object. This works for both cases where the Kafka libraries are added with --jars spark-streaming-kafka-assembly.jar or with --packages spark-streaming-kafka Also improves the error message. davies Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #4779 from tdas/kafka-python-fix and squashes the following commits: fb16b04 [Tathagata Das] Removed import c1fdf35 [Tathagata Das] Fixed long line and improved documentation 7b88be8 [Tathagata Das] Fixed --jar not working for KafkaUtils and improved error message (cherry picked from commit aa63f633d39efa8c29095295f161eaad5495071d) Signed-off-by: Andrew Or <andrew@databricks.com> 26 February 2015, 21:47:15 UTC
62652dc Modify default value description for spark.scheduler.minRegisteredResourcesRatio on docs. The configuration is not supported in mesos mode now. See https://github.com/apache/spark/pull/1462 Author: Li Zhihui <zhihui.li@intel.com> Closes #4781 from li-zhihui/fixdocconf and squashes the following commits: 63e7a44 [Li Zhihui] Modify default value description for spark.scheduler.minRegisteredResourcesRatio on docs. (cherry picked from commit 10094a523e3993b775111ae9b22ca31cc0d76e03) Signed-off-by: Andrew Or <andrew@databricks.com> 26 February 2015, 21:07:55 UTC
5d309ad [SPARK-5363] Fix bug in PythonRDD: remove() inside iterator is not safe Removing elements from a mutable HashSet while iterating over it can cause the iteration to incorrectly skip over entries that were not removed. If this happened, PythonRDD would write fewer broadcast variables than the Python worker was expecting to read, which would cause the Python worker to hang indefinitely. Author: Davies Liu <davies@databricks.com> Closes #4776 from davies/fix_hang and squashes the following commits: a4384a5 [Davies Liu] fix bug: remvoe() inside iterator is not safe (cherry picked from commit 7fa960e653a905fc48d4097b49ce560cff919fa2) Signed-off-by: Josh Rosen <joshrosen@databricks.com> 26 February 2015, 19:56:36 UTC
dafb3d2 [SPARK-6015] fix links to source code in Python API docs Author: Davies Liu <davies@databricks.com> Closes #4772 from davies/source_link and squashes the following commits: 389f0c6 [Davies Liu] fix link to source code in Pyton API docs (cherry picked from commit 015895ab508efde0702b51c5e537a5a6a191d209) Signed-off-by: Reynold Xin <rxin@databricks.com> 26 February 2015, 18:46:07 UTC
7c779d8 [SPARK-6007][SQL] Add numRows param in DataFrame.show() It is useful to let the user decide the number of rows to show in DataFrame.show Author: Jacky Li <jacky.likun@huawei.com> Closes #4767 from jackylk/show and squashes the following commits: a0e0f4b [Jacky Li] fix testcase 7cdbe91 [Jacky Li] modify according to comment bb54537 [Jacky Li] for Java compatibility d7acc18 [Jacky Li] modify according to comments 981be52 [Jacky Li] add numRows param in DataFrame.show() (cherry picked from commit 2358657547016d647cdd2e2d363426fcd8d3e9ff) Signed-off-by: Reynold Xin <rxin@databricks.com> 26 February 2015, 18:43:20 UTC
b5c5e93 [SPARK-6016][SQL] Cannot read the parquet table after overwriting the existing table when spark.sql.parquet.cacheMetadata=true Please see JIRA (https://issues.apache.org/jira/browse/SPARK-6016) for details of the bug. Author: Yin Huai <yhuai@databricks.com> Closes #4775 from yhuai/parquetFooterCache and squashes the following commits: 78787b1 [Yin Huai] Remove footerCache in FilteringParquetRowInputFormat. dff6fba [Yin Huai] Failed unit test. (cherry picked from commit 192e42a2933eb283e12bfdfb46e2ef895228af4a) Signed-off-by: Cheng Lian <lian@databricks.com> 26 February 2015, 17:04:24 UTC
e0f5fb0 [SPARK-6023][SQL] ParquetConversions fails to replace the destination MetastoreRelation of an InsertIntoTable node to ParquetRelation2 JIRA: https://issues.apache.org/jira/browse/SPARK-6023 Author: Yin Huai <yhuai@databricks.com> Closes #4782 from yhuai/parquetInsertInto and squashes the following commits: ae7e806 [Yin Huai] Convert MetastoreRelation in InsertIntoTable and InsertIntoHiveTable. ba543cd [Yin Huai] More tests. 50b6d0f [Yin Huai] Update error messages. 346780c [Yin Huai] Failed test. (cherry picked from commit f02394d06473889d0d7897c4583239e6e136ff46) Signed-off-by: Cheng Lian <lian@databricks.com> 26 February 2015, 14:40:25 UTC
a51d9db [SPARK-5976][MLLIB] Add partitioner to factors returned by ALS The model trained by ALS requires partitioning information to do quick lookup of a user/item factor for making recommendation on individual requests. In the new implementation, we didn't set partitioners in the factors returned by ALS, which would cause performance regression. srowen coderxiang Author: Xiangrui Meng <meng@databricks.com> Closes #4748 from mengxr/SPARK-5976 and squashes the following commits: 9373a09 [Xiangrui Meng] add partitioner to factors returned by ALS 260f183 [Xiangrui Meng] add a test for partitioner (cherry picked from commit e43139f40309995b1133c7ef2936ab858b7b44fc) Signed-off-by: Xiangrui Meng <meng@databricks.com> 26 February 2015, 07:43:37 UTC
56fa38a [SPARK-1182][Docs] Sort the configuration parameters in configuration.md Sorts all configuration options present on the `configuration.md` page to ease readability. Author: Brennon York <brennon.york@capitalone.com> Closes #3863 from brennonyork/SPARK-1182 and squashes the following commits: 5696f21 [Brennon York] fixed merge conflict with port comments 81a7b10 [Brennon York] capitalized A in Allocation e240486 [Brennon York] moved all spark.mesos properties into the running-on-mesos doc 7de5f75 [Brennon York] moved serialization from application to compression and serialization section a16fec0 [Brennon York] moved shuffle settings from network to shuffle f8fa286 [Brennon York] sorted encryption category 1023f15 [Brennon York] moved initialExecutors e9d62aa [Brennon York] fixed akka.heartbeat.interval 25e6f6f [Brennon York] moved spark.executer.user* 4625ade [Brennon York] added spark.executor.extra* items 4ee5648 [Brennon York] fixed merge conflicts 1b49234 [Brennon York] sorting mishap 2b5758b [Brennon York] sorting mishap 6fbdf42 [Brennon York] sorting mishap 55dc6f8 [Brennon York] sorted security ec34294 [Brennon York] sorted dynamic allocation 2a7c4a3 [Brennon York] sorted scheduling aa9acdc [Brennon York] sorted networking a4380b8 [Brennon York] sorted execution behavior 27f3919 [Brennon York] sorted compression and serialization 80a5bbb [Brennon York] sorted spark ui 3f32e5b [Brennon York] sorted shuffle behavior 6c51b38 [Brennon York] sorted runtime environment efe9d6f [Brennon York] sorted application properties (cherry picked from commit 46a044a36a2aff1306f7f677e952ce253ddbefac) Signed-off-by: Reynold Xin <rxin@databricks.com> 26 February 2015, 00:22:33 UTC
b32a653 [SPARK-5724] fix the misconfiguration in AkkaUtils https://issues.apache.org/jira/browse/SPARK-5724 In AkkaUtil, we set several failure detector related the parameters as following ``` al akkaConf = ConfigFactory.parseMap(conf.getAkkaConf.toMap[String, String]) .withFallback(akkaSslConfig).withFallback(ConfigFactory.parseString( s""" |akka.daemonic = on |akka.loggers = [""akka.event.slf4j.Slf4jLogger""] |akka.stdout-loglevel = "ERROR" |akka.jvm-exit-on-fatal-error = off |akka.remote.require-cookie = "$requireCookie" |akka.remote.secure-cookie = "$secureCookie" |akka.remote.transport-failure-detector.heartbeat-interval = $akkaHeartBeatInterval s |akka.remote.transport-failure-detector.acceptable-heartbeat-pause = $akkaHeartBeatPauses s |akka.remote.transport-failure-detector.threshold = $akkaFailureDetector |akka.actor.provider = "akka.remote.RemoteActorRefProvider" |akka.remote.netty.tcp.transport-class = "akka.remote.transport.netty.NettyTransport" |akka.remote.netty.tcp.hostname = "$host" |akka.remote.netty.tcp.port = $port |akka.remote.netty.tcp.tcp-nodelay = on |akka.remote.netty.tcp.connection-timeout = $akkaTimeout s |akka.remote.netty.tcp.maximum-frame-size = ${akkaFrameSize}B |akka.remote.netty.tcp.execution-pool-size = $akkaThreads |akka.actor.default-dispatcher.throughput = $akkaBatchSize |akka.log-config-on-start = $logAkkaConfig |akka.remote.log-remote-lifecycle-events = $lifecycleEvents |akka.log-dead-letters = $lifecycleEvents |akka.log-dead-letters-during-shutdown = $lifecycleEvents """.stripMargin)) ``` Actually, we do not have any parameter naming "akka.remote.transport-failure-detector.threshold" see: http://doc.akka.io/docs/akka/2.3.4/general/configuration.html what we have is "akka.remote.watch-failure-detector.threshold" Author: CodingCat <zhunansjtu@gmail.com> Closes #4512 from CodingCat/SPARK-5724 and squashes the following commits: bafe56e [CodingCat] fix the grammar in configuration doc 338296e [CodingCat] remove failure-detector related info 8bfcfd4 [CodingCat] fix the misconfiguration in AkkaUtils (cherry picked from commit 242d49584c6aa21d928db2552033661950f760a5) Signed-off-by: Reynold Xin <rxin@databricks.com> 26 February 2015, 00:22:14 UTC
a1b4856 [SPARK-5974] [SPARK-5980] [mllib] [python] [docs] Update ML guide with save/load, Python GBT * Add GradientBoostedTrees Python examples to ML guide * I ran these in the pyspark shell, and they worked. * Add save/load to examples in ML guide * Added note to python docs about predict,transform not working within RDD actions,transformations in some cases (See SPARK-5981) CC: mengxr Author: Joseph K. Bradley <joseph@databricks.com> Closes #4750 from jkbradley/SPARK-5974 and squashes the following commits: c410e38 [Joseph K. Bradley] Added note to LabeledPoint about attributes bcae18b [Joseph K. Bradley] Added import of models for save/load examples in ml guide. Fixed line length for tree.py, feature.py (but not other ML Pyspark files yet). 6d81c3e [Joseph K. Bradley] completed python GBT examples 9903309 [Joseph K. Bradley] Added note to python docs about predict,transform not working within RDD actions,transformations in some cases c7dfad8 [Joseph K. Bradley] Added model save/load to ML guide. Added GBT examples to ML guide (cherry picked from commit d20559b157743981b9c09e286f2aaff8cbefab59) Signed-off-by: Xiangrui Meng <meng@databricks.com> 26 February 2015, 00:13:23 UTC
5bd4b49 [SPARK-5926] [SQL] make DataFrame.explain leverage queryExecution.logical DataFrame.explain return wrong result when the query is DDL command. For example, the following two queries should print out the same execution plan, but it not. sql("create table tb as select * from src where key > 490").explain(true) sql("explain extended create table tb as select * from src where key > 490") This is because DataFrame.explain leverage logicalPlan which had been forced executed, we should use the unexecuted plan queryExecution.logical. Author: Yanbo Liang <ybliang8@gmail.com> Closes #4707 from yanboliang/spark-5926 and squashes the following commits: fa6db63 [Yanbo Liang] logicalPlan is not lazy 0e40a1b [Yanbo Liang] make DataFrame.explain leverage queryExecution.logical (cherry picked from commit 41e2e5acb749c25641f1f8dea5a2e1d8af319486) Signed-off-by: Michael Armbrust <michael@databricks.com> 25 February 2015, 23:37:29 UTC
6fff9b8 [SPARK-5999][SQL] Remove duplicate Literal matching block Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #4760 from viirya/dup_literal and squashes the following commits: 06e7516 [Liang-Chi Hsieh] Remove duplicate Literal matching block. (cherry picked from commit 12dbf98c5d270e3846e946592666160b1541d9dc) Signed-off-by: Michael Armbrust <michael@databricks.com> 25 February 2015, 23:23:54 UTC
016f1f8 [SPARK-6010] [SQL] Merging compatible Parquet schemas before computing splits `ReadContext.init` calls `InitContext.getMergedKeyValueMetadata`, which doesn't know how to merge conflicting user defined key-value metadata and throws exception. In our case, when dealing with different but compatible schemas, we have different Spark SQL schema JSON strings in different Parquet part-files, thus causes this problem. Reading similar Parquet files generated by Hive doesn't suffer from this issue. In this PR, we manually merge the schemas before passing it to `ReadContext` to avoid the exception. <!-- Reviewable:start --> [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/4768) <!-- Reviewable:end --> Author: Cheng Lian <lian@databricks.com> Closes #4768 from liancheng/spark-6010 and squashes the following commits: 9002f0a [Cheng Lian] Fixes SPARK-6010 (cherry picked from commit e0fdd467e277867d6bec5c6605cc1cabce70ac89) Signed-off-by: Michael Armbrust <michael@databricks.com> 25 February 2015, 23:15:29 UTC
9aca3c6 [SPARK-5944] [PySpark] fix version in Python API docs use RELEASE_VERSION when building the Python API docs Author: Davies Liu <davies@databricks.com> Closes #4731 from davies/api_version and squashes the following commits: c9744c9 [Davies Liu] Update create-release.sh 08cbc3f [Davies Liu] fix python docs (cherry picked from commit f3f4c87b3d944c10d1200dfe49091ebb2a149be6) Signed-off-by: Michael Armbrust <michael@databricks.com> 25 February 2015, 23:13:56 UTC
791df93 [SPARK-5982] Remove incorrect Local Read Time Metric This metric is incomplete, because the files are memory mapped, so much of the read from disk occurs later as tasks actually read the file's data. This should be merged into 1.3, so that we never expose this incorrect metric to users. CC pwendell ksakellis sryza Author: Kay Ousterhout <kayousterhout@gmail.com> Closes #4749 from kayousterhout/SPARK-5982 and squashes the following commits: 9737b5e [Kay Ousterhout] More fixes a1eb300 [Kay Ousterhout] Removed one more use of local read time cf13497 [Kay Ousterhout] [SPARK-5982] Remove incorrectwq Local Read Time Metric (cherry picked from commit 838a48036c050cef03b8c3620e16b5495cd7beab) Signed-off-by: Kay Ousterhout <kayousterhout@gmail.com> 25 February 2015, 22:55:46 UTC
8073767 [SPARK-1955][GraphX]: VertexRDD can incorrectly assume index sharing Fixes the issue whereby when VertexRDD's are `diff`ed, `innerJoin`ed, or `leftJoin`ed and have different partition sizes they fail under the `zipPartitions` method. This fix tests whether the partitions are equal or not and, if not, will repartition the other to match the partition size of the calling VertexRDD. Author: Brennon York <brennon.york@capitalone.com> Closes #4705 from brennonyork/SPARK-1955 and squashes the following commits: 0882590 [Brennon York] updated to properly handle differently-partitioned vertexRDDs (cherry picked from commit 9f603fce78fcc997926e9a72dec44d48cbc396fc) Signed-off-by: Ankur Dave <ankurdave@gmail.com> 25 February 2015, 22:11:29 UTC
eaffc6e SPARK-5930 [DOCS] Documented default of spark.shuffle.io.retryWait is confusing Clarify default max wait in spark.shuffle.io.retryWait docs CC andrewor14 Author: Sean Owen <sowen@cloudera.com> Closes #4769 from srowen/SPARK-5930 and squashes the following commits: ae2792b [Sean Owen] Clarify default max wait in spark.shuffle.io.retryWait docs (cherry picked from commit 7d8e6a2e44e13a6d6cdcd98a0d0c33b243ef0dc2) Signed-off-by: Reynold Xin <rxin@databricks.com> 25 February 2015, 20:20:53 UTC
fada683 [SPARK-5996][SQL] Fix specialized outbound conversions Author: Michael Armbrust <michael@databricks.com> Closes #4757 from marmbrus/udtConversions and squashes the following commits: 3714aad [Michael Armbrust] [SPARK-5996][SQL] Fix specialized outbound conversions (cherry picked from commit f84c799ea0b82abca6a4fad39532c2515743b632) Signed-off-by: Xiangrui Meng <meng@databricks.com> 25 February 2015, 18:13:55 UTC
5c421e0 [SPARK-5994] [SQL] Python DataFrame documentation fixes select empty should NOT be the same as select. make sure selectExpr is behaving the same. join param documentation link to source doesn't work in jekyll generated file cross reference of columns (i.e. enabling linking) show(): move df example before df.show() move tests in SQLContext out of docstring otherwise doc is too long Column.desc and .asc doesn't have any documentation in documentation, sort functions.*) Author: Davies Liu <davies@databricks.com> Closes #4756 from davies/df_docs and squashes the following commits: f30502c [Davies Liu] fix doc 32f0d46 [Davies Liu] fix DataFrame docs (cherry picked from commit d641fbb39c90b1d734cc55396ca43d7e98788975) Signed-off-by: Michael Armbrust <michael@databricks.com> 25 February 2015, 04:52:05 UTC
e7a748e [SPARK-5286][SQL] SPARK-5286 followup https://issues.apache.org/jira/browse/SPARK-5286 Author: Yin Huai <yhuai@databricks.com> Closes #4755 from yhuai/SPARK-5286-throwable and squashes the following commits: 4c0c450 [Yin Huai] Catch Throwable instead of Exception. (cherry picked from commit 769e092bdc51582372093f76dbaece27149cc4ea) Signed-off-by: Michael Armbrust <michael@databricks.com> 25 February 2015, 03:51:48 UTC
back to top