Revision history - refs/tags/v2.0.0 - origin: https://github.com/apache/spark

visit type:

Revision	Author	Date	Message	Commit Date
13650fc	Patrick Wendell	19 July 2016, 21:02:27 UTC	Preparing Spark release v2.0.0-rc5	19 July 2016, 21:02:27 UTC
80ab8b6	Yin Huai	19 July 2016, 19:58:08 UTC	[SPARK-15705][SQL] Change the default value of spark.sql.hive.convertMetastoreOrc to false. ## What changes were proposed in this pull request? In 2.0, we add a new logic to convert HiveTableScan on ORC tables to Spark's native code path. However, during this conversion, we drop the original metastore schema (https://issues.apache.org/jira/browse/SPARK-15705). Because of this regression, I am changing the default value of `spark.sql.hive.convertMetastoreOrc` to false. Author: Yin Huai <yhuai@databricks.com> Closes #14267 from yhuai/SPARK-15705-changeDefaultValue. (cherry picked from commit 2ae7b88a07140e012b6c60db3c4a2a8ca360c684) Signed-off-by: Reynold Xin <rxin@databricks.com>	19 July 2016, 19:58:13 UTC
f18f9ca	Dongjoon Hyun	19 July 2016, 17:28:17 UTC	[SPARK-16602][SQL] `Nvl` function should support numeric-string cases ## What changes were proposed in this pull request? `Nvl` function should support numeric-straing cases like Hive/Spark1.6. Currently, `Nvl` finds the tightest common types among numeric types. This PR extends that to consider `String` type, too. ```scala - TypeCoercion.findTightestCommonTypeOfTwo(left.dataType, right.dataType).map { dtype => + TypeCoercion.findTightestCommonTypeToString(left.dataType, right.dataType).map { dtype => ``` Before ```scala scala> sql("select nvl('0', 1)").collect() org.apache.spark.sql.AnalysisException: cannot resolve `nvl("0", 1)` due to data type mismatch: input to function coalesce should all be the same type, but it's [string, int]; line 1 pos 7 ``` After ```scala scala> sql("select nvl('0', 1)").collect() res0: Array[org.apache.spark.sql.Row] = Array([0]) ``` ## How was this patch tested? Pass the Jenkins tests. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #14251 from dongjoon-hyun/SPARK-16602. (cherry picked from commit 162d04a30e38bb83d35865679145f8ea80b84c26) Signed-off-by: Reynold Xin <rxin@databricks.com>	19 July 2016, 17:28:24 UTC
6ca1d94	Liwei Lin	19 July 2016, 17:24:48 UTC	[SPARK-16620][CORE] Add back the tokenization process in `RDD.pipe(command: String)` ## What changes were proposed in this pull request? Currently `RDD.pipe(command: String)`: - works only when the command is specified without any options, such as `RDD.pipe("wc")` - does NOT work when the command is specified with some options, such as `RDD.pipe("wc -l")` This is a regression from Spark 1.6. This patch adds back the tokenization process in `RDD.pipe(command: String)` to fix this regression. ## How was this patch tested? Added a test which: - would pass in `1.6` - _[prior to this patch]_ would fail in `master` - _[after this patch]_ would pass in `master` Author: Liwei Lin <lwlin7@gmail.com> Closes #14256 from lw-lin/rdd-pipe. (cherry picked from commit 0bd76e872b60cb80295fc12654e370cf22390056) Signed-off-by: Reynold Xin <rxin@databricks.com>	19 July 2016, 17:25:24 UTC
2c74b6d	WeichenXu	19 July 2016, 11:07:40 UTC	[SPARK-16600][MLLIB] fix some latex formula syntax error ## What changes were proposed in this pull request? `\partial\x` ==> `\partial x` `har{x_i}` ==> `hat{x_i}` ## How was this patch tested? N/A Author: WeichenXu <WeichenXu123@outlook.com> Closes #14246 from WeichenXu123/fix_formular_err. (cherry picked from commit 8310c0741c0ca805ec74c1a78ba4a0f18e82d459) Signed-off-by: Sean Owen <sowen@cloudera.com>	19 July 2016, 11:07:49 UTC
929fa28	Ahmed Mahran	19 July 2016, 11:01:54 UTC	[MINOR][SQL][STREAMING][DOCS] Fix minor typos, punctuations and grammar ## What changes were proposed in this pull request? Minor fixes correcting some typos, punctuations, grammar. Adding more anchors for easy navigation. Fixing minor issues with code snippets. ## How was this patch tested? `jekyll serve` Author: Ahmed Mahran <ahmed.mahran@mashin.io> Closes #14234 from ahmed-mahran/b-struct-streaming-docs. (cherry picked from commit 6caa22050e221cf14e2db0544fd2766dd1102bda) Signed-off-by: Sean Owen <sowen@cloudera.com>	19 July 2016, 11:06:26 UTC
eb1c20f	Dongjoon Hyun	19 July 2016, 10:51:43 UTC	[MINOR][BUILD] Fix Java Linter `LineLength` errors This PR fixes four java linter `LineLength` errors. Those are all `LineLength` errors, but we had better remove all java linter errors before release. After pass the Jenkins, `./dev/lint-java`. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #14255 from dongjoon-hyun/minor_java_linter. (cherry picked from commit 556a9437ac7b55079f5a8a91e669dcc36ca02696) Signed-off-by: Sean Owen <sowen@cloudera.com>	19 July 2016, 10:57:37 UTC
504aa6f	Mortada Mehyar	19 July 2016, 06:49:47 UTC	[DOC] improve python doc for rdd.histogram and dataframe.join ## What changes were proposed in this pull request? doc change only ## How was this patch tested? doc change only Author: Mortada Mehyar <mortada.mehyar@gmail.com> Closes #14253 from mortada/histogram_typos. (cherry picked from commit 6ee40d2cc5f467c78be662c1639fc3d5b7f796cf) Signed-off-by: Reynold Xin <rxin@databricks.com>	19 July 2016, 06:50:01 UTC
ef2a6f1	Cheng Lian	19 July 2016, 06:07:59 UTC	[SPARK-16303][DOCS][EXAMPLES] Minor Scala/Java example update ## What changes were proposed in this pull request? This PR moves one and the last hard-coded Scala example snippet from the SQL programming guide into `SparkSqlExample.scala`. It also renames all Scala/Java example files so that all "Sql" in the file names are updated to "SQL". ## How was this patch tested? Manually verified the generated HTML page. Author: Cheng Lian <lian@databricks.com> Closes #14245 from liancheng/minor-scala-example-update. (cherry picked from commit 1426a080528bdb470b5e81300d892af45dd188bf) Signed-off-by: Yin Huai <yhuai@databricks.com>	19 July 2016, 06:08:11 UTC
24ea875	Reynold Xin	19 July 2016, 01:03:35 UTC	[SPARK-16615][SQL] Expose sqlContext in SparkSession ## What changes were proposed in this pull request? This patch removes the private[spark] qualifier for SparkSession.sqlContext, as discussed in http://apache-spark-developers-list.1001551.n3.nabble.com/Re-transtition-SQLContext-to-SparkSession-td18342.html ## How was this patch tested? N/A - this is a visibility change. Author: Reynold Xin <rxin@databricks.com> Closes #14252 from rxin/SPARK-16615. (cherry picked from commit 69c773052acc627eb033614797de9b913dfa35c1) Signed-off-by: Reynold Xin <rxin@databricks.com>	19 July 2016, 01:03:42 UTC
1dd1526	Reynold Xin	19 July 2016, 00:56:36 UTC	[HOTFIX] Fix Scala 2.10 compilation (cherry picked from commit c4524f5193e1b3ce1c56c5aed126f4121ce26d23) Signed-off-by: Reynold Xin <rxin@databricks.com>	19 July 2016, 00:57:10 UTC
aac8608	Dongjoon Hyun	19 July 2016, 00:17:37 UTC	[SPARK-16590][SQL] Improve LogicalPlanToSQLSuite to check generated SQL directly ## What changes were proposed in this pull request? This PR improves `LogicalPlanToSQLSuite` to check the generated SQL directly by structure. So far, `LogicalPlanToSQLSuite` relies on `checkHiveQl` to ensure the successful SQL generation and answer equality. However, it does not guarantee the generated SQL is the same or will not be changed unnoticeably. ## How was this patch tested? Pass the Jenkins. This is only a testsuite change. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #14235 from dongjoon-hyun/SPARK-16590. (cherry picked from commit ea78edb80bf46e925d53e2aec29666c4eeb66188) Signed-off-by: Reynold Xin <rxin@databricks.com>	19 July 2016, 00:17:44 UTC
7889585	Felix Cheung	18 July 2016, 23:01:57 UTC	[SPARKR][DOCS] minor code sample update in R programming guide ## What changes were proposed in this pull request? Fix code style from ad hoc review of RC4 doc ## How was this patch tested? manual shivaram Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #14250 from felixcheung/rdocs2rc4. (cherry picked from commit 75f0efe74d0c9a7acb525339c5184b99fee4dafc) Signed-off-by: Shivaram Venkataraman <shivaram@cs.berkeley.edu>	18 July 2016, 23:02:33 UTC
33d92f7	Daoyuan Wang	18 July 2016, 20:58:12 UTC	[SPARK-16515][SQL] set default record reader and writer for script transformation ## What changes were proposed in this pull request? In ScriptInputOutputSchema, we read default RecordReader and RecordWriter from conf. Since Spark 2.0 has deleted those config keys from hive conf, we have to set default reader/writer class name by ourselves. Otherwise we will get None for LazySimpleSerde, the data written would not be able to read by script. The test case added worked fine with previous version of Spark, but would fail now. ## How was this patch tested? added a test case in SQLQuerySuite. Closes #14169 Author: Daoyuan Wang <daoyuan.wang@intel.com> Author: Yin Huai <yhuai@databricks.com> Closes #14249 from yhuai/scriptTransformation. (cherry picked from commit 96e9afaae93318250334211cc80ed0fee3d055b9) Signed-off-by: Yin Huai <yhuai@databricks.com>	18 July 2016, 20:58:56 UTC
085f3cc	krishnakalyan3	18 July 2016, 16:46:23 UTC	[SPARK-16055][SPARKR] warning added while using sparkPackages with spark-submit ## What changes were proposed in this pull request? https://issues.apache.org/jira/browse/SPARK-16055 sparkPackages - argument is passed and we detect that we are in the R script mode, we should print some warning like --packages flag should be used with with spark-submit ## How was this patch tested? In my system locally Author: krishnakalyan3 <krishnakalyan3@gmail.com> Closes #14179 from krishnakalyan3/spark-pkg. (cherry picked from commit 8ea3f4eaec65ee4277f9943063fcc9488d3fa924) Signed-off-by: Shivaram Venkataraman <shivaram@cs.berkeley.edu>	18 July 2016, 16:46:47 UTC
2365d63	WeichenXu	18 July 2016, 08:11:53 UTC	[MINOR][TYPO] fix fininsh typo ## What changes were proposed in this pull request? fininsh => finish ## How was this patch tested? N/A Author: WeichenXu <WeichenXu123@outlook.com> Closes #14238 from WeichenXu123/fix_fininsh_typo. (cherry picked from commit a529fc944209e7255ec5858b33490212884d6c60) Signed-off-by: Sean Owen <sowen@cloudera.com>	18 July 2016, 08:12:03 UTC
808d69a	Reynold Xin	18 July 2016, 05:48:00 UTC	[SPARK-16588][SQL] Deprecate monotonicallyIncreasingId in Scala/Java This patch deprecates monotonicallyIncreasingId in Scala/Java, as done in Python. This patch was originally written by HyukjinKwon. Closes #14236. (cherry picked from commit 480c870644595a71102be6597146d80b1c0816e4) Signed-off-by: Reynold Xin <rxin@databricks.com>	18 July 2016, 05:49:27 UTC
a4bf13a	Reynold Xin	17 July 2016, 06:42:28 UTC	[SPARK-16584][SQL] Move regexp unit tests to RegexpExpressionsSuite ## What changes were proposed in this pull request? This patch moves regexp related unit tests from StringExpressionsSuite to RegexpExpressionsSuite to match the file name for regexp expressions. ## How was this patch tested? This is a test only change. Author: Reynold Xin <rxin@databricks.com> Closes #14230 from rxin/SPARK-16584. (cherry picked from commit 7b84758034b9bceca1168438ef5d0beefd5b5273) Signed-off-by: Reynold Xin <rxin@databricks.com>	17 July 2016, 06:42:37 UTC
c527e9e	Shivaram Venkataraman	17 July 2016, 00:06:44 UTC	[SPARK-16507][SPARKR] Add a CRAN checker, fix Rd aliases ## What changes were proposed in this pull request? Add a check-cran.sh script that runs `R CMD check` as CRAN. Also fixes a number of issues pointed out by the check. These include - Updating `DESCRIPTION` to be appropriate - Adding a .Rbuildignore to ignore lintr, src-native, html that are non-standard files / dirs - Adding aliases to all S4 methods in DataFrame, Column, GroupedData etc. This is required as stated in https://cran.r-project.org/doc/manuals/r-release/R-exts.html#Documenting-S4-classes-and-methods - Other minor fixes ## How was this patch tested? SparkR unit tests, running the above mentioned script Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu> Closes #14173 from shivaram/sparkr-cran-changes. (cherry picked from commit c33e4b0d96d424568963c7e716c20f02949c72d1) Signed-off-by: Shivaram Venkataraman <shivaram@cs.berkeley.edu>	17 July 2016, 00:07:38 UTC
8c2ec44	Narine Kokhlikyan	16 July 2016, 23:56:16 UTC	[SPARK-16112][SPARKR] Programming guide for gapply/gapplyCollect ## What changes were proposed in this pull request? Updates programming guide for spark.gapply/spark.gapplyCollect. Similar to other examples I used `faithful` dataset to demonstrate gapply's functionality. Please, let me know if you prefer another example. ## How was this patch tested? Existing test cases in R Author: Narine Kokhlikyan <narine@slice.com> Closes #14090 from NarineK/gapplyProgGuide. (cherry picked from commit 416730483643a0a92dbd6ae4ad07e80ceb3c5285) Signed-off-by: Shivaram Venkataraman <shivaram@cs.berkeley.edu>	16 July 2016, 23:56:24 UTC
cad4693	Sean Owen	16 July 2016, 20:26:58 UTC	[SPARK-3359][DOCS] More changes to resolve javadoc 8 errors that will help unidoc/genjavadoc compatibility ## What changes were proposed in this pull request? These are yet more changes that resolve problems with unidoc/genjavadoc and Java 8. It does not fully resolve the problem, but gets rid of as many errors as we can from this end. ## How was this patch tested? Jenkins build of docs Author: Sean Owen <sowen@cloudera.com> Closes #14221 from srowen/SPARK-3359.3. (cherry picked from commit 5ec0d692b0789a1d06db35134ee6eac2ecce47c3) Signed-off-by: Reynold Xin <rxin@databricks.com>	16 July 2016, 20:27:03 UTC
5d49529	Sameer Agarwal	16 July 2016, 20:24:00 UTC	[SPARK-16582][SQL] Explicitly define isNull = false for non-nullable expressions ## What changes were proposed in this pull request? This patch is just a slightly safer way to fix the issue we encountered in https://github.com/apache/spark/pull/14168 should this pattern re-occur at other places in the code. ## How was this patch tested? Existing tests. Also, I manually tested that it fixes the problem in SPARK-16514 without having the proposed change in https://github.com/apache/spark/pull/14168 Author: Sameer Agarwal <sameerag@cs.berkeley.edu> Closes #14227 from sameeragarwal/codegen. (cherry picked from commit a1ffbada8a266a4130de6fffc4a5efd085a29ae4) Signed-off-by: Reynold Xin <rxin@databricks.com>	16 July 2016, 20:24:15 UTC
34ac45a	Tejas Patil	15 July 2016, 21:27:16 UTC	[SPARK-16230][CORE] CoarseGrainedExecutorBackend to self kill if there is an exception while creating an Executor ## What changes were proposed in this pull request? With the fix from SPARK-13112, I see that `LaunchTask` is always processed after `RegisteredExecutor` is done and so it gets chance to do all retries to startup an executor. There is still a problem that if `Executor` creation itself fails and there is some exception, it gets unnoticed and the executor is killed when it tries to process the `LaunchTask` as `executor` is null : https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala#L88 So if one looks at the logs, it does not tell that there was problem during `Executor` creation and thats why it was killed. This PR explicitly catches exception in `Executor` creation, logs a proper message and then exits the JVM. Also, I have changed the `exitExecutor` method to accept `reason` so that backends can use that reason and do stuff like logging to a DB to get an aggregate of such exits at a cluster level ## How was this patch tested? I am relying on existing tests Author: Tejas Patil <tejasp@fb.com> Closes #14202 from tejasapatil/exit_executor_failure. (cherry picked from commit b2f24f94591082d3ff82bd3db1760b14603b38aa) Signed-off-by: Shixiong Zhu <shixiong@databricks.com>	15 July 2016, 21:27:29 UTC
e833c90	Felix Cheung	15 July 2016, 20:58:57 UTC	[SPARK-16538][SPARKR] Add more tests for namespace call to SparkSession functions ## What changes were proposed in this pull request? More tests I don't think this is critical for Spark 2.0.0 RC, maybe Spark 2.0.1 or 2.1.0. ## How was this patch tested? unit tests shivaram dongjoon-hyun Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #14206 from felixcheung/rroutetests. (cherry picked from commit 611a8ca5895357059f1e7c035d946e0718b26a5a) Signed-off-by: Shivaram Venkataraman <shivaram@cs.berkeley.edu>	15 July 2016, 20:59:05 UTC
90686ab	Joseph K. Bradley	15 July 2016, 20:38:23 UTC	[SPARK-14817][ML][MLLIB][DOC] Made DataFrame-based API primary in MLlib guide ## What changes were proposed in this pull request? Made DataFrame-based API primary * Spark doc menu bar and other places now link to ml-guide.html, not mllib-guide.html * mllib-guide.html keeps RDD-specific list of features, with a link at the top redirecting people to ml-guide.html * ml-guide.html includes a "maintenance mode" announcement about the RDD-based API * Reviewers: please check this carefully * (minor) Titles for DF API no longer include "- spark.ml" suffix. Titles for RDD API have "- RDD-based API" suffix * Moved migration guide to ml-guide from mllib-guide * Also moved past guides from mllib-migration-guides to ml-migration-guides, with a redirect link on mllib-migration-guides * Reviewers: I did not change any of the content of the migration guides. Reorganized DataFrame-based guide: * ml-guide.html mimics the old mllib-guide.html page in terms of content: overview, migration guide, etc. * Moved Pipeline description into ml-pipeline.html and moved tuning into ml-tuning.html * Reviewers: I did not change the content of these guides, except some intro text. * Sidebar remains the same, but with pipeline and tuning sections added Other: * ml-classification-regression.html: Moved text about linear methods to new section in page ## How was this patch tested? Generated docs locally Author: Joseph K. Bradley <joseph@databricks.com> Closes #14213 from jkbradley/ml-guide-2.0. (cherry picked from commit 5ffd5d3838da40ad408a6f40071fe6f4dcacf2a1) Signed-off-by: Joseph K. Bradley <joseph@databricks.com>	15 July 2016, 20:38:33 UTC
c5f9355	Reynold Xin	15 July 2016, 02:24:42 UTC	[SPARK-16557][SQL] Remove stale doc in sql/README.md ## What changes were proposed in this pull request? Most of the documentation in https://github.com/apache/spark/blob/master/sql/README.md is stale. It would be useful to keep the list of projects to explain what's going on, and everything else should be removed. ## How was this patch tested? N/A Author: Reynold Xin <rxin@databricks.com> Closes #14211 from rxin/SPARK-16557. (cherry picked from commit 2e4075e2ece9574100c79558cab054485e25c2ee) Signed-off-by: Reynold Xin <rxin@databricks.com>	15 July 2016, 02:24:47 UTC
aa4690b	Josh Rosen	14 July 2016, 22:55:36 UTC	[SPARK-16555] Work around Jekyll error-handling bug which led to silent failures If a custom Jekyll template tag throws Ruby's equivalent of a "file not found" exception, then Jekyll will stop the doc building process but will exit with a successful status, causing our doc publishing jobs to silently fail. This is caused by https://github.com/jekyll/jekyll/issues/5104, a case of bad error-handling logic in Jekyll. This patch works around this by updating our `include_example.rb` plugin to catch the exception and exit rather than allowing it to bubble up and be ignored by Jekyll. I tested this manually with ``` rm ./examples/src/main/scala/org/apache/spark/examples/sql/SparkSQLExample.scala cd docs SKIP_API=1 jekyll build echo $? ``` Author: Josh Rosen <joshrosen@databricks.com> Closes #14209 from JoshRosen/fix-doc-building. (cherry picked from commit 972673aca562b24c885801d2ac48e0df95cde9eb) Signed-off-by: Reynold Xin <rxin@databricks.com>	14 July 2016, 22:55:42 UTC
5c56bc0	Shivaram Venkataraman	14 July 2016, 21:19:30 UTC	[SPARK-16553][DOCS] Fix SQL example file name in docs ## What changes were proposed in this pull request? Fixes a typo in the sql programming guide ## How was this patch tested? Building docs locally (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu> Closes #14208 from shivaram/spark-sql-doc-fix. (cherry picked from commit 01c4c1fa539a6c601ea0d8960363e895c17a8f76) Signed-off-by: Reynold Xin <rxin@databricks.com>	14 July 2016, 21:19:35 UTC
1fe0bcd	jerryshao	14 July 2016, 17:40:59 UTC	[SPARK-16540][YARN][CORE] Avoid adding jars twice for Spark running on yarn ## What changes were proposed in this pull request? Currently when running spark on yarn, jars specified with --jars, --packages will be added twice, one is Spark's own file server, another is yarn's distributed cache, this can be seen from log: for example: ``` ./bin/spark-shell --master yarn-client --jars examples/target/scala-2.11/jars/scopt_2.11-3.3.0.jar ``` If specified the jar to be added is scopt jar, it will added twice: ``` ... 16/07/14 15:06:48 INFO Server: Started 5603ms 16/07/14 15:06:48 INFO Utils: Successfully started service 'SparkUI' on port 4040. 16/07/14 15:06:48 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://192.168.0.102:4040 16/07/14 15:06:48 INFO SparkContext: Added JAR file:/Users/sshao/projects/apache-spark/examples/target/scala-2.11/jars/scopt_2.11-3.3.0.jar at spark://192.168.0.102:63996/jars/scopt_2.11-3.3.0.jar with timestamp 1468480008637 16/07/14 15:06:49 INFO RMProxy: Connecting to ResourceManager at /0.0.0.0:8032 16/07/14 15:06:49 INFO Client: Requesting a new application from cluster with 1 NodeManagers 16/07/14 15:06:49 INFO Client: Verifying our application has not requested more than the maximum memory capability of the cluster (8192 MB per container) 16/07/14 15:06:49 INFO Client: Will allocate AM container, with 896 MB memory including 384 MB overhead 16/07/14 15:06:49 INFO Client: Setting up container launch context for our AM 16/07/14 15:06:49 INFO Client: Setting up the launch environment for our AM container 16/07/14 15:06:49 INFO Client: Preparing resources for our AM container 16/07/14 15:06:49 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME. 16/07/14 15:06:50 INFO Client: Uploading resource file:/private/var/folders/tb/8pw1511s2q78mj7plnq8p9g40000gn/T/spark-a446300b-84bf-43ff-bfb1-3adfb0571a42/__spark_libs__6486179704064718817.zip -> hdfs://localhost:8020/user/sshao/.sparkStaging/application_1468468348998_0009/__spark_libs__6486179704064718817.zip 16/07/14 15:06:51 INFO Client: Uploading resource file:/Users/sshao/projects/apache-spark/examples/target/scala-2.11/jars/scopt_2.11-3.3.0.jar -> hdfs://localhost:8020/user/sshao/.sparkStaging/application_1468468348998_0009/scopt_2.11-3.3.0.jar 16/07/14 15:06:51 INFO Client: Uploading resource file:/private/var/folders/tb/8pw1511s2q78mj7plnq8p9g40000gn/T/spark-a446300b-84bf-43ff-bfb1-3adfb0571a42/__spark_conf__326416236462420861.zip -> hdfs://localhost:8020/user/sshao/.sparkStaging/application_1468468348998_0009/__spark_conf__.zip ... ``` So here try to avoid adding jars to Spark's fileserver unnecessarily. ## How was this patch tested? Manually verified both in yarn client and cluster mode, also in standalone mode. Author: jerryshao <sshao@hortonworks.com> Closes #14196 from jerryshao/SPARK-16540. (cherry picked from commit 91575cac32e470d7079a55fb86d66332aba599d0) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>	14 July 2016, 17:41:17 UTC
23e1ab9	Jacek Lewandowski	14 July 2016, 17:18:31 UTC	[SPARK-16528][SQL] Fix NPE problem in HiveClientImpl ## What changes were proposed in this pull request? There are some calls to methods or fields (getParameters, properties) which are then passed to Java/Scala collection converters. Unfortunately those fields can be null in some cases and then the conversions throws NPE. We fix it by wrapping calls to those fields and methods with option and then do the conversion. ## How was this patch tested? Manually tested with a custom Hive metastore. Author: Jacek Lewandowski <lewandowski.jacek@gmail.com> Closes #14200 from jacek-lewandowski/SPARK-16528. (cherry picked from commit 31ca741aef9dd138529e064785c8e58b86140ff5) Signed-off-by: Reynold Xin <rxin@databricks.com>	14 July 2016, 17:19:01 UTC
7418019	Dongjoon Hyun	14 July 2016, 16:51:11 UTC	[SPARK-16529][SQL][TEST] `withTempDatabase` should set `default` database before dropping ## What changes were proposed in this pull request? `SQLTestUtils.withTempDatabase` is a frequently used test harness to setup a temporary table and clean up finally. This issue improves like the following for usability. ```scala - try f(dbName) finally spark.sql(s"DROP DATABASE $dbName CASCADE") + try f(dbName) finally { + if (spark.catalog.currentDatabase == dbName) { + spark.sql(s"USE ${DEFAULT_DATABASE}") + } + spark.sql(s"DROP DATABASE $dbName CASCADE") + } ``` In case of forgetting to reset the databaes, `withTempDatabase` will not raise Exception. ## How was this patch tested? This improves test harness. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #14184 from dongjoon-hyun/SPARK-16529. (cherry picked from commit c576f9fb90853cce2e8e5dcc32a536a0f49cbbd8) Signed-off-by: Cheng Lian <lian@databricks.com>	14 July 2016, 16:51:56 UTC
0a651aa	Patrick Wendell	14 July 2016, 16:50:16 UTC	Preparing development version 2.0.1-SNAPSHOT	14 July 2016, 16:50:16 UTC
e5f8c11	Patrick Wendell	14 July 2016, 16:50:07 UTC	Preparing Spark release v2.0.0-rc4	14 July 2016, 16:50:07 UTC
29281bc	Felix Cheung	14 July 2016, 16:45:30 UTC	[SPARK-16538][SPARKR] fix R call with namespace operator on SparkSession functions ## What changes were proposed in this pull request? Fix function routing to work with and without namespace operator `SparkR::createDataFrame` ## How was this patch tested? manual, unit tests shivaram Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #14195 from felixcheung/rroutedefault. (cherry picked from commit 12005c88fb24168d57b577cff73eddcd9d8963fc) Signed-off-by: Shivaram Venkataraman <shivaram@cs.berkeley.edu>	14 July 2016, 16:45:39 UTC
4e9080f	Sun Rui	14 July 2016, 16:38:42 UTC	[SPARK-16509][SPARKR] Rename window.partitionBy and window.orderBy to windowPartitionBy and windowOrderBy. ## What changes were proposed in this pull request? Rename window.partitionBy and window.orderBy to windowPartitionBy and windowOrderBy to pass CRAN package check. ## How was this patch tested? SparkR unit tests. Author: Sun Rui <sunrui2016@gmail.com> Closes #14192 from sun-rui/SPARK-16509. (cherry picked from commit 093ebbc628699b40f091b5b7083c119fffa9314b) Signed-off-by: Shivaram Venkataraman <shivaram@cs.berkeley.edu>	14 July 2016, 16:38:51 UTC
240c42b	WeichenXu	14 July 2016, 08:11:04 UTC	[SPARK-16500][ML][MLLIB][OPTIMIZER] add LBFGS convergence warning for all used place in MLLib ## What changes were proposed in this pull request? Add warning_for the following case when LBFGS training not actually convergence: 1) LogisticRegression 2) AFTSurvivalRegression 3) LBFGS algorithm wrapper in mllib package ## How was this patch tested? N/A Author: WeichenXu <WeichenXu123@outlook.com> Closes #14157 from WeichenXu123/add_lbfgs_convergence_warning_for_all_used_place. (cherry picked from commit 252d4f27f23b547777892bcea25a2cea62d8cbab) Signed-off-by: Sean Owen <sowen@cloudera.com>	14 July 2016, 08:11:14 UTC
b3ebecb	Patrick Wendell	14 July 2016, 05:32:55 UTC	Preparing development version 2.0.1-SNAPSHOT	14 July 2016, 05:32:55 UTC
48d1fa3	Patrick Wendell	14 July 2016, 05:32:45 UTC	Preparing Spark release v2.0.0-rc3	14 July 2016, 05:32:45 UTC
f6eda6b	Liwei Lin	14 July 2016, 05:30:46 UTC	[SPARK-16503] SparkSession should provide Spark version ## What changes were proposed in this pull request? This patch enables SparkSession to provide spark version. ## How was this patch tested? Manual test: ``` scala> sc.version res0: String = 2.1.0-SNAPSHOT scala> spark.version res1: String = 2.1.0-SNAPSHOT ``` ``` >>> sc.version u'2.1.0-SNAPSHOT' >>> spark.version u'2.1.0-SNAPSHOT' ``` Author: Liwei Lin <lwlin7@gmail.com> Closes #14165 from lw-lin/add-version. (cherry picked from commit 39c836e976fcae51568bed5ebab28e148383b5d4) Signed-off-by: Reynold Xin <rxin@databricks.com>	14 July 2016, 05:30:52 UTC
5244f86	Patrick Wendell	14 July 2016, 05:27:15 UTC	Preparing development version 2.0.1-SNAPSHOT	14 July 2016, 05:27:15 UTC
47eb9a6	Patrick Wendell	14 July 2016, 05:27:07 UTC	Preparing Spark release v2.0.0-rc3	14 July 2016, 05:27:07 UTC
abb8023	Joseph K. Bradley	13 July 2016, 22:40:44 UTC	[SPARK-16485][ML][DOC] Fix privacy of GLM members, rename sqlDataTypes for ML, doc fixes ## What changes were proposed in this pull request? Fixing issues found during 2.0 API checks: * GeneralizedLinearRegressionModel: linkObj, familyObj, familyAndLink should not be exposed * sqlDataTypes: name does not follow conventions. Do we need to expose it? * Evaluator: inconsistent doc between evaluate and isLargerBetter * MinMaxScaler: math rendering --> hard to make it great, but I'll change it a little * GeneralizedLinearRegressionSummary: aic doc is incorrect --> will change to use more common name ## How was this patch tested? Existing unit tests. Docs generated locally. (MinMaxScaler is improved a tiny bit.) Author: Joseph K. Bradley <joseph@databricks.com> Closes #14187 from jkbradley/final-api-check-2.0. (cherry picked from commit a5f51e21627c1bcfc62829a3a962707abf41a452) Signed-off-by: Joseph K. Bradley <joseph@databricks.com>	13 July 2016, 22:40:53 UTC
550d0e7	gatorsmile	13 July 2016, 22:23:37 UTC	[SPARK-16482][SQL] Describe Table Command for Tables Requiring Runtime Inferred Schema #### What changes were proposed in this pull request? If we create a table pointing to a parquet/json datasets without specifying the schema, describe table command does not show the schema at all. It only shows `# Schema of this table is inferred at runtime`. In 1.6, describe table does show the schema of such a table. ~~For data source tables, to infer the schema, we need to load the data source tables at runtime. Thus, this PR calls the function `lookupRelation`.~~ For data source tables, we infer the schema before table creation. Thus, this PR set the inferred schema as the table schema when table creation. #### How was this patch tested? Added test cases Author: gatorsmile <gatorsmile@gmail.com> Closes #14148 from gatorsmile/describeSchema. (cherry picked from commit c5ec879828369ec1d21acd7f18a792306634ff74) Signed-off-by: Yin Huai <yhuai@databricks.com>	13 July 2016, 22:23:59 UTC
9e3a598	Felix Cheung	13 July 2016, 22:09:23 UTC	[SPARKR][DOCS][MINOR] R programming guide to include csv data source example ## What changes were proposed in this pull request? Minor documentation update for code example, code style, and missed reference to "sparkR.init" ## How was this patch tested? manual shivaram Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #14178 from felixcheung/rcsvprogrammingguide. (cherry picked from commit fb2e8eeb0b1e56bea535165f7a3bec6558b3f4a3) Signed-off-by: Shivaram Venkataraman <shivaram@cs.berkeley.edu>	13 July 2016, 22:09:31 UTC
18255a9	Felix Cheung	13 July 2016, 20:33:34 UTC	[SPARKR][MINOR] R examples and test updates ## What changes were proposed in this pull request? Minor example updates ## How was this patch tested? manual shivaram Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #14171 from felixcheung/rexample. (cherry picked from commit b4baf086ca380a46d953f2710184ad9eee3a045e) Signed-off-by: Shivaram Venkataraman <shivaram@cs.berkeley.edu>	13 July 2016, 20:33:47 UTC
86adc5c	James Thomas	13 July 2016, 20:26:23 UTC	[SPARK-16114][SQL] updated structured streaming guide ## What changes were proposed in this pull request? Updated structured streaming programming guide with new windowed example. ## How was this patch tested? Docs Author: James Thomas <jamesjoethomas@gmail.com> Closes #14183 from jjthomas/ss_docs_update. (cherry picked from commit 51a6706b1339bb761602e33276a469f71be2cd90) Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>	13 July 2016, 20:26:34 UTC
7de183d	Burak Yavuz	13 July 2016, 19:54:57 UTC	[SPARK-16531][SQL][TEST] Remove timezone setting from DataFrameTimeWindowingSuite ## What changes were proposed in this pull request? It's unnecessary. `QueryTest` already sets it. Author: Burak Yavuz <brkyvz@gmail.com> Closes #14170 from brkyvz/test-tz. (cherry picked from commit 0744d84c91d6e494dea77a35e6410bc4b1849e71) Signed-off-by: Michael Armbrust <michael@databricks.com>	13 July 2016, 19:55:11 UTC
2e97f3a	Joseph K. Bradley	13 July 2016, 19:33:39 UTC	[SPARK-14812][ML][MLLIB][PYTHON] Experimental, DeveloperApi annotation audit for ML ## What changes were proposed in this pull request? General decisions to follow, except where noted: * spark.mllib, pyspark.mllib: Remove all Experimental annotations. Leave DeveloperApi annotations alone. * spark.ml, pyspark.ml Annotate Estimator-Model pairs of classes and companion objects the same way. For all algorithms marked Experimental with Since tag <= 1.6, remove Experimental annotation. ** For all algorithms marked Experimental with Since tag = 2.0, leave Experimental annotation. * DeveloperApi annotations are left alone, except where noted. * No changes to which types are sealed. Exceptions where I am leaving items Experimental in spark.ml, pyspark.ml, mainly because the items are new: * Model Summary classes * MLWriter, MLReader, MLWritable, MLReadable * Evaluator and subclasses: There is discussion of changes around evaluating multiple metrics at once for efficiency. * RFormula: Its behavior may need to change slightly to match R in edge cases. * AFTSurvivalRegression * MultilayerPerceptronClassifier DeveloperApi changes: * ml.tree.Node, ml.tree.Split, and subclasses should no longer be DeveloperApi ## How was this patch tested? N/A Note to reviewers: * spark.ml.clustering.LDA underwent significant changes (additional methods), so let me know if you want me to leave it Experimental. * Be careful to check for cases where a class should no longer be Experimental but has an Experimental method, val, or other feature. I did not find such cases, but please verify. Author: Joseph K. Bradley <joseph@databricks.com> Closes #14147 from jkbradley/experimental-audit. (cherry picked from commit 01f09b161217193b797c8c85969d17054c958615) Signed-off-by: Joseph K. Bradley <joseph@databricks.com>	13 July 2016, 19:34:15 UTC
90f0e81	jerryshao	13 July 2016, 18:24:47 UTC	[SPARK-16435][YARN][MINOR] Add warning log if initialExecutors is less than minExecutors ## What changes were proposed in this pull request? Currently if `spark.dynamicAllocation.initialExecutors` is less than `spark.dynamicAllocation.minExecutors`, Spark will automatically pick the minExecutors without any warning. While in 1.6 Spark will throw exception if configured like this. So here propose to add warning log if these parameters are configured invalidly. ## How was this patch tested? Unit test added to verify the scenario. Author: jerryshao <sshao@hortonworks.com> Closes #14149 from jerryshao/SPARK-16435. (cherry picked from commit d8220c1e5e94abbdb9643672b918f0d748206db9) Signed-off-by: Tom Graves <tgraves@yahoo-inc.com>	13 July 2016, 18:25:05 UTC
7d9bd95	oraviv	13 July 2016, 13:47:08 UTC	[SPARK-16469] enhanced simulate multiply ## What changes were proposed in this pull request? We have a use case of multiplying very big sparse matrices. we have about 1000x1000 distributed block matrices multiplication and the simulate multiply goes like O(n^4) (n being 1000). it takes about 1.5 hours. We modified it slightly with classical hashmap and now run in about 30 seconds O(n^2). ## How was this patch tested? We have added a performance test and verified the reduced time. Author: oraviv <oraviv@paypal.com> Closes #14068 from uzadude/master. (cherry picked from commit ea06e4ef34c860219a9aeec81816ef53ada96253) Signed-off-by: Sean Owen <sowen@cloudera.com>	13 July 2016, 13:47:47 UTC
5a71a05	Sean Owen	13 July 2016, 10:39:32 UTC	[SPARK-16440][MLLIB] Undeleted broadcast variables in Word2Vec causing OoM for long runs ## What changes were proposed in this pull request? Unpersist broadcasted vars in Word2Vec.fit for more timely / reliable resource cleanup ## How was this patch tested? Jenkins tests Author: Sean Owen <sowen@cloudera.com> Closes #14153 from srowen/SPARK-16440. (cherry picked from commit 51ade51a9fd64fc2fe651c505a286e6f29f59d40) Signed-off-by: Sean Owen <sowen@cloudera.com>	13 July 2016, 10:39:39 UTC
74ad486	sharkd	13 July 2016, 10:36:02 UTC	[MINOR][YARN] Fix code error in yarn-cluster unit test ## What changes were proposed in this pull request? Fix code error in yarn-cluster unit test. ## How was this patch tested? Use exist tests Author: sharkd <sharkd.tu@gmail.com> Closes #14166 from sharkdtu/master. (cherry picked from commit 3d6f679cfe5945a9f72841727342af39e9410e0a) Signed-off-by: Sean Owen <sowen@cloudera.com>	13 July 2016, 10:36:09 UTC
a34a544	sandy	13 July 2016, 10:33:46 UTC	[SPARK-16438] Add Asynchronous Actions documentation ## What changes were proposed in this pull request? Add Asynchronous Actions documentation inside action of programming guide ## How was this patch tested? check the documentation indentation and formatting with md preview. Author: sandy <phalodi@gmail.com> Closes #14104 from phalodi/SPARK-16438. (cherry picked from commit bf107f1e6522f9138d454b0723089c24626e775a) Signed-off-by: Sean Owen <sowen@cloudera.com>	13 July 2016, 10:33:54 UTC
38787ec	Maciej Brynski	13 July 2016, 09:50:26 UTC	[SPARK-16439] Fix number formatting in SQL UI ## What changes were proposed in this pull request? Spark SQL UI display numbers greater than 1000 with u00A0 as grouping separator. Problem exists when server locale has no-breaking space as separator. (for example pl_PL) This patch turns off grouping and remove this separator. The problem starts with this PR. https://github.com/apache/spark/pull/12425/files#diff-803f475b01acfae1c5c96807c2ea9ddcR125 ## How was this patch tested? Manual UI tests. Screenshot attached. ![image](https://cloud.githubusercontent.com/assets/4006010/16749556/5cb5a372-47cb-11e6-9a95-67fd3f9d1c71.png) Author: Maciej Brynski <maciej.brynski@adpilot.pl> Closes #14142 from maver1ck/master. (cherry picked from commit 83879ebc5850b74369a5b066c65fa9929bbdb21c) Signed-off-by: Sean Owen <sowen@cloudera.com>	13 July 2016, 09:50:34 UTC
934e2aa	Xin Ren	13 July 2016, 09:47:07 UTC	[MINOR] Fix Java style errors and remove unused imports Fix Java style errors and remove unused imports, which are randomly found Tested on my local machine. Author: Xin Ren <iamshrek@126.com> Closes #14161 from keypointt/SPARK-16437. (cherry picked from commit f73891e0b9640e14455bdbfd999a8ff10b78a819) Signed-off-by: Sean Owen <sowen@cloudera.com>	13 July 2016, 09:48:39 UTC
5301efc	Alex Bozarth	13 July 2016, 09:45:06 UTC	[SPARK-16375][WEB UI] Fixed misassigned var: numCompletedTasks was assigned to numSkippedTasks ## What changes were proposed in this pull request? I fixed a misassigned var, numCompletedTasks was assigned to numSkippedTasks in the convertJobData method ## How was this patch tested? dev/run-tests Author: Alex Bozarth <ajbozart@us.ibm.com> Closes #14141 from ajbozarth/spark16375. (cherry picked from commit f156136dae5df38f73a25cf3fb48f98f417ef059) Signed-off-by: Sean Owen <sowen@cloudera.com>	13 July 2016, 09:45:14 UTC
4b93a83	Sean Owen	13 July 2016, 09:44:07 UTC	[SPARK-15889][STREAMING] Follow-up fix to erroneous condition in StreamTest ## What changes were proposed in this pull request? A second form of AssertQuery now actually invokes the condition; avoids a build warning too ## How was this patch tested? Jenkins; running StreamTest Author: Sean Owen <sowen@cloudera.com> Closes #14133 from srowen/SPARK-15889.2. (cherry picked from commit c190d89bd3cf677400c49238498207b87da9ee78) Signed-off-by: Sean Owen <sowen@cloudera.com>	13 July 2016, 09:44:15 UTC
5173f84	aokolnychyi	13 July 2016, 08:12:05 UTC	[SPARK-16303][DOCS][EXAMPLES] Updated SQL programming guide and examples - Hard-coded Spark SQL sample snippets were moved into source files under examples sub-project. - Removed the inconsistency between Scala and Java Spark SQL examples - Scala and Java Spark SQL examples were updated The work is still in progress. All involved examples were tested manually. An additional round of testing will be done after the code review. ![image](https://cloud.githubusercontent.com/assets/6235869/16710314/51851606-462a-11e6-9fbe-0818daef65e4.png) Author: aokolnychyi <okolnychyyanton@gmail.com> Closes #14119 from aokolnychyi/spark_16303. (cherry picked from commit 772c213ec702c80d0f25aa6f30b2dffebfbe2d0d) Signed-off-by: Cheng Lian <lian@databricks.com>	13 July 2016, 08:12:51 UTC
41df62c	Eric Liang	13 July 2016, 06:09:02 UTC	[SPARK-16514][SQL] Fix various regex codegen bugs ## What changes were proposed in this pull request? RegexExtract and RegexReplace currently crash on non-nullable input due use of a hard-coded local variable name (e.g. compiles fail with `java.lang.Exception: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 85, Column 26: Redefinition of local variable "m" `). This changes those variables to use fresh names, and also in a few other places. ## How was this patch tested? Unit tests. rxin Author: Eric Liang <ekl@databricks.com> Closes #14168 from ericl/sc-3906. (cherry picked from commit 1c58fa905b6543d366d00b2e5394dfd633987f6d) Signed-off-by: Reynold Xin <rxin@databricks.com>	13 July 2016, 06:09:08 UTC
4303d29	petermaxlee	13 July 2016, 00:05:20 UTC	[SPARK-16284][SQL] Implement reflect SQL function ## What changes were proposed in this pull request? This patch implements reflect SQL function, which can be used to invoke a Java method in SQL. Slightly different from Hive, this implementation requires the class name and the method name to be literals. This implementation also supports only a smaller number of data types, and requires the function to be static, as suggested by rxin in #13969. java_method is an alias for reflect, so this should also resolve SPARK-16277. ## How was this patch tested? Added expression unit tests and an end-to-end test. Author: petermaxlee <petermaxlee@gmail.com> Closes #14138 from petermaxlee/reflect-static. (cherry picked from commit 56bd399a86c4e92be412d151200cb5e4a5f6a48a) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	13 July 2016, 00:05:52 UTC
2f47b37	sharkd	12 July 2016, 17:10:35 UTC	[SPARK-16414][YARN] Fix bugs for "Can not get user config when calling SparkHadoopUtil.get.conf on yarn cluser mode" ## What changes were proposed in this pull request? The `SparkHadoopUtil` singleton was instantiated before `ApplicationMaster` in `ApplicationMaster.main` when deploying spark on yarn cluster mode, the `conf` in the `SparkHadoopUtil` singleton didn't include user's configuration. So, we should load the properties file with the Spark configuration and set entries as system properties before `SparkHadoopUtil` first instantiate. ## How was this patch tested? Add a test case Author: sharkd <sharkd.tu@gmail.com> Author: sharkdtu <sharkdtu@tencent.com> Closes #14088 from sharkdtu/master. (cherry picked from commit d513c99c19e229f72d03006e251725a43c13fefd)	12 July 2016, 17:14:41 UTC
f419476	Reynold Xin	12 July 2016, 17:07:23 UTC	[SPARK-16489][SQL] Guard against variable reuse mistakes in expression code generation In code generation, it is incorrect for expressions to reuse variable names across different instances of itself. As an example, SPARK-16488 reports a bug in which pmod expression reuses variable name "r". This patch updates ExpressionEvalHelper test harness to always project two instances of the same expression, which will help us catch variable reuse problems in expression unit tests. This patch also fixes the bug in crc32 expression. This is a test harness change, but I also created a new test suite for testing the test harness. Author: Reynold Xin <rxin@databricks.com> Closes #14146 from rxin/SPARK-16489. (cherry picked from commit c377e49e38a290e5c4fbc178278069788674dfb7) Signed-off-by: Reynold Xin <rxin@databricks.com>	12 July 2016, 17:08:07 UTC
7b63e7d	WeichenXu	12 July 2016, 12:04:34 UTC	[SPARK-16470][ML][OPTIMIZER] Check linear regression training whether actually reach convergence and add warning if not ## What changes were proposed in this pull request? In `ml.regression.LinearRegression`, it use breeze `LBFGS` and `OWLQN` optimizer to do data training, but do not check whether breeze's optimizer returned result actually reached convergence. The `LBFGS` and `OWLQN` optimizer in breeze finish iteration may result the following situations: 1) reach max iteration number 2) function reach value convergence 3) objective function stop improving 4) gradient reach convergence 5) search failed(due to some internal numerical error) I add warning printing code so that if the iteration result is (1) or (3) or (5) in above, it will print a warning with respective reason string. ## How was this patch tested? Manual. Author: WeichenXu <WeichenXu123@outlook.com> Closes #14122 from WeichenXu123/add_lr_not_convergence_warn. (cherry picked from commit 6cb75db9ab1a4f227069bec2763b89546b88b0ee) Signed-off-by: Sean Owen <sowen@cloudera.com>	12 July 2016, 12:04:42 UTC
9e0d2e2	WeichenXu	12 July 2016, 08:23:59 UTC	[MINOR][ML] update comment where is inconsistent with code in ml.regression.LinearRegression ## What changes were proposed in this pull request? In `train` method of `ml.regression.LinearRegression` when handling situation `std(label) == 0` the code replace `std(label)` with `mean(label)` but the relative comment is inconsistent, I update it. ## How was this patch tested? N/A Author: WeichenXu <WeichenXu123@outlook.com> Closes #14121 from WeichenXu123/update_lr_comment. (cherry picked from commit fc11c509e234c5414687f7fbd13af113a1f52f10) Signed-off-by: Sean Owen <sowen@cloudera.com>	12 July 2016, 08:24:09 UTC
6892614	Sameer Agarwal	12 July 2016, 03:26:01 UTC	[SPARK-16488] Fix codegen variable namespace collision in pmod and partitionBy This patch fixes a variable namespace collision bug in pmod and partitionBy Regression test for one possible occurrence. A more general fix in `ExpressionEvalHelper.checkEvaluation` will be in a subsequent PR. Author: Sameer Agarwal <sameer@databricks.com> Closes #14144 from sameeragarwal/codegen-bug. (cherry picked from commit 9cc74f95edb6e4f56151966139cd0dc24e377949) Signed-off-by: Reynold Xin <rxin@databricks.com>	12 July 2016, 03:29:35 UTC
b37177c	Tathagata Das	12 July 2016, 01:41:36 UTC	[SPARK-16430][SQL][STREAMING] Fixed bug in the maxFilesPerTrigger in FileStreamSource ## What changes were proposed in this pull request? Incorrect list of files were being allocated to a batch. This caused a file to read multiple times in the multiple batches. ## How was this patch tested? Added unit tests Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #14143 from tdas/SPARK-16430-1. (cherry picked from commit e50efd53f073890d789a8448f850cc219cca7708) Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>	12 July 2016, 01:41:45 UTC
b716e10	Shixiong Zhu	12 July 2016, 01:11:06 UTC	[SPARK-16433][SQL] Improve StreamingQuery.explain when no data arrives ## What changes were proposed in this pull request? Display `No physical plan. Waiting for data.` instead of `N/A` for StreamingQuery.explain when no data arrives because `N/A` doesn't provide meaningful information. ## How was this patch tested? Existing unit tests. Author: Shixiong Zhu <shixiong@databricks.com> Closes #14100 from zsxwing/SPARK-16433. (cherry picked from commit 91a443b849e4d1ccc50a32b25fdd2bb502cf9b84) Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>	12 July 2016, 01:11:17 UTC
81d7f48	Xin Ren	12 July 2016, 01:09:14 UTC	[MINOR][STREAMING][DOCS] Minor changes on kinesis integration ## What changes were proposed in this pull request? Some minor changes for documentation page "Spark Streaming + Kinesis Integration". Moved "streaming-kinesis-arch.png" before the bullet list, not in between the bullets. ## How was this patch tested? Tested manually, on my local machine. Author: Xin Ren <iamshrek@126.com> Closes #14097 from keypointt/kinesisDoc. (cherry picked from commit 05d7151ccbccdd977ec2f2301d5b12566018c988) Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>	12 July 2016, 01:09:27 UTC
02d584c	James Thomas	12 July 2016, 00:57:51 UTC	[SPARK-16114][SQL] structured streaming event time window example ## What changes were proposed in this pull request? A structured streaming example with event time windowing. ## How was this patch tested? Run locally Author: James Thomas <jamesjoethomas@gmail.com> Closes #13957 from jjthomas/current. (cherry picked from commit 9e2c763dbb5ac6fc5d2eb0759402504d4b9073a4) Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>	12 July 2016, 00:58:00 UTC
cb463b6	Felix Cheung	11 July 2016, 21:34:48 UTC	[SPARK-16144][SPARKR] update R API doc for mllib ## What changes were proposed in this pull request? From SPARK-16140/PR #13921 - the issue is we left write.ml doc empty: ![image](https://cloud.githubusercontent.com/assets/8969467/16481934/856dd0ea-3e62-11e6-9474-e4d57d1ca001.png) Here's what I meant as the fix: ![image](https://cloud.githubusercontent.com/assets/8969467/16481943/911f02ec-3e62-11e6-9d68-17363a9f5628.png) ![image](https://cloud.githubusercontent.com/assets/8969467/16481950/9bc057aa-3e62-11e6-8127-54870701c4b1.png) I didn't realize there was already a JIRA on this. mengxr yanboliang ## How was this patch tested? check doc generated. Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #13993 from felixcheung/rmllibdoc. (cherry picked from commit 7f38b9d5f469b2550bc481cbf9adb9acc3779712) Signed-off-by: Shivaram Venkataraman <shivaram@cs.berkeley.edu>	11 July 2016, 21:34:57 UTC
b938ca7	Yanbo Liang	11 July 2016, 21:31:11 UTC	[SPARKR][DOC] SparkR ML user guides update for 2.0 ## What changes were proposed in this pull request? * Update SparkR ML section to make them consistent with SparkR API docs. * Since #13972 adds labelling support for the ```include_example``` Jekyll plugin, so that we can split the single ```ml.R``` example file into multiple line blocks with different labels, and include them in different algorithms/models in the generated HTML page. ## How was this patch tested? Only docs update, manually check the generated docs. Author: Yanbo Liang <ybliang8@gmail.com> Closes #14011 from yanboliang/r-user-guide-update. (cherry picked from commit 2ad031be67c7a0f0c4895c084c891330a9ec935e) Signed-off-by: Shivaram Venkataraman <shivaram@cs.berkeley.edu>	11 July 2016, 21:31:18 UTC
aea33bf	Dongjoon Hyun	11 July 2016, 20:45:22 UTC	[SPARK-16458][SQL] SessionCatalog should support `listColumns` for temporary tables ## What changes were proposed in this pull request? Temporary tables are used frequently, but `spark.catalog.listColumns` does not support those tables. This PR make `SessionCatalog` supports temporary table column listing. Before ```scala scala> spark.range(10).createOrReplaceTempView("t1") scala> spark.catalog.listTables().collect() res1: Array[org.apache.spark.sql.catalog.Table] = Array(Table[name=`t1`, tableType=`TEMPORARY`, isTemporary=`true`]) scala> spark.catalog.listColumns("t1").collect() org.apache.spark.sql.AnalysisException: Table `t1` does not exist in database `default`.; ``` After ``` scala> spark.catalog.listColumns("t1").collect() res2: Array[org.apache.spark.sql.catalog.Column] = Array(Column[name='id', description='id', dataType='bigint', nullable='false', isPartition='false', isBucket='false']) ``` ## How was this patch tested? Pass the Jenkins tests including a new testcase. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #14114 from dongjoon-hyun/SPARK-16458. (cherry picked from commit 840853ed06d63694bf98b21a889a960aac6ac0ac) Signed-off-by: Herman van Hovell <hvanhovell@databricks.com>	11 July 2016, 20:45:49 UTC
72cf743	petermaxlee	11 July 2016, 19:42:43 UTC	[SPARK-16318][SQL] Implement all remaining xpath functions (branch-2.0) ## What changes were proposed in this pull request? This patch implements all remaining xpath functions that Hive supports and not natively supported in Spark: xpath_int, xpath_short, xpath_long, xpath_float, xpath_double, xpath_string, and xpath. This is based on https://github.com/apache/spark/pull/13991 but for branch-2.0. ## How was this patch tested? Added unit tests and end-to-end tests. Author: petermaxlee <petermaxlee@gmail.com> Closes #14131 from petermaxlee/xpath-branch-2.0.	11 July 2016, 19:42:43 UTC
f97dd8a	Dongjoon Hyun	11 July 2016, 13:15:47 UTC	[SPARK-16459][SQL] Prevent dropping current database This PR prevents dropping the current database to avoid errors like the followings. ```scala scala> sql("create database delete_db") scala> sql("use delete_db") scala> sql("drop database delete_db") scala> sql("create table t as select 1") org.apache.spark.sql.catalyst.analysis.NoSuchDatabaseException: Database `delete_db` not found; ``` Pass the Jenkins tests including an updated testcase. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #14115 from dongjoon-hyun/SPARK-16459. (cherry picked from commit 7ac79da0e4607f7f89a3617edf53c2b174b378e8) Signed-off-by: Herman van Hovell <hvanhovell@databricks.com>	11 July 2016, 17:02:01 UTC
7e4ba66	Xin Ren	11 July 2016, 12:05:28 UTC	[SPARK-16381][SQL][SPARKR] Update SQL examples and programming guide for R language binding https://issues.apache.org/jira/browse/SPARK-16381 ## What changes were proposed in this pull request? Update SQL examples and programming guide for R language binding. Here I just follow example https://github.com/apache/spark/compare/master...liancheng:example-snippet-extraction, created a separate R file to store all the example code. ## How was this patch tested? Manual test on my local machine. Screenshot as below: ![screen shot 2016-07-06 at 4 52 25 pm](https://cloud.githubusercontent.com/assets/3925641/16638180/13925a58-439a-11e6-8d57-8451a63dcae9.png) Author: Xin Ren <iamshrek@126.com> Closes #14082 from keypointt/SPARK-16381. (cherry picked from commit 9cb1eb7af779e74165552977002158a7dad9bb09) Signed-off-by: Cheng Lian <lian@databricks.com>	11 July 2016, 12:05:41 UTC
aa8cbcd	gatorsmile	11 July 2016, 08:21:13 UTC	[SPARK-16355][SPARK-16354][SQL] Fix Bugs When LIMIT/TABLESAMPLE is Non-foldable, Zero or Negative #### What changes were proposed in this pull request? Issue 1: When a query containing LIMIT/TABLESAMPLE 0, the statistics could be zero. Results are correct but it could cause a huge performance regression. For example, ```Scala Seq(("one", 1), ("two", 2), ("three", 3), ("four", 4)).toDF("k", "v") .createOrReplaceTempView("test") val df1 = spark.table("test") val df2 = spark.table("test").limit(0) val df = df1.join(df2, Seq("k"), "left") ``` The statistics of both `df` and `df2` are zero. The statistics values should never be zero; otherwise `sizeInBytes` of `BinaryNode` will also be zero (product of children). This PR is to increase it to `1` when the num of rows is equal to 0. Issue 2: When a query containing negative LIMIT/TABLESAMPLE, we should issue exceptions. Negative values could break the implementation assumption of multiple parts. For example, statistics calculation. Below is the example query. ```SQL SELECT * FROM testData TABLESAMPLE (-1 rows) SELECT * FROM testData LIMIT -1 ``` This PR is to issue an appropriate exception in this case. Issue 3: Spark SQL follows the restriction of LIMIT clause in Hive. The argument to the LIMIT clause must evaluate to a constant value. It can be a numeric literal, or another kind of numeric expression involving operators, casts, and function return values. You cannot refer to a column or use a subquery. Currently, we do not detect whether the expression in LIMIT clause is foldable or not. If non-foldable, we might issue a strange error message. For example, ```SQL SELECT * FROM testData LIMIT rand() > 0.2 ``` Then, a misleading error message is issued, like ``` assertion failed: No plan for GlobalLimit (_nondeterministic#203 > 0.2) +- Project [key#11, value#12, rand(-1441968339187861415) AS _nondeterministic#203] +- LocalLimit (_nondeterministic#202 > 0.2) +- Project [key#11, value#12, rand(-1308350387169017676) AS _nondeterministic#202] +- LogicalRDD [key#11, value#12] java.lang.AssertionError: assertion failed: No plan for GlobalLimit (_nondeterministic#203 > 0.2) +- Project [key#11, value#12, rand(-1441968339187861415) AS _nondeterministic#203] +- LocalLimit (_nondeterministic#202 > 0.2) +- Project [key#11, value#12, rand(-1308350387169017676) AS _nondeterministic#202] +- LogicalRDD [key#11, value#12] ``` This PR detects it and then issues a meaningful error message. #### How was this patch tested? Added test cases. Author: gatorsmile <gatorsmile@gmail.com> Closes #14034 from gatorsmile/limit. (cherry picked from commit e22627894126dceb7491300b63f1fe028b1e2e2c) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	11 July 2016, 08:27:51 UTC
139d5ea	Reynold Xin	11 July 2016, 05:05:16 UTC	[SPARK-16476] Restructure MimaExcludes for easier union excludes ## What changes were proposed in this pull request? It is currently fairly difficult to have proper mima excludes when we cut a version branch. I'm proposing a small change to take the exclude list out of the exclude function, and put it in a variable so we can easily union excludes. After this change, we can bump pom.xml version to 2.1.0-SNAPSHOT, without bumping the diff base version. Note that I also deleted all the exclude rules for version 1.x, to cut down the size of the file. ## How was this patch tested? N/A - this is a build infra change. Author: Reynold Xin <rxin@databricks.com> Closes #14128 from rxin/SPARK-16476. (cherry picked from commit 52b5bb0b7fabe6cc949f514c548f9fbc6a4fa181) Signed-off-by: Reynold Xin <rxin@databricks.com>	11 July 2016, 05:05:22 UTC
a33643c	gatorsmile	09 July 2016, 12:35:45 UTC	[SPARK-16401][SQL] Data Source API: Enable Extending RelationProvider and CreatableRelationProvider without Extending SchemaRelationProvider #### What changes were proposed in this pull request? When users try to implement a data source API with extending only `RelationProvider` and `CreatableRelationProvider`, they will hit an error when resolving the relation. ```Scala spark.read .format("org.apache.spark.sql.test.DefaultSourceWithoutUserSpecifiedSchema") .load() .write. format("org.apache.spark.sql.test.DefaultSourceWithoutUserSpecifiedSchema") .save() ``` The error they hit is like ``` org.apache.spark.sql.test.DefaultSourceWithoutUserSpecifiedSchema does not allow user-specified schemas.; org.apache.spark.sql.AnalysisException: org.apache.spark.sql.test.DefaultSourceWithoutUserSpecifiedSchema does not allow user-specified schemas.; at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:319) at org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:494) at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:211) ``` Actually, the bug fix is simple. [`DataSource.createRelation(sparkSession.sqlContext, mode, options, data)`](https://github.com/gatorsmile/spark/blob/dd644f8117e889cebd6caca58702a7c7e3d88bef/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L429) already returns a BaseRelation. We should not assign schema to `userSpecifiedSchema`. That schema assignment only makes sense for the data sources that extend `FileFormat`. #### How was this patch tested? Added a test case. Author: gatorsmile <gatorsmile@gmail.com> Closes #14075 from gatorsmile/dataSource. (cherry picked from commit 7374e518e2641fddfe57003340db410224b37581) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	09 July 2016, 12:37:06 UTC
50d7002	Michael Gummelt	09 July 2016, 03:20:26 UTC	[SPARK-11857][MESOS] Deprecate fine grained ## What changes were proposed in this pull request? Documentation changes to indicate that fine-grained mode is now deprecated. No code changes were made, and all fine-grained mode instructions were left in place. We can remove all of that once the deprecation cycle completes (Does Spark have a standard deprecation cycle? One major version?) Blocked on https://github.com/apache/spark/pull/14059 ## How was this patch tested? Viewed in Github Author: Michael Gummelt <mgummelt@mesosphere.io> Closes #14078 from mgummelt/deprecate-fine-grained. (cherry picked from commit b1db26acc51003e68e4e8d7d324cf74e3aa03cfd) Signed-off-by: Reynold Xin <rxin@databricks.com>	09 July 2016, 03:20:48 UTC
5024c4c	Eric Liang	09 July 2016, 03:18:49 UTC	[SPARK-16432] Empty blocks fail to serialize due to assert in ChunkedByteBuffer ## What changes were proposed in this pull request? It's possible to also change the callers to not pass in empty chunks, but it seems cleaner to just allow `ChunkedByteBuffer` to handle empty arrays. cc JoshRosen ## How was this patch tested? Unit tests, also checked that the original reproduction case in https://github.com/apache/spark/pull/11748#issuecomment-230760283 is resolved. Author: Eric Liang <ekl@databricks.com> Closes #14099 from ericl/spark-16432. (cherry picked from commit d8b06f18dc3e35938d15099beac98221d6f528b5) Signed-off-by: Reynold Xin <rxin@databricks.com>	09 July 2016, 03:18:54 UTC
16202ba	Sean Owen	09 July 2016, 03:17:50 UTC	[SPARK-16376][WEBUI][SPARK WEB UI][APP-ID] HTTP ERROR 500 when using rest api "/applications//jobs" if array "stageIds" is empty ## What changes were proposed in this pull request? Avoid error finding max of empty Seq when stageIds is empty. It does fix the immediate problem; I don't know if it results in meaningful output, but not an error at least. ## How was this patch tested? Jenkins tests Author: Sean Owen <sowen@cloudera.com> Closes #14105 from srowen/SPARK-16376. (cherry picked from commit 6cef0183c0f0392dad78fec54635afdb9341b7f3) Signed-off-by: Reynold Xin <rxin@databricks.com>	09 July 2016, 03:17:58 UTC
c425230	cody koeninger	09 July 2016, 00:47:58 UTC	[SPARK-13569][STREAMING][KAFKA] pattern based topic subscription ## What changes were proposed in this pull request? Allow for kafka topic subscriptions based on a regex pattern. ## How was this patch tested? Unit tests, manual tests Author: cody koeninger <cody@koeninger.org> Closes #14026 from koeninger/SPARK-13569. (cherry picked from commit fd6e8f0e2269a2e7f24f79d5c2041816ea308c86) Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>	09 July 2016, 00:48:38 UTC
463cbf7	Dongjoon Hyun	08 July 2016, 23:07:12 UTC	[SPARK-16387][SQL] JDBC Writer should use dialect to quote field names. ## What changes were proposed in this pull request? Currently, JDBC Writer uses dialects to get datatypes, but doesn't to quote field names. This PR uses dialects to quote the field names, too. Reported Error Scenario (MySQL case) ```scala scala> val url="jdbc:mysql://localhost:3306/temp" scala> val prop = new java.util.Properties scala> prop.setProperty("user","root") scala> spark.createDataset(Seq("a","b","c")).toDF("order") scala> df.write.mode("overwrite").jdbc(url, "temptable", prop) ...MySQLSyntaxErrorException: ... near 'order TEXT ) ``` ## How was this patch tested? Pass the Jenkins tests and manually do the above case. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #14107 from dongjoon-hyun/SPARK-16387. (cherry picked from commit 3b22291b5f0317609cd71ce7af78e4c5063d66e8) Signed-off-by: Reynold Xin <rxin@databricks.com>	08 July 2016, 23:07:19 UTC
07f562f	Yin Huai	08 July 2016, 22:56:46 UTC	[SPARK-16453][BUILD] release-build.sh is missing hive-thriftserver for scala 2.10 ## What changes were proposed in this pull request? This PR adds hive-thriftserver profile to scala 2.10 build created by release-build.sh. Author: Yin Huai <yhuai@databricks.com> Closes #14108 from yhuai/SPARK-16453. (cherry picked from commit 60ba436b7010436c77dfe5219a9662accc25bffa) Signed-off-by: Yin Huai <yhuai@databricks.com>	08 July 2016, 22:57:10 UTC
e3424fd	wujian	08 July 2016, 21:38:05 UTC	[SPARK-16281][SQL] Implement parse_url SQL function ## What changes were proposed in this pull request? This PR adds parse_url SQL functions in order to remove Hive fallback. A new implementation of #13999 ## How was this patch tested? Pass the exist tests including new testcases. Author: wujian <jan.chou.wu@gmail.com> Closes #14008 from janplus/SPARK-16281. (cherry picked from commit f5fef69143b2a83bb8b168b7417e92659af0c72c) Signed-off-by: Reynold Xin <rxin@databricks.com>	08 July 2016, 21:38:11 UTC
0e9333b	Ryan Blue	08 July 2016, 19:37:26 UTC	[SPARK-16420] Ensure compression streams are closed. ## What changes were proposed in this pull request? This uses the try/finally pattern to ensure streams are closed after use. `UnsafeShuffleWriter` wasn't closing compression streams, causing them to leak resources until garbage collected. This was causing a problem with codecs that use off-heap memory. ## How was this patch tested? Current tests are sufficient. This should not change behavior. Author: Ryan Blue <blue@apache.org> Closes #14093 from rdblue/SPARK-16420-unsafe-shuffle-writer-leak. (cherry picked from commit 67e085ef6dd62774095f3187844c091db1a6a72c) Signed-off-by: Reynold Xin <rxin@databricks.com>	08 July 2016, 19:37:32 UTC
8dee2ec	Jurriaan Pruis	08 July 2016, 18:45:41 UTC	[SPARK-13638][SQL] Add quoteAll option to CSV DataFrameWriter ## What changes were proposed in this pull request? Adds an quoteAll option for writing CSV which will quote all fields. See https://issues.apache.org/jira/browse/SPARK-13638 ## How was this patch tested? Added a test to verify the output columns are quoted for all fields in the Dataframe Author: Jurriaan Pruis <email@jurriaanpruis.nl> Closes #13374 from jurriaan/csv-quote-all. (cherry picked from commit 38cf8f2a50068f80350740ac28e31c8accd20634) Signed-off-by: Reynold Xin <rxin@databricks.com>	08 July 2016, 18:46:01 UTC
8c81806	Xusen Yin	08 July 2016, 13:23:57 UTC	[SPARK-16369][MLLIB] tallSkinnyQR of RowMatrix should aware of empty partition ## What changes were proposed in this pull request? tallSkinnyQR of RowMatrix should aware of empty partition, which could cause exception from Breeze qr decomposition. See the [archived dev mail](https://mail-archives.apache.org/mod_mbox/spark-dev/201510.mbox/%3CCAF7ADNrycvPL3qX-VZJhq4OYmiUUhoscut_tkOm63Cm18iK1tQmail.gmail.com%3E) for more details. ## How was this patch tested? Scala unit test. Author: Xusen Yin <yinxusen@gmail.com> Closes #14049 from yinxusen/SPARK-16369. (cherry picked from commit 255d74fe4a0db2cc842177ec735bbde07c7c8732) Signed-off-by: Sean Owen <sowen@cloudera.com>	08 July 2016, 13:24:07 UTC
221a4a7	Dongjoon Hyun	08 July 2016, 09:05:24 UTC	[SPARK-16285][SQL] Implement sentences SQL functions ## What changes were proposed in this pull request? This PR implements `sentences` SQL function. ## How was this patch tested? Pass the Jenkins tests with a new testcase. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #14004 from dongjoon-hyun/SPARK_16285. (cherry picked from commit a54438cb23c80f7c7fc35da273677c39317cb1a5) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	08 July 2016, 09:07:27 UTC
18ace01	Tathagata Das	08 July 2016, 06:19:41 UTC	[SPARK-16430][SQL][STREAMING] Add option maxFilesPerTrigger ## What changes were proposed in this pull request? An option that limits the file stream source to read 1 file at a time enables rate limiting. It has the additional convenience that a static set of files can be used like a stream for testing as this will allows those files to be considered one at a time. This PR adds option `maxFilesPerTrigger`. ## How was this patch tested? New unit test Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #14094 from tdas/SPARK-16430. (cherry picked from commit 5bce4580939c27876f11cd75f0dc2190fb9fa908) Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>	08 July 2016, 06:19:59 UTC
565e18c	Dongjoon Hyun	06 July 2016, 02:54:43 UTC	[SPARK-16286][SQL] Implement stack table generating function This PR implements `stack` table generating function. Pass the Jenkins tests including new testcases. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #14033 from dongjoon-hyun/SPARK-16286. (cherry picked from commit d0d28507cacfca5919dbfb4269892d58b62e8662) Signed-off-by: Reynold Xin <rxin@databricks.com>	08 July 2016, 04:09:09 UTC
e32c29d	Dongjoon Hyun	03 July 2016, 17:57:45 UTC	[SPARK-16288][SQL] Implement inline table generating function This PR implements `inline` table generating function. Pass the Jenkins tests with new testcase. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #13976 from dongjoon-hyun/SPARK-16288. (cherry picked from commit 88134e736829f5f93a82879c08cb191f175ff8af) Signed-off-by: Reynold Xin <rxin@databricks.com>	08 July 2016, 04:08:45 UTC
bb4b041	petermaxlee	30 June 2016, 01:27:48 UTC	[SPARK-16274][SQL] Implement xpath_boolean This patch implements xpath_boolean expression for Spark SQL, a xpath function that returns true or false. The implementation is modelled after Hive's xpath_boolean, except that how the expression handles null inputs. Hive throws a NullPointerException at runtime if either of the input is null. This implementation returns null if either of the input is null. Created two new test suites. One for unit tests covering the expression, and the other for end-to-end test in SQL. Author: petermaxlee <petermaxlee@gmail.com> Closes #13964 from petermaxlee/SPARK-16274. (cherry picked from commit d3af6731fa270842818ed91d6b4d14708ddae2db) Signed-off-by: Reynold Xin <rxin@databricks.com>	08 July 2016, 04:07:33 UTC
144aa84	petermaxlee	29 June 2016, 04:07:52 UTC	[SPARK-16271][SQL] Implement Hive's UDFXPathUtil This patch ports Hive's UDFXPathUtil over to Spark, which can be used to implement xpath functionality in Spark in the near future. Added two new test suites UDFXPathUtilSuite and ReusableStringReaderSuite. They have been ported over from Hive (but rewritten in Scala in order to leverage ScalaTest). Author: petermaxlee <petermaxlee@gmail.com> Closes #13961 from petermaxlee/xpath. (cherry picked from commit 153c2f9ac12846367a09684fd875c496d350a603) Signed-off-by: Reynold Xin <rxin@databricks.com>	08 July 2016, 04:07:03 UTC
a049754	Dongjoon Hyun	30 June 2016, 19:03:54 UTC	[SPARK-16289][SQL] Implement posexplode table generating function This PR implements `posexplode` table generating function. Currently, master branch raises the following exception for `map` argument. It's different from Hive. Before ```scala scala> sql("select posexplode(map('a', 1, 'b', 2))").show org.apache.spark.sql.AnalysisException: No handler for Hive UDF ... posexplode() takes an array as a parameter; line 1 pos 7 ``` After ```scala scala> sql("select posexplode(map('a', 1, 'b', 2))").show +---+---+-----+ \|pos\|key\|value\| +---+---+-----+ \| 0\| a\| 1\| \| 1\| b\| 2\| +---+---+-----+ ``` For `array` argument, `after` is the same with `before`. ``` scala> sql("select posexplode(array(1, 2, 3))").show +---+---+ \|pos\|col\| +---+---+ \| 0\| 1\| \| 1\| 2\| \| 2\| 3\| +---+---+ ``` Pass the Jenkins tests with newly added testcases. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #13971 from dongjoon-hyun/SPARK-16289. (cherry picked from commit 46395db80e3304e3f3a1ebdc8aadb8f2819b48b4) Signed-off-by: Reynold Xin <rxin@databricks.com>	08 July 2016, 04:05:31 UTC
7ef1d1c	Dongjoon Hyun	03 July 2016, 08:59:40 UTC	[SPARK-16278][SPARK-16279][SQL] Implement map_keys/map_values SQL functions This PR adds `map_keys` and `map_values` SQL functions in order to remove Hive fallback. Pass the Jenkins tests including new testcases. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #13967 from dongjoon-hyun/SPARK-16278. (cherry picked from commit 54b27c1797fcd32b3f3e9d44e1a149ae396a61e6) Signed-off-by: Reynold Xin <rxin@databricks.com>	08 July 2016, 04:02:50 UTC
88603bd	petermaxlee	30 June 2016, 23:57:48 UTC	[SPARK-16276][SQL] Implement elt SQL function This patch implements the elt function, as it is implemented in Hive. Added expression unit test in StringExpressionsSuite and end-to-end test in StringFunctionsSuite. Author: petermaxlee <petermaxlee@gmail.com> Closes #13966 from petermaxlee/SPARK-16276. (cherry picked from commit 85f2303ecadd9bf6d9694a2743dda075654c5ccf) Signed-off-by: Reynold Xin <rxin@databricks.com>	08 July 2016, 04:00:53 UTC
73c764a	Dongjoon Hyun	08 July 2016, 00:47:29 UTC	[SPARK-16425][R] `describe()` should not fail with non-numeric columns ## What changes were proposed in this pull request? This PR prevents ERRORs when `summary(df)` is called for `SparkDataFrame` with not-numeric columns. This failure happens only in `SparkR`. Before ```r > df <- createDataFrame(faithful) > df <- withColumn(df, "boolean", df$waiting==79) > summary(df) 16/07/07 14:15:16 ERROR RBackendHandler: describe on 34 failed Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) : org.apache.spark.sql.AnalysisException: cannot resolve 'avg(`boolean`)' due to data type mismatch: function average requires numeric types, not BooleanType; ``` After ```r > df <- createDataFrame(faithful) > df <- withColumn(df, "boolean", df$waiting==79) > summary(df) SparkDataFrame[summary:string, eruptions:string, waiting:string] ``` ## How was this patch tested? Pass the Jenkins with a updated testcase. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #14096 from dongjoon-hyun/SPARK-16425. (cherry picked from commit 6aa7d09f4e126f42e41085dec169c813379ed354) Signed-off-by: Shivaram Venkataraman <shivaram@cs.berkeley.edu>	08 July 2016, 00:47:38 UTC
5828da4	Felix Cheung	07 July 2016, 22:21:57 UTC	[SPARK-16310][SPARKR] R na.string-like default for csv source ## What changes were proposed in this pull request? Apply default "NA" as null string for R, like R read.csv na.string parameter. https://stat.ethz.ch/R-manual/R-devel/library/utils/html/read.table.html na.strings = "NA" An user passing a csv file with NA value should get the same behavior with SparkR read.df(... source = "csv") (couldn't open JIRA, will do that later) ## How was this patch tested? unit tests shivaram Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #13984 from felixcheung/rcsvnastring. (cherry picked from commit f4767bcc7a9d1bdd301f054776aa45e7c9f344a7) Signed-off-by: Shivaram Venkataraman <shivaram@cs.berkeley.edu>	07 July 2016, 22:22:06 UTC
30cb3f1	Daoyuan Wang	07 July 2016, 18:08:06 UTC	[SPARK-16415][SQL] fix catalog string error ## What changes were proposed in this pull request? In #13537 we truncate `simpleString` if it is a long `StructType`. But sometimes we need `catalogString` to reconstruct `TypeInfo`, for example in description of [SPARK-16415 ](https://issues.apache.org/jira/browse/SPARK-16415). So we need to keep the implementation of `catalogString` not affected by our truncate. ## How was this patch tested? added a test case. Author: Daoyuan Wang <daoyuan.wang@intel.com> Closes #14089 from adrian-wang/catalogstring. (cherry picked from commit 28710b42b0d18a55bd64d597558649537259b127) Signed-off-by: Reynold Xin <rxin@databricks.com>	07 July 2016, 18:08:12 UTC

Newer
Older