https://github.com/apache/spark

sort by:
Revision Author Date Message Commit Date
15de51c Preparing Spark release v1.6.1-rc1 27 February 2016, 04:09:04 UTC
eb6f6da Update CHANGES.txt and spark-ec2 and R package versions for 1.6.1 This patch updates a few more 1.6.0 version numbers for the 1.6.1 release candidate. Verified this by running ``` git grep "1\.6\.0" | grep -v since | grep -v deprecated | grep -v Since | grep -v versionadded | grep 1.6.0 ``` and inspecting the output. Author: Josh Rosen <joshrosen@databricks.com> Closes #11407 from JoshRosen/version-number-updates. 27 February 2016, 04:05:44 UTC
8a43c3b [SPARK-13474][PROJECT INFRA] Update packaging scripts to push artifacts to home.apache.org Due to the people.apache.org -> home.apache.org migration, we need to update our packaging scripts to publish artifacts to the new server. Because the new server only supports sftp instead of ssh, we need to update the scripts to use lftp instead of ssh + rsync. Author: Josh Rosen <joshrosen@databricks.com> Closes #11350 from JoshRosen/update-release-scripts-for-apache-home. (cherry picked from commit f77dc4e1e202942aa8393fb5d8f492863973fe17) Signed-off-by: Josh Rosen <joshrosen@databricks.com> 27 February 2016, 02:40:23 UTC
a57f87e [SPARK-13454][SQL] Allow users to drop a table with a name starting with an underscore. ## What changes were proposed in this pull request? This change adds a workaround to allow users to drop a table with a name starting with an underscore. Without this patch, we can create such a table, but we cannot drop it. The reason is that Hive's parser unquote an quoted identifier (see https://github.com/apache/hive/blob/release-1.2.1/ql/src/java/org/apache/hadoop/hive/ql/parse/HiveLexer.g#L453). So, when we issue a drop table command to Hive, a table name starting with an underscore is actually not quoted. Then, Hive will complain about it because it does not support a table name starting with an underscore without using backticks (underscores are allowed as long as it is not the first char though). ## How was this patch tested? Add a test to make sure we can drop a table with a name starting with an underscore. https://issues.apache.org/jira/browse/SPARK-13454 Author: Yin Huai <yhuai@databricks.com> Closes #11349 from yhuai/fixDropTable. 26 February 2016, 20:34:03 UTC
abe8f99 [SPARK-12874][ML] ML StringIndexer does not protect itself from column name duplication ## What changes were proposed in this pull request? ML StringIndexer does not protect itself from column name duplication. We should still improve a way to validate a schema of `StringIndexer` and `StringIndexerModel`. However, it would be great to fix at another issue. ## How was this patch tested? unit test Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com> Closes #11370 from yu-iskw/SPARK-12874. (cherry picked from commit 14e2700de29d06460179a94cc9816bcd37344cf7) Signed-off-by: Xiangrui Meng <meng@databricks.com> 25 February 2016, 21:23:44 UTC
d59a08f Revert "[SPARK-13444][MLLIB] QuantileDiscretizer chooses bad splits on large DataFrames" This reverts commit cb869a143d338985c3d99ef388dd78b1e3d90a73. 25 February 2016, 20:28:03 UTC
5f7440b [SPARK-12316] Wait a minutes to avoid cycle calling. When application end, AM will clean the staging dir. But if the driver trigger to update the delegation token, it will can't find the right token file and then it will endless cycle call the method 'updateCredentialsIfRequired'. Then it lead driver StackOverflowError. https://issues.apache.org/jira/browse/SPARK-12316 Author: huangzhaowei <carlmartinmax@gmail.com> Closes #10475 from SaintBacchus/SPARK-12316. (cherry picked from commit 5fcf4c2bfce4b7e3543815c8e49ffdec8072c9a2) Signed-off-by: Tom Graves <tgraves@yahoo-inc.com> 25 February 2016, 15:14:36 UTC
e3802a7 [SPARK-13439][MESOS] Document that spark.mesos.uris is comma-separated Author: Michael Gummelt <mgummelt@mesosphere.io> Closes #11311 from mgummelt/document_csv. (cherry picked from commit c98a93ded36db5da2f3ebd519aa391de90927688) Signed-off-by: Sean Owen <sowen@cloudera.com> 25 February 2016, 13:32:50 UTC
1f03163 [SPARK-13441][YARN] Fix NPE in yarn Client.createConfArchive method ## What changes were proposed in this pull request? Instead of using result of File.listFiles() directly, which may throw NPE, check for null first. If it is null, log a warning instead ## How was the this patch tested? Ran the ./dev/run-tests locally Tested manually on a cluster Author: Terence Yim <terence@cask.co> Closes #11337 from chtyim/fixes/SPARK-13441-null-check. (cherry picked from commit fae88af18445c5a88212b4644e121de4b30ce027) Signed-off-by: Sean Owen <sowen@cloudera.com> 25 February 2016, 13:29:41 UTC
cb869a1 [SPARK-13444][MLLIB] QuantileDiscretizer chooses bad splits on large DataFrames Change line 113 of QuantileDiscretizer.scala to `val requiredSamples = math.max(numBins * numBins, 10000.0)` so that `requiredSamples` is a `Double`. This will fix the division in line 114 which currently results in zero if `requiredSamples < dataset.count` Manual tests. I was having a problems using QuantileDiscretizer with my a dataset and after making this change QuantileDiscretizer behaves as expected. Author: Oliver Pierson <ocp@gatech.edu> Author: Oliver Pierson <opierson@umd.edu> Closes #11319 from oliverpierson/SPARK-13444. (cherry picked from commit 6f8e835c68dff6fcf97326dc617132a41ff9d043) Signed-off-by: Sean Owen <sowen@cloudera.com> 25 February 2016, 13:27:10 UTC
3cc938a [SPARK-13473][SQL] Don't push predicate through project with nondeterministic field(s) ## What changes were proposed in this pull request? Predicates shouldn't be pushed through project with nondeterministic field(s). See https://github.com/graphframes/graphframes/pull/23 and SPARK-13473 for more details. This PR targets master, branch-1.6, and branch-1.5. ## How was this patch tested? A test case is added in `FilterPushdownSuite`. It constructs a query plan where a filter is over a project with a nondeterministic field. Optimized query plan shouldn't change in this case. Author: Cheng Lian <lian@databricks.com> Closes #11348 from liancheng/spark-13473-no-ppd-through-nondeterministic-project-field. (cherry picked from commit 3fa6491be66dad690ca5329dd32e7c82037ae8c1) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 25 February 2016, 12:45:18 UTC
8975996 [SPARK-13482][MINOR][CONFIGURATION] Make consistency of the configuraiton named in TransportConf. `spark.storage.memoryMapThreshold` has two kind of the value, one is 2*1024*1024 as integer and the other one is '2m' as string. "2m" is recommanded in document but it will go wrong if the code goes into `TransportConf#memoryMapBytes`. [Jira](https://issues.apache.org/jira/browse/SPARK-13482) Author: huangzhaowei <carlmartinmax@gmail.com> Closes #11360 from SaintBacchus/SPARK-13482. (cherry picked from commit 264533b553be806b6c45457201952e83c028ec78) Signed-off-by: Reynold Xin <rxin@databricks.com> 25 February 2016, 07:52:23 UTC
fe71cab [SPARK-13475][TESTS][SQL] HiveCompatibilitySuite should still run in PR builder even if a PR only changes sql/core ## What changes were proposed in this pull request? `HiveCompatibilitySuite` should still run in PR build even if a PR only changes sql/core. So, I am going to remove `ExtendedHiveTest` annotation from `HiveCompatibilitySuite`. https://issues.apache.org/jira/browse/SPARK-13475 Author: Yin Huai <yhuai@databricks.com> Closes #11351 from yhuai/SPARK-13475. (cherry picked from commit bc353805bd98243872d520e05fa6659da57170bf) Signed-off-by: Yin Huai <yhuai@databricks.com> 24 February 2016, 21:35:53 UTC
06f4fce [SPARK-13390][SQL][BRANCH-1.6] Fix the issue that Iterator.map().toSeq is not Serializable ## What changes were proposed in this pull request? `scala.collection.Iterator`'s methods (e.g., map, filter) will return an `AbstractIterator` which is not Serializable. E.g., ```Scala scala> val iter = Array(1, 2, 3).iterator.map(_ + 1) iter: Iterator[Int] = non-empty iterator scala> println(iter.isInstanceOf[Serializable]) false ``` If we call something like `Iterator.map(...).toSeq`, it will create a `Stream` that contains a non-serializable `AbstractIterator` field and make the `Stream` be non-serializable. This PR uses `toArray` instead of `toSeq` to fix such issue in `def createDataFrame(data: java.util.List[_], beanClass: Class[_]): DataFrame`. ## How was the this patch tested? Jenkins tests. Author: Shixiong Zhu <shixiong@databricks.com> Closes #11334 from zsxwing/SPARK-13390. 24 February 2016, 13:35:36 UTC
573a2c9 [SPARK-13410][SQL] Support unionAll for DataFrames with UDT columns. ## What changes were proposed in this pull request? This PR adds equality operators to UDT classes so that they can be correctly tested for dataType equality during union operations. This was previously causing `"AnalysisException: u"unresolved operator 'Union;""` when trying to unionAll two dataframes with UDT columns as below. ``` from pyspark.sql.tests import PythonOnlyPoint, PythonOnlyUDT from pyspark.sql import types schema = types.StructType([types.StructField("point", PythonOnlyUDT(), True)]) a = sqlCtx.createDataFrame([[PythonOnlyPoint(1.0, 2.0)]], schema) b = sqlCtx.createDataFrame([[PythonOnlyPoint(3.0, 4.0)]], schema) c = a.unionAll(b) ``` ## How was the this patch tested? Tested using two unit tests in sql/test.py and the DataFrameSuite. Additional information here : https://issues.apache.org/jira/browse/SPARK-13410 rxin Author: Franklyn D'souza <franklynd@gmail.com> Closes #11333 from damnMeddlingKid/udt-union-patch. 23 February 2016, 23:34:04 UTC
0784e02 [SPARK-13355][MLLIB] replace GraphImpl.fromExistingRDDs by Graph.apply `GraphImpl.fromExistingRDDs` expects preprocessed vertex RDD as input. We call it in LDA without validating this requirement. So it might introduce errors. Replacing it by `Graph.apply` would be safer and more proper because it is a public API. The tests still pass. So maybe it is safe to use `fromExistingRDDs` here (though it doesn't seem so based on the implementation) or the test cases are special. jkbradley ankurdave Author: Xiangrui Meng <meng@databricks.com> Closes #11226 from mengxr/SPARK-13355. (cherry picked from commit 764ca18037b6b1884fbc4be9a011714a81495020) Signed-off-by: Xiangrui Meng <meng@databricks.com> 23 February 2016, 07:54:29 UTC
d31854d [SPARK-12746][ML] ArrayType(_, true) should also accept ArrayType(_, false) fix for branch-1.6 https://issues.apache.org/jira/browse/SPARK-13359 Author: Earthson Lu <Earthson.Lu@gmail.com> Closes #11237 from Earthson/SPARK-13359. 23 February 2016, 07:40:36 UTC
2902798 Preparing development version 1.6.1-SNAPSHOT 23 February 2016, 02:30:30 UTC
152252f Preparing Spark release v1.6.1-rc1 23 February 2016, 02:30:24 UTC
40d11d0 Update branch-1.6 for 1.6.1 release 23 February 2016, 02:25:48 UTC
f7898f9 [SPARK-11624][SPARK-11972][SQL] fix commands that need hive to exec In SparkSQLCLI, we have created a `CliSessionState`, but then we call `SparkSQLEnv.init()`, which will start another `SessionState`. This would lead to exception because `processCmd` need to get the `CliSessionState` instance by calling `SessionState.get()`, but the return value would be a instance of `SessionState`. See the exception below. spark-sql> !echo "test"; Exception in thread "main" java.lang.ClassCastException: org.apache.hadoop.hive.ql.session.SessionState cannot be cast to org.apache.hadoop.hive.cli.CliSessionState at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:112) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:301) at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:376) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:242) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:691) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Author: Daoyuan Wang <daoyuan.wang@intel.com> Closes #9589 from adrian-wang/clicommand. (cherry picked from commit 5d80fac58f837933b5359a8057676f45539e53af) Signed-off-by: Michael Armbrust <michael@databricks.com> Conflicts: sql/hive/src/main/scala/org/apache/spark/sql/hive/client/ClientWrapper.scala 23 February 2016, 02:20:06 UTC
85e6a22 [SPARK-13298][CORE][UI] Escape "label" to avoid DAG being broken by some special character ## What changes were proposed in this pull request? When there are some special characters (e.g., `"`, `\`) in `label`, DAG will be broken. This patch just escapes `label` to avoid DAG being broken by some special characters ## How was the this patch tested? Jenkins tests Author: Shixiong Zhu <shixiong@databricks.com> Closes #11309 from zsxwing/SPARK-13298. (cherry picked from commit a11b3995190cb4a983adcc8667f7b316cce18d24) Signed-off-by: Andrew Or <andrew@databricks.com> 23 February 2016, 01:42:37 UTC
699644c [SPARK-12546][SQL] Change default number of open parquet files A common problem that users encounter with Spark 1.6.0 is that writing to a partitioned parquet table OOMs. The root cause is that parquet allocates a significant amount of memory that is not accounted for by our own mechanisms. As a workaround, we can ensure that only a single file is open per task unless the user explicitly asks for more. Author: Michael Armbrust <michael@databricks.com> Closes #11308 from marmbrus/parquetWriteOOM. (cherry picked from commit 173aa949c309ff7a7a03e9d762b9108542219a95) Signed-off-by: Michael Armbrust <michael@databricks.com> 22 February 2016, 23:27:41 UTC
16f35c4 [SPARK-13371][CORE][STRING] TaskSetManager.dequeueSpeculativeTask compares Option and String directly. ## What changes were proposed in this pull request? Fix some comparisons between unequal types that cause IJ warnings and in at least one case a likely bug (TaskSetManager) ## How was the this patch tested? Running Jenkins tests Author: Sean Owen <sowen@cloudera.com> Closes #11253 from srowen/SPARK-13371. (cherry picked from commit 78562535feb6e214520b29e0bbdd4b1302f01e93) Signed-off-by: Andrew Or <andrew@databricks.com> 18 February 2016, 20:14:41 UTC
66106a6 [SPARK-13350][DOCS] Config doc updated to state that PYSPARK_PYTHON's default is "python2.7" Author: Christopher C. Aycock <chris@chrisaycock.com> Closes #11239 from chrisaycock/master. (cherry picked from commit a7c74d7563926573c01baf613708a0f105a03e57) Signed-off-by: Josh Rosen <joshrosen@databricks.com> 17 February 2016, 19:24:51 UTC
98354ca [SPARK-13279] Remove O(n^2) operation from scheduler. This commit removes an unnecessary duplicate check in addPendingTask that meant that scheduling a task set took time proportional to (# tasks)^2. Author: Sital Kedia <skedia@fb.com> Closes #11175 from sitalkedia/fix_stuck_driver. (cherry picked from commit 1e1e31e03df14f2e7a9654e640fb2796cf059fe0) Signed-off-by: Kay Ousterhout <kayousterhout@gmail.com> 17 February 2016, 06:28:32 UTC
d950891 Correct SparseVector.parse documentation There's a small typo in the SparseVector.parse docstring (which says that it returns a DenseVector rather than a SparseVector), which seems to be incorrect. Author: Miles Yucht <miles@databricks.com> Closes #11213 from mgyucht/fix-sparsevector-docs. (cherry picked from commit 827ed1c06785692d14857bd41f1fd94a0853874a) Signed-off-by: Sean Owen <sowen@cloudera.com> 16 February 2016, 13:01:37 UTC
71f53ed [SPARK-13312][MLLIB] Update java train-validation-split example in ml-guide Response to JIRA https://issues.apache.org/jira/browse/SPARK-13312. This contribution is my original work and I license the work to this project. Author: JeremyNixon <jnixon2@gmail.com> Closes #11199 from JeremyNixon/update_train_val_split_example. (cherry picked from commit adb548365012552e991d51740bfd3c25abf0adec) Signed-off-by: Sean Owen <sowen@cloudera.com> 15 February 2016, 09:25:22 UTC
ec40c5a [SPARK-13300][DOCUMENTATION] Added pygments.rb dependancy Looks like pygments.rb gem is also required for jekyll build to work. At least on Ubuntu/RHEL I could not do build without this dependency. So added this to steps. Author: Amit Dev <amitdev@gmail.com> Closes #11180 from amitdev/master. (cherry picked from commit 331293c30242dc43e54a25171ca51a1c9330ae44) Signed-off-by: Sean Owen <sowen@cloudera.com> 14 February 2016, 11:41:37 UTC
107290c [SPARK-12363][MLLIB] Remove setRun and fix PowerIterationClustering failed test JIRA: https://issues.apache.org/jira/browse/SPARK-12363 This issue is pointed by yanboliang. When `setRuns` is removed from PowerIterationClustering, one of the tests will be failed. I found that some `dstAttr`s of the normalized graph are not correct values but 0.0. By setting `TripletFields.All` in `mapTriplets` it can work. Author: Liang-Chi Hsieh <viirya@gmail.com> Author: Xiangrui Meng <meng@databricks.com> Closes #10539 from viirya/fix-poweriter. (cherry picked from commit e3441e3f68923224d5b576e6112917cf1fe1f89a) Signed-off-by: Xiangrui Meng <meng@databricks.com> 13 February 2016, 23:56:31 UTC
93a55f3 [SPARK-13142][WEB UI] Problem accessing Web UI /logPage/ on Microsoft Windows Due to being on a Windows platform I have been unable to run the tests as described in the "Contributing to Spark" instructions. As the change is only to two lines of code in the Web UI, which I have manually built and tested, I am submitting this pull request anyway. I hope this is OK. Is it worth considering also including this fix in any future 1.5.x releases (if any)? I confirm this is my own original work and license it to the Spark project under its open source license. Author: markpavey <mark.pavey@thefilter.com> Closes #11135 from markpavey/JIRA_SPARK-13142_WindowsWebUILogFix. (cherry picked from commit 374c4b2869fc50570a68819cf0ece9b43ddeb34b) Signed-off-by: Sean Owen <sowen@cloudera.com> 13 February 2016, 08:39:55 UTC
18661a2 [SPARK-13153][PYSPARK] ML persistence failed when handle no default value parameter Fix this defect by check default value exist or not. yanboliang Please help to review. Author: Tommy YU <tummyyu@163.com> Closes #11043 from Wenpei/spark-13153-handle-param-withnodefaultvalue. (cherry picked from commit d3e2e202994e063856c192e9fdd0541777b88e0e) Signed-off-by: Xiangrui Meng <meng@databricks.com> 12 February 2016, 02:39:08 UTC
9d45ec4 [SPARK-13047][PYSPARK][ML] Pyspark Params.hasParam should not throw an error Pyspark Params class has a method `hasParam(paramName)` which returns `True` if the class has a parameter by that name, but throws an `AttributeError` otherwise. There is not currently a way of getting a Boolean to indicate if a class has a parameter. With Spark 2.0 we could modify the existing behavior of `hasParam` or add an additional method with this functionality. In Python: ```python from pyspark.ml.classification import NaiveBayes nb = NaiveBayes() print nb.hasParam("smoothing") print nb.hasParam("notAParam") ``` produces: > True > AttributeError: 'NaiveBayes' object has no attribute 'notAParam' However, in Scala: ```scala import org.apache.spark.ml.classification.NaiveBayes val nb = new NaiveBayes() nb.hasParam("smoothing") nb.hasParam("notAParam") ``` produces: > true > false cc holdenk Author: sethah <seth.hendrickson16@gmail.com> Closes #10962 from sethah/SPARK-13047. (cherry picked from commit b35467388612167f0bc3d17142c21a406f6c620d) Signed-off-by: Xiangrui Meng <meng@databricks.com> 12 February 2016, 00:42:52 UTC
91a5ca5 [SPARK-13265][ML] Refactoring of basic ML import/export for other file system besides HDFS jkbradley I tried to improve the function to export a model. When I tried to export a model to S3 under Spark 1.6, we couldn't do that. So, it should offer S3 besides HDFS. Can you review it when you have time? Thanks! Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com> Closes #11151 from yu-iskw/SPARK-13265. (cherry picked from commit efb65e09bcfa4542348f5cd37fe5c14047b862e5) Signed-off-by: Xiangrui Meng <meng@databricks.com> 11 February 2016, 23:00:32 UTC
b57fac5 [SPARK-13274] Fix Aggregator Links on GroupedDataset Scala API Update Aggregator links to point to #org.apache.spark.sql.expressions.Aggregator Author: raela <raela@databricks.com> Closes #11158 from raelawang/master. (cherry picked from commit 719973b05ef6d6b9fbb83d76aebac6454ae84fad) Signed-off-by: Reynold Xin <rxin@databricks.com> 11 February 2016, 01:01:15 UTC
93f1d91 [SPARK-12921] Fix another non-reflective TaskAttemptContext access in SpecificParquetRecordReaderBase This is a minor followup to #10843 to fix one remaining place where we forgot to use reflective access of TaskAttemptContext methods. Author: Josh Rosen <joshrosen@databricks.com> Closes #11131 from JoshRosen/SPARK-12921-take-2. 10 February 2016, 19:02:41 UTC
89818cb [SPARK-10524][ML] Use the soft prediction to order categories' bins JIRA: https://issues.apache.org/jira/browse/SPARK-10524 Currently we use the hard prediction (`ImpurityCalculator.predict`) to order categories' bins. But we should use the soft prediction. Author: Liang-Chi Hsieh <viirya@gmail.com> Author: Liang-Chi Hsieh <viirya@appier.com> Author: Joseph K. Bradley <joseph@databricks.com> Closes #8734 from viirya/dt-soft-centroids. (cherry picked from commit 9267bc68fab65c6a798e065a1dbe0f5171df3077) Signed-off-by: Joseph K. Bradley <joseph@databricks.com> 10 February 2016, 01:12:18 UTC
82fa864 [SPARK-12807][YARN] Spark External Shuffle not working in Hadoop clusters with Jackson 2.2.3 Patch to 1. Shade jackson 2.x in spark-yarn-shuffle JAR: core, databind, annotation 2. Use maven antrun to verify the JAR has the renamed classes Being Maven-based, I don't know if the verification phase kicks in on an SBT/jenkins build. It will on a `mvn install` Author: Steve Loughran <stevel@hortonworks.com> Closes #10780 from steveloughran/stevel/patches/SPARK-12807-master-shuffle. (cherry picked from commit 34d0b70b309f16af263eb4e6d7c36e2ea170bc67) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> 09 February 2016, 19:02:13 UTC
9b30096 [SPARK-13210][SQL] catch OOM when allocate memory and expand array There is a bug when we try to grow the buffer, OOM is ignore wrongly (the assert also skipped by JVM), then we try grow the array again, this one will trigger spilling free the current page, the current record we inserted will be invalid. The root cause is that JVM has less free memory than MemoryManager thought, it will OOM when allocate a page without trigger spilling. We should catch the OOM, and acquire memory again to trigger spilling. And also, we could not grow the array in `insertRecord` of `InMemorySorter` (it was there just for easy testing). Author: Davies Liu <davies@databricks.com> Closes #11095 from davies/fix_expand. 08 February 2016, 20:11:37 UTC
3ca5dc3 [SPARK-13214][DOCS] update dynamicAllocation documentation Author: Bill Chambers <bill@databricks.com> Closes #11094 from anabranch/dynamic-docs. (cherry picked from commit 66e1383de2650a0f06929db8109a02e32c5eaf6b) Signed-off-by: Andrew Or <andrew@databricks.com> 05 February 2016, 22:35:48 UTC
a907c7c [SPARK-13195][STREAMING] Fix NoSuchElementException when a state is not set but timeoutThreshold is defined Check the state Existence before calling get. Author: Shixiong Zhu <shixiong@databricks.com> Closes #11081 from zsxwing/SPARK-13195. (cherry picked from commit 8e2f296306131e6c7c2f06d6672995d3ff8ab021) Signed-off-by: Shixiong Zhu <shixiong@databricks.com> 04 February 2016, 20:43:25 UTC
2f390d3 [ML][DOC] fix wrong api link in ml onevsrest minor fix for api link in ml onevsrest Author: Yuhao Yang <hhbyyh@gmail.com> Closes #11068 from hhbyyh/onevsrestDoc. (cherry picked from commit c2c956bcd1a75fd01868ee9ad2939a6d3de52bc2) Signed-off-by: Xiangrui Meng <meng@databricks.com> 04 February 2016, 05:20:22 UTC
cdfb2a1 [SPARK-13101][SQL][BRANCH-1.6] nullability of array type element should not fail analysis of encoder nullability should only be considered as an optimization rather than part of the type system, so instead of failing analysis for mismatch nullability, we should pass analysis and add runtime null check. backport https://github.com/apache/spark/pull/11035 to 1.6 Author: Wenchen Fan <wenchen@databricks.com> Closes #11042 from cloud-fan/branch-1.6. 04 February 2016, 00:13:23 UTC
5fe8796 [SPARK-12739][STREAMING] Details of batch in Streaming tab uses two Duration columns I have clearly prefix the two 'Duration' columns in 'Details of Batch' Streaming tab as 'Output Op Duration' and 'Job Duration' Author: Mario Briggs <mario.briggs@in.ibm.com> Author: mariobriggs <mariobriggs@in.ibm.com> Closes #11022 from mariobriggs/spark-12739. (cherry picked from commit e9eb248edfa81d75f99c9afc2063e6b3d9ee7392) Signed-off-by: Shixiong Zhu <shixiong@databricks.com> 03 February 2016, 17:50:47 UTC
2f8abb4 [SPARK-13122] Fix race condition in MemoryStore.unrollSafely() https://issues.apache.org/jira/browse/SPARK-13122 A race condition can occur in MemoryStore's unrollSafely() method if two threads that return the same value for currentTaskAttemptId() execute this method concurrently. This change makes the operation of reading the initial amount of unroll memory used, performing the unroll, and updating the associated memory maps atomic in order to avoid this race condition. Initial proposed fix wraps all of unrollSafely() in a memoryManager.synchronized { } block. A cleaner approach might be introduce a mechanism that synchronizes based on task attempt ID. An alternative option might be to track unroll/pending unroll memory based on block ID rather than task attempt ID. Author: Adam Budde <budde@amazon.com> Closes #11012 from budde/master. (cherry picked from commit ff71261b651a7b289ea2312abd6075da8b838ed9) Signed-off-by: Andrew Or <andrew@databricks.com> Conflicts: core/src/main/scala/org/apache/spark/storage/MemoryStore.scala 03 February 2016, 03:36:52 UTC
e81333b [DOCS] Update StructType.scala The example will throw error like <console>:20: error: not found: value StructType Need to add this line: import org.apache.spark.sql.types._ Author: Kevin (Sangwoo) Kim <sangwookim.me@gmail.com> Closes #10141 from swkimme/patch-1. (cherry picked from commit b377b03531d21b1d02a8f58b3791348962e1f31b) Signed-off-by: Michael Armbrust <michael@databricks.com> 02 February 2016, 21:24:37 UTC
3c92333 [SPARK-13056][SQL] map column would throw NPE if value is null Jira: https://issues.apache.org/jira/browse/SPARK-13056 Create a map like { "a": "somestring", "b": null} Query like SELECT col["b"] FROM t1; NPE would be thrown. Author: Daoyuan Wang <daoyuan.wang@intel.com> Closes #10964 from adrian-wang/npewriter. (cherry picked from commit 358300c795025735c3b2f96c5447b1b227d4abc1) Signed-off-by: Michael Armbrust <michael@databricks.com> Conflicts: sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala 02 February 2016, 19:31:04 UTC
9c0cf22 [SPARK-12711][ML] ML StopWordsRemover does not protect itself from column name duplication Fixes problem and verifies fix by test suite. Also - adds optional parameter: nullable (Boolean) to: SchemaUtils.appendColumn and deduplicates SchemaUtils.appendColumn functions. Author: Grzegorz Chilkiewicz <grzegorz.chilkiewicz@codilime.com> Closes #10741 from grzegorz-chilkiewicz/master. (cherry picked from commit b1835d727234fdff42aa8cadd17ddcf43b0bed15) Signed-off-by: Joseph K. Bradley <joseph@databricks.com> 02 February 2016, 19:16:44 UTC
4c28b4c [SPARK-13121][STREAMING] java mapWithState mishandles scala Option java mapwithstate with Function3 has wrong conversion of java `Optional` to scala `Option`, fixed code uses same conversion used in the mapwithstate call that uses Function4 as an input. `Optional.fromNullable(v.get)` fails if v is `None`, better to use `JavaUtils.optionToOptional(v)` instead. Author: Gabriele Nizzoli <mail@nizzoli.net> Closes #11007 from gabrielenizzoli/branch-1.6. 02 February 2016, 18:57:18 UTC
53f518a [SPARK-12629][SPARKR] Fixes for DataFrame saveAsTable method I've tried to solve some of the issues mentioned in: https://issues.apache.org/jira/browse/SPARK-12629 Please, let me know what do you think. Thanks! Author: Narine Kokhlikyan <narine.kokhlikyan@gmail.com> Closes #10580 from NarineK/sparkrSavaAsRable. (cherry picked from commit 8a88e121283472c26e70563a4e04c109e9b183b3) Signed-off-by: Shivaram Venkataraman <shivaram@cs.berkeley.edu> 02 February 2016, 18:53:24 UTC
9a3d1bd [SPARK-12780][ML][PYTHON][BACKPORT] Inconsistency returning value of ML python models' properties Backport of [SPARK-12780] for branch-1.6 Original PR for master: https://github.com/apache/spark/pull/10724 This fixes StringIndexerModel.labels in pyspark. Author: Xusen Yin <yinxusen@gmail.com> Closes #10950 from jkbradley/yinxusen-spark-12780-backport. 02 February 2016, 18:21:21 UTC
99594b2 [SPARK-13094][SQL] Add encoders for seq/array of primitives Author: Michael Armbrust <michael@databricks.com> Closes #11014 from marmbrus/seqEncoders. (cherry picked from commit 29d92181d0c49988c387d34e4a71b1afe02c29e2) Signed-off-by: Michael Armbrust <michael@databricks.com> 02 February 2016, 18:15:53 UTC
bd8efba [SPARK-13087][SQL] Fix group by function for sort based aggregation It is not valid to call `toAttribute` on a `NamedExpression` unless we know for sure that the child produced that `NamedExpression`. The current code worked fine when the grouping expressions were simple, but when they were a derived value this blew up at execution time. Author: Michael Armbrust <michael@databricks.com> Closes #11011 from marmbrus/groupByFunction. 02 February 2016, 08:51:07 UTC
70fcbf6 [SPARK-11780][SQL] Add catalyst type aliases backwards compatibility Changed a target at branch-1.6 from #10635. Author: Takeshi YAMAMURO <linguin.m.s@gmail.com> Closes #10915 from maropu/pr9935-v3. 01 February 2016, 20:13:17 UTC
215d5d8 [DOCS] Fix the jar location of datanucleus in sql-programming-guid.md ISTM `lib` is better because `datanucleus` jars are located in `lib` for release builds. Author: Takeshi YAMAMURO <linguin.m.s@gmail.com> Closes #10901 from maropu/DocFix. (cherry picked from commit da9146c91a33577ff81378ca7e7c38a4b1917876) Signed-off-by: Michael Armbrust <michael@databricks.com> 01 February 2016, 20:02:28 UTC
9a5b25d [SPARK-12989][SQL] Delaying Alias Cleanup after ExtractWindowExpressions JIRA: https://issues.apache.org/jira/browse/SPARK-12989 In the rule `ExtractWindowExpressions`, we simply replace alias by the corresponding attribute. However, this will cause an issue exposed by the following case: ```scala val data = Seq(("a", "b", "c", 3), ("c", "b", "a", 3)).toDF("A", "B", "C", "num") .withColumn("Data", struct("A", "B", "C")) .drop("A") .drop("B") .drop("C") val winSpec = Window.partitionBy("Data.A", "Data.B").orderBy($"num".desc) data.select($"*", max("num").over(winSpec) as "max").explain(true) ``` In this case, both `Data.A` and `Data.B` are `alias` in `WindowSpecDefinition`. If we replace these alias expression by their alias names, we are unable to know what they are since they will not be put in `missingExpr` too. Author: gatorsmile <gatorsmile@gmail.com> Author: xiaoli <lixiao1983@gmail.com> Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local> Closes #10963 from gatorsmile/seletStarAfterColDrop. (cherry picked from commit 33c8a490f7f64320c53530a57bd8d34916e3607c) Signed-off-by: Michael Armbrust <michael@databricks.com> 01 February 2016, 19:22:26 UTC
ddb9633 [SPARK-12231][SQL] create a combineFilters' projection when we call buildPartitionedTableScan Hello Michael & All: We have some issues to submit the new codes in the other PR(#10299), so we closed that PR and open this one with the fix. The reason for the previous failure is that the projection for the scan when there is a filter that is not pushed down (the "left-over" filter) could be different, in elements or ordering, from the original projection. With this new codes, the approach to solve this problem is: Insert a new Project if the "left-over" filter is nonempty and (the original projection is not empty and the projection for the scan has more than one elements which could otherwise cause different ordering in projection). We create 3 test cases to cover the otherwise failure cases. Author: Kevin Yu <qyu@us.ibm.com> Closes #10388 from kevinyu98/spark-12231. (cherry picked from commit fd50df413fbb3b7528cdff311cc040a6212340b9) Signed-off-by: Cheng Lian <lian@databricks.com> 01 February 2016, 00:16:34 UTC
bb01cbe [SPARK-13088] Fix DAG viz in latest version of chrome Apparently chrome removed `SVGElement.prototype.getTransformToElement`, which is used by our JS library dagre-d3 when creating edges. The real diff can be found here: https://github.com/andrewor14/dagre-d3/commit/7d6c0002e4c74b82a02c5917876576f71e215590, which is taken from the fix in the main repo: https://github.com/cpettitt/dagre-d3/commit/1ef067f1c6ad2e0980f6f0ca471bce998784b7b2 Upstream issue: https://github.com/cpettitt/dagre-d3/issues/202 Author: Andrew Or <andrew@databricks.com> Closes #10986 from andrewor14/fix-dag-viz. (cherry picked from commit 70e69fc4dd619654f5d24b8b84f6a94f7705c59b) Signed-off-by: Andrew Or <andrew@databricks.com> 30 January 2016, 02:00:57 UTC
84dab72 [SPARK-13082][PYSPARK] Backport the fix of 'read.json(rdd)' in #10559 to branch-1.6 SPARK-13082 actually fixed by #10559. However, it's a big PR and not backported to 1.6. This PR just backported the fix of 'read.json(rdd)' to branch-1.6. Author: Shixiong Zhu <shixiong@databricks.com> Closes #10988 from zsxwing/json-rdd. 29 January 2016, 21:53:11 UTC
96e32db [SPARK-10847][SQL][PYSPARK] Pyspark - DataFrame - Optional Metadata with `None` triggers cryptic failure The error message is now changed from "Do not support type class scala.Tuple2." to "Do not support type class org.json4s.JsonAST$JNull$" to be more informative about what is not supported. Also, StructType metadata now handles JNull correctly, i.e., {'a': None}. test_metadata_null is added to tests.py to show the fix works. Author: Jason Lee <cjlee@us.ibm.com> Closes #8969 from jasoncl/SPARK-10847. (cherry picked from commit edd473751b59b55fa3daede5ed7bc19ea8bd7170) Signed-off-by: Yin Huai <yhuai@databricks.com> 27 January 2016, 17:55:24 UTC
17d1071 [SPARK-12834][ML][PYTHON][BACKPORT] Change ser/de of JavaArray and JavaList Backport of SPARK-12834 for branch-1.6 Original PR: https://github.com/apache/spark/pull/10772 Original commit message: We use `SerDe.dumps()` to serialize `JavaArray` and `JavaList` in `PythonMLLibAPI`, then deserialize them with `PickleSerializer` in Python side. However, there is no need to transform them in such an inefficient way. Instead of it, we can use type conversion to convert them, e.g. `list(JavaArray)` or `list(JavaList)`. What's more, there is an issue to Ser/De Scala Array as I said in https://issues.apache.org/jira/browse/SPARK-12780 Author: Xusen Yin <yinxusen@gmail.com> Closes #10941 from jkbradley/yinxusen-SPARK-12834-1.6. 27 January 2016, 08:32:52 UTC
85518ed [SPARK-12611][SQL][PYSPARK][TESTS] Fix test_infer_schema_to_local Previously (when the PR was first created) not specifying b= explicitly was fine (and treated as default null) - instead be explicit about b being None in the test. Author: Holden Karau <holden@us.ibm.com> Closes #10564 from holdenk/SPARK-12611-fix-test-infer-schema-local. (cherry picked from commit 13dab9c3862cc454094cd9ba7b4504a2d095028f) Signed-off-by: Yin Huai <yhuai@databricks.com> 26 January 2016, 20:05:09 UTC
6ce3dd9 [SPARK-12682][SQL][HOT-FIX] Fix test compilation Author: Yin Huai <yhuai@databricks.com> Closes #10925 from yhuai/branch-1.6-hot-fix. 26 January 2016, 16:34:10 UTC
f0c98a6 [SPARK-12682][SQL] Add support for (optionally) not storing tables in hive metadata format This PR adds a new table option (`skip_hive_metadata`) that'd allow the user to skip storing the table metadata in hive metadata format. While this could be useful in general, the specific use-case for this change is that Hive doesn't handle wide schemas well (see https://issues.apache.org/jira/browse/SPARK-12682 and https://issues.apache.org/jira/browse/SPARK-6024) which in turn prevents such tables from being queried in SparkSQL. Author: Sameer Agarwal <sameer@databricks.com> Closes #10826 from sameeragarwal/skip-hive-metadata. (cherry picked from commit 08c781ca672820be9ba32838bbe40d2643c4bde4) Signed-off-by: Yin Huai <yhuai@databricks.com> 26 January 2016, 15:50:58 UTC
572bc39 [SPARK-12961][CORE] Prevent snappy-java memory leak JIRA: https://issues.apache.org/jira/browse/SPARK-12961 To prevent memory leak in snappy-java, just call the method once and cache the result. After the library releases new version, we can remove this object. JoshRosen Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #10875 from viirya/prevent-snappy-memory-leak. (cherry picked from commit 5936bf9fa85ccf7f0216145356140161c2801682) Signed-off-by: Sean Owen <sowen@cloudera.com> 26 January 2016, 11:36:12 UTC
b40e58c [SPARK-12755][CORE] Stop the event logger before the DAG scheduler [SPARK-12755][CORE] Stop the event logger before the DAG scheduler to avoid a race condition where the standalone master attempts to build the app's history UI before the event log is stopped. This contribution is my original work, and I license this work to the Spark project under the project's open source license. Author: Michael Allman <michael@videoamp.com> Closes #10700 from mallman/stop_event_logger_first. (cherry picked from commit 4ee8191e57cb823a23ceca17908af86e70354554) Signed-off-by: Sean Owen <sowen@cloudera.com> 25 January 2016, 09:51:51 UTC
88114d3 [SPARK-12932][JAVA API] improved error message for java type inference failure Author: Andy Grove <andygrove73@gmail.com> Closes #10865 from andygrove/SPARK-12932. (cherry picked from commit d8e480521e362bc6bc5d8ebcea9b2d50f72a71b9) Signed-off-by: Sean Owen <sowen@cloudera.com> 25 January 2016, 09:22:19 UTC
88614dd [SPARK-12624][PYSPARK] Checks row length when converting Java arrays to Python rows When actual row length doesn't conform to specified schema field length, we should give a better error message instead of throwing an unintuitive `ArrayOutOfBoundsException`. Author: Cheng Lian <lian@databricks.com> Closes #10886 from liancheng/spark-12624. (cherry picked from commit 3327fd28170b549516fee1972dc6f4c32541591b) Signed-off-by: Yin Huai <yhuai@databricks.com> 25 January 2016, 03:40:47 UTC
f913f7e [SPARK-12120][PYSPARK] Improve exception message when failing to init… …ialize HiveContext in PySpark davies Mind to review ? This is the error message after this PR ``` 15/12/03 16:59:53 WARN ObjectStore: Failed to get database default, returning NoSuchObjectException /Users/jzhang/github/spark/python/pyspark/sql/context.py:689: UserWarning: You must build Spark with Hive. Export 'SPARK_HIVE=true' and run build/sbt assembly warnings.warn("You must build Spark with Hive. " Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/Users/jzhang/github/spark/python/pyspark/sql/context.py", line 663, in read return DataFrameReader(self) File "/Users/jzhang/github/spark/python/pyspark/sql/readwriter.py", line 56, in __init__ self._jreader = sqlContext._ssql_ctx.read() File "/Users/jzhang/github/spark/python/pyspark/sql/context.py", line 692, in _ssql_ctx raise e py4j.protocol.Py4JJavaError: An error occurred while calling None.org.apache.spark.sql.hive.HiveContext. : java.lang.RuntimeException: java.net.ConnectException: Call From jzhangMBPr.local/127.0.0.1 to 0.0.0.0:9000 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:522) at org.apache.spark.sql.hive.client.ClientWrapper.<init>(ClientWrapper.scala:194) at org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:238) at org.apache.spark.sql.hive.HiveContext.executionHive$lzycompute(HiveContext.scala:218) at org.apache.spark.sql.hive.HiveContext.executionHive(HiveContext.scala:208) at org.apache.spark.sql.hive.HiveContext.functionRegistry$lzycompute(HiveContext.scala:462) at org.apache.spark.sql.hive.HiveContext.functionRegistry(HiveContext.scala:461) at org.apache.spark.sql.UDFRegistration.<init>(UDFRegistration.scala:40) at org.apache.spark.sql.SQLContext.<init>(SQLContext.scala:330) at org.apache.spark.sql.hive.HiveContext.<init>(HiveContext.scala:90) at org.apache.spark.sql.hive.HiveContext.<init>(HiveContext.scala:101) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:526) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:234) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381) at py4j.Gateway.invoke(Gateway.java:214) at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:79) at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:68) at py4j.GatewayConnection.run(GatewayConnection.java:209) at java.lang.Thread.run(Thread.java:745) ``` Author: Jeff Zhang <zjffdu@apache.org> Closes #10126 from zjffdu/SPARK-12120. (cherry picked from commit e789b1d2c1eab6187f54424ed92697ca200c3101) Signed-off-by: Josh Rosen <joshrosen@databricks.com> 24 January 2016, 20:31:16 UTC
f13a3d1 [SPARK-12760][DOCS] inaccurate description for difference between local vs cluster mode in closure handling Clarify that modifying a driver local variable won't have the desired effect in cluster modes, and may or may not work as intended in local mode Author: Sean Owen <sowen@cloudera.com> Closes #10866 from srowen/SPARK-12760. (cherry picked from commit aca2a0165405b9eba27ac5e4739e36a618b96676) Signed-off-by: Sean Owen <sowen@cloudera.com> 23 January 2016, 11:45:21 UTC
e8ae242 [SPARK-12760][DOCS] invalid lambda expression in python example for … …local vs cluster srowen thanks for the PR at https://github.com/apache/spark/pull/10866! sorry it took me a while. This is related to https://github.com/apache/spark/pull/10866, basically the assignment in the lambda expression in the python example is actually invalid ``` In [1]: data = [1, 2, 3, 4, 5] In [2]: counter = 0 In [3]: rdd = sc.parallelize(data) In [4]: rdd.foreach(lambda x: counter += x) File "<ipython-input-4-fcb86c182bad>", line 1 rdd.foreach(lambda x: counter += x) ^ SyntaxError: invalid syntax ``` Author: Mortada Mehyar <mortada.mehyar@gmail.com> Closes #10867 from mortada/doc_python_fix. (cherry picked from commit 56f57f894eafeda48ce118eec16ecb88dbd1b9dc) Signed-off-by: Sean Owen <sowen@cloudera.com> 23 January 2016, 11:36:43 UTC
dca238a [SPARK-12859][STREAMING][WEB UI] Names of input streams with receivers don't fit in Streaming page Added CSS style to force names of input streams with receivers to wrap Author: Alex Bozarth <ajbozart@us.ibm.com> Closes #10873 from ajbozarth/spark12859. (cherry picked from commit 358a33bbff549826b2336c317afc7274bdd30fdb) Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.co.jp> 23 January 2016, 11:20:18 UTC
b5d7dbe [SPARK-12747][SQL] Use correct type name for Postgres JDBC's real array https://issues.apache.org/jira/browse/SPARK-12747 Postgres JDBC driver uses "FLOAT4" or "FLOAT8" not "real". Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #10695 from viirya/fix-postgres-jdbc. (cherry picked from commit 55c7dd031b8a58976922e469626469aa4aff1391) Signed-off-by: Reynold Xin <rxin@databricks.com> 22 January 2016, 02:55:39 UTC
40fa218 [SPARK-12921] Use SparkHadoopUtil reflection in SpecificParquetRecordReaderBase It looks like there's one place left in the codebase, SpecificParquetRecordReaderBase, where we didn't use SparkHadoopUtil's reflective accesses of TaskAttemptContext methods, which could create problems when using a single Spark artifact with both Hadoop 1.x and 2.x. Author: Josh Rosen <joshrosen@databricks.com> Closes #10843 from JoshRosen/SPARK-12921. 21 January 2016, 00:10:28 UTC
962e618 [MLLIB] Fix CholeskyDecomposition assertion's message Change assertion's message so it's consistent with the code. The old message says that the invoked method was lapack.dports, where in fact it was lapack.dppsv method. Author: Wojciech Jurczyk <wojtek.jurczyk@gmail.com> Closes #10818 from wjur/wjur/rename_error_message. (cherry picked from commit ebd9ce0f1f55f7d2d3bd3b92c4b0a495c51ac6fd) Signed-off-by: Sean Owen <sowen@cloudera.com> 19 January 2016, 09:36:55 UTC
30f55e5 [SQL][MINOR] Fix one little mismatched comment according to the codes in interface.scala Author: proflin <proflin.me@gmail.com> Closes #10824 from proflin/master. (cherry picked from commit c00744e60f77edb238aff1e30b450dca65451e91) Signed-off-by: Reynold Xin <rxin@databricks.com> 19 January 2016, 08:15:50 UTC
68265ac [SPARK-12841][SQL][BRANCH-1.6] fix cast in filter In SPARK-10743 we wrap cast with `UnresolvedAlias` to give `Cast` a better alias if possible. However, for cases like filter, the `UnresolvedAlias` can't be resolved and actually we don't need a better alias for this case. This PR move the cast wrapping logic to `Column.named` so that we will only do it when we need a alias name. backport https://github.com/apache/spark/pull/10781 to 1.6 Author: Wenchen Fan <wenchen@databricks.com> Closes #10819 from cloud-fan/bug. 19 January 2016, 05:20:19 UTC
d43704d [SPARK-12894][DOCUMENT] Add deploy instructions for Python in Kinesis integration doc This PR added instructions to get Kinesis assembly jar for Python users in the Kinesis integration page like Kafka doc. Author: Shixiong Zhu <shixiong@databricks.com> Closes #10822 from zsxwing/kinesis-doc. (cherry picked from commit 721845c1b64fd6e3b911bd77c94e01dc4e5fd102) Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com> 19 January 2016, 00:50:36 UTC
7482c7b [SPARK-12814][DOCUMENT] Add deploy instructions for Python in flume integration doc This PR added instructions to get flume assembly jar for Python users in the flume integration page like Kafka doc. Author: Shixiong Zhu <shixiong@databricks.com> Closes #10746 from zsxwing/flume-doc. (cherry picked from commit a973f483f6b819ed4ecac27ff5c064ea13a8dd71) Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com> 18 January 2016, 23:38:16 UTC
8c2b67f [SPARK-12346][ML] Missing attribute names in GLM for vector-type features Currently `summary()` fails on a GLM model fitted over a vector feature missing ML attrs, since the output feature attrs will also have no name. We can avoid this situation by forcing `VectorAssembler` to make up suitable names when inputs are missing names. cc mengxr Author: Eric Liang <ekl@databricks.com> Closes #10323 from ericl/spark-12346. (cherry picked from commit 5e492e9d5bc0992cbcffe64a9aaf3b334b173d2c) Signed-off-by: Xiangrui Meng <meng@databricks.com> 18 January 2016, 20:51:06 UTC
53184ce [SPARK-12558][FOLLOW-UP] AnalysisException when multiple functions applied in GROUP BY clause Addresses the comments from Yin. https://github.com/apache/spark/pull/10520 Author: Dilip Biswal <dbiswal@us.ibm.com> Closes #10758 from dilipbiswal/spark-12558-followup. (cherry picked from commit db9a860589bfc4f80d6cdf174a577ca538b82e6d) Signed-off-by: Yin Huai <yhuai@databricks.com> Conflicts: sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveUDFSuite.scala 18 January 2016, 18:29:15 UTC
5803fce [SPARK-12722][DOCS] Fixed typo in Pipeline example http://spark.apache.org/docs/latest/ml-guide.html#example-pipeline ``` val sameModel = Pipeline.load("/tmp/spark-logistic-regression-model") ``` should be ``` val sameModel = PipelineModel.load("/tmp/spark-logistic-regression-model") ``` cc: jkbradley Author: Jeff Lam <sha0lin@alumni.carnegiemellon.edu> Closes #10769 from Agent007/SPARK-12722. (cherry picked from commit 86972fa52152d2149b88ba75be048a6986006285) Signed-off-by: Sean Owen <sowen@cloudera.com> 16 January 2016, 10:41:50 UTC
7733668 [SPARK-12701][CORE] FileAppender should use join to ensure writing thread completion Changed Logging FileAppender to use join in `awaitTermination` to ensure that thread is properly finished before returning. Author: Bryan Cutler <cutlerb@gmail.com> Closes #10654 from BryanCutler/fileAppender-join-thread-SPARK-12701. (cherry picked from commit ea104b8f1ce8aa109d1b16b696a61a47df6283b2) Signed-off-by: Sean Owen <sowen@cloudera.com> 15 January 2016, 20:11:31 UTC
5a00528 [SPARK-11031][SPARKR] Method str() on a DataFrame Author: Oscar D. Lara Yejas <odlaraye@oscars-mbp.usca.ibm.com> Author: Oscar D. Lara Yejas <olarayej@mail.usf.edu> Author: Oscar D. Lara Yejas <oscar.lara.yejas@us.ibm.com> Author: Oscar D. Lara Yejas <odlaraye@oscars-mbp.attlocal.net> Closes #9613 from olarayej/SPARK-11031. (cherry picked from commit ba4a641902f95c5a9b3a6bebcaa56039eca2720d) Signed-off-by: Shivaram Venkataraman <shivaram@cs.berkeley.edu> 15 January 2016, 15:38:18 UTC
d23e57d [SPARK-12708][UI] Sorting task error in Stages Page when yarn mode. If sort column contains slash(e.g. "Executor ID / Host") when yarn mode,sort fail with following message. ![spark-12708](https://cloud.githubusercontent.com/assets/6679275/12193320/80814f8c-b62a-11e5-9914-7bf3907029df.png) It's similar to SPARK-4313 . Author: root <root@R520T1.(none)> Author: Koyo Yoshida <koyo0615@gmail.com> Closes #10663 from yoshidakuy/SPARK-12708. (cherry picked from commit 32cca933546b4aaf0fc040b9cfd1a5968171b423) Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.co.jp> 15 January 2016, 04:35:41 UTC
d1855ad [SPARK-12784][UI] Fix Spark UI IndexOutOfBoundsException with dynamic allocation Add `listener.synchronized` to get `storageStatusList` and `execInfo` atomically. Author: Shixiong Zhu <shixiong@databricks.com> Closes #10728 from zsxwing/SPARK-12784. (cherry picked from commit 501e99ef0fbd2f2165095548fe67a3447ccbfc91) Signed-off-by: Shixiong Zhu <shixiong@databricks.com> 14 January 2016, 17:51:07 UTC
0c67993 [SPARK-9844][CORE] File appender race condition during shutdown When an Executor process is destroyed, the FileAppender that is asynchronously reading the stderr stream of the process can throw an IOException during read because the stream is closed. Before the ExecutorRunner destroys the process, the FileAppender thread is flagged to stop. This PR wraps the inputStream.read call of the FileAppender in a try/catch block so that if an IOException is thrown and the thread has been flagged to stop, it will safely ignore the exception. Additionally, the FileAppender thread was changed to use Utils.tryWithSafeFinally to better log any exception that do occur. Added unit tests to verify a IOException is thrown and logged if FileAppender is not flagged to stop, and that no IOException when the flag is set. Author: Bryan Cutler <cutlerb@gmail.com> Closes #10714 from BryanCutler/file-appender-read-ioexception-SPARK-9844. (cherry picked from commit 56cdbd654d54bf07a063a03a5c34c4165818eeb2) Signed-off-by: Sean Owen <sowen@cloudera.com> 14 January 2016, 11:05:09 UTC
a490787 [SPARK-12026][MLLIB] ChiSqTest gets slower and slower over time when number of features is large jira: https://issues.apache.org/jira/browse/SPARK-12026 The issue is valid as features.toArray.view.zipWithIndex.slice(startCol, endCol) becomes slower as startCol gets larger. I tested on local and the change can improve the performance and the running time was stable. Author: Yuhao Yang <hhbyyh@gmail.com> Closes #10146 from hhbyyh/chiSq. (cherry picked from commit 021dafc6a05a31dc22c9f9110dedb47a1f913087) Signed-off-by: Joseph K. Bradley <joseph@databricks.com> 14 January 2016, 01:43:38 UTC
26f13fa [SPARK-12690][CORE] Fix NPE in UnsafeInMemorySorter.free() I hit the exception below. The `UnsafeKVExternalSorter` does pass `null` as the consumer when creating an `UnsafeInMemorySorter`. Normally the NPE doesn't occur because the `inMemSorter` is set to null later and the `free()` method is not called. It happens when there is another exception like OOM thrown before setting `inMemSorter` to null. Anyway, we can add the null check to avoid it. ``` ERROR spark.TaskContextImpl: Error in TaskCompletionListener java.lang.NullPointerException at org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.free(UnsafeInMemorySorter.java:110) at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.cleanupResources(UnsafeExternalSorter.java:288) at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter$1.onTaskCompletion(UnsafeExternalSorter.java:141) at org.apache.spark.TaskContextImpl$$anonfun$markTaskCompleted$1.apply(TaskContextImpl.scala:79) at org.apache.spark.TaskContextImpl$$anonfun$markTaskCompleted$1.apply(TaskContextImpl.scala:77) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.TaskContextImpl.markTaskCompleted(TaskContextImpl.scala:77) at org.apache.spark.scheduler.Task.run(Task.scala:91) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:722) ``` Author: Carson Wang <carson.wang@intel.com> Closes #10637 from carsonwang/FixNPE. (cherry picked from commit eabc7b8ee7e809bab05361ed154f87bff467bd88) Signed-off-by: Josh Rosen <joshrosen@databricks.com> 13 January 2016, 21:29:18 UTC
cf6d506 [SPARK-12268][PYSPARK] Make pyspark shell pythonstartup work under python3 This replaces the `execfile` used for running custom python shell scripts with explicit open, compile and exec (as recommended by 2to3). The reason for this change is to make the pythonstartup option compatible with python3. Author: Erik Selin <erik.selin@gmail.com> Closes #10255 from tyro89/pythonstartup-python3. (cherry picked from commit e4e0b3f7b2945aae5ec7c3d68296010bbc5160cf) Signed-off-by: Josh Rosen <joshrosen@databricks.com> 13 January 2016, 20:22:21 UTC
364f799 [SPARK-12685][MLLIB][BACKPORT TO 1.4] word2vec trainWordsCount gets overflow jira: https://issues.apache.org/jira/browse/SPARK-12685 master PR: https://github.com/apache/spark/pull/10627 the log of word2vec reports trainWordsCount = -785727483 during computation over a large dataset. Update the priority as it will affect the computation process. alpha = learningRate * (1 - numPartitions * wordCount.toDouble / (trainWordsCount + 1)) Author: Yuhao Yang <hhbyyh@gmail.com> Closes #10721 from hhbyyh/branch-1.4. (cherry picked from commit 7bd2564192f51f6229cf759a2bafc22134479955) Signed-off-by: Joseph K. Bradley <joseph@databricks.com> 13 January 2016, 19:54:02 UTC
f9ecd3a [SPARK-12805][MESOS] Fixes documentation on Mesos run modes The default run has changed, but the documentation didn't fully reflect the change. Author: Luc Bourlier <luc.bourlier@typesafe.com> Closes #10740 from skyluc/issue/mesos-modes-doc. (cherry picked from commit cc91e21879e031bcd05316eabb856e67a51b191d) Signed-off-by: Reynold Xin <rxin@databricks.com> 13 January 2016, 19:45:30 UTC
dcdc864 [SPARK-12558][SQL] AnalysisException when multiple functions applied in GROUP BY clause cloud-fan Can you please take a look ? In this case, we are failing during check analysis while validating the aggregation expression. I have added a semanticEquals for HiveGenericUDF to fix this. Please let me know if this is the right way to address this issue. Author: Dilip Biswal <dbiswal@us.ibm.com> Closes #10520 from dilipbiswal/spark-12558. (cherry picked from commit dc7b3870fcfc2723319dbb8c53d721211a8116be) Signed-off-by: Yin Huai <yhuai@databricks.com> Conflicts: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveShim.scala 13 January 2016, 05:45:16 UTC
f71e5cc [HOT-FIX] bypass hive test when parse logical plan to json https://github.com/apache/spark/pull/10311 introduces some rare, non-deterministic flakiness for hive udf tests, see https://github.com/apache/spark/pull/10311#issuecomment-166548851 I can't reproduce it locally, and may need more time to investigate, a quick solution is: bypass hive tests for json serialization. Author: Wenchen Fan <wenchen@databricks.com> Closes #10430 from cloud-fan/hot-fix. (cherry picked from commit 8543997f2daa60dfa0509f149fab207de98145a0) Signed-off-by: Michael Armbrust <michael@databricks.com> 13 January 2016, 02:53:25 UTC
03e523e Revert "[SPARK-12645][SPARKR] SparkR support hash function" This reverts commit 8b5f23043322254c725c703c618ba3d3cc4a4240. 12 January 2016, 23:15:10 UTC
94b39f7 [SPARK-7615][MLLIB] MLLIB Word2Vec wordVectors divided by Euclidean Norm equals to zero Cosine similarity with 0 vector should be 0 Related to https://github.com/apache/spark/pull/10152 Author: Sean Owen <sowen@cloudera.com> Closes #10696 from srowen/SPARK-7615. (cherry picked from commit c48f2a3a5fd714ad2ff19b29337e55583988431e) Signed-off-by: Sean Owen <sowen@cloudera.com> 12 January 2016, 13:27:44 UTC
4c67d55 [SPARK-5273][MLLIB][DOCS] Improve documentation examples for LinearRegression Use a much smaller step size in LinearRegressionWithSGD MLlib examples to achieve a reasonable RMSE. Our training folks hit this exact same issue when concocting an example and had the same solution. Author: Sean Owen <sowen@cloudera.com> Closes #10675 from srowen/SPARK-5273. (cherry picked from commit 9c7f34af37ef328149c1d66b4689d80a1589e1cc) Signed-off-by: Sean Owen <sowen@cloudera.com> 12 January 2016, 13:26:37 UTC
3221a7d [SPARK-12582][TEST] IndexShuffleBlockResolverSuite fails in windows [SPARK-12582][Test] IndexShuffleBlockResolverSuite fails in windows * IndexShuffleBlockResolverSuite fails in windows due to file is not closed. * mv IndexShuffleBlockResolverSuite.scala from "test/java" to "test/scala". https://issues.apache.org/jira/browse/SPARK-12582 Author: Yucai Yu <yucai.yu@intel.com> Closes #10526 from yucai/master. (cherry picked from commit 7e15044d9d9f9839c8d422bae71f27e855d559b4) Signed-off-by: Sean Owen <sowen@cloudera.com> 12 January 2016, 13:23:34 UTC
46fc7a1 [SPARK-12638][API DOC] Parameter explanation not very accurate for rdd function "aggregate" Currently, RDD function aggregate's parameter doesn't explain well, especially parameter "zeroValue". It's helpful to let junior scala user know that "zeroValue" attend both "seqOp" and "combOp" phase. Author: Tommy YU <tummyyu@163.com> Closes #10587 from Wenpei/rdd_aggregate_doc. (cherry picked from commit 9f0995bb0d0bbe5d9b15a1ca9fa18e246ff90d66) Signed-off-by: Sean Owen <sowen@cloudera.com> 12 January 2016, 13:20:29 UTC
a6c9c68 [SPARK-11823] Ignores HiveThriftBinaryServerSuite's test jdbc cancel https://issues.apache.org/jira/browse/SPARK-11823 This test often hangs and times out, leaving hanging processes. Let's ignore it for now and improve the test. Author: Yin Huai <yhuai@databricks.com> Closes #10715 from yhuai/SPARK-11823-ignore. (cherry picked from commit aaa2c3b628319178ca1f3f68966ff253c2de49cb) Signed-off-by: Josh Rosen <joshrosen@databricks.com> 12 January 2016, 03:59:37 UTC
back to top