https://github.com/apache/spark

sort by:
Revision Author Date Message Commit Date
1e86074 Preparing Spark release v1.6.3-rc2 02 November 2016, 21:45:51 UTC
82e98f1 [SPARK-16078][SQL] Backport: from_utc_timestamp/to_utc_timestamp should not depends on local timezone ## What changes were proposed in this pull request? Back-port of https://github.com/apache/spark/pull/13784 to `branch-1.6` ## How was this patch tested? Existing tests. Author: Davies Liu <davies@databricks.com> Closes #15554 from srowen/SPARK-16078. 20 October 2016, 05:55:30 UTC
b95ac0d Preparing development version 1.6.4-SNAPSHOT 17 October 2016, 05:23:31 UTC
7375bb0 Preparing Spark release v1.6.3 17 October 2016, 05:23:21 UTC
0f57785 Prepare branch-1.6 for 1.6.3 release. 17 October 2016, 05:21:04 UTC
745c5e7 [SPARK-17884][SQL] To resolve Null pointer exception when casting from empty string to interval type ## What changes were proposed in this pull request? This change adds a check in castToInterval method of Cast expression , such that if converted value is null , then isNull variable should be set to true. Earlier, the expression Cast(Literal(), CalendarIntervalType) was throwing NullPointerException because of the above mentioned reason. ## How was this patch tested? Added test case in CastSuite.scala jira entry for detail: https://issues.apache.org/jira/browse/SPARK-17884 Author: prigarg <prigarg@adobe.com> Closes #15479 from priyankagargnitk/cast_empty_string_bug. 14 October 2016, 18:28:16 UTC
18b173c [SPARK-17678][REPL][BRANCH-1.6] Honor spark.replClassServer.port in scala-2.11 repl ## What changes were proposed in this pull request? Spark 1.6 Scala-2.11 repl doesn't honor "spark.replClassServer.port" configuration, so user cannot set a fixed port number through "spark.replClassServer.port". ## How was this patch tested? N/A Author: jerryshao <sshao@hortonworks.com> Closes #15253 from jerryshao/SPARK-17678. 13 October 2016, 23:49:11 UTC
585c565 [SPARK-17850][CORE] Add a flag to ignore corrupt files (branch 1.6) ## What changes were proposed in this pull request? This is the patch for 1.6. It only adds Spark conf `spark.files.ignoreCorruptFiles` because SQL just uses HadoopRDD directly in 1.6. `spark.files.ignoreCorruptFiles` is `true` by default. ## How was this patch tested? The added test. Author: Shixiong Zhu <shixiong@databricks.com> Closes #15454 from zsxwing/SPARK-17850-1.6. 13 October 2016, 07:33:00 UTC
d3890de [SPARK-15062][SQL] Backport fix list type infer serializer issue This backports https://github.com/apache/spark/commit/733cbaa3c0ff617a630a9d6937699db37ad2943b to Branch 1.6. It's a pretty simple patch, and would be nice to have for Spark 1.6.3. Unit tests Author: Burak Yavuz <brkyvz@gmail.com> Closes #15380 from brkyvz/bp-SPARK-15062. Signed-off-by: Michael Armbrust <michael@databricks.com> 06 October 2016, 20:48:02 UTC
376545e [SPARK-17721][MLLIB][BACKPORT] Fix for multiplying transposed SparseMatrix with SparseVector Backport PR of changes relevant to mllib only, but otherwise identical to #15296 jkbradley Author: Bjarne Fruergaard <bwahlgreen@gmail.com> Closes #15311 from bwahlgreen/bugfix-spark-17721-1.6. 02 October 2016, 02:28:51 UTC
b999fa4 [SPARK-17696][SPARK-12330][CORE] Partial backport of to branch-1.6. From the original commit message: This PR also fixes a regression caused by [SPARK-10987] whereby submitting a shutdown causes a race between the local shutdown procedure and the notification of the scheduler driver disconnection. If the scheduler driver disconnection wins the race, the coarse executor incorrectly exits with status 1 (instead of the proper status 0) Author: Charles Allen <charlesallen-net.com> (cherry picked from commit 2eaeafe8a2aa31be9b230b8d53d3baccd32535b1) Author: Charles Allen <charles@allen-net.com> Closes #15270 from vanzin/SPARK-17696. 28 September 2016, 21:39:50 UTC
e2ce0ca [SPARK-17618] Fix invalid comparisons between UnsafeRow and other row formats ## What changes were proposed in this pull request? This patch addresses a correctness bug in Spark 1.6.x in where `coalesce()` declares that it can process `UnsafeRows` but mis-declares that it always outputs safe rows. If UnsafeRow and other Row types are compared for equality then we will get spurious `false` comparisons, leading to wrong answers in operators which perform whole-row comparison (such as `distinct()` or `except()`). An example of a query impacted by this bug is given in the [JIRA ticket](https://issues.apache.org/jira/browse/SPARK-17618). The problem is that the validity of our row format conversion rules depends on operators which handle `unsafeRows` (signalled by overriding `canProcessUnsafeRows`) correctly reporting their output row format (which is done by overriding `outputsUnsafeRows`). In #9024, we overrode `canProcessUnsafeRows` but forgot to override `outputsUnsafeRows`, leading to the incorrect `equals()` comparison. Our interface design is flawed because correctness depends on operators correctly overriding multiple methods this problem could have been prevented by a design which coupled row format methods / metadata into a single method / class so that all three methods had to be overridden at the same time. This patch addresses this issue by adding missing `outputsUnsafeRows` overrides. In order to ensure that bugs in this logic are uncovered sooner, I have modified `UnsafeRow.equals()` to throw an `IllegalArgumentException` if it is called with an object that is not an `UnsafeRow`. ## How was this patch tested? I believe that the stronger misuse-checking in `UnsafeRow.equals()` is sufficient to detect and prevent this class of bug. Author: Josh Rosen <joshrosen@databricks.com> Closes #15185 from JoshRosen/SPARK-17618. 27 September 2016, 17:57:15 UTC
7aded55 [SPARK-17649][CORE] Log how many Spark events got dropped in AsynchronousListenerBus (branch 1.6) ## What changes were proposed in this pull request? Backport #15220 to 1.6. ## How was this patch tested? Jenkins Author: Shixiong Zhu <shixiong@databricks.com> Closes #15226 from zsxwing/SPARK-17649-branch-1.6. 26 September 2016, 18:03:05 UTC
94524ce [SPARK-17485] Prevent failed remote reads of cached blocks from failing entire job (branch-1.6 backport) This patch is a branch-1.6 backport of #15037: ## What changes were proposed in this pull request? In Spark's `RDD.getOrCompute` we first try to read a local copy of a cached RDD block, then a remote copy, and only fall back to recomputing the block if no cached copy (local or remote) can be read. This logic works correctly in the case where no remote copies of the block exist, but if there _are_ remote copies and reads of those copies fail (due to network issues or internal Spark bugs) then the BlockManager will throw a `BlockFetchException` that will fail the task (and which could possibly fail the whole job if the read failures keep occurring). In the cases of TorrentBroadcast and task result fetching we really do want to fail the entire job in case no remote blocks can be fetched, but this logic is inappropriate for reads of cached RDD blocks because those can/should be recomputed in case cached blocks are unavailable. Therefore, I think that the `BlockManager.getRemoteBytes()` method should never throw on remote fetch errors and, instead, should handle failures by returning `None`. ## How was this patch tested? Block manager changes should be covered by modified tests in `BlockManagerSuite`: the old tests expected exceptions to be thrown on failed remote reads, while the modified tests now expect `None` to be returned from the `getRemote*` method. I also manually inspected all usages of `BlockManager.getRemoteValues()`, `getRemoteBytes()`, and `get()` to verify that they correctly pattern-match on the result and handle `None`. Note that these `None` branches are already exercised because the old `getRemoteBytes` returned `None` when no remote locations for the block could be found (which could occur if an executor died and its block manager de-registered with the master). Author: Josh Rosen <joshrosen@databricks.com> Closes #15186 from JoshRosen/SPARK-17485-branch-1.6-backport. 22 September 2016, 18:05:35 UTC
ce0a222 [SPARK-17418] Prevent kinesis-asl-assembly artifacts from being published This patch updates the `kinesis-asl-assembly` build to prevent that module from being published as part of Maven releases and snapshot builds. The `kinesis-asl-assembly` includes classes from the Kinesis Client Library (KCL) and Kinesis Producer Library (KPL), both of which are licensed under the Amazon Software License and are therefore prohibited from being distributed in Apache releases. Author: Josh Rosen <joshrosen@databricks.com> Closes #15167 from JoshRosen/stop-publishing-kinesis-assembly. 21 September 2016, 18:42:48 UTC
8f88412 [SPARK-17617][SQL] Remainder(%) expression.eval returns incorrect result on double value ## What changes were proposed in this pull request? Remainder(%) expression's `eval()` returns incorrect result when the dividend is a big double. The reason is that Remainder converts the double dividend to decimal to do "%", and that lose precision. This bug only affects the `eval()` that is used by constant folding, the codegen path is not impacted. ### Before change ``` scala> -5083676433652386516D % 10 res2: Double = -6.0 scala> spark.sql("select -5083676433652386516D % 10 as a").show +---+ | a| +---+ |0.0| +---+ ``` ### After change ``` scala> spark.sql("select -5083676433652386516D % 10 as a").show +----+ | a| +----+ |-6.0| +----+ ``` ## How was this patch tested? Unit test. Author: Sean Zhong <seanzhong@databricks.com> Closes #15171 from clockfly/SPARK-17617. (cherry picked from commit 3977223a3268aaf6913a325ee459139a4a302b1c) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 21 September 2016, 08:57:44 UTC
8646b84 [SPARK-17547] Ensure temp shuffle data file is cleaned up after error SPARK-8029 (#9610) modified shuffle writers to first stage their data to a temporary file in the same directory as the final destination file and then to atomically rename this temporary file at the end of the write job. However, this change introduced the potential for the temporary output file to be leaked if an exception occurs during the write because the shuffle writers' existing error cleanup code doesn't handle deletion of the temp file. This patch avoids this potential cause of disk-space leaks by adding `finally` blocks to ensure that temp files are always deleted if they haven't been renamed. Author: Josh Rosen <joshrosen@databricks.com> Closes #15104 from JoshRosen/cleanup-tmp-data-file-in-shuffle-writer. (cherry picked from commit 5b8f7377d54f83b93ef2bfc2a01ca65fae6d3032) Signed-off-by: Josh Rosen <joshrosen@databricks.com> 15 September 2016, 18:24:00 UTC
a447cd8 [SPARK-17465][SPARK CORE] Inappropriate memory management in `org.apache.spark.storage.MemoryStore` may lead to memory leak ## What changes were proposed in this pull request? The expression like `if (memoryMap(taskAttemptId) == 0) memoryMap.remove(taskAttemptId)` in method `releaseUnrollMemoryForThisTask` and `releasePendingUnrollMemoryForThisTask` should be called after release memory operation, whatever `memoryToRelease` is > 0 or not. If the memory of a task has been set to 0 when calling a `releaseUnrollMemoryForThisTask` or a `releasePendingUnrollMemoryForThisTask` method, the key in the memory map corresponding to that task will never be removed from the hash map. See the details in [SPARK-17465](https://issues.apache.org/jira/browse/SPARK-17465). Author: Xing SHI <shi-kou@indetail.co.jp> Closes #15022 from saturday-shi/SPARK-17465. 14 September 2016, 20:46:46 UTC
bf3f6d2 [SPARK-17531][BACKPORT] Don't initialize Hive Listeners for the Execution Client ## What changes were proposed in this pull request? If a user provides listeners inside the Hive Conf, the configuration for these listeners are passed to the Hive Execution Client as well. This may cause issues for two reasons: 1. The Execution Client will actually generate garbage 2. The listener class needs to be both in the Spark Classpath and Hive Classpath This PR empties the listener configurations in HiveUtils.newTemporaryConfiguration so that the execution client will not contain the listener confs, but the metadata client will. ## How was this patch tested? Unit tests Author: Burak Yavuz <brkyvz@gmail.com> Closes #15087 from brkyvz/overwrite-hive-listeners. 13 September 2016, 23:15:44 UTC
047bc3f [SPARK-17245][SQL][BRANCH-1.6] Do not rely on Hive's session state to retrieve HiveConf ## What changes were proposed in this pull request? Right now, we rely on Hive's `SessionState.get()` to retrieve the HiveConf used by ClientWrapper. However, this conf is actually the HiveConf set with the `state`. There is a small chance that we are trying to use the Hive client in a new thread while the global client has not been created yet. In this case, `SessionState.get()` will return a `null`, which causes a NPE when we call `SessionState.get(). getConf `. Since the conf that we want is actually the conf we set to `state`. I am changing the code to just call `state.getConf` (this is also what Spark 2.0 does). ## How was this patch tested? I have not figured out a good way to reproduce this. Author: Yin Huai <yhuai@databricks.com> Closes #14816 from yhuai/SPARK-17245. 07 September 2016, 13:55:08 UTC
69fa945 [SPARK-17378][HOTFIX] Upgrade snappy-java to 1.1.2.6 -- fix Hadoop 1 deps ## What changes were proposed in this pull request? Also update Hadoop 1 deps file to reflect Snappy 1.1.2.6 ## How was this patch tested? N/A Author: Sean Owen <sowen@cloudera.com> Closes #14992 from srowen/SPARK-17378.2. 07 September 2016, 11:12:32 UTC
3f797dd [SPARK-17316][CORE] Fix the 'ask' type parameter in 'removeExecutor' ## What changes were proposed in this pull request? Fix the 'ask' type parameter in 'removeExecutor' to eliminate a lot of error logs `Cannot cast java.lang.Boolean to scala.runtime.Nothing$` ## How was this patch tested? Jenkins Author: Shixiong Zhu <shixiong@databricks.com> Closes #14983 from zsxwing/SPARK-17316-3. (cherry picked from commit 175b4344112b376cbbbd05265125ed0e1b87d507) Signed-off-by: Shixiong Zhu <shixiong@databricks.com> 06 September 2016, 23:49:30 UTC
af8e097 [SPARK-17378][BUILD] Upgrade snappy-java to 1.1.2.6 Upgrades the Snappy version to 1.1.2.6 from 1.1.2.4, release notes: https://github.com/xerial/snappy-java/blob/master/Milestone.md mention "Fix a bug in SnappyInputStream when reading compressed data that happened to have the same first byte with the stream magic header (#142)" Existing unit tests using the latest IBM Java 8 on Intel, Power and Z architectures (little and big-endian) Author: Adam Roberts <aroberts@uk.ibm.com> Closes #14958 from a-roberts/master. (cherry picked from commit 6c08dbf683875ff1ba724447e0531f673bcff8ba) Signed-off-by: Sean Owen <sowen@cloudera.com> 06 September 2016, 21:15:58 UTC
e6480a6 [SPARK-17356][SQL][1.6] Fix out of memory issue when generating JSON for TreeNode This is a backport of PR https://github.com/apache/spark/pull/14915 to branch 1.6. ## What changes were proposed in this pull request? class `org.apache.spark.sql.types.Metadata` is widely used in mllib to store some ml attributes. `Metadata` is commonly stored in `Alias` expression. ``` case class Alias(child: Expression, name: String)( val exprId: ExprId = NamedExpression.newExprId, val qualifier: Option[String] = None, val explicitMetadata: Option[Metadata] = None, override val isGenerated: java.lang.Boolean = false) ``` The `Metadata` can take a big memory footprint since the number of attributes is big ( in scale of million). When `toJSON` is called on `Alias` expression, the `Metadata` will also be converted to a big JSON string. If a plan contains many such kind of `Alias` expressions, it may trigger out of memory error when `toJSON` is called, since converting all `Metadata` references to JSON will take huge memory. With this PR, we will skip scanning Metadata when doing JSON conversion. For a reproducer of the OOM, and analysis, please look at jira https://issues.apache.org/jira/browse/SPARK-17356. ## How was this patch tested? Existing tests. Author: Sean Zhong <seanzhong@databricks.com> Closes #14973 from clockfly/json_oom_1.6. 06 September 2016, 12:07:44 UTC
958039a [SPARK-11301][SQL] Fix case sensitivity for filter on partitioned col… ## What changes were proposed in this pull request? `DataSourceStrategy` does not consider `SQLConf` in `Context` and always match column names. For instance, `HiveContext` uses case insensitive configuration, but it's ignored in `DataSourceStrategy`. This issue was originally registered at SPARK-11301 against 1.6.0 and seemed to be fixed at that time, but Apache Spark 1.6.2 still handles **partitioned column name** in a case-sensitive way always. This is incorrect like the following. ```scala scala> sql("CREATE TABLE t(a int) PARTITIONED BY (b string) STORED AS PARQUET") scala> sql("INSERT INTO TABLE t PARTITION(b='P') SELECT * FROM (SELECT 1) t") scala> sql("INSERT INTO TABLE t PARTITION(b='Q') SELECT * FROM (SELECT 2) t") scala> sql("SELECT * FROM T WHERE B='P'").show +---+---+ | a| b| +---+---+ | 1| P| | 2| Q| +---+---+ ``` The result is the same with `set spark.sql.caseSensitive=false`. Here is the result in [Databricks CE](https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/6660119172909095/3421754458488607/5162191866050912/latest.html) . This PR reads the configuration and handle the column name comparison accordingly. ## How was this patch tested? Pass the Jenkins test with a modified test. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #14970 from dongjoon-hyun/SPARK-11301. 06 September 2016, 11:36:12 UTC
21be94b [SPARK-15091][SPARKR] Fix warnings and a failure in SparkR test cases with testthat version 1.0.1 Fix warnings and a failure in SparkR test cases with testthat version 1.0.1 SparkR unit test cases. Author: Sun Rui <sunrui2016@gmail.com> Closes #12867 from sun-rui/SPARK-15091. (cherry picked from commit 8b6491fc0b49b4e363887ae4b452ba69fe0290d5) Signed-off-by: Shivaram Venkataraman <shivaram@cs.berkeley.edu> 05 September 2016, 22:59:37 UTC
b84a92c [SPARK-17316][CORE] Make CoarseGrainedSchedulerBackend.removeExecutor non-blocking ## What changes were proposed in this pull request? StandaloneSchedulerBackend.executorRemoved is a blocking call right now. It may cause some deadlock since it's called inside StandaloneAppClient.ClientEndpoint. This PR just changed CoarseGrainedSchedulerBackend.removeExecutor to be non-blocking. It's safe since the only two usages (StandaloneSchedulerBackend and YarnSchedulerEndpoint) don't need the return value). ## How was this patch tested? Jenkins unit tests. Author: Shixiong Zhu <shixiong@databricks.com> Closes #14882 from zsxwing/SPARK-17316. 02 September 2016, 19:45:06 UTC
412b0e8 [SPARK-17038][STREAMING] fix metrics retrieval source of 'lastReceivedBatch' https://issues.apache.org/jira/browse/SPARK-17038 ## What changes were proposed in this pull request? StreamingSource's lastReceivedBatch_submissionTime, lastReceivedBatch_processingTimeStart, and lastReceivedBatch_processingTimeEnd all use data from lastCompletedBatch instead of lastReceivedBatch. In particular, this makes it impossible to match lastReceivedBatch_records with a batchID/submission time. This is apparent when looking at StreamingSource.scala, lines 89-94. ## How was this patch tested? Manually running unit tests on local laptop Author: Xin Ren <iamshrek@126.com> Closes #14681 from keypointt/SPARK-17038. (cherry picked from commit e6bef7d52f0e19ec771fb0f3e96c7ddbd1a6a19b) Signed-off-by: Shixiong Zhu <shixiong@databricks.com> 17 August 2016, 23:32:01 UTC
60de30f [SPARK-17102][SQL] bypass UserDefinedGenerator for json format check We use reflection to convert `TreeNode` to json string, and currently don't support arbitrary object. `UserDefinedGenerator` takes a function object, so we should skip json format test for it, or the tests can be flacky, e.g. `DataFrameSuite.simple explode`, this test always fail with scala 2.10(branch 1.6 builds with scala 2.10 by default), but pass with scala 2.11(master branch builds with scala 2.11 by default). N/A Author: Wenchen Fan <wenchen@databricks.com> Closes #14679 from cloud-fan/json. (cherry picked from commit 928ca1c6d12b23d84f9b6205e22d2e756311f072) Signed-off-by: Yin Huai <yhuai@databricks.com> 17 August 2016, 16:35:33 UTC
5c34029 [SPARK-16656][SQL][BRANCH-1.6] Try to make CreateTableAsSelectSuite more stable ## What changes were proposed in this pull request? This PR backports #14289 to branch 1.6 https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62593/testReport/junit/org.apache.spark.sql.sources/CreateTableAsSelectSuite/create_a_table__drop_it_and_create_another_one_with_the_same_name/ shows that `create a table, drop it and create another one with the same name` failed. But other runs were good. Seems it is a flaky test. This PR tries to make this test more stable. Author: Yin Huai <yhuai@databricks.com> Closes #14668 from yhuai/SPARK-16656-branch1.6. 16 August 2016, 20:42:58 UTC
4d64c7f Revert "[SPARK-17027][ML] Avoid integer overflow in PolynomialExpansion.getPolySize" This reverts commit b54a586af4b8ca7e8b97311bf5e75e00797de899. 14 August 2016, 11:18:30 UTC
b54a586 [SPARK-17027][ML] Avoid integer overflow in PolynomialExpansion.getPolySize Replaces custom choose function with o.a.commons.math3.CombinatoricsUtils.binomialCoefficient Spark unit tests Author: zero323 <zero323@users.noreply.github.com> Closes #14614 from zero323/SPARK-17027. (cherry picked from commit 0ebf7c1bff736cf54ec47957d71394d5b75b47a7) Signed-off-by: Sean Owen <sowen@cloudera.com> 14 August 2016, 11:01:26 UTC
8a2b8fc Change check for particular missing file message to accommodate the message that would occur, it seems, only in Hadoop 1.x (and therefore in Spark 1.x) 13 August 2016, 15:40:49 UTC
909231d [SPARK-17003][BUILD][BRANCH-1.6] release-build.sh is missing hive-thriftserver for scala 2.11 ## What changes were proposed in this pull request? hive-thriftserver works with Scala 2.11 (https://issues.apache.org/jira/browse/SPARK-8013). So, let's publish scala 2.11 artifacts with the flag of `-Phive-thfitserver`. I am also fixing the doc. Author: Yin Huai <yhuai@databricks.com> Closes #14586 from yhuai/SPARK-16453-branch-1.6. 12 August 2016, 17:29:05 UTC
b3ecff6 Revert "[SPARK-16831][PYTHON] Fixed bug in CrossValidator.avgMetrics" This reverts commit 92ee6fbf5d5096245d9f1a84cd3a8e66062dd945. 11 August 2016, 15:59:54 UTC
ace458f [SPARK-16956] Make ApplicationState.MAX_NUM_RETRY configurable ## What changes were proposed in this pull request? This patch introduces a new configuration, `spark.deploy.maxExecutorRetries`, to let users configure an obscure behavior in the standalone master where the master will kill Spark applications which have experienced too many back-to-back executor failures. The current setting is a hardcoded constant (10); this patch replaces that with a new cluster-wide configuration. **Background:** This application-killing was added in 6b5980da796e0204a7735a31fb454f312bc9daac (from September 2012) and I believe that it was designed to prevent a faulty application whose executors could never launch from DOS'ing the Spark cluster via an infinite series of executor launch attempts. In a subsequent patch (#1360), this feature was refined to prevent applications which have running executors from being killed by this code path. **Motivation for making this configurable:** Previously, if a Spark Standalone application experienced more than `ApplicationState.MAX_NUM_RETRY` executor failures and was left with no executors running then the Spark master would kill that application, but this behavior is problematic in environments where the Spark executors run on unstable infrastructure and can all simultaneously die. For instance, if your Spark driver runs on an on-demand EC2 instance while all workers run on ephemeral spot instances then it's possible for all executors to die at the same time while the driver stays alive. In this case, it may be desirable to keep the Spark application alive so that it can recover once new workers and executors are available. In order to accommodate this use-case, this patch modifies the Master to never kill faulty applications if `spark.deploy.maxExecutorRetries` is negative. I'd like to merge this patch into master, branch-2.0, and branch-1.6. ## How was this patch tested? I tested this manually using `spark-shell` and `local-cluster` mode. This is a tricky feature to unit test and historically this code has not changed very often, so I'd prefer to skip the additional effort of adding a testing framework and would rather rely on manual tests and review for now. Author: Josh Rosen <joshrosen@databricks.com> Closes #14544 from JoshRosen/add-setting-for-max-executor-failures. (cherry picked from commit b89b3a5c8e391fcaebe7ef3c77ef16bb9431d6ab) Signed-off-by: Josh Rosen <joshrosen@databricks.com> 09 August 2016, 18:22:41 UTC
a3b06ae [SPARK-16939][SQL] Fix build error by using `Tuple1` explicitly in StringFunctionsSuite ## What changes were proposed in this pull request? This PR aims to fix a build error on branch 1.6 at https://github.com/apache/spark/commit/8d8725208771a8815a60160a5a30dc6ea87a7e6a, but I think we had better have this consistently in master branch, too. It's because there exist other ongoing PR (https://github.com/apache/spark/pull/14525) about this. https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-1.6-compile-maven-with-yarn-2.3/286/console ```scala [error] /home/jenkins/workspace/spark-branch-1.6-compile-maven-with-yarn-2.3/sql/core/src/test/scala/org/apache/spark/sql/StringFunctionsSuite.scala:82: value toDF is not a member of Seq[String] [error] val df = Seq("aaaac").toDF("s") [error] ^ ``` ## How was this patch tested? After passing Jenkins, run compilation test on branch 1.6. ``` build/mvn -DskipTests -Pyarn -Phadoop-2.3 -Pkinesis-asl -Phive -Phive-thriftserver install ``` Author: Dongjoon Hyun <dongjoon@apache.org> Closes #14526 from dongjoon-hyun/SPARK-16939. (cherry picked from commit a16983c97b4c6539f97e5d26f163fed49872df2b) Signed-off-by: Sean Owen <sowen@cloudera.com> 07 August 2016, 19:54:23 UTC
1a5e762 [SPARK-16409][SQL] regexp_extract with optional groups causes NPE ## What changes were proposed in this pull request? regexp_extract actually returns null when it shouldn't when a regex matches but the requested optional group did not. This makes it return an empty string, as apparently designed. ## How was this patch tested? Additional unit test Author: Sean Owen <sowen@cloudera.com> Closes #14504 from srowen/SPARK-16409. (cherry picked from commit 8d8725208771a8815a60160a5a30dc6ea87a7e6a) Signed-off-by: Sean Owen <sowen@cloudera.com> 07 August 2016, 11:20:27 UTC
c162886 [SPARK-16925] Master should call schedule() after all executor exit events, not only failures This patch fixes a bug in Spark's standalone Master which could cause applications to hang if tasks cause executors to exit with zero exit codes. As an example of the bug, run ``` sc.parallelize(1 to 1, 1).foreachPartition { _ => System.exit(0) } ``` on a standalone cluster which has a single Spark application. This will cause all executors to die but those executors won't be replaced unless another Spark application or worker joins or leaves the cluster (or if an executor exits with a non-zero exit code). This behavior is caused by a bug in how the Master handles the `ExecutorStateChanged` event: the current implementation calls `schedule()` only if the executor exited with a non-zero exit code, so a task which causes a JVM to unexpectedly exit "cleanly" will skip the `schedule()` call. This patch addresses this by modifying the `ExecutorStateChanged` to always unconditionally call `schedule()`. This should be safe because it should always be safe to call `schedule()`; adding extra `schedule()` calls can only affect performance and should not introduce correctness bugs. I added a regression test in `DistributedSuite`. Author: Josh Rosen <joshrosen@databricks.com> Closes #14510 from JoshRosen/SPARK-16925. (cherry picked from commit 4f5f9b670e1f1783f43feb22490613e72dcff852) Signed-off-by: Josh Rosen <joshrosen@databricks.com> 07 August 2016, 02:40:34 UTC
d2518ac [SPARK-16873][CORE] Fix SpillReader NPE when spillFile has no data ## What changes were proposed in this pull request? SpillReader NPE when spillFile has no data. See follow logs: 16/07/31 20:54:04 INFO collection.ExternalSorter: spill memory to file:/data4/yarnenv/local/usercache/tesla/appcache/application_1465785263942_56138/blockmgr-db5f46c3-d7a4-4f93-8b77-565e469696fb/09/temp_shuffle_ec3ece08-4569-4197-893a-4a5dfcbbf9fa, fileSize:0.0 B 16/07/31 20:54:04 WARN memory.TaskMemoryManager: leak 164.3 MB memory from org.apache.spark.util.collection.ExternalSorter3db4b52d 16/07/31 20:54:04 ERROR executor.Executor: Managed memory leak detected; size = 190458101 bytes, TID = 2358516/07/31 20:54:04 ERROR executor.Executor: Exception in task 1013.0 in stage 18.0 (TID 23585) java.lang.NullPointerException at org.apache.spark.util.collection.ExternalSorter$SpillReader.cleanup(ExternalSorter.scala:624) at org.apache.spark.util.collection.ExternalSorter$SpillReader.nextBatchStream(ExternalSorter.scala:539) at org.apache.spark.util.collection.ExternalSorter$SpillReader.<init>(ExternalSorter.scala:507) at org.apache.spark.util.collection.ExternalSorter$SpillableIterator.spill(ExternalSorter.scala:816) at org.apache.spark.util.collection.ExternalSorter.forceSpill(ExternalSorter.scala:251) at org.apache.spark.util.collection.Spillable.spill(Spillable.scala:109) at org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:154) at org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:249) at org.apache.spark.memory.MemoryConsumer.allocatePage(MemoryConsumer.java:112) at org.apache.spark.shuffle.sort.ShuffleExternalSorter.acquireNewPageIfNecessary(ShuffleExternalSorter.java:346) at org.apache.spark.shuffle.sort.ShuffleExternalSorter.insertRecord(ShuffleExternalSorter.java:367) at org.apache.spark.shuffle.sort.UnsafeShuffleWriter.insertRecordIntoSorter(UnsafeShuffleWriter.java:237) at org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:164) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:89) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) 16/07/31 20:54:30 INFO executor.Executor: Executor is trying to kill task 1090.1 in stage 18.0 (TID 23793) 16/07/31 20:54:30 INFO executor.CoarseGrainedExecutorBackend: Driver commanded a shutdown ## How was this patch tested? Manual test. Author: sharkd <sharkd.tu@gmail.com> Author: sharkdtu <sharkdtu@tencent.com> Closes #14479 from sharkdtu/master. (cherry picked from commit 583d91a1957f4258a64184cc6b9007588791d332) Signed-off-by: Reynold Xin <rxin@databricks.com> 04 August 2016, 02:21:16 UTC
52d8837 [SPARK-16796][WEB UI] Visible passwords on Spark environment page ## What changes were proposed in this pull request? Mask spark.ssl.keyPassword, spark.ssl.keyStorePassword, spark.ssl.trustStorePassword in Web UI environment page. (Changes their values to ***** in env. page) ## How was this patch tested? I've built spark, run spark shell and checked that this values have been masked with *****. Also run tests: ./dev/run-tests [info] ScalaTest [info] Run completed in 1 hour, 9 minutes, 5 seconds. [info] Total number of tests run: 2166 [info] Suites: completed 65, aborted 0 [info] Tests: succeeded 2166, failed 0, canceled 0, ignored 590, pending 0 [info] All tests passed. ![mask](https://cloud.githubusercontent.com/assets/15244468/17262154/7641e132-55e2-11e6-8a6c-30ead77c7372.png) Author: Artur Sukhenko <artur.sukhenko@gmail.com> Closes #14409 from Devian-ua/maskpass. (cherry picked from commit 3861273771c2631e88e1f37a498c644ad45ac1c0) Signed-off-by: Sean Owen <sowen@cloudera.com> 03 August 2016, 13:14:53 UTC
92ee6fb [SPARK-16831][PYTHON] Fixed bug in CrossValidator.avgMetrics avgMetrics was summed, not averaged, across folds Author: =^_^= <maxmoroz@gmail.com> Closes #14456 from pkch/pkch-patch-1. (cherry picked from commit 639df046a250873c26446a037cb832ab28cb5272) Signed-off-by: Sean Owen <sowen@cloudera.com> 03 August 2016, 11:19:36 UTC
797e758 [SPARK-15541] Casting ConcurrentHashMap to ConcurrentMap (branch-1.6) ## What changes were proposed in this pull request? Casting ConcurrentHashMap to ConcurrentMap allows to run code compiled with Java 8 on Java 7 ## How was this patch tested? Compilation. Existing automatic tests Author: Maciej Brynski <maciej.brynski@adpilot.pl> Closes #14390 from maver1ck/spark-15541. 02 August 2016, 23:07:35 UTC
8a22275 [SPARK-15541] Casting ConcurrentHashMap to ConcurrentMap (master branch) Casting ConcurrentHashMap to ConcurrentMap allows to run code compiled with Java 8 on Java 7 Compilation. Existing automatic tests Author: Maciej Brynski <maciej.brynski@adpilot.pl> Closes #14459 from maver1ck/spark-15541-master. (cherry picked from commit 511dede1118f20a7756f614acb6fc88af52c9de9) Signed-off-by: Sean Owen <sowen@cloudera.com> 02 August 2016, 15:46:20 UTC
1b2e6f6 [SPARK-16664][SQL] Fix persist call on Data frames with more than 200… ## What changes were proposed in this pull request? Cherry-pick from d1d5069aa3744d46abd3889abab5f15e9067382a and fix the test case ## How was this patch tested? Test updated Author: Wesley Tang <tangmingjun@mininglamp.com> Closes #14404 from breakdawn/branch-1.6. 29 July 2016, 20:25:33 UTC
03913af [SPARK-16751][HOTFIX] Also update hadoop-1 deps file to reflect derby 10.12.1.1 security fix ## What changes were proposed in this pull request? See https://github.com/apache/spark/pull/14379 ; I failed to note in back-porting to 1.6 that an additional Hadoop 1 deps file would need to be updated. This makes that change. ## How was this patch tested? Jenkins tests. Author: Sean Owen <sowen@cloudera.com> Closes #14403 from srowen/SPARK-16751.2. 29 July 2016, 16:06:34 UTC
f445cce Revert "[SPARK-16664][SQL] Fix persist call on Data frames with more than 200…" This reverts commit 15abbf9d26fd80ae44d6aaee4b435ec4dc08aa95. 29 July 2016, 12:40:58 UTC
b6f6075 [SPARK-16751] Upgrade derby to 10.12.1.1 Version of derby upgraded based on important security info at VersionEye. Test scope added so we don't include it in our final package anyway. NB: I think this should be backported to all previous releases as it is a security problem https://www.versioneye.com/java/org.apache.derby:derby/10.11.1.1 The CVE number is 2015-1832. I also suggest we add a SECURITY tag for JIRAs Existing tests with the change making sure that we see no new failures. I checked derby 10.12.x and not derby 10.11.x is downloaded to our ~/.m2 folder. I then used dev/make-distribution.sh and checked the dist/jars folder for Spark 2.0: no derby jar is present. I don't know if this would also remove it from the assembly jar in our 1.x branches. Author: Adam Roberts <aroberts@uk.ibm.com> Closes #14379 from a-roberts/patch-4. (cherry picked from commit 04a2c072d94874f3f7ae9dd94c026e8826a75ccd) Signed-off-by: Sean Owen <sowen@cloudera.com> 29 July 2016, 11:46:24 UTC
15abbf9 [SPARK-16664][SQL] Fix persist call on Data frames with more than 200… f12f11e578169b47e3f8b18b299948c0670ba585 introduced this bug, missed foreach as map Test added Author: Wesley Tang <tangmingjun@mininglamp.com> Closes #14324 from breakdawn/master. (cherry picked from commit d1d5069aa3744d46abd3889abab5f15e9067382a) Signed-off-by: Sean Owen <sowen@cloudera.com> 29 July 2016, 11:27:54 UTC
4ff9892 [MINOR][ML] Fix some mistake in LinearRegression formula. ## What changes were proposed in this pull request? Fix some mistake in ```LinearRegression``` formula. ## How was this patch tested? Documents change, no tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes #14369 from yanboliang/LiR-formula. (cherry picked from commit 3c3371bbd6361011b138cce88f6396a2aa4e2cb9) Signed-off-by: Sean Owen <sowen@cloudera.com> 27 July 2016, 10:24:58 UTC
f6e0c17 [SPARK-16440][MLLIB] Destroy broadcasted variables even on driver ## What changes were proposed in this pull request? Forgotten broadcasted variables were persisted into a previous #PR 14153). This PR turns those `unpersist()` into `destroy()` so that memory is freed even on the driver. ## How was this patch tested? Unit Tests in Word2VecSuite were run locally. This contribution is done on behalf of Criteo, according to the terms of the Apache license 2.0. Author: Anthony Truchet <a.truchet@criteo.com> Closes #14268 from AnthonyTruchet/SPARK-16440. (cherry picked from commit 0dc79ffd1cbb45e69a35e3f5334c9a13290037a0) Signed-off-by: Sean Owen <sowen@cloudera.com> 20 July 2016, 09:40:25 UTC
6ea7d4b [SPARK-16313][SQL][BRANCH-1.6] Spark should not silently drop exceptions in file listing ## What changes were proposed in this pull request? Spark silently drops exceptions during file listing. This is a very bad behavior because it can mask legitimate errors and the resulting plan will silently have 0 rows. This patch changes it to not silently drop the errors. After making partition discovery not silently drop exceptions, HiveMetastoreCatalog can trigger partition discovery on empty tables, which cause FileNotFoundExceptions (these Exceptions were dropped by partition discovery silently). To address this issue, this PR introduces two **hacks** to workaround the issues. These two hacks try to avoid of triggering partition discovery on empty tables in HiveMetastoreCatalog. ## How was this patch tested? Manually tested. **Note: This is a backport of https://github.com/apache/spark/pull/13987** Author: Yin Huai <yhuai@databricks.com> Closes #14139 from yhuai/SPARK-16313-branch-1.6. 14 July 2016, 19:00:31 UTC
4381e21 [SPARK-16440][MLLIB] Undeleted broadcast variables in Word2Vec causing OoM for long runs ## What changes were proposed in this pull request? Unpersist broadcasted vars in Word2Vec.fit for more timely / reliable resource cleanup ## How was this patch tested? Jenkins tests Author: Sean Owen <sowen@cloudera.com> Closes #14153 from srowen/SPARK-16440. (cherry picked from commit 51ade51a9fd64fc2fe651c505a286e6f29f59d40) Signed-off-by: Sean Owen <sowen@cloudera.com> 13 July 2016, 10:39:49 UTC
fb09336 [SPARK-16375][WEB UI] Fixed misassigned var: numCompletedTasks was assigned to numSkippedTasks ## What changes were proposed in this pull request? I fixed a misassigned var, numCompletedTasks was assigned to numSkippedTasks in the convertJobData method ## How was this patch tested? dev/run-tests Author: Alex Bozarth <ajbozart@us.ibm.com> Closes #14141 from ajbozarth/spark16375. (cherry picked from commit f156136dae5df38f73a25cf3fb48f98f417ef059) Signed-off-by: Sean Owen <sowen@cloudera.com> 13 July 2016, 09:45:24 UTC
980db2b [HOTFIX] Fix build break. 13 July 2016, 06:40:37 UTC
7c8a399 [SPARK-16489][SQL] Guard against variable reuse mistakes in expression code generation In code generation, it is incorrect for expressions to reuse variable names across different instances of itself. As an example, SPARK-16488 reports a bug in which pmod expression reuses variable name "r". This patch updates ExpressionEvalHelper test harness to always project two instances of the same expression, which will help us catch variable reuse problems in expression unit tests. This patch also fixes the bug in crc32 expression. This is a test harness change, but I also created a new test suite for testing the test harness. Author: Reynold Xin <rxin@databricks.com> Closes #14146 from rxin/SPARK-16489. (cherry picked from commit c377e49e38a290e5c4fbc178278069788674dfb7) Signed-off-by: Reynold Xin <rxin@databricks.com> 13 July 2016, 06:14:17 UTC
d1c992f [SPARK-16488] Fix codegen variable namespace collision in pmod and partitionBy This patch fixes a variable namespace collision bug in pmod and partitionBy Regression test for one possible occurrence. A more general fix in `ExpressionEvalHelper.checkEvaluation` will be in a subsequent PR. Author: Sameer Agarwal <sameer@databricks.com> Closes #14144 from sameeragarwal/codegen-bug. (cherry picked from commit 9cc74f95edb6e4f56151966139cd0dc24e377949) Signed-off-by: Reynold Xin <rxin@databricks.com> (cherry picked from commit 689261465ad1dd443ebf764ad837243418b986ef) Signed-off-by: Reynold Xin <rxin@databricks.com> 13 July 2016, 06:13:19 UTC
9808735 [SPARK-16514][SQL] Fix various regex codegen bugs ## What changes were proposed in this pull request? RegexExtract and RegexReplace currently crash on non-nullable input due use of a hard-coded local variable name (e.g. compiles fail with `java.lang.Exception: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 85, Column 26: Redefinition of local variable "m" `). This changes those variables to use fresh names, and also in a few other places. ## How was this patch tested? Unit tests. rxin Author: Eric Liang <ekl@databricks.com> Closes #14168 from ericl/sc-3906. (cherry picked from commit 1c58fa905b6543d366d00b2e5394dfd633987f6d) Signed-off-by: Reynold Xin <rxin@databricks.com> 13 July 2016, 06:09:31 UTC
702178d [SPARK-16385][CORE] Catch correct exception when calling method via reflection. Using "Method.invoke" causes an exception to be thrown, not an error, so Utils.waitForProcess() was always throwing an exception when run on Java 7. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #14056 from vanzin/SPARK-16385. (cherry picked from commit 59f9c1bd1adfea7069e769fb68351c228c37c8fc) Signed-off-by: Sean Owen <sowen@cloudera.com> 11 July 2016, 21:03:43 UTC
bb92788 Revert "[SPARK-16372][MLLIB] Retag RDD to tallSkinnyQR of RowMatrix" This reverts commit 45dda92214191310a56333a2085e2343eba170cd. 07 July 2016, 17:34:50 UTC
45dda92 [SPARK-16372][MLLIB] Retag RDD to tallSkinnyQR of RowMatrix ## What changes were proposed in this pull request? The following Java code because of type erasing: ```Java JavaRDD<Vector> rows = jsc.parallelize(...); RowMatrix mat = new RowMatrix(rows.rdd()); QRDecomposition<RowMatrix, Matrix> result = mat.tallSkinnyQR(true); ``` We should use retag to restore the type to prevent the following exception: ```Java java.lang.ClassCastException: [Ljava.lang.Object; cannot be cast to [Lorg.apache.spark.mllib.linalg.Vector; ``` ## How was this patch tested? Java unit test Author: Xusen Yin <yinxusen@gmail.com> Closes #14051 from yinxusen/SPARK-16372. (cherry picked from commit 4c6f00d09c016dfc1d2de6e694dff219c9027fa0) Signed-off-by: Sean Owen <sowen@cloudera.com> 07 July 2016, 10:28:29 UTC
2588776 [MINOR][CORE][1.6-BACKPORT] Fix display wrong free memory size in the log ## What changes were proposed in this pull request? Free memory size displayed in the log is wrong (used memory), fix to make it correct. Backported to 1.6. ## How was this patch tested? N/A Author: jerryshao <sshao@hortonworks.com> Closes #14043 from jerryshao/memory-log-fix-1.6-backport. 06 July 2016, 13:49:21 UTC
7678195 [MINOR][BUILD] Download Maven 3.3.9 instead of 3.3.3 because the latter is no longer published on Apache mirrors ## What changes were proposed in this pull request? Download Maven 3.3.9 instead of 3.3.3 because the latter is no longer published on Apache mirrors ## How was this patch tested? Jenkins Author: Sean Owen <sowen@cloudera.com> Closes #14066 from srowen/Maven339Branch16. 06 July 2016, 11:27:17 UTC
4fcb888 [SPARK-16353][BUILD][DOC] Missing javadoc options for java unidoc Link to Jira issue: https://issues.apache.org/jira/browse/SPARK-16353 ## What changes were proposed in this pull request? The javadoc options for the java unidoc generation are ignored when generating the java unidoc. For example, the generated `index.html` has the wrong HTML page title. This can be seen at http://spark.apache.org/docs/latest/api/java/index.html. I changed the relevant setting scope from `doc` to `(JavaUnidoc, unidoc)`. ## How was this patch tested? I ran `docs/jekyll build` and verified that the java unidoc `index.html` has the correct HTML page title. Author: Michael Allman <michael@videoamp.com> Closes #14031 from mallman/spark-16353. (cherry picked from commit 7dbffcdd6dc76b8e8d6a9cd6eeb24323a6b740c3) Signed-off-by: Sean Owen <sowen@cloudera.com> 04 July 2016, 20:16:41 UTC
c25aa8f [SPARK-16329][SQL][BACKPORT-1.6] Star Expansion over Table Containing No Column #14040 #### What changes were proposed in this pull request? Star expansion over a table containing zero column does not work since 1.6. However, it works in Spark 1.5.1. This PR is to fix the issue in the master branch. For example, ```scala val rddNoCols = sqlContext.sparkContext.parallelize(1 to 10).map(_ => Row.empty) val dfNoCols = sqlContext.createDataFrame(rddNoCols, StructType(Seq.empty)) dfNoCols.registerTempTable("temp_table_no_cols") sqlContext.sql("select * from temp_table_no_cols").show ``` Without the fix, users will get the following the exception: ``` java.lang.IllegalArgumentException: requirement failed at scala.Predef$.require(Predef.scala:221) at org.apache.spark.sql.catalyst.analysis.UnresolvedStar.expand(unresolved.scala:199) ``` #### How was this patch tested? Tests are added Author: gatorsmile <gatorsmile@gmail.com> Closes #14042 from gatorsmile/starExpansionEmpty. 04 July 2016, 16:40:08 UTC
1026aba [SPARK-15761][MLLIB][PYSPARK] Load ipython when default python is Python3 ## What changes were proposed in this pull request? I would like to use IPython with Python 3.5. It is annoying when it fails with IPython requires Python 2.7+; please install python2.7 or set PYSPARK_PYTHON when I have a version greater than 2.7 ## How was this patch tested It now works with IPython and Python3 Author: MechCoder <mks542@nyu.edu> Closes #13503 from MechCoder/spark-15761. (cherry picked from commit 66283ee0b25de2a5daaa21d50a05a7fadec1de77) Signed-off-by: Sean Owen <sowen@cloudera.com> 01 July 2016, 08:27:54 UTC
83f8604 [SPARK-16182][CORE] Utils.scala -- terminateProcess() should call Process.destroyForcibly() if and only if Process.destroy() fails ## What changes were proposed in this pull request? Utils.terminateProcess should `destroy()` first and only fall back to `destroyForcibly()` if it fails. It's kind of bad that we're force-killing executors -- and only in Java 8. See JIRA for an example of the impact: no shutdown While here: `Utils.waitForProcess` should use the Java 8 method if available instead of a custom implementation. ## How was this patch tested? Existing tests, which cover the force-kill case, and Amplab tests, which will cover both Java 7 and Java 8 eventually. However I tested locally on Java 8 and the PR builder will try Java 7 here. Author: Sean Owen <sowen@cloudera.com> Closes #13973 from srowen/SPARK-16182. (cherry picked from commit 2075bf8ef6035fd7606bcf20dc2cd7d7b9cda446) Signed-off-by: Sean Owen <sowen@cloudera.com> 01 July 2016, 08:25:02 UTC
ccc7fa3 [SPARK-16257][BUILD] Update spark_ec2.py to support Spark 1.6.2 and 1.6.3. ## What changes were proposed in this pull request? - Adds 1.6.2 and 1.6.3 as supported Spark versions within the bundled spark-ec2 script. - Makes the default Spark version 1.6.3 to keep in sync with the upcoming release. - Does not touch the newer spark-ec2 scripts in the separate amplabs repository. ## How was this patch tested? - Manual script execution: export AWS_SECRET_ACCESS_KEY=_snip_ export AWS_ACCESS_KEY_ID=_snip_ $SPARK_HOME/ec2/spark-ec2 \ --key-pair=_snip_ \ --identity-file=_snip_ \ --region=us-east-1 \ --vpc-id=_snip_ \ --slaves=1 \ --instance-type=t1.micro \ --spark-version=1.6.2 \ --hadoop-major-version=yarn \ launch test-cluster - Result: Successful creation of a 1.6.2-based Spark cluster. This contribution is my original work and I license the work to the project under the project's open source license. Author: Brian Uri <brian.uri@novetta.com> Closes #13947 from briuri/branch-1.6-bug-spark-16257. 30 June 2016, 06:52:28 UTC
1ac830a [SPARK-16044][SQL] Backport input_file_name() for data source based on NewHadoopRDD to branch 1.6 ## What changes were proposed in this pull request? This PR backports https://github.com/apache/spark/pull/13759. (`SqlNewHadoopRDDState` was renamed to `InputFileNameHolder` and `spark` API does not exist in branch 1.6) ## How was this patch tested? Unit tests in `ColumnExpressionSuite`. Author: hyukjinkwon <gurwls223@gmail.com> Closes #13806 from HyukjinKwon/backport-SPARK-16044. 29 June 2016, 20:11:56 UTC
0cb06c9 [SPARK-16148][SCHEDULER] Allow for underscores in TaskLocation in the Executor ID ## What changes were proposed in this pull request? Previously, the TaskLocation implementation would not allow for executor ids which include underscores. This tweaks the string split used to get the hostname and executor id, allowing for underscores in the executor id. This addresses the JIRA found here: https://issues.apache.org/jira/browse/SPARK-16148 This is moved over from a previous PR against branch-1.6: https://github.com/apache/spark/pull/13857 ## How was this patch tested? Ran existing unit tests for core and streaming. Manually ran a simple streaming job with an executor whose id contained underscores and confirmed that the job ran successfully. This is my original work and I license the work to the project under the project's open source license. Author: Tom Magrino <tmagrino@fb.com> Closes #13858 from tmagrino/fixtasklocation. (cherry picked from commit ae14f362355b131fcb3e3633da7bb14bdd2b6893) Signed-off-by: Shixiong Zhu <shixiong@databricks.com> 28 June 2016, 20:38:59 UTC
4a67541 [SPARK-13023][PROJECT INFRA][FOLLOWUP][BRANCH-1.6] Unable to check `root` module ending up failure of Python tests ## What changes were proposed in this pull request? This PR fixes incorrect checking for `root` module (meaning all tests). I realised that https://github.com/apache/spark/pull/13806 is being failed due to this one. The PR corrects two files in `sql` and `core`. Since it seems fixing `core` module triggers all tests by `root` value from `determine_modules_for_files`. So, `changed_modules` becomes as below: ``` ['root', 'sql'] ``` and `module.dependent_modules` becaomes as below: ``` ['pyspark-mllib', 'pyspark-ml', 'hive-thriftserver', 'sparkr', 'mllib', 'examples', 'pyspark-sql'] ``` Now, `modules_to_test` does not include `root` and this checking is skipped but then both `changed_modules` and `modules_to_test` are being merged after that. So, this includes `root` module to test. This ends up with failing with the message below (e.g. https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60990/consoleFull): ``` Error: unrecognized module 'root'. Supported modules: pyspark-core, pyspark-sql, pyspark-streaming, pyspark-ml, pyspark-mllib ``` ## How was this patch tested? N/A Author: hyukjinkwon <gurwls223@gmail.com> Closes #13845 from HyukjinKwon/fix-build-1.6. 28 June 2016, 00:40:37 UTC
22a496d [SPARK-16214][EXAMPLES] fix the denominator of SparkPi ## What changes were proposed in this pull request? reduce the denominator of SparkPi by 1 ## How was this patch tested? integration tests Author: 杨浩 <yanghaogn@163.com> Closes #13910 from yanghaogn/patch-1. (cherry picked from commit b452026324da20f76f7d8b78e5ba1c007712e585) Signed-off-by: Sean Owen <sowen@cloudera.com> 27 June 2016, 07:32:12 UTC
60e095b [SPARK-16193][TESTS] Address flaky ExternalAppendOnlyMapSuite spilling tests ## What changes were proposed in this pull request? Make spill tests wait until job has completed before returning the number of stages that spilled ## How was this patch tested? Existing Jenkins tests. Author: Sean Owen <sowen@cloudera.com> Closes #13896 from srowen/SPARK-16193. (cherry picked from commit e87741589a24821b5fe73e5d9ee2164247998580) Signed-off-by: Sean Owen <sowen@cloudera.com> 25 June 2016, 11:14:40 UTC
24d59fb [MLLIB] org.apache.spark.mllib.util.SVMDataGenerator generates ArrayIndexOutOfBoundsException. I have found the bug and tested the solution. ## What changes were proposed in this pull request? Just adjust the size of an array in line 58 so it does not cause an ArrayOutOfBoundsException in line 66. ## How was this patch tested? Manual tests. I have recompiled the entire project with the fix, it has been built successfully and I have run the code, also with good results. line 66: val yD = blas.ddot(trueWeights.length, x, 1, trueWeights, 1) + rnd.nextGaussian() * 0.1 crashes because trueWeights has length "nfeatures + 1" while "x" has length "features", and they should have the same length. To fix this just make trueWeights be the same length as x. I have recompiled the project with the change and it is working now: [spark-1.6.1]$ spark-submit --master local[*] --class org.apache.spark.mllib.util.SVMDataGenerator mllib/target/spark-mllib_2.11-1.6.1.jar local /home/user/test And it generates the data successfully now in the specified folder. Author: José Antonio <joseanmunoz@gmail.com> Closes #13895 from j4munoz/patch-2. (cherry picked from commit a3c7b4187bad00dad87df7e3b5929a44d29568ed) Signed-off-by: Sean Owen <sowen@cloudera.com> 25 June 2016, 08:11:47 UTC
b7acc1b [SPARK-16173] [SQL] Can't join describe() of DataFrame in Scala 2.10 ## What changes were proposed in this pull request? This PR fixes `DataFrame.describe()` by forcing materialization to make the `Seq` serializable. Currently, `describe()` of `DataFrame` throws `Task not serializable` Spark exceptions when joining in Scala 2.10. ## How was this patch tested? Manual. (After building with Scala 2.10, test on bin/spark-shell and bin/pyspark.) Author: Dongjoon Hyun <dongjoon@apache.org> Closes #13902 from dongjoon-hyun/SPARK-16173-branch-1.6. 25 June 2016, 05:30:52 UTC
d7223bb [SPARK-16077] [PYSPARK] catch the exception from pickle.whichmodule() ## What changes were proposed in this pull request? In the case that we don't know which module a object came from, will call pickle.whichmodule() to go throught all the loaded modules to find the object, which could fail because some modules, for example, six, see https://bitbucket.org/gutworth/six/issues/63/importing-six-breaks-pickling We should ignore the exception here, use `__main__` as the module name (it means we can't find the module). ## How was this patch tested? Manual tested. Can't have a unit test for this. Author: Davies Liu <davies@databricks.com> Closes #13788 from davies/whichmodule. (cherry picked from commit d48935400ca47275f677b527c636976af09332c8) Signed-off-by: Davies Liu <davies.liu@gmail.com> 24 June 2016, 21:35:51 UTC
4fdac3c [SPARK-6005][TESTS] Fix flaky test: o.a.s.streaming.kafka.DirectKafkaStreamSuite.offset recovery ## What changes were proposed in this pull request? Because this test extracts data from `DStream.generatedRDDs` before stopping, it may get data before checkpointing. Then after recovering from the checkpoint, `recoveredOffsetRanges` may contain something not in `offsetRangesBeforeStop`, which will fail the test. Adding `Thread.sleep(1000)` before `ssc.stop()` will reproduce this failure. This PR just moves the logic of `offsetRangesBeforeStop` (also renamed to `offsetRangesAfterStop`) after `ssc.stop()` to fix the flaky test. ## How was this patch tested? Jenkins unit tests. Author: Shixiong Zhu <shixiong@databricks.com> Closes #12903 from zsxwing/SPARK-6005. (cherry picked from commit 9533f5390a3ad7ab96a7bea01cdb6aed89503a51) Signed-off-by: Sean Owen <sowen@cloudera.com> 22 June 2016, 13:10:50 UTC
d98fb19 [SPARK-15606][CORE] Use non-blocking removeExecutor call to avoid deadlocks ## What changes were proposed in this pull request? Set minimum number of dispatcher threads to 3 to avoid deadlocks on machines with only 2 cores ## How was this patch tested? Spark test builds Author: Pete Robbins <robbinspg@gmail.com> Closes #13355 from robbinspg/SPARK-13906. 21 June 2016, 21:21:51 UTC
abe36c5 [SPARK-16086] [SQL] fix Python UDF without arguments (for 1.6) ## What changes were proposed in this pull request? Fix the bug for Python UDF that does not have any arguments. ## How was this patch tested? Added regression tests. Author: Davies Liu <davies.liu@gmail.com> Closes #13793 from davies/fix_no_arguments. 21 June 2016, 03:50:30 UTC
db86e7f [SPARK-15613] [SQL] Fix incorrect days to millis conversion due to Daylight Saving Time Internally, we use Int to represent a date (the days since 1970-01-01), when we convert that into unix timestamp (milli-seconds since epoch in UTC), we get the offset of a timezone using local millis (the milli-seconds since 1970-01-01 in a timezone), but TimeZone.getOffset() expect unix timestamp, the result could be off by one hour (in Daylight Saving Time (DST) or not). This PR change to use best effort approximate of posix timestamp to lookup the offset. In the event of changing of DST, Some time is not defined (for example, 2016-03-13 02:00:00 PST), or could lead to multiple valid result in UTC (for example, 2016-11-06 01:00:00), this best effort approximate should be enough in practice. Added regression tests. Author: Davies Liu <davies@databricks.com> Closes #13652 from davies/fix_timezone. 20 June 2016, 20:38:16 UTC
16b7f1d [SPARK-14391][LAUNCHER] Fix launcher communication test, take 2. There's actually a race here: the state of the handler was changed before the connection was set, so the test code could be notified of the state change, wake up, and still see the connection as null, triggering the assert. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #12785 from vanzin/SPARK-14391. (cherry picked from commit 73c20bf32524c2232febc8c4b12d5fa228347163) 20 June 2016, 16:55:06 UTC
2083485 Preparing development version 1.6.3-SNAPSHOT 19 June 2016, 21:06:28 UTC
54b1121 Preparing Spark release v1.6.2-rc2 19 June 2016, 21:06:21 UTC
3d569d9 Revert "[SPARK-15613] [SQL] Fix incorrect days to millis conversion due to Daylight Saving Time" This reverts commit 41efd2091781b31118c6d37be59e4f0f4ae2bf66. 19 June 2016, 16:30:59 UTC
41efd20 [SPARK-15613] [SQL] Fix incorrect days to millis conversion due to Daylight Saving Time ## What changes were proposed in this pull request? Internally, we use Int to represent a date (the days since 1970-01-01), when we convert that into unix timestamp (milli-seconds since epoch in UTC), we get the offset of a timezone using local millis (the milli-seconds since 1970-01-01 in a timezone), but TimeZone.getOffset() expect unix timestamp, the result could be off by one hour (in Daylight Saving Time (DST) or not). This PR change to use best effort approximate of posix timestamp to lookup the offset. In the event of changing of DST, Some time is not defined (for example, 2016-03-13 02:00:00 PST), or could lead to multiple valid result in UTC (for example, 2016-11-06 01:00:00), this best effort approximate should be enough in practice. ## How was this patch tested? Added regression tests. Author: Davies Liu <davies@databricks.com> Closes #13652 from davies/fix_timezone. (cherry picked from commit 001a58960311b07fe80e2f01e473f4987948d06e) Signed-off-by: Davies Liu <davies.liu@gmail.com> 19 June 2016, 07:35:17 UTC
3f1d730 [SPARK-16035][PYSPARK] Fix SparseVector parser assertion for end parenthesis ## What changes were proposed in this pull request? The check on the end parenthesis of the expression to parse was using the wrong variable. I corrected that. ## How was this patch tested? Manual test Author: andreapasqua <andrea@radius.com> Closes #13750 from andreapasqua/sparse-vector-parser-assertion-fix. (cherry picked from commit 4c64e88d5ba4c36cbdbc903376492f0f43401e4e) Signed-off-by: Xiangrui Meng <meng@databricks.com> 18 June 2016, 05:41:28 UTC
fd05389 [SPARK-15892][ML] Backport correctly merging AFTAggregators to branch 1.6 ## What changes were proposed in this pull request? This PR backports https://github.com/apache/spark/pull/13619. The original test added in branch-2.0 was failed in branch-1.6. This seems because the behaviour was changed in https://github.com/apache/spark/commit/101663f1ae222a919fc40510aa4f2bad22d1be6f. This was failure while calculating Euler's number which ends up with a infinity regardless of this path. So, I brought the dataset from `AFTSurvivalRegressionExample` to make sure this is working and then wrote the test. I ran the test before/after creating empty partitions. `model.scale` becomes `1.0` with empty partitions and becames `1.547` without them. After this patch, this becomes always `1.547`. ## How was this patch tested? Unit test in `AFTSurvivalRegressionSuite`. Author: hyukjinkwon <gurwls223@gmail.com> Closes #13725 from HyukjinKwon/SPARK-15892-1-6. 18 June 2016, 04:04:24 UTC
e530823 Revert "[SPARK-15395][CORE] Use getHostString to create RpcAddress (backport for 1.6)" This reverts commit 7ad82b663092615b02bef3991fb1a21af77d2358. See SPARK-16017. 17 June 2016, 20:33:43 UTC
4621fe9 Preparing development version 1.6.3-SNAPSHOT 16 June 2016, 23:40:26 UTC
4168d9c Preparing Spark release v1.6.2-rc1 16 June 2016, 23:40:19 UTC
b8f380f Preparing development version 1.6.3-SNAPSHOT 16 June 2016, 23:35:51 UTC
f166493 Preparing Spark release v1.6.2 16 June 2016, 23:35:44 UTC
a4485c3 Update branch-1.6 for 1.6.2 release. 16 June 2016, 23:30:18 UTC
0a8ada5 [SPARK-15975] Fix improper Popen retcode code handling in dev/run-tests In the `dev/run-tests.py` script we check a `Popen.retcode` for success using `retcode > 0`, but this is subtlety wrong because Popen's return code will be negative if the child process was terminated by a signal: https://docs.python.org/2/library/subprocess.html#subprocess.Popen.returncode In order to properly handle signals, we should change this to check `retcode != 0` instead. Author: Josh Rosen <joshrosen@databricks.com> Closes #13692 from JoshRosen/dev-run-tests-return-code-handling. (cherry picked from commit acef843f67e770f0a2709fb3fbd1a53c200b2bc5) Signed-off-by: Andrew Or <andrew@databricks.com> 16 June 2016, 21:19:19 UTC
cffc080 [SPARK-15915][SQL] Logical plans should use subqueries eliminated plan when override sameResult. ## What changes were proposed in this pull request? This pr is a backport of #13638 for `branch-1.6`. ## How was this patch tested? Added the same test as #13638 modified for `branch-1.6`. Author: Takuya UESHIN <ueshin@happy-camper.st> Closes #13668 from ueshin/issues/SPARK-15915_1.6. 15 June 2016, 17:05:19 UTC
2f3e327 Revert "[SPARK-15892][ML] Incorrectly merged AFTAggregator with zero total count" This reverts commit be3c41b2633215ff6f20885c04f288aab25a1712. 14 June 2016, 21:08:33 UTC
be3c41b [SPARK-15892][ML] Incorrectly merged AFTAggregator with zero total count ## What changes were proposed in this pull request? Currently, `AFTAggregator` is not being merged correctly. For example, if there is any single empty partition in the data, this creates an `AFTAggregator` with zero total count which causes the exception below: ``` IllegalArgumentException: u'requirement failed: The number of instances should be greater than 0.0, but got 0.' ``` Please see [AFTSurvivalRegression.scala#L573-L575](https://github.com/apache/spark/blob/6ecedf39b44c9acd58cdddf1a31cf11e8e24428c/mllib/src/main/scala/org/apache/spark/ml/regression/AFTSurvivalRegression.scala#L573-L575) as well. Just to be clear, the python example `aft_survival_regression.py` seems using 5 rows. So, if there exist partitions more than 5, it throws the exception above since it contains empty partitions which results in an incorrectly merged `AFTAggregator`. Executing `bin/spark-submit examples/src/main/python/ml/aft_survival_regression.py` on a machine with CPUs more than 5 is being failed because it creates tasks with some empty partitions with defualt configurations (AFAIK, it sets the parallelism level to the number of CPU cores). ## How was this patch tested? An unit test in `AFTSurvivalRegressionSuite.scala` and manually tested by `bin/spark-submit examples/src/main/python/ml/aft_survival_regression.py`. Author: hyukjinkwon <gurwls223@gmail.com> Author: Hyukjin Kwon <gurwls223@gmail.com> Closes #13619 from HyukjinKwon/SPARK-15892. (cherry picked from commit e3554605b36bdce63ac180cc66dbdee5c1528ec7) Signed-off-by: Joseph K. Bradley <joseph@databricks.com> 12 June 2016, 21:27:20 UTC
393f4ba [DOCUMENTATION] fixed groupby aggregation example for pyspark ## What changes were proposed in this pull request? fixing documentation for the groupby/agg example in python ## How was this patch tested? the existing example in the documentation dose not contain valid syntax (missing parenthesis) and is not using `Column` in the expression for `agg()` after the fix here's how I tested it: ``` In [1]: from pyspark.sql import Row In [2]: import pyspark.sql.functions as func In [3]: %cpaste Pasting code; enter '--' alone on the line to stop or use Ctrl-D. :records = [{'age': 19, 'department': 1, 'expense': 100}, : {'age': 20, 'department': 1, 'expense': 200}, : {'age': 21, 'department': 2, 'expense': 300}, : {'age': 22, 'department': 2, 'expense': 300}, : {'age': 23, 'department': 3, 'expense': 300}] :-- In [4]: df = sqlContext.createDataFrame([Row(**d) for d in records]) In [5]: df.groupBy("department").agg(df["department"], func.max("age"), func.sum("expense")).show() +----------+----------+--------+------------+ |department|department|max(age)|sum(expense)| +----------+----------+--------+------------+ | 1| 1| 20| 300| | 2| 2| 22| 600| | 3| 3| 23| 300| +----------+----------+--------+------------+ Author: Mortada Mehyar <mortada.mehyar@gmail.com> Closes #13587 from mortada/groupby_agg_doc_fix. (cherry picked from commit 675a73715d3c8adb9d9a9dce5f76a2db5106790c) Signed-off-by: Reynold Xin <rxin@databricks.com> 10 June 2016, 07:23:49 UTC
739d992 [SPARK-15827][BUILD] Publish Spark's forked sbt-pom-reader to Maven Central Spark's SBT build currently uses a fork of the sbt-pom-reader plugin but depends on that fork via a SBT subproject which is cloned from https://github.com/scrapcodes/sbt-pom-reader/tree/ignore_artifact_id. This unnecessarily slows down the initial build on fresh machines and is also risky because it risks a build breakage in case that GitHub repository ever changes or is deleted. In order to address these issues, I have published a pre-built binary of our forked sbt-pom-reader plugin to Maven Central under the `org.spark-project` namespace and have updated Spark's build to use that artifact. This published artifact was built from https://github.com/JoshRosen/sbt-pom-reader/tree/v1.0.0-spark, which contains the contents of ScrapCodes's branch plus an additional patch to configure the build for artifact publication. /cc srowen ScrapCodes for review. Author: Josh Rosen <joshrosen@databricks.com> Closes #13564 from JoshRosen/use-published-fork-of-pom-reader. (cherry picked from commit f74b77713e17960dddb7459eabfdc19f08f4024b) Signed-off-by: Josh Rosen <joshrosen@databricks.com> 09 June 2016, 18:08:22 UTC
bb917fc [SPARK-12712] Fix failure in ./dev/test-dependencies when run against empty .m2 cache This patch fixes a bug in `./dev/test-dependencies.sh` which caused spurious failures when the script was run on a machine with an empty `.m2` cache. The problem was that extra log output from the dependency download was conflicting with the grep / regex used to identify the classpath in the Maven output. This patch fixes this issue by adjusting the regex pattern. Tested manually with the following reproduction of the bug: ``` rm -rf ~/.m2/repository/org/apache/commons/ ./dev/test-dependencies.sh ``` Author: Josh Rosen <joshrosen@databricks.com> Closes #13568 from JoshRosen/SPARK-12712. (cherry picked from commit 921fa40b14082bfd1094fa49fb3b0c46a79c1aaa) Signed-off-by: Josh Rosen <joshrosen@databricks.com> 09 June 2016, 07:52:43 UTC
back to top