https://github.com/apache/spark

sort by:
Revision Author Date Message Commit Date
41116ab Preparing Spark release v2.2.1-rc1 13 November 2017, 19:04:27 UTC
c68b4c5 [MINOR][CORE] Using bufferedInputStream for dataDeserializeStream ## What changes were proposed in this pull request? Small fix. Using bufferedInputStream for dataDeserializeStream. ## How was this patch tested? Existing UT. Author: Xianyang Liu <xianyang.liu@intel.com> Closes #19735 from ConeyLiu/smallfix. (cherry picked from commit 176ae4d53e0269cfc2cfa62d3a2991e28f5a9182) Signed-off-by: Sean Owen <sowen@cloudera.com> 13 November 2017, 12:19:21 UTC
2f6dece [SPARK-22442][SQL][BRANCH-2.2][FOLLOWUP] ScalaReflection should produce correct field names for special characters ## What changes were proposed in this pull request? `val TermName: TermNameExtractor` is new in scala 2.11. For 2.10, we should use deprecated `newTermName`. ## How was this patch tested? Build locally with scala 2.10. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #19736 from viirya/SPARK-22442-2.2-followup. 13 November 2017, 11:41:42 UTC
f736377 [SPARK-22442][SQL][BRANCH-2.2] ScalaReflection should produce correct field names for special characters ## What changes were proposed in this pull request? For a class with field name of special characters, e.g.: ```scala case class MyType(`field.1`: String, `field 2`: String) ``` Although we can manipulate DataFrame/Dataset, the field names are encoded: ```scala scala> val df = Seq(MyType("a", "b"), MyType("c", "d")).toDF df: org.apache.spark.sql.DataFrame = [field$u002E1: string, field$u00202: string] scala> df.as[MyType].collect res7: Array[MyType] = Array(MyType(a,b), MyType(c,d)) ``` It causes resolving problem when we try to convert the data with non-encoded field names: ```scala spark.read.json(path).as[MyType] ... [info] org.apache.spark.sql.AnalysisException: cannot resolve '`field$u002E1`' given input columns: [field 2, fie ld.1]; [info] at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) ... ``` We should use decoded field name in Dataset schema. ## How was this patch tested? Added tests. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #19734 from viirya/SPARK-22442-2.2. 13 November 2017, 05:19:15 UTC
8acd02f [SPARK-21694][R][ML] Reduce max iterations in Linear SVM test in R to speed up AppVeyor build This PR proposes to reduce max iteration in Linear SVM test in SparkR. This particular test elapses roughly 5 mins on my Mac and over 20 mins on Windows. The root cause appears, it triggers 2500ish jobs by the default 100 max iterations. In Linux, `daemon.R` is forked but on Windows another process is launched, which is extremely slow. So, given my observation, there are many processes (not forked) ran on Windows, which makes the differences of elapsed time. After reducing the max iteration to 10, the total jobs in this single test is reduced to 550ish. After reducing the max iteration to 5, the total jobs in this single test is reduced to 360ish. Manually tested the elapsed times. Author: hyukjinkwon <gurwls223@gmail.com> Closes #19722 from HyukjinKwon/SPARK-21693-test. (cherry picked from commit 3d90b2cb384affe8ceac9398615e9e21b8c8e0b0) Signed-off-by: Felix Cheung <felixcheung@apache.org> 12 November 2017, 22:44:36 UTC
2a04cfa [SPARK-19606][BUILD][BACKPORT-2.2][MESOS] fix mesos break ## What changes were proposed in this pull request? Fix break from cherry pick ## How was this patch tested? Jenkins Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #19732 from felixcheung/fixmesosdriverconstraint. 12 November 2017, 22:27:49 UTC
95981fa [SPARK-22464][BACKPORT-2.2][SQL] No pushdown for Hive metastore partition predicates containing null-safe equality ## What changes were proposed in this pull request? `<=>` is not supported by Hive metastore partition predicate pushdown. We should not push down it to Hive metastore when they are be using in partition predicates. ## How was this patch tested? Added a test case Author: gatorsmile <gatorsmile@gmail.com> Closes #19724 from gatorsmile/backportSPARK-22464. 12 November 2017, 22:17:06 UTC
00cb9d0 [SPARK-22488][BACKPORT-2.2][SQL] Fix the view resolution issue in the SparkSession internal table() API ## What changes were proposed in this pull request? The current internal `table()` API of `SparkSession` bypasses the Analyzer and directly calls `sessionState.catalog.lookupRelation` API. This skips the view resolution logics in our Analyzer rule `ResolveRelations`. This internal API is widely used by various DDL commands, public and internal APIs. Users might get the strange error caused by view resolution when the default database is different. ``` Table or view not found: t1; line 1 pos 14 org.apache.spark.sql.AnalysisException: Table or view not found: t1; line 1 pos 14 at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) ``` This PR is to fix it by enforcing it to use `ResolveRelations` to resolve the table. ## How was this patch tested? Added a test case and modified the existing test cases Author: gatorsmile <gatorsmile@gmail.com> Closes #19723 from gatorsmile/backport22488. 12 November 2017, 22:15:58 UTC
114dc42 [SPARK-21720][SQL] Fix 64KB JVM bytecode limit problem with AND or OR This PR changes `AND` or `OR` code generation to place condition and then expressions' generated code into separated methods if these size could be large. When the method is newly generated, variables for `isNull` and `value` are declared as an instance variable to pass these values (e.g. `isNull1409` and `value1409`) to the callers of the generated method. This PR resolved two cases: * large code size of left expression * large code size of right expression Added a new test case into `CodeGenerationSuite` Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Closes #18972 from kiszk/SPARK-21720. (cherry picked from commit 9bf696dbece6b1993880efba24a6d32c54c4d11c) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 12 November 2017, 21:47:54 UTC
f6ee3d9 [SPARK-19606][MESOS] Support constraints in spark-dispatcher A discussed in SPARK-19606, the addition of a new config property named "spark.mesos.constraints.driver" for constraining drivers running on a Mesos cluster Corresponding unit test added also tested locally on a Mesos cluster Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Paul Mackles <pmackles@adobe.com> Closes #19543 from pmackles/SPARK-19606. (cherry picked from commit b3f9dbf48ec0938ff5c98833bb6b6855c620ef57) Signed-off-by: Felix Cheung <felixcheung@apache.org> 12 November 2017, 19:40:42 UTC
4ef0bef [SPARK-21667][STREAMING] ConsoleSink should not fail streaming query with checkpointLocation option ## What changes were proposed in this pull request? Fix to allow recovery on console , avoid checkpoint exception ## How was this patch tested? existing tests manual tests [ Replicating error and seeing no checkpoint error after fix] Author: Rekha Joshi <rekhajoshm@gmail.com> Author: rjoshi2 <rekhajoshm@gmail.com> Closes #19407 from rekhajoshm/SPARK-21667. (cherry picked from commit 808e886b9638ab2981dac676b594f09cda9722fe) Signed-off-by: Shixiong Zhu <zsxwing@gmail.com> 10 November 2017, 23:18:35 UTC
8b7f72e [SPARK-19644][SQL] Clean up Scala reflection garbage after creating Encoder (branch-2.2) ## What changes were proposed in this pull request? Backport #19687 to branch-2.2. The major difference is `cleanUpReflectionObjects` is protected by `ScalaReflectionLock.synchronized` in this PR for Scala 2.10. ## How was this patch tested? Jenkins Author: Shixiong Zhu <zsxwing@gmail.com> Closes #19718 from zsxwing/SPARK-19644-2.2. 10 November 2017, 22:14:47 UTC
6b4ec22 [SPARK-22284][SQL] Fix 64KB JVM bytecode limit problem in calculating hash for nested structs ## What changes were proposed in this pull request? This PR avoids to generate a huge method for calculating a murmur3 hash for nested structs. This PR splits a huge method (e.g. `apply_4`) into multiple smaller methods. Sample program ``` val structOfString = new StructType().add("str", StringType) var inner = new StructType() for (_ <- 0 until 800) { inner = inner1.add("structOfString", structOfString) } var schema = new StructType() for (_ <- 0 until 50) { schema = schema.add("structOfStructOfStrings", inner) } GenerateMutableProjection.generate(Seq(Murmur3Hash(exprs, 42))) ``` Without this PR ``` /* 005 */ class SpecificMutableProjection extends org.apache.spark.sql.catalyst.expressions.codegen.BaseMutableProjection { /* 006 */ /* 007 */ private Object[] references; /* 008 */ private InternalRow mutableRow; /* 009 */ private int value; /* 010 */ private int value_0; ... /* 034 */ public java.lang.Object apply(java.lang.Object _i) { /* 035 */ InternalRow i = (InternalRow) _i; /* 036 */ /* 037 */ /* 038 */ /* 039 */ value = 42; /* 040 */ apply_0(i); /* 041 */ apply_1(i); /* 042 */ apply_2(i); /* 043 */ apply_3(i); /* 044 */ apply_4(i); /* 045 */ nestedClassInstance.apply_5(i); ... /* 089 */ nestedClassInstance8.apply_49(i); /* 090 */ value_0 = value; /* 091 */ /* 092 */ // copy all the results into MutableRow /* 093 */ mutableRow.setInt(0, value_0); /* 094 */ return mutableRow; /* 095 */ } /* 096 */ /* 097 */ /* 098 */ private void apply_4(InternalRow i) { /* 099 */ /* 100 */ boolean isNull5 = i.isNullAt(4); /* 101 */ InternalRow value5 = isNull5 ? null : (i.getStruct(4, 800)); /* 102 */ if (!isNull5) { /* 103 */ /* 104 */ if (!value5.isNullAt(0)) { /* 105 */ /* 106 */ final InternalRow element6400 = value5.getStruct(0, 1); /* 107 */ /* 108 */ if (!element6400.isNullAt(0)) { /* 109 */ /* 110 */ final UTF8String element6401 = element6400.getUTF8String(0); /* 111 */ value = org.apache.spark.unsafe.hash.Murmur3_x86_32.hashUnsafeBytes(element6401.getBaseObject(), element6401.getBaseOffset(), element6401.numBytes(), value); /* 112 */ /* 113 */ } /* 114 */ /* 115 */ /* 116 */ } /* 117 */ /* 118 */ /* 119 */ if (!value5.isNullAt(1)) { /* 120 */ /* 121 */ final InternalRow element6402 = value5.getStruct(1, 1); /* 122 */ /* 123 */ if (!element6402.isNullAt(0)) { /* 124 */ /* 125 */ final UTF8String element6403 = element6402.getUTF8String(0); /* 126 */ value = org.apache.spark.unsafe.hash.Murmur3_x86_32.hashUnsafeBytes(element6403.getBaseObject(), element6403.getBaseOffset(), element6403.numBytes(), value); /* 127 */ /* 128 */ } /* 128 */ } /* 129 */ /* 130 */ /* 131 */ } /* 132 */ /* 133 */ /* 134 */ if (!value5.isNullAt(2)) { /* 135 */ /* 136 */ final InternalRow element6404 = value5.getStruct(2, 1); /* 137 */ /* 138 */ if (!element6404.isNullAt(0)) { /* 139 */ /* 140 */ final UTF8String element6405 = element6404.getUTF8String(0); /* 141 */ value = org.apache.spark.unsafe.hash.Murmur3_x86_32.hashUnsafeBytes(element6405.getBaseObject(), element6405.getBaseOffset(), element6405.numBytes(), value); /* 142 */ /* 143 */ } /* 144 */ /* 145 */ /* 146 */ } /* 147 */ ... /* 12074 */ if (!value5.isNullAt(798)) { /* 12075 */ /* 12076 */ final InternalRow element7996 = value5.getStruct(798, 1); /* 12077 */ /* 12078 */ if (!element7996.isNullAt(0)) { /* 12079 */ /* 12080 */ final UTF8String element7997 = element7996.getUTF8String(0); /* 12083 */ } /* 12084 */ /* 12085 */ /* 12086 */ } /* 12087 */ /* 12088 */ /* 12089 */ if (!value5.isNullAt(799)) { /* 12090 */ /* 12091 */ final InternalRow element7998 = value5.getStruct(799, 1); /* 12092 */ /* 12093 */ if (!element7998.isNullAt(0)) { /* 12094 */ /* 12095 */ final UTF8String element7999 = element7998.getUTF8String(0); /* 12096 */ value = org.apache.spark.unsafe.hash.Murmur3_x86_32.hashUnsafeBytes(element7999.getBaseObject(), element7999.getBaseOffset(), element7999.numBytes(), value); /* 12097 */ /* 12098 */ } /* 12099 */ /* 12100 */ /* 12101 */ } /* 12102 */ /* 12103 */ } /* 12104 */ /* 12105 */ } /* 12106 */ /* 12106 */ /* 12107 */ /* 12108 */ private void apply_1(InternalRow i) { ... ``` With this PR ``` /* 005 */ class SpecificMutableProjection extends org.apache.spark.sql.catalyst.expressions.codegen.BaseMutableProjection { /* 006 */ /* 007 */ private Object[] references; /* 008 */ private InternalRow mutableRow; /* 009 */ private int value; /* 010 */ private int value_0; /* 011 */ ... /* 034 */ public java.lang.Object apply(java.lang.Object _i) { /* 035 */ InternalRow i = (InternalRow) _i; /* 036 */ /* 037 */ /* 038 */ /* 039 */ value = 42; /* 040 */ nestedClassInstance11.apply50_0(i); /* 041 */ nestedClassInstance11.apply50_1(i); ... /* 088 */ nestedClassInstance11.apply50_48(i); /* 089 */ nestedClassInstance11.apply50_49(i); /* 090 */ value_0 = value; /* 091 */ /* 092 */ // copy all the results into MutableRow /* 093 */ mutableRow.setInt(0, value_0); /* 094 */ return mutableRow; /* 095 */ } /* 096 */ ... /* 37717 */ private void apply4_0(InternalRow value5, InternalRow i) { /* 37718 */ /* 37719 */ if (!value5.isNullAt(0)) { /* 37720 */ /* 37721 */ final InternalRow element6400 = value5.getStruct(0, 1); /* 37722 */ /* 37723 */ if (!element6400.isNullAt(0)) { /* 37724 */ /* 37725 */ final UTF8String element6401 = element6400.getUTF8String(0); /* 37726 */ value = org.apache.spark.unsafe.hash.Murmur3_x86_32.hashUnsafeBytes(element6401.getBaseObject(), element6401.getBaseOffset(), element6401.numBytes(), value); /* 37727 */ /* 37728 */ } /* 37729 */ /* 37730 */ /* 37731 */ } /* 37732 */ /* 37733 */ if (!value5.isNullAt(1)) { /* 37734 */ /* 37735 */ final InternalRow element6402 = value5.getStruct(1, 1); /* 37736 */ /* 37737 */ if (!element6402.isNullAt(0)) { /* 37738 */ /* 37739 */ final UTF8String element6403 = element6402.getUTF8String(0); /* 37740 */ value = org.apache.spark.unsafe.hash.Murmur3_x86_32.hashUnsafeBytes(element6403.getBaseObject(), element6403.getBaseOffset(), element6403.numBytes(), value); /* 37741 */ /* 37742 */ } /* 37743 */ /* 37744 */ /* 37745 */ } /* 37746 */ /* 37747 */ if (!value5.isNullAt(2)) { /* 37748 */ /* 37749 */ final InternalRow element6404 = value5.getStruct(2, 1); /* 37750 */ /* 37751 */ if (!element6404.isNullAt(0)) { /* 37752 */ /* 37753 */ final UTF8String element6405 = element6404.getUTF8String(0); /* 37754 */ value = org.apache.spark.unsafe.hash.Murmur3_x86_32.hashUnsafeBytes(element6405.getBaseObject(), element6405.getBaseOffset(), element6405.numBytes(), value); /* 37755 */ /* 37756 */ } /* 37757 */ /* 37758 */ /* 37759 */ } /* 37760 */ /* 37761 */ } ... /* 218470 */ /* 218471 */ private void apply50_4(InternalRow i) { /* 218472 */ /* 218473 */ boolean isNull5 = i.isNullAt(4); /* 218474 */ InternalRow value5 = isNull5 ? null : (i.getStruct(4, 800)); /* 218475 */ if (!isNull5) { /* 218476 */ apply4_0(value5, i); /* 218477 */ apply4_1(value5, i); /* 218478 */ apply4_2(value5, i); ... /* 218742 */ nestedClassInstance.apply4_266(value5, i); /* 218743 */ } /* 218744 */ /* 218745 */ } ``` ## How was this patch tested? Added new test to `HashExpressionsSuite` Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Closes #19563 from kiszk/SPARK-22284. (cherry picked from commit f2da738c76810131045e6c32533a2d13526cdaf6) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 10 November 2017, 20:18:14 UTC
371be22 [SPARK-22294][DEPLOY] Reset spark.driver.bindAddress when starting a Checkpoint ## What changes were proposed in this pull request? It seems that recovering from a checkpoint can replace the old driver and executor IP addresses, as the workload can now be taking place in a different cluster configuration. It follows that the bindAddress for the master may also have changed. Thus we should not be keeping the old one, and instead be added to the list of properties to reset and recreate from the new environment. ## How was this patch tested? This patch was tested via manual testing on AWS, using the experimental (not yet merged) Kubernetes scheduler, which uses bindAddress to bind to a Kubernetes service (and thus was how I first encountered the bug too), but it is not a code-path related to the scheduler and this may have slipped through when merging SPARK-4563. Author: Santiago Saavedra <ssaavedra@openshine.com> Closes #19427 from ssaavedra/fix-checkpointing-master. (cherry picked from commit 5ebdcd185f2108a90e37a1aa4214c3b6c69a97a4) Signed-off-by: Shixiong Zhu <zsxwing@gmail.com> 10 November 2017, 18:58:10 UTC
eb49c32 [SPARK-22243][DSTREAM] spark.yarn.jars should reload from config when checkpoint recovery ## What changes were proposed in this pull request? the previous [PR](https://github.com/apache/spark/pull/19469) is deleted by mistake. the solution is straight forward. adding "spark.yarn.jars" to propertiesToReload so this property will load from config. ## How was this patch tested? manual tests Author: ZouChenjun <zouchenjun@youzan.com> Closes #19637 from ChenjunZou/checkpoint-yarn-jars. 10 November 2017, 18:57:22 UTC
0568f28 [SPARK-22344][SPARKR] clean up install dir if running test as source package ## What changes were proposed in this pull request? remove spark if spark downloaded & installed ## How was this patch tested? manually by building package Jenkins, AppVeyor Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #19657 from felixcheung/rinstalldir. (cherry picked from commit b70aa9e08b4476746e912c2c2a8b7bdd102305e8) Signed-off-by: Felix Cheung <felixcheung@apache.org> 10 November 2017, 18:22:56 UTC
7551524 [SPARK-22472][SQL] add null check for top-level primitive values ## What changes were proposed in this pull request? One powerful feature of `Dataset` is, we can easily map SQL rows to Scala/Java objects and do runtime null check automatically. For example, let's say we have a parquet file with schema `<a: int, b: string>`, and we have a `case class Data(a: Int, b: String)`. Users can easily read this parquet file into `Data` objects, and Spark will throw NPE if column `a` has null values. However the null checking is left behind for top-level primitive values. For example, let's say we have a parquet file with schema `<a: Int>`, and we read it into Scala `Int`. If column `a` has null values, we will get some weird results. ``` scala> val ds = spark.read.parquet(...).as[Int] scala> ds.show() +----+ |v | +----+ |null| |1 | +----+ scala> ds.collect res0: Array[Long] = Array(0, 1) scala> ds.map(_ * 2).show +-----+ |value| +-----+ |-2 | |2 | +-----+ ``` This is because internally Spark use some special default values for primitive types, but never expect users to see/operate these default value directly. This PR adds null check for top-level primitive values ## How was this patch tested? new test Author: Wenchen Fan <wenchen@databricks.com> Closes #19707 from cloud-fan/bug. (cherry picked from commit 0025ddeb1dd4fd6951ecd8456457f6b94124f84e) Signed-off-by: gatorsmile <gatorsmile@gmail.com> # Conflicts: # sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/ScalaReflectionSuite.scala # sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala 10 November 2017, 06:00:12 UTC
1b70c66 [SPARK-22287][MESOS] SPARK_DAEMON_MEMORY not honored by MesosClusterD… …ispatcher ## What changes were proposed in this pull request? Allow JVM max heap size to be controlled for MesosClusterDispatcher via SPARK_DAEMON_MEMORY environment variable. ## How was this patch tested? Tested on local Mesos cluster Author: Paul Mackles <pmackles@adobe.com> Closes #19515 from pmackles/SPARK-22287. (cherry picked from commit f5fe63f7b8546b0102d7bfaf3dde77379f58a4d1) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> 10 November 2017, 00:42:45 UTC
ede0e1a [SPARK-22403][SS] Add optional checkpointLocation argument to StructuredKafkaWordCount example ## What changes were proposed in this pull request? When run in YARN cluster mode, the StructuredKafkaWordCount example fails because Spark tries to create a temporary checkpoint location in a subdirectory of the path given by java.io.tmpdir, and YARN sets java.io.tmpdir to a path in the local filesystem that usually does not correspond to an existing path in the distributed filesystem. Add an optional checkpointLocation argument to the StructuredKafkaWordCount example so that users can specify the checkpoint location and avoid this issue. ## How was this patch tested? Built and ran the example manually on YARN client and cluster mode. Author: Wing Yew Poon <wypoon@cloudera.com> Closes #19703 from wypoon/SPARK-22403. (cherry picked from commit 11c4021044f3a302449a2ea76811e73f5c99a26a) Signed-off-by: Shixiong Zhu <zsxwing@gmail.com> 10 November 2017, 00:21:06 UTC
0e97c8e [SPARK-22417][PYTHON][FOLLOWUP][BRANCH-2.2] Fix for createDataFrame from pandas.DataFrame with timestamp ## What changes were proposed in this pull request? This is a follow-up of #19646 for branch-2.2. The original pr breaks branch-2.2 because the cherry-picked patch doesn't include some code which exists in master. ## How was this patch tested? Existing tests. Author: Takuya UESHIN <ueshin@databricks.com> Closes #19704 from ueshin/issues/SPARK-22417_2.2/fup1. 09 November 2017, 07:56:50 UTC
efaf73f [SPARK-22211][SQL][FOLLOWUP] Fix bad merge for tests ## What changes were proposed in this pull request? The merge of SPARK-22211 to branch-2.2 dropped a couple of important lines that made sure the tests that compared plans did so with both plans having been analyzed. Fix by reintroducing the correct analysis statements. ## How was this patch tested? Re-ran LimitPushdownSuite. All tests passed. Author: Henry Robinson <henry@apache.org> Closes #19701 from henryr/branch-2.2. 09 November 2017, 07:34:22 UTC
73a2ca0 [SPARK-22281][SPARKR] Handle R method breaking signature changes ## What changes were proposed in this pull request? This is to fix the code for the latest R changes in R-devel, when running CRAN check ``` checking for code/documentation mismatches ... WARNING Codoc mismatches from documentation object 'attach': attach Code: function(what, pos = 2L, name = deparse(substitute(what), backtick = FALSE), warn.conflicts = TRUE) Docs: function(what, pos = 2L, name = deparse(substitute(what)), warn.conflicts = TRUE) Mismatches in argument default values: Name: 'name' Code: deparse(substitute(what), backtick = FALSE) Docs: deparse(substitute(what)) Codoc mismatches from documentation object 'glm': glm Code: function(formula, family = gaussian, data, weights, subset, na.action, start = NULL, etastart, mustart, offset, control = list(...), model = TRUE, method = "glm.fit", x = FALSE, y = TRUE, singular.ok = TRUE, contrasts = NULL, ...) Docs: function(formula, family = gaussian, data, weights, subset, na.action, start = NULL, etastart, mustart, offset, control = list(...), model = TRUE, method = "glm.fit", x = FALSE, y = TRUE, contrasts = NULL, ...) Argument names in code not in docs: singular.ok Mismatches in argument names: Position: 16 Code: singular.ok Docs: contrasts Position: 17 Code: contrasts Docs: ... ``` With attach, it's pulling in the function definition from base::attach. We need to disable that but we would still need a function signature for roxygen2 to build with. With glm it's pulling in the function definition (ie. "usage") from the stats::glm function. Since this is "compiled in" when we build the source package into the .Rd file, when it changes at runtime or in CRAN check it won't match the latest signature. The solution is not to pull in from stats::glm since there isn't much value in doing that (none of the param we actually use, the ones we do use we have explicitly documented them) Also with attach we are changing to call dynamically. ## How was this patch tested? Manually. - [x] check documentation output - yes - [x] check help `?attach` `?glm` - yes - [x] check on other platforms, r-hub, on r-devel etc.. Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #19557 from felixcheung/rattachglmdocerror. (cherry picked from commit 2ca5aae47a25dc6bc9e333fb592025ff14824501) Signed-off-by: Felix Cheung <felixcheung@apache.org> 08 November 2017, 05:03:10 UTC
5c9035b [SPARK-22327][SPARKR][TEST][BACKPORT-2.2] check for version warning ## What changes were proposed in this pull request? Will need to port to this to branch-~~1.6~~, -2.0, -2.1, -2.2 ## How was this patch tested? manually Jenkins, AppVeyor Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #19619 from felixcheung/rcranversioncheck22. 08 November 2017, 04:58:29 UTC
161ba18 [SPARK-22417][PYTHON] Fix for createDataFrame from pandas.DataFrame with timestamp Currently, a pandas.DataFrame that contains a timestamp of type 'datetime64[ns]' when converted to a Spark DataFrame with `createDataFrame` will interpret the values as LongType. This fix will check for a timestamp type and convert it to microseconds which will allow Spark to read as TimestampType. Added unit test to verify Spark schema is expected for TimestampType and DateType when created from pandas Author: Bryan Cutler <cutlerb@gmail.com> Closes #19646 from BryanCutler/pyspark-non-arrow-createDataFrame-ts-fix-SPARK-22417. (cherry picked from commit 1d341042d6948e636643183da9bf532268592c6a) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 07 November 2017, 20:38:54 UTC
2695b92 [SPARK-22315][SPARKR] Warn if SparkR package version doesn't match SparkContext ## What changes were proposed in this pull request? This PR adds a check between the R package version used and the version reported by SparkContext running in the JVM. The goal here is to warn users when they have a R package downloaded from CRAN and are using that to connect to an existing Spark cluster. This is raised as a warning rather than an error as users might want to use patch versions interchangeably (e.g., 2.1.3 with 2.1.2 etc.) ## How was this patch tested? Manually by changing the `DESCRIPTION` file Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu> Closes #19624 from shivaram/sparkr-version-check. (cherry picked from commit 65a8bf6036fe41a53b4b1e4298fa35d7fa4e9970) Signed-off-by: Shivaram Venkataraman <shivaram@cs.berkeley.edu> 06 November 2017, 16:58:55 UTC
e35c53a [SPARK-22429][STREAMING] Streaming checkpointing code does not retry after failure ## What changes were proposed in this pull request? SPARK-14930/SPARK-13693 put in a change to set the fs object to null after a failure, however the retry loop does not include initialization. Moved fs initialization inside the retry while loop to aid recoverability. ## How was this patch tested? Passes all existing unit tests. Author: Tristan Stevens <tristan@cloudera.com> Closes #19645 from tmgstevens/SPARK-22429. (cherry picked from commit fe258a7963361c1f31bc3dc3a2a2ee4a5834bb58) Signed-off-by: Sean Owen <sowen@cloudera.com> 05 November 2017, 09:10:51 UTC
5e38373 [SPARK-22211][SQL] Remove incorrect FOJ limit pushdown It's not safe in all cases to push down a LIMIT below a FULL OUTER JOIN. If the limit is pushed to one side of the FOJ, the physical join operator can not tell if a row in the non-limited side would have a match in the other side. *If* the join operator guarantees that unmatched tuples from the limited side are emitted before any unmatched tuples from the other side, pushing down the limit is safe. But this is impractical for some join implementations, e.g. SortMergeJoin. For now, disable limit pushdown through a FULL OUTER JOIN, and we can evaluate whether a more complicated solution is necessary in the future. Ran org.apache.spark.sql.* tests. Altered full outer join tests in LimitPushdownSuite. Author: Henry Robinson <henry@cloudera.com> Closes #19647 from henryr/spark-22211. (cherry picked from commit 6c6626614e59b2e8d66ca853a74638d3d6267d73) Signed-off-by: gatorsmile <gatorsmile@gmail.com> 05 November 2017, 05:54:51 UTC
4074ed2 [SPARK-22306][SQL][2.2] alter table schema should not erase the bucketing metadata at hive side ## What changes were proposed in this pull request? When we alter table schema, we set the new schema to spark `CatalogTable`, convert it to hive table, and finally call `hive.alterTable`. This causes a problem in Spark 2.2, because hive bucketing metedata is not recognized by Spark, which means a Spark `CatalogTable` representing a hive table is always non-bucketed, and when we convert it to hive table and call `hive.alterTable`, the original hive bucketing metadata will be removed. To fix this bug, we should read out the raw hive table metadata, update its schema, and call `hive.alterTable`. By doing this we can guarantee only the schema is changed, and nothing else. Note that this bug doesn't exist in the master branch, because we've added hive bucketing support and the hive bucketing metadata can be recognized by Spark. I think we should merge this PR to master too, for code cleanup and reduce the difference between master and 2.2 branch for backporting. ## How was this patch tested? new regression test Author: Wenchen Fan <wenchen@databricks.com> Closes #19622 from cloud-fan/infer. 02 November 2017, 11:37:52 UTC
c311c5e [MINOR][DOC] automatic type inference supports also Date and Timestamp ## What changes were proposed in this pull request? Easy fix in the documentation, which is reporting that only numeric types and string are supported in type inference for partition columns, while Date and Timestamp are supported too since 2.1.0, thanks to SPARK-17388. ## How was this patch tested? n/a Author: Marco Gaido <mgaido@hortonworks.com> Closes #19628 from mgaido91/SPARK-22398. (cherry picked from commit b04eefae49b96e2ef5a8d75334db29ef4e19ce58) Signed-off-by: hyukjinkwon <gurwls223@gmail.com> 02 November 2017, 00:30:20 UTC
ab87a92 [SPARK-22333][SQL][BACKPORT-2.2] timeFunctionCall(CURRENT_DATE, CURRENT_TIMESTAMP) has conflicts with columnReference ## What changes were proposed in this pull request? This is a backport pr of https://github.com/apache/spark/pull/19559 for branch-2.2 ## How was this patch tested? unit tests Author: donnyzone <wellfengzhu@gmail.com> Closes #19606 from DonnyZone/branch-2.2. 31 October 2017, 17:37:27 UTC
dd69ac6 [SPARK-19611][SQL][FOLLOWUP] set dataSchema correctly in HiveMetastoreCatalog.convertToLogicalRelation We made a mistake in https://github.com/apache/spark/pull/16944 . In `HiveMetastoreCatalog#inferIfNeeded` we infer the data schema, merge with full schema, and return the new full schema. At caller side we treat the full schema as data schema and set it to `HadoopFsRelation`. This doesn't cause any problem because both parquet and orc can work with a wrong data schema that has extra columns, but it's better to fix this mistake. N/A Author: Wenchen Fan <wenchen@databricks.com> Closes #19615 from cloud-fan/infer. (cherry picked from commit 4d9ebf3835dde1abbf9cff29a55675d9f4227620) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 31 October 2017, 10:36:52 UTC
7f8236c [SPARK-22291][SQL] Conversion error when transforming array types of uuid, inet and cidr to StingType in PostgreSQL ## What changes were proposed in this pull request? This PR fixes the conversion error when transforming array types of `uuid`, `inet` and `cidr` to `StingType` in PostgreSQL. ## How was this patch tested? Added test in `PostgresIntegrationSuite`. Author: Jen-Ming Chung <jenmingisme@gmail.com> Closes #19604 from jmchung/SPARK-22291-FOLLOWUP. 30 October 2017, 08:09:11 UTC
f973587 [SPARK-22344][SPARKR] Set java.io.tmpdir for SparkR tests This PR sets the java.io.tmpdir for CRAN checks and also disables the hsperfdata for the JVM when running CRAN checks. Together this prevents files from being left behind in `/tmp` ## How was this patch tested? Tested manually on a clean EC2 machine Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu> Closes #19589 from shivaram/sparkr-tmpdir-clean. (cherry picked from commit 1fe27612d7bcb8b6478a36bc16ddd4802e4ee2fc) Signed-off-by: Shivaram Venkataraman <shivaram@cs.berkeley.edu> 30 October 2017, 01:54:00 UTC
cac6506 [SPARK-19727][SQL][FOLLOWUP] Fix for round function that modifies original column ## What changes were proposed in this pull request? This is a followup of https://github.com/apache/spark/pull/17075 , to fix the bug in codegen path. ## How was this patch tested? new regression test Author: Wenchen Fan <wenchen@databricks.com> Closes #19576 from cloud-fan/bug. (cherry picked from commit 7fdacbc77bbcf98c2c045a1873e749129769dcc0) Signed-off-by: gatorsmile <gatorsmile@gmail.com> 29 October 2017, 01:24:32 UTC
cb54f29 [SPARK-22356][SQL] data source table should support overlapped columns between data and partition schema This is a regression introduced by #14207. After Spark 2.1, we store the inferred schema when creating the table, to avoid inferring schema again at read path. However, there is one special case: overlapped columns between data and partition. For this case, it breaks the assumption of table schema that there is on ovelap between data and partition schema, and partition columns should be at the end. The result is, for Spark 2.1, the table scan has incorrect schema that puts partition columns at the end. For Spark 2.2, we add a check in CatalogTable to validate table schema, which fails at this case. To fix this issue, a simple and safe approach is to fallback to old behavior when overlapeed columns detected, i.e. store empty schema in metastore. new regression test Author: Wenchen Fan <wenchen@databricks.com> Closes #19579 from cloud-fan/bug2. 27 October 2017, 00:56:29 UTC
2839280 [SPARK-22355][SQL] Dataset.collect is not threadsafe It's possible that users create a `Dataset`, and call `collect` of this `Dataset` in many threads at the same time. Currently `Dataset#collect` just call `encoder.fromRow` to convert spark rows to objects of type T, and this encoder is per-dataset. This means `Dataset#collect` is not thread-safe, because the encoder uses a projection to output the object to a re-usable row. This PR fixes this problem, by creating a new projection when calling `Dataset#collect`, so that we have the re-usable row for each method call, instead of each Dataset. N/A Author: Wenchen Fan <wenchen@databricks.com> Closes #19577 from cloud-fan/encoder. (cherry picked from commit 5c3a1f3fad695317c2fff1243cdb9b3ceb25c317) Signed-off-by: gatorsmile <gatorsmile@gmail.com> 27 October 2017, 00:52:26 UTC
a607ddc [SPARK-22328][CORE] ClosureCleaner should not miss referenced superclass fields When the given closure uses some fields defined in super class, `ClosureCleaner` can't figure them and don't set it properly. Those fields will be in null values. Added test. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #19556 from viirya/SPARK-22328. (cherry picked from commit 4f8dc6b01ea787243a38678ea8199fbb0814cffc) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 26 October 2017, 20:44:17 UTC
24fe7cc [SPARK-17902][R] Revive stringsAsFactors option for collect() in SparkR ## What changes were proposed in this pull request? This PR proposes to revive `stringsAsFactors` option in collect API, which was mistakenly removed in https://github.com/apache/spark/commit/71a138cd0e0a14e8426f97877e3b52a562bbd02c. Simply, it casts `charactor` to `factor` if it meets the condition, `stringsAsFactors && is.character(vec)` in primitive type conversion. ## How was this patch tested? Unit test in `R/pkg/tests/fulltests/test_sparkSQL.R`. Author: hyukjinkwon <gurwls223@gmail.com> Closes #19551 from HyukjinKwon/SPARK-17902. (cherry picked from commit a83d8d5adcb4e0061e43105767242ba9770dda96) Signed-off-by: hyukjinkwon <gurwls223@gmail.com> 26 October 2017, 11:55:00 UTC
d2dc175 [SPARK-21991][LAUNCHER][FOLLOWUP] Fix java lint ## What changes were proposed in this pull request? Fix java lint ## How was this patch tested? Run `./dev/lint-java` Author: Andrew Ash <andrew@andrewash.com> Closes #19574 from ash211/aash/fix-java-lint. (cherry picked from commit 5433be44caecaeef45ed1fdae10b223c698a9d14) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> 25 October 2017, 21:41:17 UTC
35725f7 [SPARK-22332][ML][TEST] Fix NaiveBayes unit test occasionly fail (cause by test dataset not deterministic) ## What changes were proposed in this pull request? Fix NaiveBayes unit test occasionly fail: Set seed for `BrzMultinomial.sample`, make `generateNaiveBayesInput` output deterministic dataset. (If we do not set seed, the generated dataset will be random, and the model will be possible to exceed the tolerance in the test, which trigger this failure) ## How was this patch tested? Manually run tests multiple times and check each time output models contains the same values. Author: WeichenXu <weichen.xu@databricks.com> Closes #19558 from WeichenXu123/fix_nb_test_seed. (cherry picked from commit 841f1d776f420424c20d99cf7110d06c73f9ca20) Signed-off-by: Joseph K. Bradley <joseph@databricks.com> 25 October 2017, 21:31:49 UTC
9ed6404 [SPARK-22227][CORE] DiskBlockManager.getAllBlocks now tolerates temp files Prior to this commit getAllBlocks implicitly assumed that the directories managed by the DiskBlockManager contain only the files corresponding to valid block IDs. In reality, this assumption was violated during shuffle, which produces temporary files in the same directory as the resulting blocks. As a result, calls to getAllBlocks during shuffle were unreliable. The fix could be made more efficient, but this is probably good enough. `DiskBlockManagerSuite` Author: Sergei Lebedev <s.lebedev@criteo.com> Closes #19458 from superbobry/block-id-option. (cherry picked from commit b377ef133cdc38d49b460b2cc6ece0b5892804cc) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 25 October 2017, 21:17:40 UTC
4c1a868 [SPARK-21991][LAUNCHER] Fix race condition in LauncherServer#acceptConnections ## What changes were proposed in this pull request? This patch changes the order in which _acceptConnections_ starts the client thread and schedules the client timeout action ensuring that the latter has been scheduled before the former get a chance to cancel it. ## How was this patch tested? Due to the non-deterministic nature of the patch I wasn't able to add a new test for this issue. Author: Andrea zito <andrea.zito@u-hopper.com> Closes #19217 from nivox/SPARK-21991. (cherry picked from commit 6ea8a56ca26a7e02e6574f5f763bb91059119a80) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> 25 October 2017, 17:10:37 UTC
daa838b [SPARK-21936][SQL][FOLLOW-UP] backward compatibility test framework for HiveExternalCatalog ## What changes were proposed in this pull request? Adjust Spark download in test to use Apache mirrors and respect its load balancer, and use Spark 2.1.2. This follows on a recent PMC list thread about removing the cloudfront download rather than update it further. ## How was this patch tested? Existing tests. Author: Sean Owen <sowen@cloudera.com> Closes #19564 from srowen/SPARK-21936.2. (cherry picked from commit 8beeaed66bde0ace44495b38dc967816e16b3464) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 24 October 2017, 12:56:31 UTC
bf8163f [SPARK-22319][CORE][BACKPORT-2.2] call loginUserFromKeytab before accessing hdfs In SparkSubmit, call loginUserFromKeytab before attempting to make RPC calls to the NameNode. Same as #https://github.com/apache/spark/pull/19540, but for branch-2.2. Manually tested for master as described in https://github.com/apache/spark/pull/19540. Author: Steven Rand <srand@palantir.com> Closes #19554 from sjrand/SPARK-22319-branch-2.2. Change-Id: Ic550a818fd6a3f38b356ac48029942d463738458 23 October 2017, 06:26:03 UTC
f8c83fd [SPARK-21551][PYTHON] Increase timeout for PythonRDD.serveIterator Backport of https://github.com/apache/spark/pull/18752 (https://issues.apache.org/jira/browse/SPARK-21551) (cherry picked from commit 9d3c6640f56e3e4fd195d3ad8cead09df67a72c7) Author: peay <peay@protonmail.com> Closes #19512 from FRosner/branch-2.2. 19 October 2017, 04:07:04 UTC
010b50c [SPARK-22249][FOLLOWUP][SQL] Check if list of value for IN is empty in the optimizer ## What changes were proposed in this pull request? This PR addresses the comments by gatorsmile on [the previous PR](https://github.com/apache/spark/pull/19494). ## How was this patch tested? Previous UT and added UT. Author: Marco Gaido <marcogaido91@gmail.com> Closes #19522 from mgaido91/SPARK-22249_FOLLOWUP. (cherry picked from commit 1f25d8683a84a479fd7fc77b5a1ea980289b681b) Signed-off-by: gatorsmile <gatorsmile@gmail.com> 18 October 2017, 16:14:57 UTC
c423091 [SPARK-22271][SQL] mean overflows and returns null for some decimal variables ## What changes were proposed in this pull request? In Average.scala, it has ``` override lazy val evaluateExpression = child.dataType match { case DecimalType.Fixed(p, s) => // increase the precision and scale to prevent precision loss val dt = DecimalType.bounded(p + 14, s + 4) Cast(Cast(sum, dt) / Cast(count, dt), resultType) case _ => Cast(sum, resultType) / Cast(count, resultType) } def setChild (newchild: Expression) = { child = newchild } ``` It is possible that Cast(count, dt), resultType) will make the precision of the decimal number bigger than 38, and this causes over flow. Since count is an integer and doesn't need a scale, I will cast it using DecimalType.bounded(38,0) ## How was this patch tested? In DataFrameSuite, I will add a test case. Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Huaxin Gao <huaxing@us.ibm.com> Closes #19496 from huaxingao/spark-22271. (cherry picked from commit 28f9f3f22511e9f2f900764d9bd5b90d2eeee773) Signed-off-by: gatorsmile <gatorsmile@gmail.com> # Conflicts: # sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala 17 October 2017, 19:53:14 UTC
71d1cb6 [SPARK-22249][SQL] isin with empty list throws exception on cached DataFrame ## What changes were proposed in this pull request? As pointed out in the JIRA, there is a bug which causes an exception to be thrown if `isin` is called with an empty list on a cached DataFrame. The PR fixes it. ## How was this patch tested? Added UT. Author: Marco Gaido <marcogaido91@gmail.com> Closes #19494 from mgaido91/SPARK-22249. (cherry picked from commit 8148f19ca1f0e0375603cb4f180c1bad8b0b8042) Signed-off-by: Sean Owen <sowen@cloudera.com> 17 October 2017, 07:41:42 UTC
0f060a2 [SPARK-22223][SQL] ObjectHashAggregate should not introduce unnecessary shuffle `ObjectHashAggregateExec` should override `outputPartitioning` in order to avoid unnecessary shuffle. Added Jenkins test. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #19501 from viirya/SPARK-22223. (cherry picked from commit 0ae96495dedb54b3b6bae0bd55560820c5ca29a2) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 16 October 2017, 05:47:59 UTC
6b6761e [SPARK-21549][CORE] Respect OutputFormats with no/invalid output directory provided ## What changes were proposed in this pull request? PR #19294 added support for null's - but spark 2.1 handled other error cases where path argument can be invalid. Namely: * empty string * URI parse exception while creating Path This is resubmission of PR #19487, which I messed up while updating my repo. ## How was this patch tested? Enhanced test to cover new support added. Author: Mridul Muralidharan <mridul@gmail.com> Closes #19497 from mridulm/master. (cherry picked from commit 13c1559587d0eb533c94f5a492390f81b048b347) Signed-off-by: Mridul Muralidharan <mridul@gmail.com> 16 October 2017, 01:41:47 UTC
acbad83 [SPARK-22273][SQL] Fix key/value schema field names in HashMapGenerators. ## What changes were proposed in this pull request? When fixing schema field names using escape characters with `addReferenceMinorObj()` at [SPARK-18952](https://issues.apache.org/jira/browse/SPARK-18952) (#16361), double-quotes around the names were remained and the names become something like `"((java.lang.String) references[1])"`. ```java /* 055 */ private int maxSteps = 2; /* 056 */ private int numRows = 0; /* 057 */ private org.apache.spark.sql.types.StructType keySchema = new org.apache.spark.sql.types.StructType().add("((java.lang.String) references[1])", org.apache.spark.sql.types.DataTypes.StringType); /* 058 */ private org.apache.spark.sql.types.StructType valueSchema = new org.apache.spark.sql.types.StructType().add("((java.lang.String) references[2])", org.apache.spark.sql.types.DataTypes.LongType); /* 059 */ private Object emptyVBase; ``` We should remove the double-quotes to refer the values in `references` properly: ```java /* 055 */ private int maxSteps = 2; /* 056 */ private int numRows = 0; /* 057 */ private org.apache.spark.sql.types.StructType keySchema = new org.apache.spark.sql.types.StructType().add(((java.lang.String) references[1]), org.apache.spark.sql.types.DataTypes.StringType); /* 058 */ private org.apache.spark.sql.types.StructType valueSchema = new org.apache.spark.sql.types.StructType().add(((java.lang.String) references[2]), org.apache.spark.sql.types.DataTypes.LongType); /* 059 */ private Object emptyVBase; ``` ## How was this patch tested? Existing tests. Author: Takuya UESHIN <ueshin@databricks.com> Closes #19491 from ueshin/issues/SPARK-22273. (cherry picked from commit e0503a7223410289d01bc4b20da3a451730577da) Signed-off-by: gatorsmile <gatorsmile@gmail.com> 14 October 2017, 06:24:49 UTC
30d5c9f [SPARK-14387][SPARK-16628][SPARK-18355][SQL] Use Spark schema to read ORC table instead of ORC file schema Before Hive 2.0, ORC File schema has invalid column names like `_col1` and `_col2`. This is a well-known limitation and there are several Apache Spark issues with `spark.sql.hive.convertMetastoreOrc=true`. This PR ignores ORC File schema and use Spark schema. Pass the newly added test case. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #19470 from dongjoon-hyun/SPARK-18355. (cherry picked from commit e6e36004afc3f9fc8abea98542248e9de11b4435) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 13 October 2017, 15:11:50 UTC
c9187db [SPARK-22252][SQL][2.2] FileFormatWriter should respect the input query schema ## What changes were proposed in this pull request? https://github.com/apache/spark/pull/18386 fixes SPARK-21165 but breaks SPARK-22252. This PR reverts https://github.com/apache/spark/pull/18386 and picks the patch from https://github.com/apache/spark/pull/19483 to fix SPARK-21165. ## How was this patch tested? new regression test Author: Wenchen Fan <wenchen@databricks.com> Closes #19484 from cloud-fan/bug. 13 October 2017, 04:54:00 UTC
cfc04e0 [SPARK-22217][SQL] ParquetFileFormat to support arbitrary OutputCommitters ## What changes were proposed in this pull request? `ParquetFileFormat` to relax its requirement of output committer class from `org.apache.parquet.hadoop.ParquetOutputCommitter` or subclass thereof (and so implicitly Hadoop `FileOutputCommitter`) to any committer implementing `org.apache.hadoop.mapreduce.OutputCommitter` This enables output committers which don't write to the filesystem the way `FileOutputCommitter` does to save parquet data from a dataframe: at present you cannot do this. Before a committer which isn't a subclass of `ParquetOutputCommitter`, it checks to see if the context has requested summary metadata by setting `parquet.enable.summary-metadata`. If true, and the committer class isn't a parquet committer, it raises a RuntimeException with an error message. (It could downgrade, of course, but raising an exception makes it clear there won't be an summary. It also makes the behaviour testable.) Note that `SQLConf` already states that any `OutputCommitter` can be used, but that typically it's a subclass of ParquetOutputCommitter. That's not currently true. This patch will make the code consistent with the docs, adding tests to verify, ## How was this patch tested? The patch includes a test suite, `ParquetCommitterSuite`, with a new committer, `MarkingFileOutputCommitter` which extends `FileOutputCommitter` and writes a marker file in the destination directory. The presence of the marker file can be used to verify the new committer was used. The tests then try the combinations of Parquet committer summary/no-summary and marking committer summary/no-summary. | committer | summary | outcome | |-----------|---------|---------| | parquet | true | success | | parquet | false | success | | marking | false | success with marker | | marking | true | exception | All tests are happy. Author: Steve Loughran <stevel@hortonworks.com> Closes #19448 from steveloughran/cloud/SPARK-22217-committer. 13 October 2017, 01:00:33 UTC
cd51e2c [SPARK-21907][CORE][BACKPORT 2.2] oom during spill back-port #19181 to branch-2.2. ## What changes were proposed in this pull request? 1. a test reproducing [SPARK-21907](https://issues.apache.org/jira/browse/SPARK-21907) 2. a fix for the root cause of the issue. `org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill` calls `org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.reset` which may trigger another spill, when this happens the `array` member is already de-allocated but still referenced by the code, this causes the nested spill to fail with an NPE in `org.apache.spark.memory.TaskMemoryManager.getPage`. This patch introduces a reproduction in a test case and a fix, the fix simply sets the in-mem sorter's array member to an empty array before actually performing the allocation. This prevents the spilling code from 'touching' the de-allocated array. ## How was this patch tested? introduced a new test case: `org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorterSuite#testOOMDuringSpill`. Author: Eyal Farago <eyal@nrgene.com> Closes #19481 from eyalfa/SPARK-21907__oom_during_spill__BACKPORT-2.2. 12 October 2017, 12:56:04 UTC
c5889b5 [SPARK-22218] spark shuffle services fails to update secret on app re-attempts This patch fixes application re-attempts when running spark on yarn using the external shuffle service with security on. Currently executors will fail to launch on any application re-attempt when launched on a nodemanager that had an executor from the first attempt. The reason for this is because we aren't updating the secret key after the first application attempt. The fix here is to just remove the containskey check to see if it already exists. In this way, we always add it and make sure its the most recent secret. Similarly remove the check for containsKey on the remove since its just adding extra check that isn't really needed. Note this worked before spark 2.2 because the check used to be contains (which was looking for the value) rather then containsKey, so that never matched and it was just always adding the new secret. Patch was tested on a 10 node cluster as well as added the unit test. The test ran was a wordcount where the output directory already existed. With the bug present the application attempt failed with max number of executor Failures which were all saslExceptions. With the fix present the application re-attempts fail with directory already exists or when you remove the directory between attempts the re-attemps succeed. Author: Thomas Graves <tgraves@unharmedunarmed.corp.ne1.yahoo.com> Closes #19450 from tgravescs/SPARK-22218. (cherry picked from commit a74ec6d7bbfe185ba995dcb02d69e90a089c293e) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> 09 October 2017, 19:56:46 UTC
0d3f166 [SPARK-21549][CORE] Respect OutputFormats with no output directory provided ## What changes were proposed in this pull request? Fix for https://issues.apache.org/jira/browse/SPARK-21549 JIRA issue. Since version 2.2 Spark does not respect OutputFormat with no output paths provided. The examples of such formats are [Cassandra OutputFormat](https://github.com/finn-no/cassandra-hadoop/blob/08dfa3a7ac727bb87269f27a1c82ece54e3f67e6/src/main/java/org/apache/cassandra/hadoop2/AbstractColumnFamilyOutputFormat.java), [Aerospike OutputFormat](https://github.com/aerospike/aerospike-hadoop/blob/master/mapreduce/src/main/java/com/aerospike/hadoop/mapreduce/AerospikeOutputFormat.java), etc. which do not have an ability to rollback the results written to an external systems on job failure. Provided output directory is required by Spark to allows files to be committed to an absolute output location, that is not the case for output formats which write data to external systems. This pull request prevents accessing `absPathStagingDir` method that causes the error described in SPARK-21549 unless there are files to rename in `addedAbsPathFiles`. ## How was this patch tested? Unit tests Author: Sergey Zhemzhitsky <szhemzhitski@gmail.com> Closes #19294 from szhem/SPARK-21549-abs-output-commits. (cherry picked from commit 2030f19511f656e9534f3fd692e622e45f9a074e) Signed-off-by: Mridul Muralidharan <mridul@gmail.com> 07 October 2017, 03:44:47 UTC
8a4e7dd [SPARK-22206][SQL][SPARKR] gapply in R can't work on empty grouping columns ## What changes were proposed in this pull request? Looks like `FlatMapGroupsInRExec.requiredChildDistribution` didn't consider empty grouping attributes. It should be a problem when running `EnsureRequirements` and `gapply` in R can't work on empty grouping columns. ## How was this patch tested? Added test. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #19436 from viirya/fix-flatmapinr-distribution. (cherry picked from commit ae61f187aa0471242c046fdeac6ed55b9b98a3f6) Signed-off-by: hyukjinkwon <gurwls223@gmail.com> 05 October 2017, 14:36:56 UTC
81232ce [SPARK-20466][CORE] HadoopRDD#addLocalConfiguration throws NPE ## What changes were proposed in this pull request? Fix for SPARK-20466, full description of the issue in the JIRA. To summarize, `HadoopRDD` uses a metadata cache to cache `JobConf` objects. The cache uses soft-references, which means the JVM can delete entries from the cache whenever there is GC pressure. `HadoopRDD#getJobConf` had a bug where it would check if the cache contained the `JobConf`, if it did it would get the `JobConf` from the cache and return it. This doesn't work when soft-references are used as the JVM can delete the entry between the existence check and the get call. ## How was this patch tested? Haven't thought of a good way to test this yet given the issue only occurs sometimes, and happens during high GC pressure. Was thinking of using mocks to verify `#getJobConf` is doing the right thing. I deleted the method `HadoopRDD#containsCachedMetadata` so that we don't hit this issue again. Author: Sahil Takiar <stakiar@cloudera.com> Closes #19413 from sahilTakiar/master. (cherry picked from commit e36ec38d89472df0dfe12222b6af54cd6eea8e98) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> 03 October 2017, 23:53:43 UTC
5e3f254 [SPARK-22178][SQL] Refresh Persistent Views by REFRESH TABLE Command ## What changes were proposed in this pull request? The underlying tables of persistent views are not refreshed when users issue the REFRESH TABLE command against the persistent views. ## How was this patch tested? Added a test case Author: gatorsmile <gatorsmile@gmail.com> Closes #19405 from gatorsmile/refreshView. (cherry picked from commit e65b6b7ca1a7cff1b91ad2262bb7941e6bf057cd) Signed-off-by: gatorsmile <gatorsmile@gmail.com> 03 October 2017, 19:40:36 UTC
3c30be5 [SPARK-22158][SQL][BRANCH-2.2] convertMetastore should not ignore table property ## What changes were proposed in this pull request? From the beginning, **convertMetastoreOrc** ignores table properties and use an empty map instead. This PR fixes that. **convertMetastoreParquet** also ignore. ```scala val options = Map[String, String]() ``` - [SPARK-14070: HiveMetastoreCatalog.scala](https://github.com/apache/spark/pull/11891/files#diff-ee66e11b56c21364760a5ed2b783f863R650) - [Master branch: HiveStrategies.scala](https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveStrategies.scala#L197 ) ## How was this patch tested? Pass the Jenkins with an updated test suite. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #19417 from dongjoon-hyun/SPARK-22158-BRANCH-2.2. 03 October 2017, 18:42:55 UTC
b9adddb [SPARK-22167][R][BUILD] sparkr packaging issue allow zinc ## What changes were proposed in this pull request? When zinc is running the pwd might be in the root of the project. A quick solution to this is to not go a level up incase we are in the root rather than root/core/. If we are in the root everything works fine, if we are in core add a script which goes and runs the level up ## How was this patch tested? set -x in the SparkR install scripts. Author: Holden Karau <holden@us.ibm.com> Closes #19402 from holdenk/SPARK-22167-sparkr-packaging-issue-allow-zinc. (cherry picked from commit 8fab7995d36c7bc4524393b20a4e524dbf6bbf62) Signed-off-by: Holden Karau <holden@us.ibm.com> 02 October 2017, 18:47:11 UTC
7bf25e0 [SPARK-22146] FileNotFoundException while reading ORC files containing special characters ## What changes were proposed in this pull request? Reading ORC files containing special characters like '%' fails with a FileNotFoundException. This PR aims to fix the problem. ## How was this patch tested? Added UT. Author: Marco Gaido <marcogaido91@gmail.com> Author: Marco Gaido <mgaido@hortonworks.com> Closes #19368 from mgaido91/SPARK-22146. 29 September 2017, 16:05:15 UTC
ac9a0f6 [SPARK-22161][SQL] Add Impala-modified TPC-DS queries ## What changes were proposed in this pull request? Added IMPALA-modified TPCDS queries to TPC-DS query suites. - Ref: https://github.com/cloudera/impala-tpcds-kit/tree/master/queries ## How was this patch tested? N/A Author: gatorsmile <gatorsmile@gmail.com> Closes #19386 from gatorsmile/addImpalaQueries. (cherry picked from commit 9ed7394a68315126b2dd00e53a444cc65b5a62ea) Signed-off-by: gatorsmile <gatorsmile@gmail.com> 29 September 2017, 16:00:15 UTC
8b2d838 [SPARK-22129][SPARK-22138] Release script improvements ## What changes were proposed in this pull request? Use the GPG_KEY param, fix lsof to non-hardcoded path, remove version swap since it wasn't really needed. Use EXPORT on JAVA_HOME for downstream scripts as well. ## How was this patch tested? Rolled 2.1.2 RC2 Author: Holden Karau <holden@us.ibm.com> Closes #19359 from holdenk/SPARK-22129-fix-signing. (cherry picked from commit ecbe416ab5001b32737966c5a2407597a1dafc32) Signed-off-by: Holden Karau <holden@us.ibm.com> 29 September 2017, 15:04:26 UTC
8c5ab4e [SPARK-22143][SQL][BRANCH-2.2] Fix memory leak in OffHeapColumnVector This is a backport of https://github.com/apache/spark/commit/02bb0682e68a2ce81f3b98d33649d368da7f2b3d. ## What changes were proposed in this pull request? `WriteableColumnVector` does not close its child column vectors. This can create memory leaks for `OffHeapColumnVector` where we do not clean up the memory allocated by a vectors children. This can be especially bad for string columns (which uses a child byte column vector). ## How was this patch tested? I have updated the existing tests to always use both on-heap and off-heap vectors. Testing and diagnosis was done locally. Author: Herman van Hovell <hvanhovell@databricks.com> Closes #19378 from hvanhovell/SPARK-22143-2.2. 28 September 2017, 15:51:40 UTC
12a74e3 [SPARK-22135][MESOS] metrics in spark-dispatcher not being registered properly ## What changes were proposed in this pull request? Fix a trivial bug with how metrics are registered in the mesos dispatcher. Bug resulted in creating a new registry each time the metricRegistry() method was called. ## How was this patch tested? Verified manually on local mesos setup Author: Paul Mackles <pmackles@adobe.com> Closes #19358 from pmackles/SPARK-22135. (cherry picked from commit f20be4d70bf321f377020d1bde761a43e5c72f0a) Signed-off-by: jerryshao <sshao@hortonworks.com> 28 September 2017, 06:43:53 UTC
42e1727 [SPARK-22140] Add TPCDSQuerySuite ## What changes were proposed in this pull request? Now, we are not running TPC-DS queries as regular test cases. Thus, we need to add a test suite using empty tables for ensuring the new code changes will not break them. For example, optimizer/analyzer batches should not exceed the max iteration. ## How was this patch tested? N/A Author: gatorsmile <gatorsmile@gmail.com> Closes #19361 from gatorsmile/tpcdsQuerySuite. (cherry picked from commit 9244957b500cb2b458c32db2c63293a1444690d7) Signed-off-by: gatorsmile <gatorsmile@gmail.com> 28 September 2017, 00:04:34 UTC
a406473 [SPARK-22141][BACKPORT][SQL] Propagate empty relation before checking Cartesian products Back port https://github.com/apache/spark/pull/19362 to branch-2.2 ## What changes were proposed in this pull request? When inferring constraints from children, Join's condition can be simplified as None. For example, ``` val testRelation = LocalRelation('a.int) val x = testRelation.as("x") val y = testRelation.where($"a" === 2 && !($"a" === 2)).as("y") x.join.where($"x.a" === $"y.a") ``` The plan will become ``` Join Inner :- LocalRelation <empty>, [a#23] +- LocalRelation <empty>, [a#224] ``` And the Cartesian products check will throw exception for above plan. Propagate empty relation before checking Cartesian products, and the issue is resolved. ## How was this patch tested? Unit test Author: Wang Gengliang <ltnwgl@gmail.com> Closes #19366 from gengliangwang/branch-2.2. 27 September 2017, 15:40:31 UTC
b0f30b5 [SPARK-22120][SQL] TestHiveSparkSession.reset() should clean out Hive warehouse directory ## What changes were proposed in this pull request? During TestHiveSparkSession.reset(), which is called after each TestHiveSingleton suite, we now delete and recreate the Hive warehouse directory. ## How was this patch tested? Ran full suite of tests locally, verified that they pass. Author: Greg Owen <greg@databricks.com> Closes #19341 from GregOwen/SPARK-22120. (cherry picked from commit ce204780ee2434ff6bae50428ae37083835798d3) Signed-off-by: gatorsmile <gatorsmile@gmail.com> 25 September 2017, 21:16:25 UTC
9836ea1 [SPARK-22083][CORE] Release locks in MemoryStore.evictBlocksToFreeSpace ## What changes were proposed in this pull request? MemoryStore.evictBlocksToFreeSpace acquires write locks for all the blocks it intends to evict up front. If there is a failure to evict blocks (eg., some failure dropping a block to disk), then we have to release the lock. Otherwise the lock is never released and an executor trying to get the lock will wait forever. ## How was this patch tested? Added unit test. Author: Imran Rashid <irashid@cloudera.com> Closes #19311 from squito/SPARK-22083. (cherry picked from commit 2c5b9b1173c23f6ca8890817a9a35dc7557b0776) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> 25 September 2017, 19:02:40 UTC
8acce00 [SPARK-22107] Change as to alias in python quickstart ## What changes were proposed in this pull request? Updated docs so that a line of python in the quick start guide executes. Closes #19283 ## How was this patch tested? Existing tests. Author: John O'Leary <jgoleary@gmail.com> Closes #19326 from jgoleary/issues/22107. (cherry picked from commit 20adf9aa1f42353432d356117e655e799ea1290b) Signed-off-by: hyukjinkwon <gurwls223@gmail.com> 25 September 2017, 00:16:46 UTC
211d81b [SPARK-22109][SQL][BRANCH-2.2] Resolves type conflicts between strings and timestamps in partition column ## What changes were proposed in this pull request? This PR backports https://github.com/apache/spark/commit/04975a68b583a6175f93da52374108e5d4754d9a into branch-2.2. ## How was this patch tested? Unit tests in `ParquetPartitionDiscoverySuite`. Author: hyukjinkwon <gurwls223@gmail.com> Closes #19333 from HyukjinKwon/SPARK-22109-backport-2.2. 23 September 2017, 17:51:04 UTC
1a829df [SPARK-22092] Reallocation in OffHeapColumnVector.reserveInternal corrupts struct and array data `OffHeapColumnVector.reserveInternal()` will only copy already inserted values during reallocation if `data != null`. In vectors containing arrays or structs this is incorrect, since there field `data` is not used at all. We need to check `nulls` instead. Adds new tests to `ColumnVectorSuite` that reproduce the errors. Author: Ala Luszczak <ala@databricks.com> Closes #19323 from ala/port-vector-realloc. 23 September 2017, 14:09:47 UTC
c0a34a9 [SPARK-18136] Fix SPARK_JARS_DIR for Python pip install on Windows ## What changes were proposed in this pull request? Fix for setup of `SPARK_JARS_DIR` on Windows as it looks for `%SPARK_HOME%\RELEASE` file instead of `%SPARK_HOME%\jars` as it should. RELEASE file is not included in the `pip` build of PySpark. ## How was this patch tested? Local install of PySpark on Anaconda 4.4.0 (Python 3.6.1). Author: Jakub Nowacki <j.s.nowacki@gmail.com> Closes #19310 from jsnowacki/master. (cherry picked from commit c11f24a94007bbaad0835645843e776507094071) Signed-off-by: hyukjinkwon <gurwls223@gmail.com> 23 September 2017, 12:04:26 UTC
de6274a [SPARK-22072][SPARK-22071][BUILD] Improve release build scripts ## What changes were proposed in this pull request? Check JDK version (with javac) and use SPARK_VERSION for publish-release ## How was this patch tested? Manually tried local build with wrong JDK / JAVA_HOME & built a local release (LFTP disabled) Author: Holden Karau <holden@us.ibm.com> Closes #19312 from holdenk/improve-release-scripts-r2. (cherry picked from commit 8f130ad40178e35fecb3f2ba4a61ad23e6a90e3d) Signed-off-by: Holden Karau <holden@us.ibm.com> 22 September 2017, 07:15:12 UTC
090b987 [SPARK-22094][SS] processAllAvailable should check the query state `processAllAvailable` should also check the query state and if the query is stopped, it should return. The new unit test. Author: Shixiong Zhu <zsxwing@gmail.com> Closes #19314 from zsxwing/SPARK-22094. (cherry picked from commit fedf6961be4e99139eb7ab08d5e6e29187ea5ccf) Signed-off-by: Shixiong Zhu <zsxwing@gmail.com> 22 September 2017, 05:08:45 UTC
765fd92 [SPARK-21928][CORE] Set classloader on SerializerManager's private kryo ## What changes were proposed in this pull request? We have to make sure that SerializerManager's private instance of kryo also uses the right classloader, regardless of the current thread classloader. In particular, this fixes serde during remote cache fetches, as those occur in netty threads. ## How was this patch tested? Manual tests & existing suite via jenkins. I haven't been able to reproduce this is in a unit test, because when a remote RDD partition can be fetched, there is a warning message and then the partition is just recomputed locally. I manually verified the warning message is no longer present. Author: Imran Rashid <irashid@cloudera.com> Closes #19280 from squito/SPARK-21928_ser_classloader. (cherry picked from commit b75bd1777496ce0354458bf85603a8087a6a0ff8) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> 21 September 2017, 17:20:28 UTC
401ac20 [SPARK-21384][YARN] Spark + YARN fails with LocalFileSystem as default FS ## What changes were proposed in this pull request? When the libraries temp directory(i.e. __spark_libs__*.zip dir) file system and staging dir(destination) file systems are the same then the __spark_libs__*.zip is not copying to the staging directory. But after making this decision the libraries zip file is getting deleted immediately and becoming unavailable for the Node Manager's localization. With this change, client copies the files to remote always when the source scheme is "file". ## How was this patch tested? I have verified it manually in yarn/cluster and yarn/client modes with hdfs and local file systems. Author: Devaraj K <devaraj@apache.org> Closes #19141 from devaraj-kavali/SPARK-21384. (cherry picked from commit 55d5fa79db883e4d93a9c102a94713c9d2d1fb55) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> 20 September 2017, 23:22:46 UTC
5d10586 [SPARK-22076][SQL] Expand.projections should not be a Stream ## What changes were proposed in this pull request? Spark with Scala 2.10 fails with a group by cube: ``` spark.range(1).select($"id" as "a", $"id" as "b").write.partitionBy("a").mode("overwrite").saveAsTable("rollup_bug") spark.sql("select 1 from rollup_bug group by rollup ()").show ``` It can be traced back to https://github.com/apache/spark/pull/15484 , which made `Expand.projections` a lazy `Stream` for group by cube. In scala 2.10 `Stream` captures a lot of stuff, and in this case it captures the entire query plan which has some un-serializable parts. This change is also good for master branch, to reduce the serialized size of `Expand.projections`. ## How was this patch tested? manually verified with Spark with Scala 2.10. Author: Wenchen Fan <wenchen@databricks.com> Closes #19289 from cloud-fan/bug. (cherry picked from commit ce6a71e013c403d0a3690cf823934530ce0ea5ef) Signed-off-by: gatorsmile <gatorsmile@gmail.com> 20 September 2017, 16:01:25 UTC
6764408 [SPARK-22052] Incorrect Metric assigned in MetricsReporter.scala Current implementation for processingRate-total uses wrong metric: mistakenly uses inputRowsPerSecond instead of processedRowsPerSecond ## What changes were proposed in this pull request? Adjust processingRate-total from using inputRowsPerSecond to processedRowsPerSecond ## How was this patch tested? Built spark from source with proposed change and tested output with correct parameter. Before change the csv metrics file for inputRate-total and processingRate-total displayed the same values due to the error. After changing MetricsReporter.scala the processingRate-total csv file displayed the correct metric. <img width="963" alt="processed rows per second" src="https://user-images.githubusercontent.com/32072374/30554340-82eea12c-9ca4-11e7-8370-8168526ff9a2.png"> Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Taaffy <32072374+Taaffy@users.noreply.github.com> Closes #19268 from Taaffy/patch-1. (cherry picked from commit 1bc17a6b8add02772a8a0a1048ac6a01d045baf4) Signed-off-by: Sean Owen <sowen@cloudera.com> 19 September 2017, 09:20:14 UTC
d0234eb [SPARK-22047][FLAKY TEST] HiveExternalCatalogVersionsSuite ## What changes were proposed in this pull request? This PR tries to download Spark for each test run, to make sure each test run is absolutely isolated. ## How was this patch tested? N/A Author: Wenchen Fan <wenchen@databricks.com> Closes #19265 from cloud-fan/test. (cherry picked from commit 10f45b3c84ff7b3f1765dc6384a563c33d26548b) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 19 September 2017, 03:54:03 UTC
48d6aef [SPARK-22047][TEST] ignore HiveExternalCatalogVersionsSuite ## What changes were proposed in this pull request? As reported in https://issues.apache.org/jira/browse/SPARK-22047 , HiveExternalCatalogVersionsSuite is failing frequently, let's disable this test suite to unblock other PRs, I'm looking into the root cause. ## How was this patch tested? N/A Author: Wenchen Fan <wenchen@databricks.com> Closes #19264 from cloud-fan/test. (cherry picked from commit 894a7561de2c2ff01fe7fcc5268378161e9e5643) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 18 September 2017, 08:42:32 UTC
a86831d [SPARK-22043][PYTHON] Improves error message for show_profiles and dump_profiles ## What changes were proposed in this pull request? This PR proposes to improve error message from: ``` >>> sc.show_profiles() Traceback (most recent call last): File "<stdin>", line 1, in <module> File ".../spark/python/pyspark/context.py", line 1000, in show_profiles self.profiler_collector.show_profiles() AttributeError: 'NoneType' object has no attribute 'show_profiles' >>> sc.dump_profiles("/tmp/abc") Traceback (most recent call last): File "<stdin>", line 1, in <module> File ".../spark/python/pyspark/context.py", line 1005, in dump_profiles self.profiler_collector.dump_profiles(path) AttributeError: 'NoneType' object has no attribute 'dump_profiles' ``` to ``` >>> sc.show_profiles() Traceback (most recent call last): File "<stdin>", line 1, in <module> File ".../spark/python/pyspark/context.py", line 1003, in show_profiles raise RuntimeError("'spark.python.profile' configuration must be set " RuntimeError: 'spark.python.profile' configuration must be set to 'true' to enable Python profile. >>> sc.dump_profiles("/tmp/abc") Traceback (most recent call last): File "<stdin>", line 1, in <module> File ".../spark/python/pyspark/context.py", line 1012, in dump_profiles raise RuntimeError("'spark.python.profile' configuration must be set " RuntimeError: 'spark.python.profile' configuration must be set to 'true' to enable Python profile. ``` ## How was this patch tested? Unit tests added in `python/pyspark/tests.py` and manual tests. Author: hyukjinkwon <gurwls223@gmail.com> Closes #19260 from HyukjinKwon/profile-errors. (cherry picked from commit 7c7266208a3be984ac1ce53747dc0c3640f4ecac) Signed-off-by: hyukjinkwon <gurwls223@gmail.com> 18 September 2017, 04:20:29 UTC
309c401 [SPARK-21953] Show both memory and disk bytes spilled if either is present As written now, there must be both memory and disk bytes spilled to show either of them. If there is only one of those types of spill recorded, it will be hidden. Author: Andrew Ash <andrew@andrewash.com> Closes #19164 from ash211/patch-3. (cherry picked from commit 6308c65f08b507408033da1f1658144ea8c1491f) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 18 September 2017, 02:42:41 UTC
42852bb [SPARK-21985][PYSPARK] PairDeserializer is broken for double-zipped RDDs ## What changes were proposed in this pull request? (edited) Fixes a bug introduced in #16121 In PairDeserializer convert each batch of keys and values to lists (if they do not have `__len__` already) so that we can check that they are the same size. Normally they already are lists so this should not have a performance impact, but this is needed when repeated `zip`'s are done. ## How was this patch tested? Additional unit test Author: Andrew Ray <ray.andrew@gmail.com> Closes #19226 from aray/SPARK-21985. (cherry picked from commit 6adf67dd14b0ece342bb91adf800df0a7101e038) Signed-off-by: hyukjinkwon <gurwls223@gmail.com> 17 September 2017, 17:46:47 UTC
51e5a82 [SPARK-18608][ML][FOLLOWUP] Fix double caching for PySpark OneVsRest. ## What changes were proposed in this pull request? #19197 fixed double caching for MLlib algorithms, but missed PySpark ```OneVsRest```, this PR fixed it. ## How was this patch tested? Existing tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes #19220 from yanboliang/SPARK-18608. (cherry picked from commit c76153cc7dd25b8de5266fe119095066be7f78f5) Signed-off-by: Yanbo Liang <ybliang8@gmail.com> 14 September 2017, 06:10:10 UTC
3a692e3 [SPARK-21980][SQL] References in grouping functions should be indexed with semanticEquals ## What changes were proposed in this pull request? https://issues.apache.org/jira/browse/SPARK-21980 This PR fixes the issue in ResolveGroupingAnalytics rule, which indexes the column references in grouping functions without considering case sensitive configurations. The problem can be reproduced by: `val df = spark.createDataFrame(Seq((1, 1), (2, 1), (2, 2))).toDF("a", "b") df.cube("a").agg(grouping("A")).show()` ## How was this patch tested? unit tests Author: donnyzone <wellfengzhu@gmail.com> Closes #19202 from DonnyZone/ResolveGroupingAnalytics. (cherry picked from commit 21c4450fb24635fab6481a3756fefa9c6f6d6235) Signed-off-by: gatorsmile <gatorsmile@gmail.com> 13 September 2017, 17:10:59 UTC
b606dc1 [SPARK-18608][ML] Fix double caching ## What changes were proposed in this pull request? `df.rdd.getStorageLevel` => `df.storageLevel` using cmd `find . -name '*.scala' | xargs -i bash -c 'egrep -in "\.rdd\.getStorageLevel" {} && echo {}'` to make sure all algs involved in this issue are fixed. Previous discussion in other PRs: https://github.com/apache/spark/pull/19107, https://github.com/apache/spark/pull/17014 ## How was this patch tested? existing tests Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes #19197 from zhengruifeng/double_caching. (cherry picked from commit c5f9b89dda40ffaa4622a7ba2b3d0605dbe815c0) Signed-off-by: Joseph K. Bradley <joseph@databricks.com> 12 September 2017, 18:37:22 UTC
63098dc [DOCS] Fix unreachable links in the document ## What changes were proposed in this pull request? Recently, I found two unreachable links in the document and fixed them. Because of small changes related to the document, I don't file this issue in JIRA but please suggest I should do it if you think it's needed. ## How was this patch tested? Tested manually. Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp> Closes #19195 from sarutak/fix-unreachable-link. (cherry picked from commit 957558235b7537c706c6ab4779655aa57838ebac) Signed-off-by: Sean Owen <sowen@cloudera.com> 12 September 2017, 14:07:21 UTC
10c6836 [SPARK-21976][DOC] Fix wrong documentation for Mean Absolute Error. ## What changes were proposed in this pull request? Fixed wrong documentation for Mean Absolute Error. Even though the code is correct for the MAE: ```scala Since("1.2.0") def meanAbsoluteError: Double = { summary.normL1(1) / summary.count } ``` In the documentation the division by N is missing. ## How was this patch tested? All of spark tests were run. Please review http://spark.apache.org/contributing.html before opening a pull request. Author: FavioVazquez <favio.vazquezp@gmail.com> Author: faviovazquez <favio.vazquezp@gmail.com> Author: Favio André Vázquez <favio.vazquezp@gmail.com> Closes #19190 from FavioVazquez/mae-fix. (cherry picked from commit e2ac2f1c71a0f8b03743d0d916dc0ef28482a393) Signed-off-by: Sean Owen <sowen@cloudera.com> 12 September 2017, 09:33:43 UTC
b1b5a7f [SPARK-20098][PYSPARK] dataType's typeName fix ## What changes were proposed in this pull request? `typeName` classmethod has been fixed by using type -> typeName map. ## How was this patch tested? local build Author: Peter Szalai <szalaipeti.vagyok@gmail.com> Closes #17435 from szalai1/datatype-gettype-fix. (cherry picked from commit 520d92a191c3148498087d751aeeddd683055622) Signed-off-by: hyukjinkwon <gurwls223@gmail.com> 10 September 2017, 08:48:00 UTC
182478e [SPARK-21954][SQL] JacksonUtils should verify MapType's value type instead of key type ## What changes were proposed in this pull request? `JacksonUtils.verifySchema` verifies if a data type can be converted to JSON. For `MapType`, it now verifies the key type. However, in `JacksonGenerator`, when converting a map to JSON, we only care about its values and create a writer for the values. The keys in a map are treated as strings by calling `toString` on the keys. Thus, we should change `JacksonUtils.verifySchema` to verify the value type of `MapType`. ## How was this patch tested? Added tests. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #19167 from viirya/test-jacksonutils. (cherry picked from commit 6b45d7e941eba8a36be26116787322d9e3ae25d0) Signed-off-by: hyukjinkwon <gurwls223@gmail.com> 09 September 2017, 10:11:28 UTC
9876821 [SPARK-21128][R][BACKPORT-2.2] Remove both "spark-warehouse" and "metastore_db" before listing files in R tests ## What changes were proposed in this pull request? This PR proposes to list the files in test _after_ removing both "spark-warehouse" and "metastore_db" so that the next run of R tests pass fine. This is sometimes a bit annoying. ## How was this patch tested? Manually running multiple times R tests via `./R/run-tests.sh`. **Before** Second run: ``` SparkSQL functions: Spark package found in SPARK_HOME: .../spark ............................................................................................................................................................... ............................................................................................................................................................... ............................................................................................................................................................... ............................................................................................................................................................... ............................................................................................................................................................... ....................................................................................................1234....................... Failed ------------------------------------------------------------------------- 1. Failure: No extra files are created in SPARK_HOME by starting session and making calls (test_sparkSQL.R#3384) length(list1) not equal to length(list2). 1/1 mismatches [1] 25 - 23 == 2 2. Failure: No extra files are created in SPARK_HOME by starting session and making calls (test_sparkSQL.R#3384) sort(list1, na.last = TRUE) not equal to sort(list2, na.last = TRUE). 10/25 mismatches x[16]: "metastore_db" y[16]: "pkg" x[17]: "pkg" y[17]: "R" x[18]: "R" y[18]: "README.md" x[19]: "README.md" y[19]: "run-tests.sh" x[20]: "run-tests.sh" y[20]: "SparkR_2.2.0.tar.gz" x[21]: "metastore_db" y[21]: "pkg" x[22]: "pkg" y[22]: "R" x[23]: "R" y[23]: "README.md" x[24]: "README.md" y[24]: "run-tests.sh" x[25]: "run-tests.sh" y[25]: "SparkR_2.2.0.tar.gz" 3. Failure: No extra files are created in SPARK_HOME by starting session and making calls (test_sparkSQL.R#3388) length(list1) not equal to length(list2). 1/1 mismatches [1] 25 - 23 == 2 4. Failure: No extra files are created in SPARK_HOME by starting session and making calls (test_sparkSQL.R#3388) sort(list1, na.last = TRUE) not equal to sort(list2, na.last = TRUE). 10/25 mismatches x[16]: "metastore_db" y[16]: "pkg" x[17]: "pkg" y[17]: "R" x[18]: "R" y[18]: "README.md" x[19]: "README.md" y[19]: "run-tests.sh" x[20]: "run-tests.sh" y[20]: "SparkR_2.2.0.tar.gz" x[21]: "metastore_db" y[21]: "pkg" x[22]: "pkg" y[22]: "R" x[23]: "R" y[23]: "README.md" x[24]: "README.md" y[24]: "run-tests.sh" x[25]: "run-tests.sh" y[25]: "SparkR_2.2.0.tar.gz" DONE =========================================================================== ``` **After** Second run: ``` SparkSQL functions: Spark package found in SPARK_HOME: .../spark ............................................................................................................................................................... ............................................................................................................................................................... ............................................................................................................................................................... ............................................................................................................................................................... ............................................................................................................................................................... ............................................................................................................................... ``` Author: hyukjinkwon <gurwls223gmail.com> Closes #18335 from HyukjinKwon/SPARK-21128. Author: hyukjinkwon <gurwls223@gmail.com> Closes #19166 from felixcheung/rbackport21128. 08 September 2017, 16:47:45 UTC
9ae7c96 [SPARK-21946][TEST] fix flaky test: "alter table: rename cached table" in InMemoryCatalogedDDLSuite ## What changes were proposed in this pull request? This PR fixes flaky test `InMemoryCatalogedDDLSuite "alter table: rename cached table"`. Since this test validates distributed DataFrame, the result should be checked by using `checkAnswer`. The original version used `df.collect().Seq` method that does not guaranty an order of each element of the result. ## How was this patch tested? Use existing test case Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Closes #19159 from kiszk/SPARK-21946. (cherry picked from commit 8a4f228dc0afed7992695486ecab6bc522f1e392) Signed-off-by: gatorsmile <gatorsmile@gmail.com> 08 September 2017, 16:39:32 UTC
08cb06a [SPARK-21936][SQL][2.2] backward compatibility test framework for HiveExternalCatalog backport https://github.com/apache/spark/pull/19148 to 2.2 Author: Wenchen Fan <wenchen@databricks.com> Closes #19163 from cloud-fan/test. 08 September 2017, 16:35:41 UTC
781a1f8 [SPARK-21915][ML][PYSPARK] Model 1 and Model 2 ParamMaps Missing dongjoon-hyun HyukjinKwon Error in PySpark example code: /examples/src/main/python/ml/estimator_transformer_param_example.py The original Scala code says println("Model 2 was fit using parameters: " + model2.parent.extractParamMap) The parent is lr There is no method for accessing parent as is done in Scala. This code has been tested in Python, and returns values consistent with Scala ## What changes were proposed in this pull request? Proposing to call the lr variable instead of model1 or model2 ## How was this patch tested? This patch was tested with Spark 2.1.0 comparing the Scala and PySpark results. Pyspark returns nothing at present for those two print lines. The output for model2 in PySpark should be {Param(parent='LogisticRegression_4187be538f744d5a9090', name='tol', doc='the convergence tolerance for iterative algorithms (>= 0).'): 1e-06, Param(parent='LogisticRegression_4187be538f744d5a9090', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.0, Param(parent='LogisticRegression_4187be538f744d5a9090', name='predictionCol', doc='prediction column name.'): 'prediction', Param(parent='LogisticRegression_4187be538f744d5a9090', name='featuresCol', doc='features column name.'): 'features', Param(parent='LogisticRegression_4187be538f744d5a9090', name='labelCol', doc='label column name.'): 'label', Param(parent='LogisticRegression_4187be538f744d5a9090', name='probabilityCol', doc='Column name for predicted class conditional probabilities. Note: Not all models output well-calibrated probability estimates! These probabilities should be treated as confidences, not precise probabilities.'): 'myProbability', Param(parent='LogisticRegression_4187be538f744d5a9090', name='rawPredictionCol', doc='raw prediction (a.k.a. confidence) column name.'): 'rawPrediction', Param(parent='LogisticRegression_4187be538f744d5a9090', name='family', doc='The name of family which is a description of the label distribution to be used in the model. Supported options: auto, binomial, multinomial'): 'auto', Param(parent='LogisticRegression_4187be538f744d5a9090', name='fitIntercept', doc='whether to fit an intercept term.'): True, Param(parent='LogisticRegression_4187be538f744d5a9090', name='threshold', doc='Threshold in binary classification prediction, in range [0, 1]. If threshold and thresholds are both set, they must match.e.g. if threshold is p, then thresholds must be equal to [1-p, p].'): 0.55, Param(parent='LogisticRegression_4187be538f744d5a9090', name='aggregationDepth', doc='suggested depth for treeAggregate (>= 2).'): 2, Param(parent='LogisticRegression_4187be538f744d5a9090', name='maxIter', doc='max number of iterations (>= 0).'): 30, Param(parent='LogisticRegression_4187be538f744d5a9090', name='regParam', doc='regularization parameter (>= 0).'): 0.1, Param(parent='LogisticRegression_4187be538f744d5a9090', name='standardization', doc='whether to standardize the training features before fitting the model.'): True} Please review http://spark.apache.org/contributing.html before opening a pull request. Author: MarkTab marktab.net <marktab@users.noreply.github.com> Closes #19152 from marktab/branch-2.2. 08 September 2017, 07:08:09 UTC
4304d0b [SPARK-21950][SQL][PYTHON][TEST] pyspark.sql.tests.SQLTests2 should stop SparkContext. ## What changes were proposed in this pull request? `pyspark.sql.tests.SQLTests2` doesn't stop newly created spark context in the test and it might affect the following tests. This pr makes `pyspark.sql.tests.SQLTests2` stop `SparkContext`. ## How was this patch tested? Existing tests. Author: Takuya UESHIN <ueshin@databricks.com> Closes #19158 from ueshin/issues/SPARK-21950. (cherry picked from commit 57bc1e9eb452284cbed090dbd5008eb2062f1b36) Signed-off-by: Takuya UESHIN <ueshin@databricks.com> 08 September 2017, 05:26:23 UTC
0848df1 [SPARK-21890] Credentials not being passed to add the tokens ## What changes were proposed in this pull request? I observed this while running a oozie job trying to connect to hbase via spark. It look like the creds are not being passed in thehttps://github.com/apache/spark/blob/branch-2.2/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/security/HadoopFSCredentialProvider.scala#L53 for 2.2 release. More Info as to why it fails on secure grid: Oozie client gets the necessary tokens the application needs before launching. It passes those tokens along to the oozie launcher job (MR job) which will then actually call the Spark client to launch the spark app and pass the tokens along. The oozie launcher job cannot get anymore tokens because all it has is tokens ( you can't get tokens with tokens, you need tgt or keytab). The error here is because the launcher job runs the Spark Client to submit the spark job but the spark client doesn't see that it already has the hdfs tokens so it tries to get more, which ends with the exception. There was a change with SPARK-19021 to generalize the hdfs credentials provider that changed it so we don't pass the existing credentials into the call to get tokens so it doesn't realize it already has the necessary tokens. https://issues.apache.org/jira/browse/SPARK-21890 Modified to pass creds to get delegation tokens ## How was this patch tested? Manual testing on our secure cluster Author: Sanket Chintapalli <schintap@yahoo-inc.com> Closes #19103 from redsanket/SPARK-21890. 07 September 2017, 17:20:39 UTC
49968de Fixed pandoc dependency issue in python/setup.py ## Problem Description When pyspark is listed as a dependency of another package, installing the other package will cause an install failure in pyspark. When the other package is being installed, pyspark's setup_requires requirements are installed including pypandoc. Thus, the exception handling on setup.py:152 does not work because the pypandoc module is indeed available. However, the pypandoc.convert() function fails if pandoc itself is not installed (in our use cases it is not). This raises an OSError that is not handled, and setup fails. The following is a sample failure: ``` $ which pandoc $ pip freeze | grep pypandoc pypandoc==1.4 $ pip install pyspark Collecting pyspark Downloading pyspark-2.2.0.post0.tar.gz (188.3MB) 100% |████████████████████████████████| 188.3MB 16.8MB/s Complete output from command python setup.py egg_info: Maybe try: sudo apt-get install pandoc See http://johnmacfarlane.net/pandoc/installing.html for installation options --------------------------------------------------------------- Traceback (most recent call last): File "<string>", line 1, in <module> File "/tmp/pip-build-mfnizcwa/pyspark/setup.py", line 151, in <module> long_description = pypandoc.convert('README.md', 'rst') File "/home/tbeck/.virtualenvs/cem/lib/python3.5/site-packages/pypandoc/__init__.py", line 69, in convert outputfile=outputfile, filters=filters) File "/home/tbeck/.virtualenvs/cem/lib/python3.5/site-packages/pypandoc/__init__.py", line 260, in _convert_input _ensure_pandoc_path() File "/home/tbeck/.virtualenvs/cem/lib/python3.5/site-packages/pypandoc/__init__.py", line 544, in _ensure_pandoc_path raise OSError("No pandoc was found: either install pandoc and add it\n" OSError: No pandoc was found: either install pandoc and add it to your PATH or or call pypandoc.download_pandoc(...) or install pypandoc wheels with included pandoc. ---------------------------------------- Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-build-mfnizcwa/pyspark/ ``` ## What changes were proposed in this pull request? This change simply adds an additional exception handler for the OSError that is raised. This allows pyspark to be installed client-side without requiring pandoc to be installed. ## How was this patch tested? I tested this by building a wheel package of pyspark with the change applied. Then, in a clean virtual environment with pypandoc installed but pandoc not available on the system, I installed pyspark from the wheel. Here is the output ``` $ pip freeze | grep pypandoc pypandoc==1.4 $ which pandoc $ pip install --no-cache-dir ../spark/python/dist/pyspark-2.3.0.dev0-py2.py3-none-any.whl Processing /home/tbeck/work/spark/python/dist/pyspark-2.3.0.dev0-py2.py3-none-any.whl Requirement already satisfied: py4j==0.10.6 in /home/tbeck/.virtualenvs/cem/lib/python3.5/site-packages (from pyspark==2.3.0.dev0) Installing collected packages: pyspark Successfully installed pyspark-2.3.0.dev0 ``` Author: Tucker Beck <tucker.beck@rentrakmail.com> Closes #18981 from dusktreader/dusktreader/fix-pandoc-dependency-issue-in-setup_py. (cherry picked from commit aad2125475dcdeb4a0410392b6706511db17bac4) Signed-off-by: hyukjinkwon <gurwls223@gmail.com> 07 September 2017, 00:38:21 UTC
back to top