https://github.com/apache/spark

sort by:
Revision Author Date Message Commit Date
e30e269 Preparing Spark release v2.2.1-rc2 24 November 2017, 21:11:35 UTC
c3b5df2 fix typo 24 November 2017, 20:06:57 UTC
b606cc2 [SPARK-22495] Fix setup of SPARK_HOME variable on Windows ## What changes were proposed in this pull request? This is a cherry pick of the original PR 19370 onto branch-2.2 as suggested in https://github.com/apache/spark/pull/19370#issuecomment-346526920. Fixing the way how `SPARK_HOME` is resolved on Windows. While the previous version was working with the built release download, the set of directories changed slightly for the PySpark `pip` or `conda` install. This has been reflected in Linux files in `bin` but not for Windows `cmd` files. First fix improves the way how the `jars` directory is found, as this was stoping Windows version of `pip/conda` install from working; JARs were not found by on Session/Context setup. Second fix is adding `find-spark-home.cmd` script, which uses `find_spark_home.py` script, as the Linux version, to resolve `SPARK_HOME`. It is based on `find-spark-home` bash script, though, some operations are done in different order due to the `cmd` script language limitations. If environment variable is set, the Python script `find_spark_home.py` will not be run. The process can fail if Python is not installed, but it will mostly use this way if PySpark is installed via `pip/conda`, thus, there is some Python in the system. ## How was this patch tested? Tested on local installation. Author: Jakub Nowacki <j.s.nowacki@gmail.com> Closes #19807 from jsnowacki/fix_spark_cmds_2. 24 November 2017, 20:05:57 UTC
ad57141 [SPARK-22595][SQL] fix flaky test: CastSuite.SPARK-22500: cast for struct should not generate codes beyond 64KB This PR reduces the number of fields in the test case of `CastSuite` to fix an issue that is pointed at [here](https://github.com/apache/spark/pull/19800#issuecomment-346634950). ``` java.lang.OutOfMemoryError: GC overhead limit exceeded java.lang.OutOfMemoryError: GC overhead limit exceeded at org.codehaus.janino.UnitCompiler.findClass(UnitCompiler.java:10971) at org.codehaus.janino.UnitCompiler.findTypeByName(UnitCompiler.java:7607) at org.codehaus.janino.UnitCompiler.getReferenceType(UnitCompiler.java:5758) at org.codehaus.janino.UnitCompiler.getType2(UnitCompiler.java:5732) at org.codehaus.janino.UnitCompiler.access$13200(UnitCompiler.java:206) at org.codehaus.janino.UnitCompiler$18.visitReferenceType(UnitCompiler.java:5668) at org.codehaus.janino.UnitCompiler$18.visitReferenceType(UnitCompiler.java:5660) at org.codehaus.janino.Java$ReferenceType.accept(Java.java:3356) at org.codehaus.janino.UnitCompiler.getType(UnitCompiler.java:5660) at org.codehaus.janino.UnitCompiler.buildLocalVariableMap(UnitCompiler.java:2892) at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:2764) at org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1262) at org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1234) at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:538) at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:890) at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:894) at org.codehaus.janino.UnitCompiler.access$600(UnitCompiler.java:206) at org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:377) at org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:369) at org.codehaus.janino.Java$MemberClassDeclaration.accept(Java.java:1128) at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:369) at org.codehaus.janino.UnitCompiler.compileDeclaredMemberTypes(UnitCompiler.java:1209) at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:564) at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:890) at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:894) at org.codehaus.janino.UnitCompiler.access$600(UnitCompiler.java:206) at org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:377) at org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:369) at org.codehaus.janino.Java$MemberClassDeclaration.accept(Java.java:1128) at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:369) at org.codehaus.janino.UnitCompiler.compileDeclaredMemberTypes(UnitCompiler.java:1209) at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:564) ... ``` Used existing test case Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Closes #19806 from kiszk/SPARK-22595. (cherry picked from commit 554adc77d24c411a6df6d38c596aa33cdf68f3c1) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 24 November 2017, 11:10:32 UTC
f4c457a [SPARK-22591][SQL] GenerateOrdering shouldn't change CodegenContext.INPUT_ROW ## What changes were proposed in this pull request? When I played with codegen in developing another PR, I found the value of `CodegenContext.INPUT_ROW` is not reliable. Under wholestage codegen, it is assigned to null first and then suddenly changed to `i`. The reason is `GenerateOrdering` changes `CodegenContext.INPUT_ROW` but doesn't restore it back. ## How was this patch tested? Added test. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #19800 from viirya/SPARK-22591. (cherry picked from commit 62a826f17c549ed93300bdce562db56bddd5d959) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 24 November 2017, 10:47:10 UTC
f8e73d0 [SPARK-17920][SQL] [FOLLOWUP] Backport PR 19779 to branch-2.2 ## What changes were proposed in this pull request? A followup of https://github.com/apache/spark/pull/19795 , to simplify the file creation. ## How was this patch tested? Only test case is updated Author: vinodkc <vinod.kc.in@gmail.com> Closes #19809 from vinodkc/br_FollowupSPARK-17920_branch-2.2. 24 November 2017, 10:42:47 UTC
b17f406 [SPARK-17920][SPARK-19580][SPARK-19878][SQL] Backport PR 19779 to branch-2.2 - Support writing to Hive table which uses Avro schema url 'avro.schema.url' ## What changes were proposed in this pull request? > Backport https://github.com/apache/spark/pull/19779 to branch-2.2 SPARK-19580 Support for avro.schema.url while writing to hive table SPARK-19878 Add hive configuration when initialize hive serde in InsertIntoHiveTable.scala SPARK-17920 HiveWriterContainer passes null configuration to serde.initialize, causing NullPointerException in AvroSerde when using avro.schema.url Support writing to Hive table which uses Avro schema url 'avro.schema.url' For ex: create external table avro_in (a string) stored as avro location '/avro-in/' tblproperties ('avro.schema.url'='/avro-schema/avro.avsc'); create external table avro_out (a string) stored as avro location '/avro-out/' tblproperties ('avro.schema.url'='/avro-schema/avro.avsc'); insert overwrite table avro_out select * from avro_in; // fails with java.lang.NullPointerException WARN AvroSerDe: Encountered exception determining schema. Returning signal schema to indicate problem java.lang.NullPointerException at org.apache.hadoop.fs.FileSystem.getDefaultUri(FileSystem.java:182) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:174) ## Changes proposed in this fix Currently 'null' value is passed to serializer, which causes NPE during insert operation, instead pass Hadoop configuration object ## How was this patch tested? Added new test case in VersionsSuite Author: vinodkc <vinod.kc.in@gmail.com> Closes #19795 from vinodkc/br_Fix_SPARK-17920_branch-2.2. 22 November 2017, 17:21:26 UTC
df9228b [SPARK-22548][SQL] Incorrect nested AND expression pushed down to JDBC data source ## What changes were proposed in this pull request? Let’s say I have a nested AND expression shown below and p2 can not be pushed down, (p1 AND p2) OR p3 In current Spark code, during data source filter translation, (p1 AND p2) is returned as p1 only and p2 is simply lost. This issue occurs with JDBC data source and is similar to [SPARK-12218](https://github.com/apache/spark/pull/10362) for Parquet. When we have AND nested below another expression, we should either push both legs or nothing. Note that: - The current Spark code will always split conjunctive predicate before it determines if a predicate can be pushed down or not - If I have (p1 AND p2) AND p3, it will be split into p1, p2, p3. There won't be nested AND expression. - The current Spark code logic for OR is OK. It either pushes both legs or nothing. The same translation method is also called by Data Source V2. ## How was this patch tested? Added new unit test cases to JDBCSuite gatorsmile Author: Jia Li <jiali@us.ibm.com> Closes #19776 from jliwork/spark-22548. (cherry picked from commit 881c5c807304a305ef96e805d51afbde097f7f4f) Signed-off-by: gatorsmile <gatorsmile@gmail.com> 22 November 2017, 01:30:15 UTC
11a599b [SPARK-22500][SQL] Fix 64KB JVM bytecode limit problem with cast This PR changes `cast` code generation to place generated code for expression for fields of a structure into separated methods if these size could be large. Added new test cases into `CastSuite` Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Closes #19730 from kiszk/SPARK-22500. (cherry picked from commit ac10171bea2fc027d6691393b385b3fc0ef3293d) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 21 November 2017, 21:45:40 UTC
94f9227 [SPARK-22550][SQL] Fix 64KB JVM bytecode limit problem with elt This PR changes `elt` code generation to place generated code for expression for arguments into separated methods if these size could be large. This PR resolved the case of `elt` with a lot of argument Added new test cases into `StringExpressionsSuite` Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Closes #19778 from kiszk/SPARK-22550. (cherry picked from commit 9bdff0bcd83e730aba8dc1253da24a905ba07ae3) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 21 November 2017, 11:53:55 UTC
23eb4d7 [SPARK-22508][SQL] Fix 64KB JVM bytecode limit problem with GenerateUnsafeRowJoiner.create() ## What changes were proposed in this pull request? This PR changes `GenerateUnsafeRowJoiner.create()` code generation to place generated code for statements to operate bitmap and offset into separated methods if these size could be large. ## How was this patch tested? Added a new test case into `GenerateUnsafeRowJoinerSuite` Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Closes #19737 from kiszk/SPARK-22508. (cherry picked from commit c9577148069d2215dc79cbf828a378591b4fba5d) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 21 November 2017, 11:17:06 UTC
ca02575 [SPARK-22549][SQL] Fix 64KB JVM bytecode limit problem with concat_ws ## What changes were proposed in this pull request? This PR changes `concat_ws` code generation to place generated code for expression for arguments into separated methods if these size could be large. This PR resolved the case of `concat_ws` with a lot of argument ## How was this patch tested? Added new test cases into `StringExpressionsSuite` Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Closes #19777 from kiszk/SPARK-22549. (cherry picked from commit 41c6f36018eb086477f21574aacd71616513bd8e) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 21 November 2017, 00:42:23 UTC
710d618 [SPARK-22498][SQL] Fix 64KB JVM bytecode limit problem with concat ## What changes were proposed in this pull request? This PR changes `concat` code generation to place generated code for expression for arguments into separated methods if these size could be large. This PR resolved the case of `concat` with a lot of argument ## How was this patch tested? Added new test cases into `StringExpressionsSuite` Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Closes #19728 from kiszk/SPARK-22498. (cherry picked from commit d54bfec2e07f2eb934185402f915558fe27b9312) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 18 November 2017, 18:40:18 UTC
53a6076 [SPARK-22544][SS] FileStreamSource should use its own hadoop conf to call globPathIfNecessary ## What changes were proposed in this pull request? Pass the FileSystem created using the correct Hadoop conf into `globPathIfNecessary` so that it can pick up user's hadoop configurations, such as credentials. ## How was this patch tested? Jenkins Author: Shixiong Zhu <zsxwing@gmail.com> Closes #19771 from zsxwing/fix-file-stream-conf. (cherry picked from commit bf0c0ae2dcc7fd1ce92cd0fb4809bb3d65b2e309) Signed-off-by: Shixiong Zhu <zsxwing@gmail.com> 17 November 2017, 23:35:36 UTC
3bc37e5 [SPARK-22538][ML] SQLTransformer should not unpersist possibly cached input dataset ## What changes were proposed in this pull request? `SQLTransformer.transform` unpersists input dataset when dropping temporary view. We should not change input dataset's cache status. ## How was this patch tested? Added test. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #19772 from viirya/SPARK-22538. (cherry picked from commit fccb337f9d1e44a83cfcc00ce33eae1fad367695) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 17 November 2017, 16:44:05 UTC
ef7ccc1 [SPARK-22540][SQL] Ensure HighlyCompressedMapStatus calculates correct avgSize ## What changes were proposed in this pull request? Ensure HighlyCompressedMapStatus calculates correct avgSize ## How was this patch tested? New unit test added. Author: yucai <yucai.yu@intel.com> Closes #19765 from yucai/avgsize. (cherry picked from commit d00b55d4b25ba0bf92983ff1bb47d8528e943737) Signed-off-by: Sean Owen <sowen@cloudera.com> 17 November 2017, 13:54:08 UTC
be68f86 [SPARK-22535][PYSPARK] Sleep before killing the python worker in PythRunner.MonitorThread (branch-2.2) ## What changes were proposed in this pull request? Backport #19762 to 2.2 ## How was this patch tested? Jenkins Author: Shixiong Zhu <zsxwing@gmail.com> Closes #19768 from zsxwing/SPARK-22535-2.2. 16 November 2017, 22:41:05 UTC
0b51fd3 [SPARK-22501][SQL] Fix 64KB JVM bytecode limit problem with in ## What changes were proposed in this pull request? This PR changes `In` code generation to place generated code for expression for expressions for arguments into separated methods if these size could be large. ## How was this patch tested? Added new test cases into `PredicateSuite` Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Closes #19733 from kiszk/SPARK-22501. (cherry picked from commit 7f2e62ee6b9d1f32772a18d626fb9fd907aa7733) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 16 November 2017, 17:25:01 UTC
52b05b6 [SPARK-22494][SQL] Fix 64KB limit exception with Coalesce and AtleastNNonNulls ## What changes were proposed in this pull request? Both `Coalesce` and `AtLeastNNonNulls` can cause the 64KB limit exception when used with a lot of arguments and/or complex expressions. This PR splits their expressions in order to avoid the issue. ## How was this patch tested? Added UTs Author: Marco Gaido <marcogaido91@gmail.com> Author: Marco Gaido <mgaido@hortonworks.com> Closes #19720 from mgaido91/SPARK-22494. (cherry picked from commit 4e7f07e2550fa995cc37406173a937033135cf3b) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 16 November 2017, 17:19:26 UTC
17ba7b9 [SPARK-22499][SQL] Fix 64KB JVM bytecode limit problem with least and greatest ## What changes were proposed in this pull request? This PR changes `least` and `greatest` code generation to place generated code for expression for arguments into separated methods if these size could be large. This PR resolved two cases: * `least` with a lot of argument * `greatest` with a lot of argument ## How was this patch tested? Added a new test case into `ArithmeticExpressionsSuite` Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Closes #19729 from kiszk/SPARK-22499. (cherry picked from commit ed885e7a6504c439ffb6730e6963efbd050d43dd) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 16 November 2017, 16:56:33 UTC
b17ba35 [SPARK-22479][SQL][BRANCH-2.2] Exclude credentials from SaveintoDataSourceCommand.simpleString ## What changes were proposed in this pull request? Do not include jdbc properties which may contain credentials in logging a logical plan with `SaveIntoDataSourceCommand` in it. ## How was this patch tested? new tests Author: osatici <osatici@palantir.com> Closes #19761 from onursatici/os/redact-jdbc-creds-2.2. 16 November 2017, 11:49:29 UTC
3ae187b [SPARK-22469][SQL] Accuracy problem in comparison with string and numeric This fixes a problem caused by #15880 `select '1.5' > 0.5; // Result is NULL in Spark but is true in Hive. ` When compare string and numeric, cast them as double like Hive. Author: liutang123 <liutang123@yeah.net> Closes #19692 from liutang123/SPARK-22469. (cherry picked from commit bc0848b4c1ab84ccef047363a70fd11df240dbbf) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 16 November 2017, 11:47:22 UTC
3cefdde [SPARK-22490][DOC] Add PySpark doc for SparkSession.builder ## What changes were proposed in this pull request? In PySpark API Document, [SparkSession.build](http://spark.apache.org/docs/2.2.0/api/python/pyspark.sql.html) is not documented and shows default value description. ``` SparkSession.builder = <pyspark.sql.session.Builder object ... ``` This PR adds the doc. ![screen](https://user-images.githubusercontent.com/9700541/32705514-1bdcafaa-c7ca-11e7-88bf-05566fea42de.png) The following is the diff of the generated result. ``` $ diff old.html new.html 95a96,101 > <dl class="attribute"> > <dt id="pyspark.sql.SparkSession.builder"> > <code class="descname">builder</code><a class="headerlink" href="#pyspark.sql.SparkSession.builder" title="Permalink to this definition">¶</a></dt> > <dd><p>A class attribute having a <a class="reference internal" href="#pyspark.sql.SparkSession.Builder" title="pyspark.sql.SparkSession.Builder"><code class="xref py py-class docutils literal"><span class="pre">Builder</span></code></a> to construct <a class="reference internal" href="#pyspark.sql.SparkSession" title="pyspark.sql.SparkSession"><code class="xref py py-class docutils literal"><span class="pre">SparkSession</span></code></a> instances</p> > </dd></dl> > 212,216d217 < <dt id="pyspark.sql.SparkSession.builder"> < <code class="descname">builder</code><em class="property"> = &lt;pyspark.sql.session.SparkSession.Builder object&gt;</em><a class="headerlink" href="#pyspark.sql.SparkSession.builder" title="Permalink to this definition">¶</a></dt> < <dd></dd></dl> < < <dl class="attribute"> ``` ## How was this patch tested? Manual. ``` cd python/docs make html open _build/html/pyspark.sql.html ``` Author: Dongjoon Hyun <dongjoon@apache.org> Closes #19726 from dongjoon-hyun/SPARK-22490. (cherry picked from commit aa88b8dbbb7e71b282f31ae775140c783e83b4d6) Signed-off-by: gatorsmile <gatorsmile@gmail.com> 15 November 2017, 17:00:39 UTC
210f292 [SPARK-22511][BUILD] Update maven central repo address Use repo.maven.apache.org repo address; use latest ASF parent POM version 18 Existing tests; no functional change Author: Sean Owen <sowen@cloudera.com> Closes #19742 from srowen/SPARK-22511. (cherry picked from commit b009722591d2635698233c84f6e7e6cde7177019) Signed-off-by: Sean Owen <sowen@cloudera.com> 14 November 2017, 23:59:37 UTC
3ea6fd0 [SPARK-22377][BUILD] Use /usr/sbin/lsof if lsof does not exists in release-build.sh ## What changes were proposed in this pull request? This PR proposes to use `/usr/sbin/lsof` if `lsof` is missing in the path to fix nightly snapshot jenkins jobs. Please refer https://github.com/apache/spark/pull/19359#issuecomment-340139557: > Looks like some of the snapshot builds are having lsof issues: > > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-branch-2.1-maven-snapshots/182/console > >https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-branch-2.2-maven-snapshots/134/console > >spark-build/dev/create-release/release-build.sh: line 344: lsof: command not found >usage: kill [ -s signal | -p ] [ -a ] pid ... >kill -l [ signal ] Up to my knowledge, the full path of `lsof` is required for non-root user in few OSs. ## How was this patch tested? Manually tested as below: ```bash #!/usr/bin/env bash LSOF=lsof if ! hash $LSOF 2>/dev/null; then echo "a" LSOF=/usr/sbin/lsof fi $LSOF -P | grep "a" ``` Author: hyukjinkwon <gurwls223@gmail.com> Closes #19695 from HyukjinKwon/SPARK-22377. (cherry picked from commit c8b7f97b8a58bf4a9f6e3a07dd6e5b0f646d8d99) Signed-off-by: hyukjinkwon <gurwls223@gmail.com> 13 November 2017, 23:28:28 UTC
d905e85 [SPARK-22471][SQL] SQLListener consumes much memory causing OutOfMemoryError ## What changes were proposed in this pull request? This PR addresses the issue [SPARK-22471](https://issues.apache.org/jira/browse/SPARK-22471). The modified version of `SQLListener` respects the setting `spark.ui.retainedStages` and keeps the number of the tracked stages within the specified limit. The hash map `_stageIdToStageMetrics` does not outgrow the limit, hence overall memory consumption does not grow with time anymore. A 2.2-compatible fix. Maybe incompatible with 2.3 due to #19681. ## How was this patch tested? A new unit test covers this fix - see `SQLListenerMemorySuite.scala`. Author: Arseniy Tashoyan <tashoyan@gmail.com> Closes #19711 from tashoyan/SPARK-22471-branch-2.2. 13 November 2017, 21:50:12 UTC
af0b185 Preparing development version 2.2.2-SNAPSHOT 13 November 2017, 19:04:34 UTC
41116ab Preparing Spark release v2.2.1-rc1 13 November 2017, 19:04:27 UTC
c68b4c5 [MINOR][CORE] Using bufferedInputStream for dataDeserializeStream ## What changes were proposed in this pull request? Small fix. Using bufferedInputStream for dataDeserializeStream. ## How was this patch tested? Existing UT. Author: Xianyang Liu <xianyang.liu@intel.com> Closes #19735 from ConeyLiu/smallfix. (cherry picked from commit 176ae4d53e0269cfc2cfa62d3a2991e28f5a9182) Signed-off-by: Sean Owen <sowen@cloudera.com> 13 November 2017, 12:19:21 UTC
2f6dece [SPARK-22442][SQL][BRANCH-2.2][FOLLOWUP] ScalaReflection should produce correct field names for special characters ## What changes were proposed in this pull request? `val TermName: TermNameExtractor` is new in scala 2.11. For 2.10, we should use deprecated `newTermName`. ## How was this patch tested? Build locally with scala 2.10. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #19736 from viirya/SPARK-22442-2.2-followup. 13 November 2017, 11:41:42 UTC
f736377 [SPARK-22442][SQL][BRANCH-2.2] ScalaReflection should produce correct field names for special characters ## What changes were proposed in this pull request? For a class with field name of special characters, e.g.: ```scala case class MyType(`field.1`: String, `field 2`: String) ``` Although we can manipulate DataFrame/Dataset, the field names are encoded: ```scala scala> val df = Seq(MyType("a", "b"), MyType("c", "d")).toDF df: org.apache.spark.sql.DataFrame = [field$u002E1: string, field$u00202: string] scala> df.as[MyType].collect res7: Array[MyType] = Array(MyType(a,b), MyType(c,d)) ``` It causes resolving problem when we try to convert the data with non-encoded field names: ```scala spark.read.json(path).as[MyType] ... [info] org.apache.spark.sql.AnalysisException: cannot resolve '`field$u002E1`' given input columns: [field 2, fie ld.1]; [info] at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) ... ``` We should use decoded field name in Dataset schema. ## How was this patch tested? Added tests. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #19734 from viirya/SPARK-22442-2.2. 13 November 2017, 05:19:15 UTC
8acd02f [SPARK-21694][R][ML] Reduce max iterations in Linear SVM test in R to speed up AppVeyor build This PR proposes to reduce max iteration in Linear SVM test in SparkR. This particular test elapses roughly 5 mins on my Mac and over 20 mins on Windows. The root cause appears, it triggers 2500ish jobs by the default 100 max iterations. In Linux, `daemon.R` is forked but on Windows another process is launched, which is extremely slow. So, given my observation, there are many processes (not forked) ran on Windows, which makes the differences of elapsed time. After reducing the max iteration to 10, the total jobs in this single test is reduced to 550ish. After reducing the max iteration to 5, the total jobs in this single test is reduced to 360ish. Manually tested the elapsed times. Author: hyukjinkwon <gurwls223@gmail.com> Closes #19722 from HyukjinKwon/SPARK-21693-test. (cherry picked from commit 3d90b2cb384affe8ceac9398615e9e21b8c8e0b0) Signed-off-by: Felix Cheung <felixcheung@apache.org> 12 November 2017, 22:44:36 UTC
2a04cfa [SPARK-19606][BUILD][BACKPORT-2.2][MESOS] fix mesos break ## What changes were proposed in this pull request? Fix break from cherry pick ## How was this patch tested? Jenkins Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #19732 from felixcheung/fixmesosdriverconstraint. 12 November 2017, 22:27:49 UTC
95981fa [SPARK-22464][BACKPORT-2.2][SQL] No pushdown for Hive metastore partition predicates containing null-safe equality ## What changes were proposed in this pull request? `<=>` is not supported by Hive metastore partition predicate pushdown. We should not push down it to Hive metastore when they are be using in partition predicates. ## How was this patch tested? Added a test case Author: gatorsmile <gatorsmile@gmail.com> Closes #19724 from gatorsmile/backportSPARK-22464. 12 November 2017, 22:17:06 UTC
00cb9d0 [SPARK-22488][BACKPORT-2.2][SQL] Fix the view resolution issue in the SparkSession internal table() API ## What changes were proposed in this pull request? The current internal `table()` API of `SparkSession` bypasses the Analyzer and directly calls `sessionState.catalog.lookupRelation` API. This skips the view resolution logics in our Analyzer rule `ResolveRelations`. This internal API is widely used by various DDL commands, public and internal APIs. Users might get the strange error caused by view resolution when the default database is different. ``` Table or view not found: t1; line 1 pos 14 org.apache.spark.sql.AnalysisException: Table or view not found: t1; line 1 pos 14 at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) ``` This PR is to fix it by enforcing it to use `ResolveRelations` to resolve the table. ## How was this patch tested? Added a test case and modified the existing test cases Author: gatorsmile <gatorsmile@gmail.com> Closes #19723 from gatorsmile/backport22488. 12 November 2017, 22:15:58 UTC
114dc42 [SPARK-21720][SQL] Fix 64KB JVM bytecode limit problem with AND or OR This PR changes `AND` or `OR` code generation to place condition and then expressions' generated code into separated methods if these size could be large. When the method is newly generated, variables for `isNull` and `value` are declared as an instance variable to pass these values (e.g. `isNull1409` and `value1409`) to the callers of the generated method. This PR resolved two cases: * large code size of left expression * large code size of right expression Added a new test case into `CodeGenerationSuite` Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Closes #18972 from kiszk/SPARK-21720. (cherry picked from commit 9bf696dbece6b1993880efba24a6d32c54c4d11c) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 12 November 2017, 21:47:54 UTC
f6ee3d9 [SPARK-19606][MESOS] Support constraints in spark-dispatcher A discussed in SPARK-19606, the addition of a new config property named "spark.mesos.constraints.driver" for constraining drivers running on a Mesos cluster Corresponding unit test added also tested locally on a Mesos cluster Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Paul Mackles <pmackles@adobe.com> Closes #19543 from pmackles/SPARK-19606. (cherry picked from commit b3f9dbf48ec0938ff5c98833bb6b6855c620ef57) Signed-off-by: Felix Cheung <felixcheung@apache.org> 12 November 2017, 19:40:42 UTC
4ef0bef [SPARK-21667][STREAMING] ConsoleSink should not fail streaming query with checkpointLocation option ## What changes were proposed in this pull request? Fix to allow recovery on console , avoid checkpoint exception ## How was this patch tested? existing tests manual tests [ Replicating error and seeing no checkpoint error after fix] Author: Rekha Joshi <rekhajoshm@gmail.com> Author: rjoshi2 <rekhajoshm@gmail.com> Closes #19407 from rekhajoshm/SPARK-21667. (cherry picked from commit 808e886b9638ab2981dac676b594f09cda9722fe) Signed-off-by: Shixiong Zhu <zsxwing@gmail.com> 10 November 2017, 23:18:35 UTC
8b7f72e [SPARK-19644][SQL] Clean up Scala reflection garbage after creating Encoder (branch-2.2) ## What changes were proposed in this pull request? Backport #19687 to branch-2.2. The major difference is `cleanUpReflectionObjects` is protected by `ScalaReflectionLock.synchronized` in this PR for Scala 2.10. ## How was this patch tested? Jenkins Author: Shixiong Zhu <zsxwing@gmail.com> Closes #19718 from zsxwing/SPARK-19644-2.2. 10 November 2017, 22:14:47 UTC
6b4ec22 [SPARK-22284][SQL] Fix 64KB JVM bytecode limit problem in calculating hash for nested structs ## What changes were proposed in this pull request? This PR avoids to generate a huge method for calculating a murmur3 hash for nested structs. This PR splits a huge method (e.g. `apply_4`) into multiple smaller methods. Sample program ``` val structOfString = new StructType().add("str", StringType) var inner = new StructType() for (_ <- 0 until 800) { inner = inner1.add("structOfString", structOfString) } var schema = new StructType() for (_ <- 0 until 50) { schema = schema.add("structOfStructOfStrings", inner) } GenerateMutableProjection.generate(Seq(Murmur3Hash(exprs, 42))) ``` Without this PR ``` /* 005 */ class SpecificMutableProjection extends org.apache.spark.sql.catalyst.expressions.codegen.BaseMutableProjection { /* 006 */ /* 007 */ private Object[] references; /* 008 */ private InternalRow mutableRow; /* 009 */ private int value; /* 010 */ private int value_0; ... /* 034 */ public java.lang.Object apply(java.lang.Object _i) { /* 035 */ InternalRow i = (InternalRow) _i; /* 036 */ /* 037 */ /* 038 */ /* 039 */ value = 42; /* 040 */ apply_0(i); /* 041 */ apply_1(i); /* 042 */ apply_2(i); /* 043 */ apply_3(i); /* 044 */ apply_4(i); /* 045 */ nestedClassInstance.apply_5(i); ... /* 089 */ nestedClassInstance8.apply_49(i); /* 090 */ value_0 = value; /* 091 */ /* 092 */ // copy all the results into MutableRow /* 093 */ mutableRow.setInt(0, value_0); /* 094 */ return mutableRow; /* 095 */ } /* 096 */ /* 097 */ /* 098 */ private void apply_4(InternalRow i) { /* 099 */ /* 100 */ boolean isNull5 = i.isNullAt(4); /* 101 */ InternalRow value5 = isNull5 ? null : (i.getStruct(4, 800)); /* 102 */ if (!isNull5) { /* 103 */ /* 104 */ if (!value5.isNullAt(0)) { /* 105 */ /* 106 */ final InternalRow element6400 = value5.getStruct(0, 1); /* 107 */ /* 108 */ if (!element6400.isNullAt(0)) { /* 109 */ /* 110 */ final UTF8String element6401 = element6400.getUTF8String(0); /* 111 */ value = org.apache.spark.unsafe.hash.Murmur3_x86_32.hashUnsafeBytes(element6401.getBaseObject(), element6401.getBaseOffset(), element6401.numBytes(), value); /* 112 */ /* 113 */ } /* 114 */ /* 115 */ /* 116 */ } /* 117 */ /* 118 */ /* 119 */ if (!value5.isNullAt(1)) { /* 120 */ /* 121 */ final InternalRow element6402 = value5.getStruct(1, 1); /* 122 */ /* 123 */ if (!element6402.isNullAt(0)) { /* 124 */ /* 125 */ final UTF8String element6403 = element6402.getUTF8String(0); /* 126 */ value = org.apache.spark.unsafe.hash.Murmur3_x86_32.hashUnsafeBytes(element6403.getBaseObject(), element6403.getBaseOffset(), element6403.numBytes(), value); /* 127 */ /* 128 */ } /* 128 */ } /* 129 */ /* 130 */ /* 131 */ } /* 132 */ /* 133 */ /* 134 */ if (!value5.isNullAt(2)) { /* 135 */ /* 136 */ final InternalRow element6404 = value5.getStruct(2, 1); /* 137 */ /* 138 */ if (!element6404.isNullAt(0)) { /* 139 */ /* 140 */ final UTF8String element6405 = element6404.getUTF8String(0); /* 141 */ value = org.apache.spark.unsafe.hash.Murmur3_x86_32.hashUnsafeBytes(element6405.getBaseObject(), element6405.getBaseOffset(), element6405.numBytes(), value); /* 142 */ /* 143 */ } /* 144 */ /* 145 */ /* 146 */ } /* 147 */ ... /* 12074 */ if (!value5.isNullAt(798)) { /* 12075 */ /* 12076 */ final InternalRow element7996 = value5.getStruct(798, 1); /* 12077 */ /* 12078 */ if (!element7996.isNullAt(0)) { /* 12079 */ /* 12080 */ final UTF8String element7997 = element7996.getUTF8String(0); /* 12083 */ } /* 12084 */ /* 12085 */ /* 12086 */ } /* 12087 */ /* 12088 */ /* 12089 */ if (!value5.isNullAt(799)) { /* 12090 */ /* 12091 */ final InternalRow element7998 = value5.getStruct(799, 1); /* 12092 */ /* 12093 */ if (!element7998.isNullAt(0)) { /* 12094 */ /* 12095 */ final UTF8String element7999 = element7998.getUTF8String(0); /* 12096 */ value = org.apache.spark.unsafe.hash.Murmur3_x86_32.hashUnsafeBytes(element7999.getBaseObject(), element7999.getBaseOffset(), element7999.numBytes(), value); /* 12097 */ /* 12098 */ } /* 12099 */ /* 12100 */ /* 12101 */ } /* 12102 */ /* 12103 */ } /* 12104 */ /* 12105 */ } /* 12106 */ /* 12106 */ /* 12107 */ /* 12108 */ private void apply_1(InternalRow i) { ... ``` With this PR ``` /* 005 */ class SpecificMutableProjection extends org.apache.spark.sql.catalyst.expressions.codegen.BaseMutableProjection { /* 006 */ /* 007 */ private Object[] references; /* 008 */ private InternalRow mutableRow; /* 009 */ private int value; /* 010 */ private int value_0; /* 011 */ ... /* 034 */ public java.lang.Object apply(java.lang.Object _i) { /* 035 */ InternalRow i = (InternalRow) _i; /* 036 */ /* 037 */ /* 038 */ /* 039 */ value = 42; /* 040 */ nestedClassInstance11.apply50_0(i); /* 041 */ nestedClassInstance11.apply50_1(i); ... /* 088 */ nestedClassInstance11.apply50_48(i); /* 089 */ nestedClassInstance11.apply50_49(i); /* 090 */ value_0 = value; /* 091 */ /* 092 */ // copy all the results into MutableRow /* 093 */ mutableRow.setInt(0, value_0); /* 094 */ return mutableRow; /* 095 */ } /* 096 */ ... /* 37717 */ private void apply4_0(InternalRow value5, InternalRow i) { /* 37718 */ /* 37719 */ if (!value5.isNullAt(0)) { /* 37720 */ /* 37721 */ final InternalRow element6400 = value5.getStruct(0, 1); /* 37722 */ /* 37723 */ if (!element6400.isNullAt(0)) { /* 37724 */ /* 37725 */ final UTF8String element6401 = element6400.getUTF8String(0); /* 37726 */ value = org.apache.spark.unsafe.hash.Murmur3_x86_32.hashUnsafeBytes(element6401.getBaseObject(), element6401.getBaseOffset(), element6401.numBytes(), value); /* 37727 */ /* 37728 */ } /* 37729 */ /* 37730 */ /* 37731 */ } /* 37732 */ /* 37733 */ if (!value5.isNullAt(1)) { /* 37734 */ /* 37735 */ final InternalRow element6402 = value5.getStruct(1, 1); /* 37736 */ /* 37737 */ if (!element6402.isNullAt(0)) { /* 37738 */ /* 37739 */ final UTF8String element6403 = element6402.getUTF8String(0); /* 37740 */ value = org.apache.spark.unsafe.hash.Murmur3_x86_32.hashUnsafeBytes(element6403.getBaseObject(), element6403.getBaseOffset(), element6403.numBytes(), value); /* 37741 */ /* 37742 */ } /* 37743 */ /* 37744 */ /* 37745 */ } /* 37746 */ /* 37747 */ if (!value5.isNullAt(2)) { /* 37748 */ /* 37749 */ final InternalRow element6404 = value5.getStruct(2, 1); /* 37750 */ /* 37751 */ if (!element6404.isNullAt(0)) { /* 37752 */ /* 37753 */ final UTF8String element6405 = element6404.getUTF8String(0); /* 37754 */ value = org.apache.spark.unsafe.hash.Murmur3_x86_32.hashUnsafeBytes(element6405.getBaseObject(), element6405.getBaseOffset(), element6405.numBytes(), value); /* 37755 */ /* 37756 */ } /* 37757 */ /* 37758 */ /* 37759 */ } /* 37760 */ /* 37761 */ } ... /* 218470 */ /* 218471 */ private void apply50_4(InternalRow i) { /* 218472 */ /* 218473 */ boolean isNull5 = i.isNullAt(4); /* 218474 */ InternalRow value5 = isNull5 ? null : (i.getStruct(4, 800)); /* 218475 */ if (!isNull5) { /* 218476 */ apply4_0(value5, i); /* 218477 */ apply4_1(value5, i); /* 218478 */ apply4_2(value5, i); ... /* 218742 */ nestedClassInstance.apply4_266(value5, i); /* 218743 */ } /* 218744 */ /* 218745 */ } ``` ## How was this patch tested? Added new test to `HashExpressionsSuite` Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Closes #19563 from kiszk/SPARK-22284. (cherry picked from commit f2da738c76810131045e6c32533a2d13526cdaf6) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 10 November 2017, 20:18:14 UTC
371be22 [SPARK-22294][DEPLOY] Reset spark.driver.bindAddress when starting a Checkpoint ## What changes were proposed in this pull request? It seems that recovering from a checkpoint can replace the old driver and executor IP addresses, as the workload can now be taking place in a different cluster configuration. It follows that the bindAddress for the master may also have changed. Thus we should not be keeping the old one, and instead be added to the list of properties to reset and recreate from the new environment. ## How was this patch tested? This patch was tested via manual testing on AWS, using the experimental (not yet merged) Kubernetes scheduler, which uses bindAddress to bind to a Kubernetes service (and thus was how I first encountered the bug too), but it is not a code-path related to the scheduler and this may have slipped through when merging SPARK-4563. Author: Santiago Saavedra <ssaavedra@openshine.com> Closes #19427 from ssaavedra/fix-checkpointing-master. (cherry picked from commit 5ebdcd185f2108a90e37a1aa4214c3b6c69a97a4) Signed-off-by: Shixiong Zhu <zsxwing@gmail.com> 10 November 2017, 18:58:10 UTC
eb49c32 [SPARK-22243][DSTREAM] spark.yarn.jars should reload from config when checkpoint recovery ## What changes were proposed in this pull request? the previous [PR](https://github.com/apache/spark/pull/19469) is deleted by mistake. the solution is straight forward. adding "spark.yarn.jars" to propertiesToReload so this property will load from config. ## How was this patch tested? manual tests Author: ZouChenjun <zouchenjun@youzan.com> Closes #19637 from ChenjunZou/checkpoint-yarn-jars. 10 November 2017, 18:57:22 UTC
0568f28 [SPARK-22344][SPARKR] clean up install dir if running test as source package ## What changes were proposed in this pull request? remove spark if spark downloaded & installed ## How was this patch tested? manually by building package Jenkins, AppVeyor Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #19657 from felixcheung/rinstalldir. (cherry picked from commit b70aa9e08b4476746e912c2c2a8b7bdd102305e8) Signed-off-by: Felix Cheung <felixcheung@apache.org> 10 November 2017, 18:22:56 UTC
7551524 [SPARK-22472][SQL] add null check for top-level primitive values ## What changes were proposed in this pull request? One powerful feature of `Dataset` is, we can easily map SQL rows to Scala/Java objects and do runtime null check automatically. For example, let's say we have a parquet file with schema `<a: int, b: string>`, and we have a `case class Data(a: Int, b: String)`. Users can easily read this parquet file into `Data` objects, and Spark will throw NPE if column `a` has null values. However the null checking is left behind for top-level primitive values. For example, let's say we have a parquet file with schema `<a: Int>`, and we read it into Scala `Int`. If column `a` has null values, we will get some weird results. ``` scala> val ds = spark.read.parquet(...).as[Int] scala> ds.show() +----+ |v | +----+ |null| |1 | +----+ scala> ds.collect res0: Array[Long] = Array(0, 1) scala> ds.map(_ * 2).show +-----+ |value| +-----+ |-2 | |2 | +-----+ ``` This is because internally Spark use some special default values for primitive types, but never expect users to see/operate these default value directly. This PR adds null check for top-level primitive values ## How was this patch tested? new test Author: Wenchen Fan <wenchen@databricks.com> Closes #19707 from cloud-fan/bug. (cherry picked from commit 0025ddeb1dd4fd6951ecd8456457f6b94124f84e) Signed-off-by: gatorsmile <gatorsmile@gmail.com> # Conflicts: # sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/ScalaReflectionSuite.scala # sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala 10 November 2017, 06:00:12 UTC
1b70c66 [SPARK-22287][MESOS] SPARK_DAEMON_MEMORY not honored by MesosClusterD… …ispatcher ## What changes were proposed in this pull request? Allow JVM max heap size to be controlled for MesosClusterDispatcher via SPARK_DAEMON_MEMORY environment variable. ## How was this patch tested? Tested on local Mesos cluster Author: Paul Mackles <pmackles@adobe.com> Closes #19515 from pmackles/SPARK-22287. (cherry picked from commit f5fe63f7b8546b0102d7bfaf3dde77379f58a4d1) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> 10 November 2017, 00:42:45 UTC
ede0e1a [SPARK-22403][SS] Add optional checkpointLocation argument to StructuredKafkaWordCount example ## What changes were proposed in this pull request? When run in YARN cluster mode, the StructuredKafkaWordCount example fails because Spark tries to create a temporary checkpoint location in a subdirectory of the path given by java.io.tmpdir, and YARN sets java.io.tmpdir to a path in the local filesystem that usually does not correspond to an existing path in the distributed filesystem. Add an optional checkpointLocation argument to the StructuredKafkaWordCount example so that users can specify the checkpoint location and avoid this issue. ## How was this patch tested? Built and ran the example manually on YARN client and cluster mode. Author: Wing Yew Poon <wypoon@cloudera.com> Closes #19703 from wypoon/SPARK-22403. (cherry picked from commit 11c4021044f3a302449a2ea76811e73f5c99a26a) Signed-off-by: Shixiong Zhu <zsxwing@gmail.com> 10 November 2017, 00:21:06 UTC
0e97c8e [SPARK-22417][PYTHON][FOLLOWUP][BRANCH-2.2] Fix for createDataFrame from pandas.DataFrame with timestamp ## What changes were proposed in this pull request? This is a follow-up of #19646 for branch-2.2. The original pr breaks branch-2.2 because the cherry-picked patch doesn't include some code which exists in master. ## How was this patch tested? Existing tests. Author: Takuya UESHIN <ueshin@databricks.com> Closes #19704 from ueshin/issues/SPARK-22417_2.2/fup1. 09 November 2017, 07:56:50 UTC
efaf73f [SPARK-22211][SQL][FOLLOWUP] Fix bad merge for tests ## What changes were proposed in this pull request? The merge of SPARK-22211 to branch-2.2 dropped a couple of important lines that made sure the tests that compared plans did so with both plans having been analyzed. Fix by reintroducing the correct analysis statements. ## How was this patch tested? Re-ran LimitPushdownSuite. All tests passed. Author: Henry Robinson <henry@apache.org> Closes #19701 from henryr/branch-2.2. 09 November 2017, 07:34:22 UTC
73a2ca0 [SPARK-22281][SPARKR] Handle R method breaking signature changes ## What changes were proposed in this pull request? This is to fix the code for the latest R changes in R-devel, when running CRAN check ``` checking for code/documentation mismatches ... WARNING Codoc mismatches from documentation object 'attach': attach Code: function(what, pos = 2L, name = deparse(substitute(what), backtick = FALSE), warn.conflicts = TRUE) Docs: function(what, pos = 2L, name = deparse(substitute(what)), warn.conflicts = TRUE) Mismatches in argument default values: Name: 'name' Code: deparse(substitute(what), backtick = FALSE) Docs: deparse(substitute(what)) Codoc mismatches from documentation object 'glm': glm Code: function(formula, family = gaussian, data, weights, subset, na.action, start = NULL, etastart, mustart, offset, control = list(...), model = TRUE, method = "glm.fit", x = FALSE, y = TRUE, singular.ok = TRUE, contrasts = NULL, ...) Docs: function(formula, family = gaussian, data, weights, subset, na.action, start = NULL, etastart, mustart, offset, control = list(...), model = TRUE, method = "glm.fit", x = FALSE, y = TRUE, contrasts = NULL, ...) Argument names in code not in docs: singular.ok Mismatches in argument names: Position: 16 Code: singular.ok Docs: contrasts Position: 17 Code: contrasts Docs: ... ``` With attach, it's pulling in the function definition from base::attach. We need to disable that but we would still need a function signature for roxygen2 to build with. With glm it's pulling in the function definition (ie. "usage") from the stats::glm function. Since this is "compiled in" when we build the source package into the .Rd file, when it changes at runtime or in CRAN check it won't match the latest signature. The solution is not to pull in from stats::glm since there isn't much value in doing that (none of the param we actually use, the ones we do use we have explicitly documented them) Also with attach we are changing to call dynamically. ## How was this patch tested? Manually. - [x] check documentation output - yes - [x] check help `?attach` `?glm` - yes - [x] check on other platforms, r-hub, on r-devel etc.. Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #19557 from felixcheung/rattachglmdocerror. (cherry picked from commit 2ca5aae47a25dc6bc9e333fb592025ff14824501) Signed-off-by: Felix Cheung <felixcheung@apache.org> 08 November 2017, 05:03:10 UTC
5c9035b [SPARK-22327][SPARKR][TEST][BACKPORT-2.2] check for version warning ## What changes were proposed in this pull request? Will need to port to this to branch-~~1.6~~, -2.0, -2.1, -2.2 ## How was this patch tested? manually Jenkins, AppVeyor Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #19619 from felixcheung/rcranversioncheck22. 08 November 2017, 04:58:29 UTC
161ba18 [SPARK-22417][PYTHON] Fix for createDataFrame from pandas.DataFrame with timestamp Currently, a pandas.DataFrame that contains a timestamp of type 'datetime64[ns]' when converted to a Spark DataFrame with `createDataFrame` will interpret the values as LongType. This fix will check for a timestamp type and convert it to microseconds which will allow Spark to read as TimestampType. Added unit test to verify Spark schema is expected for TimestampType and DateType when created from pandas Author: Bryan Cutler <cutlerb@gmail.com> Closes #19646 from BryanCutler/pyspark-non-arrow-createDataFrame-ts-fix-SPARK-22417. (cherry picked from commit 1d341042d6948e636643183da9bf532268592c6a) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 07 November 2017, 20:38:54 UTC
2695b92 [SPARK-22315][SPARKR] Warn if SparkR package version doesn't match SparkContext ## What changes were proposed in this pull request? This PR adds a check between the R package version used and the version reported by SparkContext running in the JVM. The goal here is to warn users when they have a R package downloaded from CRAN and are using that to connect to an existing Spark cluster. This is raised as a warning rather than an error as users might want to use patch versions interchangeably (e.g., 2.1.3 with 2.1.2 etc.) ## How was this patch tested? Manually by changing the `DESCRIPTION` file Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu> Closes #19624 from shivaram/sparkr-version-check. (cherry picked from commit 65a8bf6036fe41a53b4b1e4298fa35d7fa4e9970) Signed-off-by: Shivaram Venkataraman <shivaram@cs.berkeley.edu> 06 November 2017, 16:58:55 UTC
e35c53a [SPARK-22429][STREAMING] Streaming checkpointing code does not retry after failure ## What changes were proposed in this pull request? SPARK-14930/SPARK-13693 put in a change to set the fs object to null after a failure, however the retry loop does not include initialization. Moved fs initialization inside the retry while loop to aid recoverability. ## How was this patch tested? Passes all existing unit tests. Author: Tristan Stevens <tristan@cloudera.com> Closes #19645 from tmgstevens/SPARK-22429. (cherry picked from commit fe258a7963361c1f31bc3dc3a2a2ee4a5834bb58) Signed-off-by: Sean Owen <sowen@cloudera.com> 05 November 2017, 09:10:51 UTC
5e38373 [SPARK-22211][SQL] Remove incorrect FOJ limit pushdown It's not safe in all cases to push down a LIMIT below a FULL OUTER JOIN. If the limit is pushed to one side of the FOJ, the physical join operator can not tell if a row in the non-limited side would have a match in the other side. *If* the join operator guarantees that unmatched tuples from the limited side are emitted before any unmatched tuples from the other side, pushing down the limit is safe. But this is impractical for some join implementations, e.g. SortMergeJoin. For now, disable limit pushdown through a FULL OUTER JOIN, and we can evaluate whether a more complicated solution is necessary in the future. Ran org.apache.spark.sql.* tests. Altered full outer join tests in LimitPushdownSuite. Author: Henry Robinson <henry@cloudera.com> Closes #19647 from henryr/spark-22211. (cherry picked from commit 6c6626614e59b2e8d66ca853a74638d3d6267d73) Signed-off-by: gatorsmile <gatorsmile@gmail.com> 05 November 2017, 05:54:51 UTC
4074ed2 [SPARK-22306][SQL][2.2] alter table schema should not erase the bucketing metadata at hive side ## What changes were proposed in this pull request? When we alter table schema, we set the new schema to spark `CatalogTable`, convert it to hive table, and finally call `hive.alterTable`. This causes a problem in Spark 2.2, because hive bucketing metedata is not recognized by Spark, which means a Spark `CatalogTable` representing a hive table is always non-bucketed, and when we convert it to hive table and call `hive.alterTable`, the original hive bucketing metadata will be removed. To fix this bug, we should read out the raw hive table metadata, update its schema, and call `hive.alterTable`. By doing this we can guarantee only the schema is changed, and nothing else. Note that this bug doesn't exist in the master branch, because we've added hive bucketing support and the hive bucketing metadata can be recognized by Spark. I think we should merge this PR to master too, for code cleanup and reduce the difference between master and 2.2 branch for backporting. ## How was this patch tested? new regression test Author: Wenchen Fan <wenchen@databricks.com> Closes #19622 from cloud-fan/infer. 02 November 2017, 11:37:52 UTC
c311c5e [MINOR][DOC] automatic type inference supports also Date and Timestamp ## What changes were proposed in this pull request? Easy fix in the documentation, which is reporting that only numeric types and string are supported in type inference for partition columns, while Date and Timestamp are supported too since 2.1.0, thanks to SPARK-17388. ## How was this patch tested? n/a Author: Marco Gaido <mgaido@hortonworks.com> Closes #19628 from mgaido91/SPARK-22398. (cherry picked from commit b04eefae49b96e2ef5a8d75334db29ef4e19ce58) Signed-off-by: hyukjinkwon <gurwls223@gmail.com> 02 November 2017, 00:30:20 UTC
ab87a92 [SPARK-22333][SQL][BACKPORT-2.2] timeFunctionCall(CURRENT_DATE, CURRENT_TIMESTAMP) has conflicts with columnReference ## What changes were proposed in this pull request? This is a backport pr of https://github.com/apache/spark/pull/19559 for branch-2.2 ## How was this patch tested? unit tests Author: donnyzone <wellfengzhu@gmail.com> Closes #19606 from DonnyZone/branch-2.2. 31 October 2017, 17:37:27 UTC
dd69ac6 [SPARK-19611][SQL][FOLLOWUP] set dataSchema correctly in HiveMetastoreCatalog.convertToLogicalRelation We made a mistake in https://github.com/apache/spark/pull/16944 . In `HiveMetastoreCatalog#inferIfNeeded` we infer the data schema, merge with full schema, and return the new full schema. At caller side we treat the full schema as data schema and set it to `HadoopFsRelation`. This doesn't cause any problem because both parquet and orc can work with a wrong data schema that has extra columns, but it's better to fix this mistake. N/A Author: Wenchen Fan <wenchen@databricks.com> Closes #19615 from cloud-fan/infer. (cherry picked from commit 4d9ebf3835dde1abbf9cff29a55675d9f4227620) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 31 October 2017, 10:36:52 UTC
7f8236c [SPARK-22291][SQL] Conversion error when transforming array types of uuid, inet and cidr to StingType in PostgreSQL ## What changes were proposed in this pull request? This PR fixes the conversion error when transforming array types of `uuid`, `inet` and `cidr` to `StingType` in PostgreSQL. ## How was this patch tested? Added test in `PostgresIntegrationSuite`. Author: Jen-Ming Chung <jenmingisme@gmail.com> Closes #19604 from jmchung/SPARK-22291-FOLLOWUP. 30 October 2017, 08:09:11 UTC
f973587 [SPARK-22344][SPARKR] Set java.io.tmpdir for SparkR tests This PR sets the java.io.tmpdir for CRAN checks and also disables the hsperfdata for the JVM when running CRAN checks. Together this prevents files from being left behind in `/tmp` ## How was this patch tested? Tested manually on a clean EC2 machine Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu> Closes #19589 from shivaram/sparkr-tmpdir-clean. (cherry picked from commit 1fe27612d7bcb8b6478a36bc16ddd4802e4ee2fc) Signed-off-by: Shivaram Venkataraman <shivaram@cs.berkeley.edu> 30 October 2017, 01:54:00 UTC
cac6506 [SPARK-19727][SQL][FOLLOWUP] Fix for round function that modifies original column ## What changes were proposed in this pull request? This is a followup of https://github.com/apache/spark/pull/17075 , to fix the bug in codegen path. ## How was this patch tested? new regression test Author: Wenchen Fan <wenchen@databricks.com> Closes #19576 from cloud-fan/bug. (cherry picked from commit 7fdacbc77bbcf98c2c045a1873e749129769dcc0) Signed-off-by: gatorsmile <gatorsmile@gmail.com> 29 October 2017, 01:24:32 UTC
cb54f29 [SPARK-22356][SQL] data source table should support overlapped columns between data and partition schema This is a regression introduced by #14207. After Spark 2.1, we store the inferred schema when creating the table, to avoid inferring schema again at read path. However, there is one special case: overlapped columns between data and partition. For this case, it breaks the assumption of table schema that there is on ovelap between data and partition schema, and partition columns should be at the end. The result is, for Spark 2.1, the table scan has incorrect schema that puts partition columns at the end. For Spark 2.2, we add a check in CatalogTable to validate table schema, which fails at this case. To fix this issue, a simple and safe approach is to fallback to old behavior when overlapeed columns detected, i.e. store empty schema in metastore. new regression test Author: Wenchen Fan <wenchen@databricks.com> Closes #19579 from cloud-fan/bug2. 27 October 2017, 00:56:29 UTC
2839280 [SPARK-22355][SQL] Dataset.collect is not threadsafe It's possible that users create a `Dataset`, and call `collect` of this `Dataset` in many threads at the same time. Currently `Dataset#collect` just call `encoder.fromRow` to convert spark rows to objects of type T, and this encoder is per-dataset. This means `Dataset#collect` is not thread-safe, because the encoder uses a projection to output the object to a re-usable row. This PR fixes this problem, by creating a new projection when calling `Dataset#collect`, so that we have the re-usable row for each method call, instead of each Dataset. N/A Author: Wenchen Fan <wenchen@databricks.com> Closes #19577 from cloud-fan/encoder. (cherry picked from commit 5c3a1f3fad695317c2fff1243cdb9b3ceb25c317) Signed-off-by: gatorsmile <gatorsmile@gmail.com> 27 October 2017, 00:52:26 UTC
a607ddc [SPARK-22328][CORE] ClosureCleaner should not miss referenced superclass fields When the given closure uses some fields defined in super class, `ClosureCleaner` can't figure them and don't set it properly. Those fields will be in null values. Added test. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #19556 from viirya/SPARK-22328. (cherry picked from commit 4f8dc6b01ea787243a38678ea8199fbb0814cffc) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 26 October 2017, 20:44:17 UTC
24fe7cc [SPARK-17902][R] Revive stringsAsFactors option for collect() in SparkR ## What changes were proposed in this pull request? This PR proposes to revive `stringsAsFactors` option in collect API, which was mistakenly removed in https://github.com/apache/spark/commit/71a138cd0e0a14e8426f97877e3b52a562bbd02c. Simply, it casts `charactor` to `factor` if it meets the condition, `stringsAsFactors && is.character(vec)` in primitive type conversion. ## How was this patch tested? Unit test in `R/pkg/tests/fulltests/test_sparkSQL.R`. Author: hyukjinkwon <gurwls223@gmail.com> Closes #19551 from HyukjinKwon/SPARK-17902. (cherry picked from commit a83d8d5adcb4e0061e43105767242ba9770dda96) Signed-off-by: hyukjinkwon <gurwls223@gmail.com> 26 October 2017, 11:55:00 UTC
d2dc175 [SPARK-21991][LAUNCHER][FOLLOWUP] Fix java lint ## What changes were proposed in this pull request? Fix java lint ## How was this patch tested? Run `./dev/lint-java` Author: Andrew Ash <andrew@andrewash.com> Closes #19574 from ash211/aash/fix-java-lint. (cherry picked from commit 5433be44caecaeef45ed1fdae10b223c698a9d14) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> 25 October 2017, 21:41:17 UTC
35725f7 [SPARK-22332][ML][TEST] Fix NaiveBayes unit test occasionly fail (cause by test dataset not deterministic) ## What changes were proposed in this pull request? Fix NaiveBayes unit test occasionly fail: Set seed for `BrzMultinomial.sample`, make `generateNaiveBayesInput` output deterministic dataset. (If we do not set seed, the generated dataset will be random, and the model will be possible to exceed the tolerance in the test, which trigger this failure) ## How was this patch tested? Manually run tests multiple times and check each time output models contains the same values. Author: WeichenXu <weichen.xu@databricks.com> Closes #19558 from WeichenXu123/fix_nb_test_seed. (cherry picked from commit 841f1d776f420424c20d99cf7110d06c73f9ca20) Signed-off-by: Joseph K. Bradley <joseph@databricks.com> 25 October 2017, 21:31:49 UTC
9ed6404 [SPARK-22227][CORE] DiskBlockManager.getAllBlocks now tolerates temp files Prior to this commit getAllBlocks implicitly assumed that the directories managed by the DiskBlockManager contain only the files corresponding to valid block IDs. In reality, this assumption was violated during shuffle, which produces temporary files in the same directory as the resulting blocks. As a result, calls to getAllBlocks during shuffle were unreliable. The fix could be made more efficient, but this is probably good enough. `DiskBlockManagerSuite` Author: Sergei Lebedev <s.lebedev@criteo.com> Closes #19458 from superbobry/block-id-option. (cherry picked from commit b377ef133cdc38d49b460b2cc6ece0b5892804cc) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 25 October 2017, 21:17:40 UTC
4c1a868 [SPARK-21991][LAUNCHER] Fix race condition in LauncherServer#acceptConnections ## What changes were proposed in this pull request? This patch changes the order in which _acceptConnections_ starts the client thread and schedules the client timeout action ensuring that the latter has been scheduled before the former get a chance to cancel it. ## How was this patch tested? Due to the non-deterministic nature of the patch I wasn't able to add a new test for this issue. Author: Andrea zito <andrea.zito@u-hopper.com> Closes #19217 from nivox/SPARK-21991. (cherry picked from commit 6ea8a56ca26a7e02e6574f5f763bb91059119a80) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> 25 October 2017, 17:10:37 UTC
daa838b [SPARK-21936][SQL][FOLLOW-UP] backward compatibility test framework for HiveExternalCatalog ## What changes were proposed in this pull request? Adjust Spark download in test to use Apache mirrors and respect its load balancer, and use Spark 2.1.2. This follows on a recent PMC list thread about removing the cloudfront download rather than update it further. ## How was this patch tested? Existing tests. Author: Sean Owen <sowen@cloudera.com> Closes #19564 from srowen/SPARK-21936.2. (cherry picked from commit 8beeaed66bde0ace44495b38dc967816e16b3464) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 24 October 2017, 12:56:31 UTC
bf8163f [SPARK-22319][CORE][BACKPORT-2.2] call loginUserFromKeytab before accessing hdfs In SparkSubmit, call loginUserFromKeytab before attempting to make RPC calls to the NameNode. Same as #https://github.com/apache/spark/pull/19540, but for branch-2.2. Manually tested for master as described in https://github.com/apache/spark/pull/19540. Author: Steven Rand <srand@palantir.com> Closes #19554 from sjrand/SPARK-22319-branch-2.2. Change-Id: Ic550a818fd6a3f38b356ac48029942d463738458 23 October 2017, 06:26:03 UTC
f8c83fd [SPARK-21551][PYTHON] Increase timeout for PythonRDD.serveIterator Backport of https://github.com/apache/spark/pull/18752 (https://issues.apache.org/jira/browse/SPARK-21551) (cherry picked from commit 9d3c6640f56e3e4fd195d3ad8cead09df67a72c7) Author: peay <peay@protonmail.com> Closes #19512 from FRosner/branch-2.2. 19 October 2017, 04:07:04 UTC
010b50c [SPARK-22249][FOLLOWUP][SQL] Check if list of value for IN is empty in the optimizer ## What changes were proposed in this pull request? This PR addresses the comments by gatorsmile on [the previous PR](https://github.com/apache/spark/pull/19494). ## How was this patch tested? Previous UT and added UT. Author: Marco Gaido <marcogaido91@gmail.com> Closes #19522 from mgaido91/SPARK-22249_FOLLOWUP. (cherry picked from commit 1f25d8683a84a479fd7fc77b5a1ea980289b681b) Signed-off-by: gatorsmile <gatorsmile@gmail.com> 18 October 2017, 16:14:57 UTC
c423091 [SPARK-22271][SQL] mean overflows and returns null for some decimal variables ## What changes were proposed in this pull request? In Average.scala, it has ``` override lazy val evaluateExpression = child.dataType match { case DecimalType.Fixed(p, s) => // increase the precision and scale to prevent precision loss val dt = DecimalType.bounded(p + 14, s + 4) Cast(Cast(sum, dt) / Cast(count, dt), resultType) case _ => Cast(sum, resultType) / Cast(count, resultType) } def setChild (newchild: Expression) = { child = newchild } ``` It is possible that Cast(count, dt), resultType) will make the precision of the decimal number bigger than 38, and this causes over flow. Since count is an integer and doesn't need a scale, I will cast it using DecimalType.bounded(38,0) ## How was this patch tested? In DataFrameSuite, I will add a test case. Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Huaxin Gao <huaxing@us.ibm.com> Closes #19496 from huaxingao/spark-22271. (cherry picked from commit 28f9f3f22511e9f2f900764d9bd5b90d2eeee773) Signed-off-by: gatorsmile <gatorsmile@gmail.com> # Conflicts: # sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala 17 October 2017, 19:53:14 UTC
71d1cb6 [SPARK-22249][SQL] isin with empty list throws exception on cached DataFrame ## What changes were proposed in this pull request? As pointed out in the JIRA, there is a bug which causes an exception to be thrown if `isin` is called with an empty list on a cached DataFrame. The PR fixes it. ## How was this patch tested? Added UT. Author: Marco Gaido <marcogaido91@gmail.com> Closes #19494 from mgaido91/SPARK-22249. (cherry picked from commit 8148f19ca1f0e0375603cb4f180c1bad8b0b8042) Signed-off-by: Sean Owen <sowen@cloudera.com> 17 October 2017, 07:41:42 UTC
0f060a2 [SPARK-22223][SQL] ObjectHashAggregate should not introduce unnecessary shuffle `ObjectHashAggregateExec` should override `outputPartitioning` in order to avoid unnecessary shuffle. Added Jenkins test. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #19501 from viirya/SPARK-22223. (cherry picked from commit 0ae96495dedb54b3b6bae0bd55560820c5ca29a2) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 16 October 2017, 05:47:59 UTC
6b6761e [SPARK-21549][CORE] Respect OutputFormats with no/invalid output directory provided ## What changes were proposed in this pull request? PR #19294 added support for null's - but spark 2.1 handled other error cases where path argument can be invalid. Namely: * empty string * URI parse exception while creating Path This is resubmission of PR #19487, which I messed up while updating my repo. ## How was this patch tested? Enhanced test to cover new support added. Author: Mridul Muralidharan <mridul@gmail.com> Closes #19497 from mridulm/master. (cherry picked from commit 13c1559587d0eb533c94f5a492390f81b048b347) Signed-off-by: Mridul Muralidharan <mridul@gmail.com> 16 October 2017, 01:41:47 UTC
acbad83 [SPARK-22273][SQL] Fix key/value schema field names in HashMapGenerators. ## What changes were proposed in this pull request? When fixing schema field names using escape characters with `addReferenceMinorObj()` at [SPARK-18952](https://issues.apache.org/jira/browse/SPARK-18952) (#16361), double-quotes around the names were remained and the names become something like `"((java.lang.String) references[1])"`. ```java /* 055 */ private int maxSteps = 2; /* 056 */ private int numRows = 0; /* 057 */ private org.apache.spark.sql.types.StructType keySchema = new org.apache.spark.sql.types.StructType().add("((java.lang.String) references[1])", org.apache.spark.sql.types.DataTypes.StringType); /* 058 */ private org.apache.spark.sql.types.StructType valueSchema = new org.apache.spark.sql.types.StructType().add("((java.lang.String) references[2])", org.apache.spark.sql.types.DataTypes.LongType); /* 059 */ private Object emptyVBase; ``` We should remove the double-quotes to refer the values in `references` properly: ```java /* 055 */ private int maxSteps = 2; /* 056 */ private int numRows = 0; /* 057 */ private org.apache.spark.sql.types.StructType keySchema = new org.apache.spark.sql.types.StructType().add(((java.lang.String) references[1]), org.apache.spark.sql.types.DataTypes.StringType); /* 058 */ private org.apache.spark.sql.types.StructType valueSchema = new org.apache.spark.sql.types.StructType().add(((java.lang.String) references[2]), org.apache.spark.sql.types.DataTypes.LongType); /* 059 */ private Object emptyVBase; ``` ## How was this patch tested? Existing tests. Author: Takuya UESHIN <ueshin@databricks.com> Closes #19491 from ueshin/issues/SPARK-22273. (cherry picked from commit e0503a7223410289d01bc4b20da3a451730577da) Signed-off-by: gatorsmile <gatorsmile@gmail.com> 14 October 2017, 06:24:49 UTC
30d5c9f [SPARK-14387][SPARK-16628][SPARK-18355][SQL] Use Spark schema to read ORC table instead of ORC file schema Before Hive 2.0, ORC File schema has invalid column names like `_col1` and `_col2`. This is a well-known limitation and there are several Apache Spark issues with `spark.sql.hive.convertMetastoreOrc=true`. This PR ignores ORC File schema and use Spark schema. Pass the newly added test case. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #19470 from dongjoon-hyun/SPARK-18355. (cherry picked from commit e6e36004afc3f9fc8abea98542248e9de11b4435) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 13 October 2017, 15:11:50 UTC
c9187db [SPARK-22252][SQL][2.2] FileFormatWriter should respect the input query schema ## What changes were proposed in this pull request? https://github.com/apache/spark/pull/18386 fixes SPARK-21165 but breaks SPARK-22252. This PR reverts https://github.com/apache/spark/pull/18386 and picks the patch from https://github.com/apache/spark/pull/19483 to fix SPARK-21165. ## How was this patch tested? new regression test Author: Wenchen Fan <wenchen@databricks.com> Closes #19484 from cloud-fan/bug. 13 October 2017, 04:54:00 UTC
cfc04e0 [SPARK-22217][SQL] ParquetFileFormat to support arbitrary OutputCommitters ## What changes were proposed in this pull request? `ParquetFileFormat` to relax its requirement of output committer class from `org.apache.parquet.hadoop.ParquetOutputCommitter` or subclass thereof (and so implicitly Hadoop `FileOutputCommitter`) to any committer implementing `org.apache.hadoop.mapreduce.OutputCommitter` This enables output committers which don't write to the filesystem the way `FileOutputCommitter` does to save parquet data from a dataframe: at present you cannot do this. Before a committer which isn't a subclass of `ParquetOutputCommitter`, it checks to see if the context has requested summary metadata by setting `parquet.enable.summary-metadata`. If true, and the committer class isn't a parquet committer, it raises a RuntimeException with an error message. (It could downgrade, of course, but raising an exception makes it clear there won't be an summary. It also makes the behaviour testable.) Note that `SQLConf` already states that any `OutputCommitter` can be used, but that typically it's a subclass of ParquetOutputCommitter. That's not currently true. This patch will make the code consistent with the docs, adding tests to verify, ## How was this patch tested? The patch includes a test suite, `ParquetCommitterSuite`, with a new committer, `MarkingFileOutputCommitter` which extends `FileOutputCommitter` and writes a marker file in the destination directory. The presence of the marker file can be used to verify the new committer was used. The tests then try the combinations of Parquet committer summary/no-summary and marking committer summary/no-summary. | committer | summary | outcome | |-----------|---------|---------| | parquet | true | success | | parquet | false | success | | marking | false | success with marker | | marking | true | exception | All tests are happy. Author: Steve Loughran <stevel@hortonworks.com> Closes #19448 from steveloughran/cloud/SPARK-22217-committer. 13 October 2017, 01:00:33 UTC
cd51e2c [SPARK-21907][CORE][BACKPORT 2.2] oom during spill back-port #19181 to branch-2.2. ## What changes were proposed in this pull request? 1. a test reproducing [SPARK-21907](https://issues.apache.org/jira/browse/SPARK-21907) 2. a fix for the root cause of the issue. `org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill` calls `org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.reset` which may trigger another spill, when this happens the `array` member is already de-allocated but still referenced by the code, this causes the nested spill to fail with an NPE in `org.apache.spark.memory.TaskMemoryManager.getPage`. This patch introduces a reproduction in a test case and a fix, the fix simply sets the in-mem sorter's array member to an empty array before actually performing the allocation. This prevents the spilling code from 'touching' the de-allocated array. ## How was this patch tested? introduced a new test case: `org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorterSuite#testOOMDuringSpill`. Author: Eyal Farago <eyal@nrgene.com> Closes #19481 from eyalfa/SPARK-21907__oom_during_spill__BACKPORT-2.2. 12 October 2017, 12:56:04 UTC
c5889b5 [SPARK-22218] spark shuffle services fails to update secret on app re-attempts This patch fixes application re-attempts when running spark on yarn using the external shuffle service with security on. Currently executors will fail to launch on any application re-attempt when launched on a nodemanager that had an executor from the first attempt. The reason for this is because we aren't updating the secret key after the first application attempt. The fix here is to just remove the containskey check to see if it already exists. In this way, we always add it and make sure its the most recent secret. Similarly remove the check for containsKey on the remove since its just adding extra check that isn't really needed. Note this worked before spark 2.2 because the check used to be contains (which was looking for the value) rather then containsKey, so that never matched and it was just always adding the new secret. Patch was tested on a 10 node cluster as well as added the unit test. The test ran was a wordcount where the output directory already existed. With the bug present the application attempt failed with max number of executor Failures which were all saslExceptions. With the fix present the application re-attempts fail with directory already exists or when you remove the directory between attempts the re-attemps succeed. Author: Thomas Graves <tgraves@unharmedunarmed.corp.ne1.yahoo.com> Closes #19450 from tgravescs/SPARK-22218. (cherry picked from commit a74ec6d7bbfe185ba995dcb02d69e90a089c293e) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> 09 October 2017, 19:56:46 UTC
0d3f166 [SPARK-21549][CORE] Respect OutputFormats with no output directory provided ## What changes were proposed in this pull request? Fix for https://issues.apache.org/jira/browse/SPARK-21549 JIRA issue. Since version 2.2 Spark does not respect OutputFormat with no output paths provided. The examples of such formats are [Cassandra OutputFormat](https://github.com/finn-no/cassandra-hadoop/blob/08dfa3a7ac727bb87269f27a1c82ece54e3f67e6/src/main/java/org/apache/cassandra/hadoop2/AbstractColumnFamilyOutputFormat.java), [Aerospike OutputFormat](https://github.com/aerospike/aerospike-hadoop/blob/master/mapreduce/src/main/java/com/aerospike/hadoop/mapreduce/AerospikeOutputFormat.java), etc. which do not have an ability to rollback the results written to an external systems on job failure. Provided output directory is required by Spark to allows files to be committed to an absolute output location, that is not the case for output formats which write data to external systems. This pull request prevents accessing `absPathStagingDir` method that causes the error described in SPARK-21549 unless there are files to rename in `addedAbsPathFiles`. ## How was this patch tested? Unit tests Author: Sergey Zhemzhitsky <szhemzhitski@gmail.com> Closes #19294 from szhem/SPARK-21549-abs-output-commits. (cherry picked from commit 2030f19511f656e9534f3fd692e622e45f9a074e) Signed-off-by: Mridul Muralidharan <mridul@gmail.com> 07 October 2017, 03:44:47 UTC
8a4e7dd [SPARK-22206][SQL][SPARKR] gapply in R can't work on empty grouping columns ## What changes were proposed in this pull request? Looks like `FlatMapGroupsInRExec.requiredChildDistribution` didn't consider empty grouping attributes. It should be a problem when running `EnsureRequirements` and `gapply` in R can't work on empty grouping columns. ## How was this patch tested? Added test. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #19436 from viirya/fix-flatmapinr-distribution. (cherry picked from commit ae61f187aa0471242c046fdeac6ed55b9b98a3f6) Signed-off-by: hyukjinkwon <gurwls223@gmail.com> 05 October 2017, 14:36:56 UTC
81232ce [SPARK-20466][CORE] HadoopRDD#addLocalConfiguration throws NPE ## What changes were proposed in this pull request? Fix for SPARK-20466, full description of the issue in the JIRA. To summarize, `HadoopRDD` uses a metadata cache to cache `JobConf` objects. The cache uses soft-references, which means the JVM can delete entries from the cache whenever there is GC pressure. `HadoopRDD#getJobConf` had a bug where it would check if the cache contained the `JobConf`, if it did it would get the `JobConf` from the cache and return it. This doesn't work when soft-references are used as the JVM can delete the entry between the existence check and the get call. ## How was this patch tested? Haven't thought of a good way to test this yet given the issue only occurs sometimes, and happens during high GC pressure. Was thinking of using mocks to verify `#getJobConf` is doing the right thing. I deleted the method `HadoopRDD#containsCachedMetadata` so that we don't hit this issue again. Author: Sahil Takiar <stakiar@cloudera.com> Closes #19413 from sahilTakiar/master. (cherry picked from commit e36ec38d89472df0dfe12222b6af54cd6eea8e98) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> 03 October 2017, 23:53:43 UTC
5e3f254 [SPARK-22178][SQL] Refresh Persistent Views by REFRESH TABLE Command ## What changes were proposed in this pull request? The underlying tables of persistent views are not refreshed when users issue the REFRESH TABLE command against the persistent views. ## How was this patch tested? Added a test case Author: gatorsmile <gatorsmile@gmail.com> Closes #19405 from gatorsmile/refreshView. (cherry picked from commit e65b6b7ca1a7cff1b91ad2262bb7941e6bf057cd) Signed-off-by: gatorsmile <gatorsmile@gmail.com> 03 October 2017, 19:40:36 UTC
3c30be5 [SPARK-22158][SQL][BRANCH-2.2] convertMetastore should not ignore table property ## What changes were proposed in this pull request? From the beginning, **convertMetastoreOrc** ignores table properties and use an empty map instead. This PR fixes that. **convertMetastoreParquet** also ignore. ```scala val options = Map[String, String]() ``` - [SPARK-14070: HiveMetastoreCatalog.scala](https://github.com/apache/spark/pull/11891/files#diff-ee66e11b56c21364760a5ed2b783f863R650) - [Master branch: HiveStrategies.scala](https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveStrategies.scala#L197 ) ## How was this patch tested? Pass the Jenkins with an updated test suite. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #19417 from dongjoon-hyun/SPARK-22158-BRANCH-2.2. 03 October 2017, 18:42:55 UTC
b9adddb [SPARK-22167][R][BUILD] sparkr packaging issue allow zinc ## What changes were proposed in this pull request? When zinc is running the pwd might be in the root of the project. A quick solution to this is to not go a level up incase we are in the root rather than root/core/. If we are in the root everything works fine, if we are in core add a script which goes and runs the level up ## How was this patch tested? set -x in the SparkR install scripts. Author: Holden Karau <holden@us.ibm.com> Closes #19402 from holdenk/SPARK-22167-sparkr-packaging-issue-allow-zinc. (cherry picked from commit 8fab7995d36c7bc4524393b20a4e524dbf6bbf62) Signed-off-by: Holden Karau <holden@us.ibm.com> 02 October 2017, 18:47:11 UTC
7bf25e0 [SPARK-22146] FileNotFoundException while reading ORC files containing special characters ## What changes were proposed in this pull request? Reading ORC files containing special characters like '%' fails with a FileNotFoundException. This PR aims to fix the problem. ## How was this patch tested? Added UT. Author: Marco Gaido <marcogaido91@gmail.com> Author: Marco Gaido <mgaido@hortonworks.com> Closes #19368 from mgaido91/SPARK-22146. 29 September 2017, 16:05:15 UTC
ac9a0f6 [SPARK-22161][SQL] Add Impala-modified TPC-DS queries ## What changes were proposed in this pull request? Added IMPALA-modified TPCDS queries to TPC-DS query suites. - Ref: https://github.com/cloudera/impala-tpcds-kit/tree/master/queries ## How was this patch tested? N/A Author: gatorsmile <gatorsmile@gmail.com> Closes #19386 from gatorsmile/addImpalaQueries. (cherry picked from commit 9ed7394a68315126b2dd00e53a444cc65b5a62ea) Signed-off-by: gatorsmile <gatorsmile@gmail.com> 29 September 2017, 16:00:15 UTC
8b2d838 [SPARK-22129][SPARK-22138] Release script improvements ## What changes were proposed in this pull request? Use the GPG_KEY param, fix lsof to non-hardcoded path, remove version swap since it wasn't really needed. Use EXPORT on JAVA_HOME for downstream scripts as well. ## How was this patch tested? Rolled 2.1.2 RC2 Author: Holden Karau <holden@us.ibm.com> Closes #19359 from holdenk/SPARK-22129-fix-signing. (cherry picked from commit ecbe416ab5001b32737966c5a2407597a1dafc32) Signed-off-by: Holden Karau <holden@us.ibm.com> 29 September 2017, 15:04:26 UTC
8c5ab4e [SPARK-22143][SQL][BRANCH-2.2] Fix memory leak in OffHeapColumnVector This is a backport of https://github.com/apache/spark/commit/02bb0682e68a2ce81f3b98d33649d368da7f2b3d. ## What changes were proposed in this pull request? `WriteableColumnVector` does not close its child column vectors. This can create memory leaks for `OffHeapColumnVector` where we do not clean up the memory allocated by a vectors children. This can be especially bad for string columns (which uses a child byte column vector). ## How was this patch tested? I have updated the existing tests to always use both on-heap and off-heap vectors. Testing and diagnosis was done locally. Author: Herman van Hovell <hvanhovell@databricks.com> Closes #19378 from hvanhovell/SPARK-22143-2.2. 28 September 2017, 15:51:40 UTC
12a74e3 [SPARK-22135][MESOS] metrics in spark-dispatcher not being registered properly ## What changes were proposed in this pull request? Fix a trivial bug with how metrics are registered in the mesos dispatcher. Bug resulted in creating a new registry each time the metricRegistry() method was called. ## How was this patch tested? Verified manually on local mesos setup Author: Paul Mackles <pmackles@adobe.com> Closes #19358 from pmackles/SPARK-22135. (cherry picked from commit f20be4d70bf321f377020d1bde761a43e5c72f0a) Signed-off-by: jerryshao <sshao@hortonworks.com> 28 September 2017, 06:43:53 UTC
42e1727 [SPARK-22140] Add TPCDSQuerySuite ## What changes were proposed in this pull request? Now, we are not running TPC-DS queries as regular test cases. Thus, we need to add a test suite using empty tables for ensuring the new code changes will not break them. For example, optimizer/analyzer batches should not exceed the max iteration. ## How was this patch tested? N/A Author: gatorsmile <gatorsmile@gmail.com> Closes #19361 from gatorsmile/tpcdsQuerySuite. (cherry picked from commit 9244957b500cb2b458c32db2c63293a1444690d7) Signed-off-by: gatorsmile <gatorsmile@gmail.com> 28 September 2017, 00:04:34 UTC
a406473 [SPARK-22141][BACKPORT][SQL] Propagate empty relation before checking Cartesian products Back port https://github.com/apache/spark/pull/19362 to branch-2.2 ## What changes were proposed in this pull request? When inferring constraints from children, Join's condition can be simplified as None. For example, ``` val testRelation = LocalRelation('a.int) val x = testRelation.as("x") val y = testRelation.where($"a" === 2 && !($"a" === 2)).as("y") x.join.where($"x.a" === $"y.a") ``` The plan will become ``` Join Inner :- LocalRelation <empty>, [a#23] +- LocalRelation <empty>, [a#224] ``` And the Cartesian products check will throw exception for above plan. Propagate empty relation before checking Cartesian products, and the issue is resolved. ## How was this patch tested? Unit test Author: Wang Gengliang <ltnwgl@gmail.com> Closes #19366 from gengliangwang/branch-2.2. 27 September 2017, 15:40:31 UTC
b0f30b5 [SPARK-22120][SQL] TestHiveSparkSession.reset() should clean out Hive warehouse directory ## What changes were proposed in this pull request? During TestHiveSparkSession.reset(), which is called after each TestHiveSingleton suite, we now delete and recreate the Hive warehouse directory. ## How was this patch tested? Ran full suite of tests locally, verified that they pass. Author: Greg Owen <greg@databricks.com> Closes #19341 from GregOwen/SPARK-22120. (cherry picked from commit ce204780ee2434ff6bae50428ae37083835798d3) Signed-off-by: gatorsmile <gatorsmile@gmail.com> 25 September 2017, 21:16:25 UTC
9836ea1 [SPARK-22083][CORE] Release locks in MemoryStore.evictBlocksToFreeSpace ## What changes were proposed in this pull request? MemoryStore.evictBlocksToFreeSpace acquires write locks for all the blocks it intends to evict up front. If there is a failure to evict blocks (eg., some failure dropping a block to disk), then we have to release the lock. Otherwise the lock is never released and an executor trying to get the lock will wait forever. ## How was this patch tested? Added unit test. Author: Imran Rashid <irashid@cloudera.com> Closes #19311 from squito/SPARK-22083. (cherry picked from commit 2c5b9b1173c23f6ca8890817a9a35dc7557b0776) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> 25 September 2017, 19:02:40 UTC
8acce00 [SPARK-22107] Change as to alias in python quickstart ## What changes were proposed in this pull request? Updated docs so that a line of python in the quick start guide executes. Closes #19283 ## How was this patch tested? Existing tests. Author: John O'Leary <jgoleary@gmail.com> Closes #19326 from jgoleary/issues/22107. (cherry picked from commit 20adf9aa1f42353432d356117e655e799ea1290b) Signed-off-by: hyukjinkwon <gurwls223@gmail.com> 25 September 2017, 00:16:46 UTC
211d81b [SPARK-22109][SQL][BRANCH-2.2] Resolves type conflicts between strings and timestamps in partition column ## What changes were proposed in this pull request? This PR backports https://github.com/apache/spark/commit/04975a68b583a6175f93da52374108e5d4754d9a into branch-2.2. ## How was this patch tested? Unit tests in `ParquetPartitionDiscoverySuite`. Author: hyukjinkwon <gurwls223@gmail.com> Closes #19333 from HyukjinKwon/SPARK-22109-backport-2.2. 23 September 2017, 17:51:04 UTC
back to top