Revision history - refs/tags/v2.3.0-rc3 - origin: https://github.com/apache/spark

visit type:

Revision	Author	Date	Message	Commit Date
89f6fcb	Sameer Agarwal	12 February 2018, 19:08:28 UTC	Preparing Spark release v2.3.0-rc3	12 February 2018, 19:08:28 UTC
d31c4ae	liuxian	12 February 2018, 14:49:45 UTC	[SPARK-23391][CORE] It may lead to overflow for some integer multiplication ## What changes were proposed in this pull request? In the `getBlockData`,`blockId.reduceId` is the `Int` type, when it is greater than 2^28, `blockId.reduceId8` will overflow In the `decompress0`, `len` and `unitSize` are Int type, so `len unitSize` may lead to overflow ## How was this patch tested? N/A Author: liuxian <liu.xian3@zte.com.cn> Closes #20581 from 10110346/overflow2. (cherry picked from commit 4a4dd4f36f65410ef5c87f7b61a960373f044e61) Signed-off-by: Sean Owen <sowen@cloudera.com>	12 February 2018, 14:49:52 UTC
1e3118c	Wenchen Fan	12 February 2018, 14:07:59 UTC	[SPARK-22977][SQL] fix web UI SQL tab for CTAS ## What changes were proposed in this pull request? This is a regression in Spark 2.3. In Spark 2.2, we have a fragile UI support for SQL data writing commands. We only track the input query plan of `FileFormatWriter` and display its metrics. This is not ideal because we don't know who triggered the writing(can be table insertion, CTAS, etc.), but it's still useful to see the metrics of the input query. In Spark 2.3, we introduced a new mechanism: `DataWritigCommand`, to fix the UI issue entirely. Now these writing commands have real children, and we don't need to hack into the `FileFormatWriter` for the UI. This also helps with `explain`, now `explain` can show the physical plan of the input query, while in 2.2 the physical writing plan is simply `ExecutedCommandExec` and it has no child. However there is a regression in CTAS. CTAS commands don't extend `DataWritigCommand`, and we don't have the UI hack in `FileFormatWriter` anymore, so the UI for CTAS is just an empty node. See https://issues.apache.org/jira/browse/SPARK-22977 for more information about this UI issue. To fix it, we should apply the `DataWritigCommand` mechanism to CTAS commands. TODO: In the future, we should refactor this part and create some physical layer code pieces for data writing, and reuse them in different writing commands. We should have different logical nodes for different operators, even some of them share some same logic, e.g. CTAS, CREATE TABLE, INSERT TABLE. Internally we can share the same physical logic. ## How was this patch tested? manually tested. For data source table <img width="644" alt="1" src="https://user-images.githubusercontent.com/3182036/35874155-bdffab28-0ba6-11e8-94a8-e32e106ba069.png"> For hive table <img width="666" alt="2" src="https://user-images.githubusercontent.com/3182036/35874161-c437e2a8-0ba6-11e8-98ed-7930f01432c5.png"> Author: Wenchen Fan <wenchen@databricks.com> Closes #20521 from cloud-fan/UI. (cherry picked from commit 0e2c266de7189473177f45aa68ea6a45c7e47ec3) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	12 February 2018, 14:08:16 UTC
79e8650	Wenchen Fan	12 February 2018, 07:46:23 UTC	[SPARK-23390][SQL] Flaky Test Suite: FileBasedDataSourceSuite in Spark 2.3/hadoop 2.7 ## What changes were proposed in this pull request? This test only fails with sbt on Hadoop 2.7, I can't reproduce it locally, but here is my speculation by looking at the code: 1. FileSystem.delete doesn't delete the directory entirely, somehow we can still open the file as a 0-length empty file.(just speculation) 2. ORC intentionally allow empty files, and the reader fails during reading without closing the file stream. This PR improves the test to make sure all files are deleted and can't be opened. ## How was this patch tested? N/A Author: Wenchen Fan <wenchen@databricks.com> Closes #20584 from cloud-fan/flaky-test. (cherry picked from commit 6efd5d117e98074d1b16a5c991fbd38df9aa196e) Signed-off-by: Sameer Agarwal <sameerag@apache.org>	12 February 2018, 07:46:43 UTC
7e2a2b3	Wenchen Fan	11 February 2018, 16:03:49 UTC	[SPARK-23376][SQL] creating UnsafeKVExternalSorter with BytesToBytesMap may fail ## What changes were proposed in this pull request? This is a long-standing bug in `UnsafeKVExternalSorter` and was reported in the dev list multiple times. When creating `UnsafeKVExternalSorter` with `BytesToBytesMap`, we need to create a `UnsafeInMemorySorter` to sort the data in `BytesToBytesMap`. The data format of the sorter and the map is same, so no data movement is required. However, both the sorter and the map need a point array for some bookkeeping work. There is an optimization in `UnsafeKVExternalSorter`: reuse the point array between the sorter and the map, to avoid an extra memory allocation. This sounds like a reasonable optimization, the length of the `BytesToBytesMap` point array is at least 4 times larger than the number of keys(to avoid hash collision, the hash table size should be at least 2 times larger than the number of keys, and each key occupies 2 slots). `UnsafeInMemorySorter` needs the pointer array size to be 4 times of the number of entries, so we are safe to reuse the point array. However, the number of keys of the map doesn't equal to the number of entries in the map, because `BytesToBytesMap` supports duplicated keys. This breaks the assumption of the above optimization and we may run out of space when inserting data into the sorter, and hit error ``` java.lang.IllegalStateException: There is no space for new record at org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.insertRecord(UnsafeInMemorySorter.java:239) at org.apache.spark.sql.execution.UnsafeKVExternalSorter.<init>(UnsafeKVExternalSorter.java:149) ... ``` This PR fixes this bug by creating a new point array if the existing one is not big enough. ## How was this patch tested? a new test Author: Wenchen Fan <wenchen@databricks.com> Closes #20561 from cloud-fan/bug. (cherry picked from commit 4bbd7443ebb005f81ed6bc39849940ac8db3b3cc) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	11 February 2018, 16:04:14 UTC
8875e47	Takuya UESHIN	11 February 2018, 13:16:47 UTC	[SPARK-23387][SQL][PYTHON][TEST][BRANCH-2.3] Backport assertPandasEqual to branch-2.3. ## What changes were proposed in this pull request? When backporting a pr with tests using `assertPandasEqual` from master to branch-2.3, the tests fail because `assertPandasEqual` doesn't exist in branch-2.3. We should backport `assertPandasEqual` to branch-2.3 to avoid the failures. ## How was this patch tested? Modified tests. Author: Takuya UESHIN <ueshin@databricks.com> Closes #20577 from ueshin/issues/SPARK-23387/branch-2.3.	11 February 2018, 13:16:47 UTC
9fa7b0e	Li Jin	11 February 2018, 08:31:35 UTC	[SPARK-23314][PYTHON] Add ambiguous=False when localizing tz-naive timestamps in Arrow codepath to deal with dst ## What changes were proposed in this pull request? When tz_localize a tz-naive timetamp, pandas will throw exception if the timestamp is during daylight saving time period, e.g., `2015-11-01 01:30:00`. This PR fixes this issue by setting `ambiguous=False` when calling tz_localize, which is the same default behavior of pytz. ## How was this patch tested? Add `test_timestamp_dst` Author: Li Jin <ice.xelloss@gmail.com> Closes #20537 from icexelloss/SPARK-23314. (cherry picked from commit a34fce19bc0ee5a7e36c6ecba75d2aeb70fdcbc7) Signed-off-by: hyukjinkwon <gurwls223@gmail.com>	11 February 2018, 08:31:48 UTC
b7571b9	Takuya UESHIN	10 February 2018, 16:08:02 UTC	[SPARK-23360][SQL][PYTHON] Get local timezone from environment via pytz, or dateutil. ## What changes were proposed in this pull request? Currently we use `tzlocal()` to get Python local timezone, but it sometimes causes unexpected behavior. I changed the way to get Python local timezone to use pytz if the timezone is specified in environment variable, or timezone file via dateutil . ## How was this patch tested? Added a test and existing tests. Author: Takuya UESHIN <ueshin@databricks.com> Closes #20559 from ueshin/issues/SPARK-23360/master. (cherry picked from commit 97a224a855c4410b2dfb9c0bcc6aae583bd28e92) Signed-off-by: hyukjinkwon <gurwls223@gmail.com>	10 February 2018, 16:08:16 UTC
f3a9a7f	Feng Liu	10 February 2018, 00:21:47 UTC	[SPARK-23275][SQL] fix the thread leaking in hive/tests ## What changes were proposed in this pull request? This is a follow up of https://github.com/apache/spark/pull/20441. The two lines actually can trigger the hive metastore bug: https://issues.apache.org/jira/browse/HIVE-16844 The two configs are not in the default `ObjectStore` properties, so any run hive commands after these two lines will set the `propsChanged` flag in the `ObjectStore.setConf` and then cause thread leaks. I don't think the two lines are very useful. They can be removed safely. ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Feng Liu <fengliu@databricks.com> Closes #20562 from liufengdb/fix-omm. (cherry picked from commit 6d7c38330e68c7beb10f54eee8b4f607ee3c4136) Signed-off-by: gatorsmile <gatorsmile@gmail.com>	10 February 2018, 00:22:14 UTC
49771ac	Jacek Laskowski	10 February 2018, 00:18:30 UTC	[MINOR][HIVE] Typo fixes ## What changes were proposed in this pull request? Typo fixes (with expanding a Hive property) ## How was this patch tested? local build. Awaiting Jenkins Author: Jacek Laskowski <jacek@japila.pl> Closes #20550 from jaceklaskowski/hiveutils-typos. (cherry picked from commit 557938e2839afce26a10a849a2a4be8fc4580427) Signed-off-by: Sean Owen <sowen@cloudera.com>	10 February 2018, 00:18:38 UTC
08eb95f	liuxian	09 February 2018, 14:45:06 UTC	[SPARK-23358][CORE] When the number of partitions is greater than 2^28, it will result in an error result ## What changes were proposed in this pull request? In the `checkIndexAndDataFile`,the `blocks` is the ` Int` type, when it is greater than 2^28, `blocks*8` will overflow, and this will result in an error result. In fact, `blocks` is actually the number of partitions. ## How was this patch tested? Manual test Author: liuxian <liu.xian3@zte.com.cn> Closes #20544 from 10110346/overflow. (cherry picked from commit f77270b8811bbd8956d0c08fa556265d2c5ee20e) Signed-off-by: Sean Owen <sowen@cloudera.com>	09 February 2018, 14:45:15 UTC
196304a	hyukjinkwon	09 February 2018, 06:21:10 UTC	[SPARK-23328][PYTHON] Disallow default value None in na.replace/replace when 'to_replace' is not a dictionary ## What changes were proposed in this pull request? This PR proposes to disallow default value None when 'to_replace' is not a dictionary. It seems weird we set the default value of `value` to `None` and we ended up allowing the case as below: ```python >>> df.show() ``` ``` +----+------+-----+ \| age\|height\| name\| +----+------+-----+ \| 10\| 80\|Alice\| ... ``` ```python >>> df.na.replace('Alice').show() ``` ``` +----+------+----+ \| age\|height\|name\| +----+------+----+ \| 10\| 80\|null\| ... ``` After This PR targets to disallow the case above: ```python >>> df.na.replace('Alice').show() ``` ``` ... TypeError: value is required when to_replace is not a dictionary. ``` while we still allow when `to_replace` is a dictionary: ```python >>> df.na.replace({'Alice': None}).show() ``` ``` +----+------+----+ \| age\|height\|name\| +----+------+----+ \| 10\| 80\|null\| ... ``` ## How was this patch tested? Manually tested, tests were added in `python/pyspark/sql/tests.py` and doctests were fixed. Author: hyukjinkwon <gurwls223@gmail.com> Closes #20499 from HyukjinKwon/SPARK-19454-followup. (cherry picked from commit 4b4ee2601079f12f8f410a38d2081793cbdedc14) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	09 February 2018, 06:21:32 UTC
dfb1614	Dongjoon Hyun	09 February 2018, 04:54:57 UTC	[SPARK-23186][SQL] Initialize DriverManager first before loading JDBC Drivers ## What changes were proposed in this pull request? Since some JDBC Drivers have class initialization code to call `DriverManager`, we need to initialize `DriverManager` first in order to avoid potential executor-side deadlock situations like the following (or [STORM-2527](https://issues.apache.org/jira/browse/STORM-2527)). ``` Thread 9587: (state = BLOCKED) - sun.reflect.NativeConstructorAccessorImpl.newInstance0(java.lang.reflect.Constructor, java.lang.Object[]) bci=0 (Compiled frame; information may be imprecise) - sun.reflect.NativeConstructorAccessorImpl.newInstance(java.lang.Object[]) bci=85, line=62 (Compiled frame) - sun.reflect.DelegatingConstructorAccessorImpl.newInstance(java.lang.Object[]) bci=5, line=45 (Compiled frame) - java.lang.reflect.Constructor.newInstance(java.lang.Object[]) bci=79, line=423 (Compiled frame) - java.lang.Class.newInstance() bci=138, line=442 (Compiled frame) - java.util.ServiceLoader$LazyIterator.nextService() bci=119, line=380 (Interpreted frame) - java.util.ServiceLoader$LazyIterator.next() bci=11, line=404 (Interpreted frame) - java.util.ServiceLoader$1.next() bci=37, line=480 (Interpreted frame) - java.sql.DriverManager$2.run() bci=21, line=603 (Interpreted frame) - java.sql.DriverManager$2.run() bci=1, line=583 (Interpreted frame) - java.security.AccessController.doPrivileged(java.security.PrivilegedAction) bci=0 (Compiled frame) - java.sql.DriverManager.loadInitialDrivers() bci=27, line=583 (Interpreted frame) - java.sql.DriverManager.<clinit>() bci=32, line=101 (Interpreted frame) - org.apache.phoenix.mapreduce.util.ConnectionUtil.getConnection(java.lang.String, java.lang.Integer, java.lang.String, java.util.Properties) bci=12, line=98 (Interpreted frame) - org.apache.phoenix.mapreduce.util.ConnectionUtil.getInputConnection(org.apache.hadoop.conf.Configuration, java.util.Properties) bci=22, line=57 (Interpreted frame) - org.apache.phoenix.mapreduce.PhoenixInputFormat.getQueryPlan(org.apache.hadoop.mapreduce.JobContext, org.apache.hadoop.conf.Configuration) bci=61, line=116 (Interpreted frame) - org.apache.phoenix.mapreduce.PhoenixInputFormat.createRecordReader(org.apache.hadoop.mapreduce.InputSplit, org.apache.hadoop.mapreduce.TaskAttemptContext) bci=10, line=71 (Interpreted frame) - org.apache.spark.rdd.NewHadoopRDD$$anon$1.<init>(org.apache.spark.rdd.NewHadoopRDD, org.apache.spark.Partition, org.apache.spark.TaskContext) bci=233, line=156 (Interpreted frame) Thread 9170: (state = BLOCKED) - org.apache.phoenix.jdbc.PhoenixDriver.<clinit>() bci=35, line=125 (Interpreted frame) - sun.reflect.NativeConstructorAccessorImpl.newInstance0(java.lang.reflect.Constructor, java.lang.Object[]) bci=0 (Compiled frame) - sun.reflect.NativeConstructorAccessorImpl.newInstance(java.lang.Object[]) bci=85, line=62 (Compiled frame) - sun.reflect.DelegatingConstructorAccessorImpl.newInstance(java.lang.Object[]) bci=5, line=45 (Compiled frame) - java.lang.reflect.Constructor.newInstance(java.lang.Object[]) bci=79, line=423 (Compiled frame) - java.lang.Class.newInstance() bci=138, line=442 (Compiled frame) - org.apache.spark.sql.execution.datasources.jdbc.DriverRegistry$.register(java.lang.String) bci=89, line=46 (Interpreted frame) - org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$createConnectionFactory$2.apply() bci=7, line=53 (Interpreted frame) - org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$createConnectionFactory$2.apply() bci=1, line=52 (Interpreted frame) - org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$$anon$1.<init>(org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD, org.apache.spark.Partition, org.apache.spark.TaskContext) bci=81, line=347 (Interpreted frame) - org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD.compute(org.apache.spark.Partition, org.apache.spark.TaskContext) bci=7, line=339 (Interpreted frame) ``` ## How was this patch tested? N/A Author: Dongjoon Hyun <dongjoon@apache.org> Closes #20359 from dongjoon-hyun/SPARK-23186. (cherry picked from commit 8cbcc33876c773722163b2259644037bbb259bd1) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	09 February 2018, 04:55:17 UTC
68f3a07	Wenchen Fan	08 February 2018, 11:20:11 UTC	[SPARK-23268][SQL][FOLLOWUP] Reorganize packages in data source V2 ## What changes were proposed in this pull request? This is a followup of https://github.com/apache/spark/pull/20435. While reorganizing the packages for streaming data source v2, the top level stream read/write support interfaces should not be in the reader/writer package, but should be in the `sources.v2` package, to follow the `ReadSupport`, `WriteSupport`, etc. ## How was this patch tested? N/A Author: Wenchen Fan <wenchen@databricks.com> Closes #20509 from cloud-fan/followup. (cherry picked from commit a75f927173632eee1316879447cb62c8cf30ae37) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	08 February 2018, 11:20:33 UTC
0c2a210	Wenchen Fan	08 February 2018, 08:08:54 UTC	[SPARK-23348][SQL] append data using saveAsTable should adjust the data types ## What changes were proposed in this pull request? For inserting/appending data to an existing table, Spark should adjust the data types of the input query according to the table schema, or fail fast if it's uncastable. There are several ways to insert/append data: SQL API, `DataFrameWriter.insertInto`, `DataFrameWriter.saveAsTable`. The first 2 ways create `InsertIntoTable` plan, and the last way creates `CreateTable` plan. However, we only adjust input query data types for `InsertIntoTable`, and users may hit weird errors when appending data using `saveAsTable`. See the JIRA for the error case. This PR fixes this bug by adjusting data types for `CreateTable` too. ## How was this patch tested? new test. Author: Wenchen Fan <wenchen@databricks.com> Closes #20527 from cloud-fan/saveAsTable. (cherry picked from commit 7f5f5fb1296275a38da0adfa05125dd8ebf729ff) Signed-off-by: gatorsmile <gatorsmile@gmail.com>	08 February 2018, 08:09:03 UTC
0538302	hyukjinkwon	08 February 2018, 07:47:12 UTC	[SPARK-23319][TESTS][BRANCH-2.3] Explicitly specify Pandas and PyArrow versions in PySpark tests (to skip or test) This PR backports https://github.com/apache/spark/pull/20487 to branch-2.3. Author: hyukjinkwon <gurwls223@gmail.com> Author: Takuya UESHIN <ueshin@databricks.com> Closes #20534 from HyukjinKwon/PR_TOOL_PICK_PR_20487_BRANCH-2.3.	08 February 2018, 07:47:12 UTC
db59e55	gatorsmile	08 February 2018, 04:21:18 UTC	Revert [SPARK-22279][SQL] Turn on spark.sql.hive.convertMetastoreOrc by default ## What changes were proposed in this pull request? This is to revert the changes made in https://github.com/apache/spark/pull/19499 , because this causes a regression. We should not ignore the table-specific compression conf when the Hive serde tables are converted to the data source tables. ## How was this patch tested? The existing tests. Author: gatorsmile <gatorsmile@gmail.com> Closes #20536 from gatorsmile/revert22279. (cherry picked from commit 3473fda6dc77bdfd84b3de95d2082856ad4f8626) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	08 February 2018, 04:21:31 UTC
2ba07d5	hyukjinkwon	08 February 2018, 00:29:31 UTC	[SPARK-23300][TESTS][BRANCH-2.3] Prints out if Pandas and PyArrow are installed or not in PySpark SQL tests This PR backports https://github.com/apache/spark/pull/20473 to branch-2.3. Author: hyukjinkwon <gurwls223@gmail.com> Closes #20533 from HyukjinKwon/backport-20473.	08 February 2018, 00:29:31 UTC
05239af	Liang-Chi Hsieh	07 February 2018, 17:48:49 UTC	[SPARK-23345][SQL] Remove open stream record even closing it fails ## What changes were proposed in this pull request? When `DebugFilesystem` closes opened stream, if any exception occurs, we still need to remove the open stream record from `DebugFilesystem`. Otherwise, it goes to report leaked filesystem connection. ## How was this patch tested? Existing tests. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #20524 from viirya/SPARK-23345. (cherry picked from commit 9841ae0313cbee1f083f131f9446808c90ed5a7b) Signed-off-by: gatorsmile <gatorsmile@gmail.com>	07 February 2018, 17:48:57 UTC
cb22e83	gatorsmile	07 February 2018, 14:24:16 UTC	[SPARK-23122][PYSPARK][FOLLOWUP] Replace registerTempTable by createOrReplaceTempView ## What changes were proposed in this pull request? Replace `registerTempTable` by `createOrReplaceTempView`. ## How was this patch tested? N/A Author: gatorsmile <gatorsmile@gmail.com> Closes #20523 from gatorsmile/updateExamples. (cherry picked from commit 9775df67f924663598d51723a878557ddafb8cfd) Signed-off-by: hyukjinkwon <gurwls223@gmail.com>	07 February 2018, 14:24:30 UTC
874d3f8	gatorsmile	07 February 2018, 00:46:43 UTC	[SPARK-23327][SQL] Update the description and tests of three external API or functions ## What changes were proposed in this pull request? Update the description and tests of three external API or functions `createFunction `, `length` and `repartitionByRange ` ## How was this patch tested? N/A Author: gatorsmile <gatorsmile@gmail.com> Closes #20495 from gatorsmile/updateFunc. (cherry picked from commit c36fecc3b416c38002779c3cf40b6a665ac4bf13) Signed-off-by: gatorsmile <gatorsmile@gmail.com>	07 February 2018, 00:46:55 UTC
f9c9132	Wenchen Fan	06 February 2018, 20:43:45 UTC	[SPARK-23315][SQL] failed to get output from canonicalized data source v2 related plans ## What changes were proposed in this pull request? `DataSourceV2Relation` keeps a `fullOutput` and resolves the real output on demand by column name lookup. i.e. ``` lazy val output: Seq[Attribute] = reader.readSchema().map(_.name).map { name => fullOutput.find(_.name == name).get } ``` This will be broken after we canonicalize the plan, because all attribute names become "None", see https://github.com/apache/spark/blob/v2.3.0-rc1/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Canonicalize.scala#L42 To fix this, `DataSourceV2Relation` should just keep `output`, and update the `output` when doing column pruning. ## How was this patch tested? a new test case Author: Wenchen Fan <wenchen@databricks.com> Closes #20485 from cloud-fan/canonicalize. (cherry picked from commit b96a083b1c6ff0d2c588be9499b456e1adce97dc) Signed-off-by: gatorsmile <gatorsmile@gmail.com>	06 February 2018, 20:43:53 UTC
77cccc5	Li Jin	06 February 2018, 20:30:04 UTC	[MINOR][TEST] Fix class name for Pandas UDF tests In https://github.com/apache/spark/commit/b2ce17b4c9fea58140a57ca1846b2689b15c0d61, I mistakenly renamed `VectorizedUDFTests` to `ScalarPandasUDF`. This PR fixes the mistake. Existing tests. Author: Li Jin <ice.xelloss@gmail.com> Closes #20489 from icexelloss/fix-scalar-udf-tests. (cherry picked from commit caf30445632de6aec810309293499199e7a20892) Signed-off-by: gatorsmile <gatorsmile@gmail.com>	06 February 2018, 20:37:25 UTC
036a04b	Wenchen Fan	06 February 2018, 20:27:37 UTC	[SPARK-23312][SQL][FOLLOWUP] add a config to turn off vectorized cache reader ## What changes were proposed in this pull request? https://github.com/apache/spark/pull/20483 tried to provide a way to turn off the new columnar cache reader, to restore the behavior in 2.2. However even we turn off that config, the behavior is still different than 2.2. If the output data are rows, we still enable whole stage codegen for the scan node, which is different with 2.2, we should also fix it. ## How was this patch tested? existing tests. Author: Wenchen Fan <wenchen@databricks.com> Closes #20513 from cloud-fan/cache. (cherry picked from commit ac7454cac04a1d9252b3856360eda5c3e8bcb8da) Signed-off-by: gatorsmile <gatorsmile@gmail.com>	06 February 2018, 20:27:52 UTC
7782fd0	Takuya UESHIN	06 February 2018, 18:46:48 UTC	[SPARK-23310][CORE][FOLLOWUP] Fix Java style check issues. ## What changes were proposed in this pull request? This is a follow-up of #20492 which broke lint-java checks. This pr fixes the lint-java issues. ``` [ERROR] src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeSorterSpillReader.java:[79] (sizes) LineLength: Line is longer than 100 characters (found 114). ``` ## How was this patch tested? Checked manually in my local environment. Author: Takuya UESHIN <ueshin@databricks.com> Closes #20514 from ueshin/issues/SPARK-23310/fup1. (cherry picked from commit 7db9979babe52d15828967c86eb77e3fb2791579) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>	06 February 2018, 18:47:04 UTC
a511544	Takuya UESHIN	06 February 2018, 09:30:50 UTC	[SPARK-23334][SQL][PYTHON] Fix pandas_udf with return type StringType() to handle str type properly in Python 2. ## What changes were proposed in this pull request? In Python 2, when `pandas_udf` tries to return string type value created in the udf with `".."`, the execution fails. E.g., ```python from pyspark.sql.functions import pandas_udf, col import pandas as pd df = spark.range(10) str_f = pandas_udf(lambda x: pd.Series(["%s" % i for i in x]), "string") df.select(str_f(col('id'))).show() ``` raises the following exception: ``` ... java.lang.AssertionError: assertion failed: Invalid schema from pandas_udf: expected StringType, got BinaryType at scala.Predef$.assert(Predef.scala:170) at org.apache.spark.sql.execution.python.ArrowEvalPythonExec$$anon$2.<init>(ArrowEvalPythonExec.scala:93) ... ``` Seems like pyarrow ignores `type` parameter for `pa.Array.from_pandas()` and consider it as binary type when the type is string type and the string values are `str` instead of `unicode` in Python 2. This pr adds a workaround for the case. ## How was this patch tested? Added a test and existing tests. Author: Takuya UESHIN <ueshin@databricks.com> Closes #20507 from ueshin/issues/SPARK-23334. (cherry picked from commit 63c5bf13ce5cd3b8d7e7fb88de881ed207fde720) Signed-off-by: hyukjinkwon <gurwls223@gmail.com>	06 February 2018, 09:31:06 UTC
4493303	Takuya UESHIN	06 February 2018, 09:29:37 UTC	[SPARK-23290][SQL][PYTHON][BACKPORT-2.3] Use datetime.date for date type when converting Spark DataFrame to Pandas DataFrame. ## What changes were proposed in this pull request? This is a backport of #20506. In #18664, there was a change in how `DateType` is being returned to users ([line 1968 in dataframe.py](https://github.com/apache/spark/pull/18664/files#diff-6fc344560230bf0ef711bb9b5573f1faR1968)). This can cause client code which works in Spark 2.2 to fail. See [SPARK-23290](https://issues.apache.org/jira/browse/SPARK-23290?focusedCommentId=16350917&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16350917) for an example. This pr modifies to use `datetime.date` for date type as Spark 2.2 does. ## How was this patch tested? Tests modified to fit the new behavior and existing tests. Author: Takuya UESHIN <ueshin@databricks.com> Closes #20515 from ueshin/issues/SPARK-23290_2.3.	06 February 2018, 09:29:37 UTC
521494d	Shixiong Zhu	06 February 2018, 06:42:42 UTC	[SPARK-23326][WEBUI] schedulerDelay should return 0 when the task is running ## What changes were proposed in this pull request? When a task is still running, metrics like executorRunTime are not available. Then `schedulerDelay` will be almost the same as `duration` and that's confusing. This PR makes `schedulerDelay` return 0 when the task is running which is the same behavior as 2.2. ## How was this patch tested? `AppStatusUtilsSuite.schedulerDelay` Author: Shixiong Zhu <zsxwing@gmail.com> Closes #20493 from zsxwing/SPARK-23326. (cherry picked from commit f3f1e14bb73dfdd2927d95b12d7d61d22de8a0ac) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	06 February 2018, 06:43:22 UTC
4aa9aaf	Xingbo Jiang	05 February 2018, 22:17:11 UTC	[SPARK-23330][WEBUI] Spark UI SQL executions page throws NPE ## What changes were proposed in this pull request? Spark SQL executions page throws the following error and the page crashes: ``` HTTP ERROR 500 Problem accessing /SQL/. Reason: Server Error Caused by: java.lang.NullPointerException at scala.collection.immutable.StringOps$.length$extension(StringOps.scala:47) at scala.collection.immutable.StringOps.length(StringOps.scala:47) at scala.collection.IndexedSeqOptimized$class.isEmpty(IndexedSeqOptimized.scala:27) at scala.collection.immutable.StringOps.isEmpty(StringOps.scala:29) at scala.collection.TraversableOnce$class.nonEmpty(TraversableOnce.scala:111) at scala.collection.immutable.StringOps.nonEmpty(StringOps.scala:29) at org.apache.spark.sql.execution.ui.ExecutionTable.descriptionCell(AllExecutionsPage.scala:182) at org.apache.spark.sql.execution.ui.ExecutionTable.row(AllExecutionsPage.scala:155) at org.apache.spark.sql.execution.ui.ExecutionTable$$anonfun$8.apply(AllExecutionsPage.scala:204) at org.apache.spark.sql.execution.ui.ExecutionTable$$anonfun$8.apply(AllExecutionsPage.scala:204) at org.apache.spark.ui.UIUtils$$anonfun$listingTable$2.apply(UIUtils.scala:339) at org.apache.spark.ui.UIUtils$$anonfun$listingTable$2.apply(UIUtils.scala:339) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at scala.collection.AbstractTraversable.map(Traversable.scala:104) at org.apache.spark.ui.UIUtils$.listingTable(UIUtils.scala:339) at org.apache.spark.sql.execution.ui.ExecutionTable.toNodeSeq(AllExecutionsPage.scala:203) at org.apache.spark.sql.execution.ui.AllExecutionsPage.render(AllExecutionsPage.scala:67) at org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:82) at org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:82) at org.apache.spark.ui.JettyUtils$$anon$3.doGet(JettyUtils.scala:90) at javax.servlet.http.HttpServlet.service(HttpServlet.java:687) at javax.servlet.http.HttpServlet.service(HttpServlet.java:790) at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:848) at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:584) at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180) at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512) at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134) at org.eclipse.jetty.server.Server.handle(Server.java:534) at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:320) at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:251) at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:283) at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:108) at org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93) at org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303) at org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148) at org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136) at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671) at org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589) at java.lang.Thread.run(Thread.java:748) ``` One of the possible reason that this page fails may be the `SparkListenerSQLExecutionStart` event get dropped before processed, so the execution description and details don't get updated. This was not a issue in 2.2 because it would ignore any job start event that arrives before the corresponding execution start event, which doesn't sound like a good decision. We shall try to handle the null values in the front page side, that is, try to give a default value when `execution.details` or `execution.description` is null. Another possible approach is not to spill the `LiveExecutionData` in `SQLAppStatusListener.update(exec: LiveExecutionData)` if `exec.details` is null. This is not ideal because this way you will not see the execution if `SparkListenerSQLExecutionStart` event is lost, because `AllExecutionsPage` only read executions from KVStore. ## How was this patch tested? After the change, the page shows the following: ![image](https://user-images.githubusercontent.com/4784782/35775480-28cc5fde-093e-11e8-8ccc-f58c2ef4a514.png) Author: Xingbo Jiang <xingbo.jiang@databricks.com> Closes #20502 from jiangxb1987/executionPage. (cherry picked from commit c2766b07b4b9ed976931966a79c65043e81cf694) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>	05 February 2018, 22:17:23 UTC
173449c	Sital Kedia	05 February 2018, 18:19:18 UTC	[SPARK-23310][CORE] Turn off read ahead input stream for unshafe shuffle reader To fix regression for TPC-DS queries Author: Sital Kedia <skedia@fb.com> Closes #20492 from sitalkedia/turn_off_async_inputstream. (cherry picked from commit 03b7e120dd7ff7848c936c7a23644da5bd7219ab) Signed-off-by: Sameer Agarwal <sameerag@apache.org>	05 February 2018, 18:20:02 UTC
e688ffe	Shixiong Zhu	05 February 2018, 10:41:49 UTC	[SPARK-23307][WEBUI] Sort jobs/stages/tasks/queries with the completed timestamp before cleaning up them ## What changes were proposed in this pull request? Sort jobs/stages/tasks/queries with the completed timestamp before cleaning up them to make the behavior consistent with 2.2. ## How was this patch tested? - Jenkins. - Manually ran the following codes and checked the UI for jobs/stages/tasks/queries. ``` spark.ui.retainedJobs 10 spark.ui.retainedStages 10 spark.sql.ui.retainedExecutions 10 spark.ui.retainedTasks 10 ``` ``` new Thread() { override def run() { spark.range(1, 2).foreach { i => Thread.sleep(10000) } } }.start() Thread.sleep(5000) for (_ <- 1 to 20) { new Thread() { override def run() { spark.range(1, 2).foreach { i => } } }.start() } Thread.sleep(15000) spark.range(1, 2).foreach { i => } sc.makeRDD(1 to 100, 100).foreach { i => } ``` Author: Shixiong Zhu <zsxwing@gmail.com> Closes #20481 from zsxwing/SPARK-23307. (cherry picked from commit a6bf3db20773ba65cbc4f2775db7bd215e78829a) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	05 February 2018, 10:42:09 UTC
430025c	Yuming Wang	04 February 2018, 17:15:48 UTC	[SPARK-22036][SQL][FOLLOWUP] Fix decimalArithmeticOperations.sql ## What changes were proposed in this pull request? Fix decimalArithmeticOperations.sql test ## How was this patch tested? N/A Author: Yuming Wang <wgyumg@gmail.com> Author: wangyum <wgyumg@gmail.com> Author: Yuming Wang <yumwang@ebay.com> Closes #20498 from wangyum/SPARK-22036. (cherry picked from commit 6fb3fd15365d43733aefdb396db205d7ccf57f75) Signed-off-by: gatorsmile <gatorsmile@gmail.com>	04 February 2018, 17:16:06 UTC
45f0f4f	hyukjinkwon	03 February 2018, 18:40:21 UTC	[SPARK-21658][SQL][PYSPARK] Revert "[] Add default None for value in na.replace in PySpark" This reverts commit 0fcde87aadc9a92e138f11583119465ca4b5c518. See the discussion in [SPARK-21658](https://issues.apache.org/jira/browse/SPARK-21658), [SPARK-19454](https://issues.apache.org/jira/browse/SPARK-19454) and https://github.com/apache/spark/pull/16793 Author: hyukjinkwon <gurwls223@gmail.com> Closes #20496 from HyukjinKwon/revert-SPARK-21658. (cherry picked from commit 551dff2bccb65e9b3f77b986f167aec90d9a6016) Signed-off-by: gatorsmile <gatorsmile@gmail.com>	03 February 2018, 18:40:29 UTC
be3de87	Shashwat Anand	03 February 2018, 18:31:04 UTC	[MINOR][DOC] Use raw triple double quotes around docstrings where there are occurrences of backslashes. From [PEP 257](https://www.python.org/dev/peps/pep-0257/): > For consistency, always use """triple double quotes""" around docstrings. Use r"""raw triple double quotes""" if you use any backslashes in your docstrings. For Unicode docstrings, use u"""Unicode triple-quoted strings""". For example, this is what help (kafka_wordcount) shows: ``` DESCRIPTION Counts words in UTF8 encoded, ' ' delimited text received from the network every second. Usage: kafka_wordcount.py <zk> <topic> To run this on your local machine, you need to setup Kafka and create a producer first, see http://kafka.apache.org/documentation.html#quickstart and then run the example `$ bin/spark-submit --jars external/kafka-assembly/target/scala-/spark-streaming-kafka-assembly-.jar examples/src/main/python/streaming/kafka_wordcount.py localhost:2181 test` ``` This is what it shows, after the fix: ``` DESCRIPTION Counts words in UTF8 encoded, '\n' delimited text received from the network every second. Usage: kafka_wordcount.py <zk> <topic> To run this on your local machine, you need to setup Kafka and create a producer first, see http://kafka.apache.org/documentation.html#quickstart and then run the example `$ bin/spark-submit --jars \ external/kafka-assembly/target/scala-/spark-streaming-kafka-assembly-.jar \ examples/src/main/python/streaming/kafka_wordcount.py \ localhost:2181 test` ``` The thing worth noticing is no linebreak here in the help. ## What changes were proposed in this pull request? Change triple double quotes to raw triple double quotes when there are occurrences of backslashes in docstrings. ## How was this patch tested? Manually as this is a doc fix. Author: Shashwat Anand <me@shashwat.me> Closes #20497 from ashashwat/docstring-fixes. (cherry picked from commit 4aaa7d40bf495317e740b6d6f9c2a55dfd03521b) Signed-off-by: gatorsmile <gatorsmile@gmail.com>	03 February 2018, 18:31:17 UTC
4de2061	Dongjoon Hyun	03 February 2018, 08:04:00 UTC	[SPARK-23305][SQL][TEST] Test `spark.sql.files.ignoreMissingFiles` for all file-based data sources ## What changes were proposed in this pull request? Like Parquet, all file-based data source handles `spark.sql.files.ignoreMissingFiles` correctly. We had better have a test coverage for feature parity and in order to prevent future accidental regression for all data sources. ## How was this patch tested? Pass Jenkins with a newly added test case. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #20479 from dongjoon-hyun/SPARK-23305. (cherry picked from commit 522e0b1866a0298669c83de5a47ba380dc0b7c84) Signed-off-by: gatorsmile <gatorsmile@gmail.com>	03 February 2018, 08:04:08 UTC
1bcb372	caoxuewen	03 February 2018, 08:02:03 UTC	[SPARK-23311][SQL][TEST] add FilterFunction test case for test CombineTypedFilters ## What changes were proposed in this pull request? In the current test case for CombineTypedFilters, we lack the test of FilterFunction, so let's add it. In addition, in TypedFilterOptimizationSuite's existing test cases, Let's extract a common LocalRelation. ## How was this patch tested? add new test cases. Author: caoxuewen <cao.xuewen@zte.com.cn> Closes #20482 from heary-cao/TypedFilterOptimizationSuite. (cherry picked from commit 63b49fa2e599080c2ba7d5189f9dde20a2e01fb4) Signed-off-by: gatorsmile <gatorsmile@gmail.com>	03 February 2018, 08:02:11 UTC
b614c08	Wenchen Fan	03 February 2018, 04:49:08 UTC	[SPARK-23317][SQL] rename ContinuousReader.setOffset to setStartOffset ## What changes were proposed in this pull request? In the document of `ContinuousReader.setOffset`, we say this method is used to specify the start offset. We also have a `ContinuousReader.getStartOffset` to get the value back. I think it makes more sense to rename `ContinuousReader.setOffset` to `setStartOffset`. ## How was this patch tested? N/A Author: Wenchen Fan <wenchen@databricks.com> Closes #20486 from cloud-fan/rename. (cherry picked from commit fe73cb4b439169f16cc24cd851a11fd398ce7edf) Signed-off-by: gatorsmile <gatorsmile@gmail.com>	03 February 2018, 04:49:17 UTC
dcd0af4	Reynold Xin	03 February 2018, 04:36:27 UTC	[SQL] Minor doc update: Add an example in DataFrameReader.schema ## What changes were proposed in this pull request? This patch adds a small example to the schema string definition of schema function. It isn't obvious how to use it, so an example would be useful. ## How was this patch tested? N/A - doc only. Author: Reynold Xin <rxin@databricks.com> Closes #20491 from rxin/schema-doc. (cherry picked from commit 3ff83ad43a704cc3354ef9783e711c065e2a1a22) Signed-off-by: gatorsmile <gatorsmile@gmail.com>	03 February 2018, 04:36:37 UTC
56eb9a3	Tathagata Das	03 February 2018, 01:37:51 UTC	[SPARK-23064][SS][DOCS] Stream-stream joins Documentation - follow up ## What changes were proposed in this pull request? Further clarification of caveats in using stream-stream outer joins. ## How was this patch tested? N/A Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #20494 from tdas/SPARK-23064-2. (cherry picked from commit eaf35de2471fac4337dd2920026836d52b1ec847) Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>	03 February 2018, 01:38:07 UTC
e5e9f9a	Wenchen Fan	02 February 2018, 14:43:28 UTC	[SPARK-23312][SQL] add a config to turn off vectorized cache reader ## What changes were proposed in this pull request? https://issues.apache.org/jira/browse/SPARK-23309 reported a performance regression about cached table in Spark 2.3. While the investigating is still going on, this PR adds a conf to turn off the vectorized cache reader, to unblock the 2.3 release. ## How was this patch tested? a new test Author: Wenchen Fan <wenchen@databricks.com> Closes #20483 from cloud-fan/cache. (cherry picked from commit b9503fcbb3f4a3ce263164d1f11a8e99b9ca5710) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	02 February 2018, 14:43:51 UTC
2b07452	Wenchen Fan	02 February 2018, 04:44:46 UTC	[SPARK-23301][SQL] data source column pruning should work for arbitrary expressions This PR fixes a mistake in the `PushDownOperatorsToDataSource` rule, the column pruning logic is incorrect about `Project`. a new test case for column pruning with arbitrary expressions, and improve the existing tests to make sure the `PushDownOperatorsToDataSource` really works. Author: Wenchen Fan <wenchen@databricks.com> Closes #20476 from cloud-fan/push-down. (cherry picked from commit 19c7c7ebdef6c1c7a02ebac9af6a24f521b52c37) Signed-off-by: gatorsmile <gatorsmile@gmail.com>	02 February 2018, 04:46:09 UTC
7baae3a	Liang-Chi Hsieh	02 February 2018, 02:18:32 UTC	[SPARK-23284][SQL] Document the behavior of several ColumnVector's get APIs when accessing null slot ## What changes were proposed in this pull request? For some ColumnVector get APIs such as getDecimal, getBinary, getStruct, getArray, getInterval, getUTF8String, we should clearly document their behaviors when accessing null slot. They should return null in this case. Then we can remove null checks from the places using above APIs. For the APIs of primitive values like getInt, getInts, etc., this also documents their behaviors when accessing null slots. Their returning values are undefined and can be anything. ## How was this patch tested? Added tests into `ColumnarBatchSuite`. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #20455 from viirya/SPARK-23272-followup. (cherry picked from commit 90848d507457d30abb36e3ba07618dfc87c34cd6) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	02 February 2018, 02:18:49 UTC
ab23785	Gera Shegalov	01 February 2018, 23:26:59 UTC	[SPARK-23296][YARN] Include stacktrace in YARN-app diagnostic ## What changes were proposed in this pull request? Include stacktrace in the diagnostics message upon abnormal unregister from RM ## How was this patch tested? Tested with a failing job, and confirmed a stacktrace in the client output and YARN webUI. Author: Gera Shegalov <gera@apache.org> Closes #20470 from gerashegalov/gera/stacktrace-diagnostics. (cherry picked from commit 032c11b83f0d276bf8085992229b8c598f02798a) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>	01 February 2018, 23:27:09 UTC
07a8f4d	Wenchen Fan	01 February 2018, 18:48:34 UTC	[SPARK-23293][SQL] fix data source v2 self join `DataSourceV2Relation` should extend `MultiInstanceRelation`, to take care of self-join. a new test Author: Wenchen Fan <wenchen@databricks.com> Closes #20466 from cloud-fan/dsv2-selfjoin. (cherry picked from commit 73da3b6968630d9e2cafc742ccb6d4eb54957df4) Signed-off-by: gatorsmile <gatorsmile@gmail.com>	01 February 2018, 18:50:44 UTC
2db7e49	Yuming Wang	01 February 2018, 18:36:31 UTC	[SPARK-13983][SQL] Fix HiveThriftServer2 can not get "--hiveconf" and ''--hivevar" variables since 2.0 ## What changes were proposed in this pull request? `--hiveconf` and `--hivevar` variables no longer work since Spark 2.0. The `spark-sql` client has fixed by [SPARK-15730](https://issues.apache.org/jira/browse/SPARK-15730) and [SPARK-18086](https://issues.apache.org/jira/browse/SPARK-18086). but `beeline`/[`Spark SQL HiveThriftServer2`](https://github.com/apache/spark/blob/v2.1.1/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/HiveThriftServer2.scala) is still broken. This pull request fix it. This pull request works for both `JDBC client` and `beeline`. ## How was this patch tested? unit tests for `JDBC client` manual tests for `beeline`: ``` git checkout origin/pr/17886 dev/make-distribution.sh --mvn mvn --tgz -Phive -Phive-thriftserver -Phadoop-2.6 -DskipTests tar -zxf spark-2.3.0-SNAPSHOT-bin-2.6.5.tgz && cd spark-2.3.0-SNAPSHOT-bin-2.6.5 sbin/start-thriftserver.sh ``` ``` cat <<EOF > test.sql select '\${a}', '\${b}'; EOF beeline -u jdbc:hive2://localhost:10000 --hiveconf a=avalue --hivevar b=bvalue -f test.sql ``` Author: Yuming Wang <wgyumg@gmail.com> Closes #17886 from wangyum/SPARK-13983-dev. (cherry picked from commit f051f834036e63d5e480d86440ce39924f979e82) Signed-off-by: gatorsmile <gatorsmile@gmail.com>	01 February 2018, 18:36:47 UTC
2549bea	Shixiong Zhu	01 February 2018, 13:00:47 UTC	[SPARK-23289][CORE] OneForOneBlockFetcher.DownloadCallback.onData should write the buffer fully ## What changes were proposed in this pull request? `channel.write(buf)` may not write the whole buffer since the underlying channel is a FileChannel, we should retry until the whole buffer is written. ## How was this patch tested? Jenkins Author: Shixiong Zhu <zsxwing@gmail.com> Closes #20461 from zsxwing/SPARK-23289. (cherry picked from commit ec63e2d0743a4f75e1cce21d0fe2b54407a86a4a) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	01 February 2018, 13:02:54 UTC
3aa780e	Takuya UESHIN	01 February 2018, 12:28:53 UTC	[SPARK-23280][SQL][FOLLOWUP] Enable `MutableColumnarRow.getMap()`. ## What changes were proposed in this pull request? This is a followup pr of #20450. We should've enabled `MutableColumnarRow.getMap()` as well. ## How was this patch tested? Existing tests. Author: Takuya UESHIN <ueshin@databricks.com> Closes #20471 from ueshin/issues/SPARK-23280/fup2. (cherry picked from commit 89e8d556b93d1bf1b28fe153fd284f154045b0ee) Signed-off-by: Takuya UESHIN <ueshin@databricks.com>	01 February 2018, 12:29:19 UTC
6b6bc9c	Takuya UESHIN	01 February 2018, 12:25:02 UTC	[SPARK-23280][SQL][FOLLOWUP] Fix Java style check issues. ## What changes were proposed in this pull request? This is a follow-up of #20450 which broke lint-java checks. This pr fixes the lint-java issues. ``` [ERROR] src/main/java/org/apache/spark/sql/vectorized/ColumnVector.java:[20,8] (imports) UnusedImports: Unused import - org.apache.spark.sql.catalyst.util.MapData. [ERROR] src/main/java/org/apache/spark/sql/vectorized/ColumnarArray.java:[21,8] (imports) UnusedImports: Unused import - org.apache.spark.sql.catalyst.util.MapData. [ERROR] src/main/java/org/apache/spark/sql/vectorized/ColumnarRow.java:[22,8] (imports) UnusedImports: Unused import - org.apache.spark.sql.catalyst.util.MapData. ``` ## How was this patch tested? Checked manually in my local environment. Author: Takuya UESHIN <ueshin@databricks.com> Closes #20468 from ueshin/issues/SPARK-23280/fup1. (cherry picked from commit 8bb70b068ea782e799e45238fcb093a6acb0fc9f) Signed-off-by: Takuya UESHIN <ueshin@databricks.com>	01 February 2018, 12:25:19 UTC
205bce9	Yanbo Liang	01 February 2018, 09:25:01 UTC	[SPARK-23107][ML] ML 2.3 QA: New Scala APIs, docs. ## What changes were proposed in this pull request? Audit new APIs and docs in 2.3.0. ## How was this patch tested? No test. Author: Yanbo Liang <ybliang8@gmail.com> Closes #20459 from yanboliang/SPARK-23107. (cherry picked from commit e15da5b14c8d845028365a609c0c66731d024ee7) Signed-off-by: Nick Pentreath <nickp@za.ibm.com>	01 February 2018, 09:25:14 UTC
871fd48	Atallah Hezbor	01 February 2018, 04:45:55 UTC	[SPARK-21396][SQL] Fixes MatchError when UDTs are passed through Hive Thriftserver Signed-off-by: Atallah Hezbor <atallahhezborgmail.com> ## What changes were proposed in this pull request? This PR proposes modifying the match statement that gets the columns of a row in HiveThriftServer. There was previously no case for `UserDefinedType`, so querying a table that contained them would throw a match error. The changes catch that case and return the string representation. ## How was this patch tested? While I would have liked to add a unit test, I couldn't easily incorporate UDTs into the ``HiveThriftServer2Suites`` pipeline. With some guidance I would be happy to push a commit with tests. Instead I did a manual test by loading a `DataFrame` with Point UDT in a spark shell with a HiveThriftServer. Then in beeline, connecting to the server and querying that table. Here is the result before the change ``` 0: jdbc:hive2://localhost:10000> select * from chicago; Error: scala.MatchError: org.apache.spark.sql.PointUDT2d980dc3 (of class org.apache.spark.sql.PointUDT) (state=,code=0) ``` And after the change: ``` 0: jdbc:hive2://localhost:10000> select * from chicago; +---------------------------------------+--------------+------------------------+---------------------+--+ \| __fid__ \| case_number \| dtg \| geom \| +---------------------------------------+--------------+------------------------+---------------------+--+ \| 109602f9-54f8-414b-8c6f-42b1a337643e \| 2 \| 2016-01-01 19:00:00.0 \| POINT (-77 38) \| \| 709602f9-fcff-4429-8027-55649b6fd7ed \| 1 \| 2015-12-31 19:00:00.0 \| POINT (-76.5 38.5) \| \| 009602f9-fcb5-45b1-a867-eb8ba10cab40 \| 3 \| 2016-01-02 19:00:00.0 \| POINT (-78 39) \| +---------------------------------------+--------------+------------------------+---------------------+--+ ``` Author: Atallah Hezbor <atallahhezbor@gmail.com> Closes #20385 from atallahhezbor/udts_over_hive. (cherry picked from commit b2e7677f4d3d8f47f5f148680af39d38f2b558f0) Signed-off-by: gatorsmile <gatorsmile@gmail.com>	01 February 2018, 04:46:03 UTC
59e89a2	Wang Gengliang	01 February 2018, 04:33:51 UTC	[SPARK-23268][SQL] Reorganize packages in data source V2 ## What changes were proposed in this pull request? 1. create a new package for partitioning/distribution related classes. As Spark will add new concrete implementations of `Distribution` in new releases, it is good to have a new package for partitioning/distribution related classes. 2. move streaming related class to package `org.apache.spark.sql.sources.v2.reader/writer.streaming`, instead of `org.apache.spark.sql.sources.v2.streaming.reader/writer`. So that the there won't be package reader/writer inside package streaming, which is quite confusing. Before change: ``` v2 ├── reader ├── streaming │ ├── reader │ └── writer └── writer ``` After change: ``` v2 ├── reader │ └── streaming └── writer └── streaming ``` ## How was this patch tested? Unit test. Author: Wang Gengliang <ltnwgl@gmail.com> Closes #20435 from gengliangwang/new_pkg. (cherry picked from commit 56ae32657e9e5d1e30b62afe77d9e14eb07cf4fb) Signed-off-by: gatorsmile <gatorsmile@gmail.com>	01 February 2018, 04:34:01 UTC
0d0f579	Wenchen Fan	01 February 2018, 03:56:06 UTC	[SPARK-23280][SQL] add map type support to ColumnVector ## What changes were proposed in this pull request? Fill the last missing piece of `ColumnVector`: the map type support. The idea is similar to the array type support. A map is basically 2 arrays: keys and values. We ask the implementations to provide a key array, a value array, and an offset and length to specify the range of this map in the key/value array. In `WritableColumnVector`, we put the key array in first child vector, and value array in second child vector, and offsets and lengths in the current vector, which is very similar to how array type is implemented here. ## How was this patch tested? a new test Author: Wenchen Fan <wenchen@databricks.com> Closes #20450 from cloud-fan/map. (cherry picked from commit 52e00f70663a87b5837235bdf72a3e6f84e11411) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	01 February 2018, 03:56:38 UTC
7ccfc75	Henry Robinson	01 February 2018, 02:15:17 UTC	[SPARK-23157][SQL][FOLLOW-UP] DataFrame -> SparkDataFrame in R comment Author: Henry Robinson <henry@cloudera.com> Closes #20443 from henryr/SPARK-23157. (cherry picked from commit f470df2fcf14e6234c577dc1bdfac27d49b441f5) Signed-off-by: hyukjinkwon <gurwls223@gmail.com>	01 February 2018, 02:15:32 UTC
8ee3a71	Dilip Biswal	31 January 2018, 21:52:47 UTC	[SPARK-23281][SQL] Query produces results in incorrect order when a composite order by clause refers to both original columns and aliases ## What changes were proposed in this pull request? Here is the test snippet. ``` SQL scala> Seq[(Integer, Integer)]( \| (1, 1), \| (1, 3), \| (2, 3), \| (3, 3), \| (4, null), \| (5, null) \| ).toDF("key", "value").createOrReplaceTempView("src") scala> sql( \| """ \| \|SELECT MAX(value) as value, key as col2 \| \|FROM src \| \|GROUP BY key \| \|ORDER BY value desc, key \| """.stripMargin).show +-----+----+ \|value\|col2\| +-----+----+ \| 3\| 3\| \| 3\| 2\| \| 3\| 1\| \| null\| 5\| \| null\| 4\| +-----+----+ ```SQL Here is the explain output : ```SQL == Parsed Logical Plan == 'Sort ['value DESC NULLS LAST, 'key ASC NULLS FIRST], true +- 'Aggregate ['key], ['MAX('value) AS value#9, 'key AS col2#10] +- 'UnresolvedRelation `src` == Analyzed Logical Plan == value: int, col2: int Project [value#9, col2#10] +- Sort [value#9 DESC NULLS LAST, col2#10 DESC NULLS LAST], true +- Aggregate [key#5], [max(value#6) AS value#9, key#5 AS col2#10] +- SubqueryAlias src +- Project [_1#2 AS key#5, _2#3 AS value#6] +- LocalRelation [_1#2, _2#3] ``` SQL The sort direction is being wrongly changed from ASC to DSC while resolving ```Sort``` in resolveAggregateFunctions. The above testcase models TPCDS-Q71 and thus we have the same issue in Q71 as well. ## How was this patch tested? A few tests are added in SQLQuerySuite. Author: Dilip Biswal <dbiswal@us.ibm.com> Closes #20453 from dilipbiswal/local_spark.	31 January 2018, 21:56:51 UTC
f5f21e8	Glen Takahashi	31 January 2018, 17:14:01 UTC	[SPARK-23249][SQL] Improved block merging logic for partitions ## What changes were proposed in this pull request? Change DataSourceScanExec so that when grouping blocks together into partitions, also checks the end of the sorted list of splits to more efficiently fill out partitions. ## How was this patch tested? Updated old test to reflect the new logic, which causes the # of partitions to drop from 4 -> 3 Also, a current test exists to test large non-splittable files at https://github.com/glentakahashi/spark/blob/c575977a5952bf50b605be8079c9be1e30f3bd36/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategySuite.scala#L346 ## Rationale The current bin-packing method of next-fit descending for blocks into partitions is sub-optimal in a lot of cases and will result in extra partitions, un-even distribution of block-counts across partitions, and un-even distribution of partition sizes. As an example, 128 files ranging from 1MB, 2MB,...127MB,128MB. will result in 82 partitions with the current algorithm, but only 64 using this algorithm. Also in this example, the max # of blocks per partition in NFD is 13, while in this algorithm is is 2. More generally, running a simulation of 1000 runs using 128MB blocksize, between 1-1000 normally distributed file sizes between 1-500Mb, you can see an improvement of approx 5% reduction of partition counts, and a large reduction in standard deviation of blocks per partition. This algorithm also runs in O(n) time as NFD does, and in every case is strictly better results than NFD. Overall, the more even distribution of blocks across partitions and therefore reduced partition counts should result in a small but significant performance increase across the board Author: Glen Takahashi <gtakahashi@palantir.com> Closes #20372 from glentakahashi/feature/improved-block-merging. (cherry picked from commit 8c21170decfb9ca4d3233e1ea13bd1b6e3199ed9) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	31 January 2018, 17:14:16 UTC
33f17b2	Wenchen Fan	31 January 2018, 16:24:42 UTC	revert [SPARK-22785][SQL] remove ColumnVector.anyNullsSet ## What changes were proposed in this pull request? In https://github.com/apache/spark/pull/19980 , we thought `anyNullsSet` can be simply implemented by `numNulls() > 0`. This is logically true, but may have performance problems. `OrcColumnVector` is an example. It doesn't have the `numNulls` property, only has a `noNulls` property. We will lose a lot of performance if we use `numNulls() > 0` to check null. This PR simply revert #19980, with a renaming to call it `hasNull`. Better name suggestions are welcome, e.g. `nullable`? ## How was this patch tested? existing test Author: Wenchen Fan <wenchen@databricks.com> Closes #20452 from cloud-fan/null. (cherry picked from commit 48dd6a4c79e33a8f2dba8349b58aa07e4796a925) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	31 January 2018, 16:24:58 UTC
c83246c	Nick Pentreath	31 January 2018, 08:37:37 UTC	[SPARK-23112][DOC] Update ML migration guide with breaking and behavior changes. Add breaking changes, as well as update behavior changes, to `2.3` ML migration guide. ## How was this patch tested? Doc only Author: Nick Pentreath <nickp@za.ibm.com> Closes #20421 from MLnick/SPARK-23112-ml-guide. (cherry picked from commit 161a3f2ae324271a601500e3d2900db9359ee2ef) Signed-off-by: Nick Pentreath <nickp@za.ibm.com>	31 January 2018, 08:37:52 UTC
7ec8ad7	Wenchen Fan	31 January 2018, 07:13:15 UTC	[SPARK-23272][SQL] add calendar interval type support to ColumnVector ## What changes were proposed in this pull request? `ColumnVector` is aimed to support all the data types, but `CalendarIntervalType` is missing. Actually we do support interval type for inner fields, e.g. `ColumnarRow`, `ColumnarArray` both support interval type. It's weird if we don't support interval type at the top level. This PR adds the interval type support. This PR also makes `ColumnVector.getChild` protect. We need it public because `MutableColumnaRow.getInterval` needs it. Now the interval implementation is in `ColumnVector.getInterval`. ## How was this patch tested? a new test. Author: Wenchen Fan <wenchen@databricks.com> Closes #20438 from cloud-fan/interval. (cherry picked from commit 695f7146bca342a0ee192d8c7f5ec48d4d8577a8) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	31 January 2018, 07:13:41 UTC
ab5a510	jerryshao	31 January 2018, 05:59:21 UTC	[SPARK-23279][SS] Avoid triggering distributed job for Console sink ## What changes were proposed in this pull request? Console sink will redistribute collected local data and trigger a distributed job in each batch, this is not necessary, so here change to local job. ## How was this patch tested? Existing UT and manual verification. Author: jerryshao <sshao@hortonworks.com> Closes #20447 from jerryshao/console-minor. (cherry picked from commit 8c6a9c90a36a938372f28ee8be72178192fbc313) Signed-off-by: jerryshao <sshao@hortonworks.com>	31 January 2018, 05:59:36 UTC
b877832	gatorsmile	31 January 2018, 04:05:57 UTC	[SPARK-23274][SQL] Fix ReplaceExceptWithFilter when the right's Filter contains the references that are not in the left output ## What changes were proposed in this pull request? This PR is to fix the `ReplaceExceptWithFilter` rule when the right's Filter contains the references that are not in the left output. Before this PR, we got the error like ``` java.util.NoSuchElementException: key not found: a at scala.collection.MapLike$class.default(MapLike.scala:228) at scala.collection.AbstractMap.default(Map.scala:59) at scala.collection.MapLike$class.apply(MapLike.scala:141) at scala.collection.AbstractMap.apply(Map.scala:59) ``` After this PR, `ReplaceExceptWithFilter ` will not take an effect in this case. ## How was this patch tested? Added tests Author: gatorsmile <gatorsmile@gmail.com> Closes #20444 from gatorsmile/fixReplaceExceptWithFilter. (cherry picked from commit ca04c3ff2387bf0a4308a4b010154e6761827278) Signed-off-by: gatorsmile <gatorsmile@gmail.com>	31 January 2018, 04:06:08 UTC
6ed0d57	Dongjoon Hyun	31 January 2018, 01:14:17 UTC	[SPARK-23276][SQL][TEST] Enable UDT tests in (Hive)OrcHadoopFsRelationSuite ## What changes were proposed in this pull request? Like Parquet, ORC test suites should enable UDT tests. ## How was this patch tested? Pass the Jenkins with newly enabled test cases. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #20440 from dongjoon-hyun/SPARK-23276. (cherry picked from commit 77866167330a665e174ae08a2f8902ef9dc3438b) Signed-off-by: gatorsmile <gatorsmile@gmail.com>	31 January 2018, 01:14:27 UTC
7b9fe08	gatorsmile	31 January 2018, 01:08:30 UTC	[SPARK-23261][PYSPARK][BACKPORT-2.3] Rename Pandas UDFs This PR is to backport https://github.com/apache/spark/pull/20428 to Spark 2.3 without adding the changes regarding `GROUPED AGG PANDAS UDF` --- ## What changes were proposed in this pull request? Rename the public APIs and names of pandas udfs. - `PANDAS SCALAR UDF` -> `SCALAR PANDAS UDF` - `PANDAS GROUP MAP UDF` -> `GROUPED MAP PANDAS UDF` ## How was this patch tested? The existing tests Author: gatorsmile <gatorsmile@gmail.com> Closes #20439 from gatorsmile/backport2.3.	31 January 2018, 01:08:30 UTC
f4802dc	Dilip Biswal	30 January 2018, 22:11:06 UTC	[SPARK-23275][SQL] hive/tests have been failing when run locally on the laptop (Mac) with OOM ## What changes were proposed in this pull request? hive tests have been failing when they are run locally (Mac Os) after a recent change in the trunk. After running the tests for some time, the test fails with OOM with Error: unable to create new native thread. I noticed the thread count goes all the way up to 2000+ after which we start getting these OOM errors. Most of the threads seem to be related to the connection pool in hive metastore (BoneCP-xxxxx-xxxx ). This behaviour change is happening after we made the following change to HiveClientImpl.reset() ``` SQL def reset(): Unit = withHiveState { try { // code } finally { runSqlHive("USE default") ===> this is causing the issue } ``` I am proposing to temporarily back-out part of a fix made to address SPARK-23000 to resolve this issue while we work-out the exact reason for this sudden increase in thread counts. ## How was this patch tested? Ran hive/test multiple times in different machines. (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Dilip Biswal <dbiswal@us.ibm.com> Closes #20441 from dilipbiswal/hive_tests. (cherry picked from commit 58fcb5a95ee0b91300138cd23f3ce2165fab597f) Signed-off-by: gatorsmile <gatorsmile@gmail.com>	30 January 2018, 22:11:15 UTC
2e0c1e5	gatorsmile	30 January 2018, 19:33:30 UTC	[SPARK-23267][SQL] Increase spark.sql.codegen.hugeMethodLimit to 65535 ## What changes were proposed in this pull request? Still saw the performance regression introduced by `spark.sql.codegen.hugeMethodLimit` in our internal workloads. There are two major issues in the current solution. - The size of the complied byte code is not identical to the bytecode size of the method. The detection is still not accurate. - The bytecode size of a single operator (e.g., `SerializeFromObject`) could still exceed 8K limit. We saw the performance regression in such scenario. Since it is close to the release of 2.3, we decide to increase it to 64K for avoiding the perf regression. ## How was this patch tested? N/A Author: gatorsmile <gatorsmile@gmail.com> Closes #20434 from gatorsmile/revertConf. (cherry picked from commit 31c00ad8b090d7eddc4622e73dc4440cd32624de) Signed-off-by: gatorsmile <gatorsmile@gmail.com>	30 January 2018, 19:33:37 UTC
7d96dc1	Liang-Chi Hsieh	30 January 2018, 13:00:29 UTC	[SPARK-23222][SQL] Make DataFrameRangeSuite not flaky ## What changes were proposed in this pull request? It is reported that the test `Cancelling stage in a query with Range` in `DataFrameRangeSuite` fails a few times in unrelated PRs. I personally also saw it too in my PR. This test is not very flaky actually but only fails occasionally. Based on how the test works, I guess that is because `range` finishes before the listener calls `cancelStage`. I increase the range number from `1000000000L` to `100000000000L` and count the range in one partition. I also reduce the `interval` of checking stage id. Hopefully it can make the test not flaky anymore. ## How was this patch tested? The modified tests. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #20431 from viirya/SPARK-23222. (cherry picked from commit 84bcf9dc88ffeae6fba4cfad9455ad75bed6e6f6) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	30 January 2018, 13:00:44 UTC
d3e623b	Wenchen Fan	30 January 2018, 11:43:17 UTC	[SPARK-23260][SPARK-23262][SQL] several data source v2 naming cleanup ## What changes were proposed in this pull request? All other classes in the reader/writer package doesn't have `V2` in their names, and the streaming reader/writer don't have `V2` either. It's more consistent to remove `V2` from `DataSourceV2Reader` and `DataSourceVWriter`. Also rename `DataSourceV2Option` to remote the `V2`, we should only have `V2` in the root interface: `DataSourceV2`. This PR also fixes some places that the mix-in interface doesn't extend the interface it aimed to mix in. ## How was this patch tested? existing tests. Author: Wenchen Fan <wenchen@databricks.com> Closes #20427 from cloud-fan/ds-v2. (cherry picked from commit 0a9ac0248b6514a1e83ff7e4c522424f01b8b78d) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	30 January 2018, 11:43:42 UTC
107d4e2	sethah	30 January 2018, 07:02:16 UTC	[SPARK-23138][ML][DOC] Multiclass logistic regression summary example and user guide ## What changes were proposed in this pull request? User guide and examples are updated to reflect multiclass logistic regression summary which was added in [SPARK-17139](https://issues.apache.org/jira/browse/SPARK-17139). I did not make a separate summary example, but added the summary code to the multiclass example that already existed. I don't see the need for a separate example for the summary. ## How was this patch tested? Docs and examples only. Ran all examples locally using spark-submit. Author: sethah <shendrickson@cloudera.com> Closes #20332 from sethah/multiclass_summary_example. (cherry picked from commit 5056877e8bea56dd0f4dc9e3385669e1e78b2925) Signed-off-by: Nick Pentreath <nickp@za.ibm.com>	30 January 2018, 07:02:31 UTC
bb7502f	Henry Robinson	30 January 2018, 06:19:59 UTC	[SPARK-23157][SQL] Explain restriction on column expression in withColumn() ## What changes were proposed in this pull request? It's not obvious from the comments that any added column must be a function of the dataset that we are adding it to. Add a comment to that effect to Scala, Python and R Data* methods. Author: Henry Robinson <henry@cloudera.com> Closes #20429 from henryr/SPARK-23157. (cherry picked from commit 8b983243e45dfe2617c043a3229a7d87f4c4b44b) Signed-off-by: gatorsmile <gatorsmile@gmail.com>	30 January 2018, 06:20:09 UTC
a81ace1	Xingbo Jiang	30 January 2018, 03:40:42 UTC	[SPARK-23207][SQL][FOLLOW-UP] Don't perform local sort for DataFrame.repartition(1) ## What changes were proposed in this pull request? In `ShuffleExchangeExec`, we don't need to insert extra local sort before round-robin partitioning, if the new partitioning has only 1 partition, because under that case all output rows go to the same partition. ## How was this patch tested? The existing test cases. Author: Xingbo Jiang <xingbo.jiang@databricks.com> Closes #20426 from jiangxb1987/repartition1. (cherry picked from commit b375397b1678b7fe20a0b7f87a7e8b37ae5646ef) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	30 January 2018, 03:41:21 UTC
2858eaa	Bryan Cutler	30 January 2018, 01:37:55 UTC	[SPARK-22221][SQL][FOLLOWUP] Externalize spark.sql.execution.arrow.maxRecordsPerBatch ## What changes were proposed in this pull request? This is a followup to #19575 which added a section on setting max Arrow record batches and this will externalize the conf that was referenced in the docs. ## How was this patch tested? NA Author: Bryan Cutler <cutlerb@gmail.com> Closes #20423 from BryanCutler/arrow-user-doc-externalize-maxRecordsPerBatch-SPARK-22221. (cherry picked from commit f235df66a4754cbb64d5b7b5cfd5a52bdd243b8a) Signed-off-by: gatorsmile <gatorsmile@gmail.com>	30 January 2018, 01:38:14 UTC
75131ee	Marcelo Vanzin	29 January 2018, 22:09:14 UTC	[SPARK-23209][core] Allow credential manager to work when Hive not available. The JVM seems to be doing early binding of classes that the Hive provider depends on, causing an error to be thrown before it was caught by the code in the class. The fix wraps the creation of the provider in a try..catch so that the provider can be ignored when dependencies are missing. Added a unit test (which fails without the fix), and also tested that getting tokens still works in a real cluster. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #20399 from vanzin/SPARK-23209. (cherry picked from commit b834446ec1338349f6d974afd96f677db3e8fd1a) Signed-off-by: Imran Rashid <irashid@cloudera.com>	29 January 2018, 22:09:28 UTC
4386310	gatorsmile	29 January 2018, 18:29:42 UTC	[SPARK-22916][SQL][FOLLOW-UP] Update the Description of Join Selection ## What changes were proposed in this pull request? This PR is to update the description of the join algorithm changes. ## How was this patch tested? N/A Author: gatorsmile <gatorsmile@gmail.com> Closes #20420 from gatorsmile/followUp22916. (cherry picked from commit e30b34f7bd9a687eb43d636fffeb98fe235fcbf4) Signed-off-by: gatorsmile <gatorsmile@gmail.com>	29 January 2018, 18:29:50 UTC
6588e00	Bryan Cutler	29 January 2018, 18:25:25 UTC	[SPARK-22221][DOCS] Adding User Documentation for Arrow ## What changes were proposed in this pull request? Adding user facing documentation for working with Arrow in Spark Author: Bryan Cutler <cutlerb@gmail.com> Author: Li Jin <ice.xelloss@gmail.com> Author: hyukjinkwon <gurwls223@gmail.com> Closes #19575 from BryanCutler/arrow-user-docs-SPARK-2221. (cherry picked from commit 0d60b3213fe9a7ae5e9b208639f92011fdb2ca32) Signed-off-by: gatorsmile <gatorsmile@gmail.com>	29 January 2018, 18:25:54 UTC
d68198d	Herman van Hovell	29 January 2018, 17:00:54 UTC	[SPARK-23223][SQL] Make stacking dataset transforms more performant ## What changes were proposed in this pull request? It is a common pattern to apply multiple transforms to a `Dataset` (using `Dataset.withColumn` for example. This is currently quite expensive because we run `CheckAnalysis` on the full plan and create an encoder for each intermediate `Dataset`. This PR extends the usage of the `AnalysisBarrier` to include `CheckAnalysis`. By doing this we hide the already analyzed plan from `CheckAnalysis` because barrier is a `LeafNode`. The `AnalysisBarrier` is in the `FinishAnalysis` phase of the optimizer. We also make binding the `Dataset` encoder lazy. The bound encoder is only needed when we materialize the dataset. ## How was this patch tested? Existing test should cover this. Author: Herman van Hovell <hvanhovell@databricks.com> Closes #20402 from hvanhovell/SPARK-23223. (cherry picked from commit 2d903cf9d3a827e54217dfc9f1e4be99d8204387) Signed-off-by: Herman van Hovell <hvanhovell@databricks.com>	29 January 2018, 17:01:13 UTC
4059454	caoxuewen	29 January 2018, 16:56:42 UTC	[SPARK-23199][SQL] improved Removes repetition from group expressions in Aggregate ## What changes were proposed in this pull request? Currently, all Aggregate operations will go into RemoveRepetitionFromGroupExpressions, but there is no group expression or there is no duplicate group expression in group expression, we not need copy for logic plan. ## How was this patch tested? the existed test case. Author: caoxuewen <cao.xuewen@zte.com.cn> Closes #20375 from heary-cao/RepetitionGroupExpressions. (cherry picked from commit 54dd7cf4ef921bc9dc12f99cfb90d1da57939901) Signed-off-by: gatorsmile <gatorsmile@gmail.com>	29 January 2018, 16:56:52 UTC
de66aba	Wang Gengliang	29 January 2018, 16:50:49 UTC	[SPARK-23219][SQL] Rename ReadTask to DataReaderFactory in data source v2 ## What changes were proposed in this pull request? Currently we have `ReadTask` in data source v2 reader, while in writer we have `DataWriterFactory`. To make the naming consistent and better, renaming `ReadTask` to `DataReaderFactory`. ## How was this patch tested? Unit test Author: Wang Gengliang <ltnwgl@gmail.com> Closes #20397 from gengliangwang/rename. (cherry picked from commit badf0d0e0d1d9aa169ed655176ce9ae684d3905d) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	29 January 2018, 16:51:14 UTC
8229e15	hyukjinkwon	29 January 2018, 12:09:05 UTC	[SPARK-23238][SQL] Externalize SQLConf configurations exposed in documentation ## What changes were proposed in this pull request? This PR proposes to expose few internal configurations found in the documentation. Also it fixes the description for `spark.sql.execution.arrow.enabled`. It's quite self-explanatory. ## How was this patch tested? N/A Author: hyukjinkwon <gurwls223@gmail.com> Closes #20403 from HyukjinKwon/minor-doc-arrow. (cherry picked from commit 39d2c6b03488895a0acb1dd3c46329db00fdd357) Signed-off-by: hyukjinkwon <gurwls223@gmail.com>	29 January 2018, 12:10:21 UTC
5dda5db	Sameer Agarwal	29 January 2018, 07:13:30 UTC	[SPARK-23020] Ignore Flaky Test: SparkLauncherSuite.testInProcessLauncher in Spark 2.3	29 January 2018, 07:16:53 UTC
588b969	Jose Torres	29 January 2018, 05:10:38 UTC	[SPARK-23196] Unify continuous and microbatch V2 sinks ## What changes were proposed in this pull request? Replace streaming V2 sinks with a unified StreamWriteSupport interface, with a shim to use it with microbatch execution. Add a new SQL config to use for disabling V2 sinks, falling back to the V1 sink implementation. ## How was this patch tested? Existing tests, which in the case of Kafka (the only existing continuous V2 sink) now use V2 for microbatch. Author: Jose Torres <jose@databricks.com> Closes #20369 from jose-torres/streaming-sink. (cherry picked from commit 49b0207dc9327989c72700b4d04d2a714c92e159) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	29 January 2018, 05:10:51 UTC
7ca2cd4	CCInCharge	28 January 2018, 20:55:43 UTC	[SPARK-23250][DOCS] Typo in JavaDoc/ScalaDoc for DataFrameWriter ## What changes were proposed in this pull request? Fix typo in ScalaDoc for DataFrameWriter - originally stated "This is applicable for all file-based data sources (e.g. Parquet, JSON) staring Spark 2.1.0", should be "starting with Spark 2.1.0". ## How was this patch tested? Check of correct spelling in ScalaDoc Please review http://spark.apache.org/contributing.html before opening a pull request. Author: CCInCharge <charles.l.chen.clc@gmail.com> Closes #20417 from CCInCharge/master. (cherry picked from commit 686a622c93207564635569f054e1e6c921624e96) Signed-off-by: Sean Owen <sowen@cloudera.com>	28 January 2018, 20:55:49 UTC
8ff0cc4	hyukjinkwon	28 January 2018, 01:33:06 UTC	[SPARK-23248][PYTHON][EXAMPLES] Relocate module docstrings to the top in PySpark examples ## What changes were proposed in this pull request? This PR proposes to relocate the docstrings in modules of examples to the top. Seems these are mistakes. So, for example, the below codes ```python >>> help(aft_survival_regression) ``` shows the module docstrings for examples as below: Before ``` Help on module aft_survival_regression: NAME aft_survival_regression ... DESCRIPTION # Licensed to the Apache Software Foundation (ASF) under one or more # contributor license agreements. See the NOTICE file distributed with # this work for additional information regarding copyright ownership. # The ASF licenses this file to You under the Apache License, Version 2.0 # (the "License"); you may not use this file except in compliance with # the License. You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. # ... (END) ``` After ``` Help on module aft_survival_regression: NAME aft_survival_regression ... DESCRIPTION An example demonstrating aft survival regression. Run with: bin/spark-submit examples/src/main/python/ml/aft_survival_regression.py (END) ``` ## How was this patch tested? Manually checked. Author: hyukjinkwon <gurwls223@gmail.com> Closes #20416 from HyukjinKwon/module-docstring-example. (cherry picked from commit b8c32dc57368e49baaacf660b7e8836eedab2df7) Signed-off-by: hyukjinkwon <gurwls223@gmail.com>	28 January 2018, 01:33:24 UTC
3b6fc28	hyukjinkwon	27 January 2018, 19:26:09 UTC	[SPARK-23233][PYTHON] Reset the cache in asNondeterministic to set deterministic properly ## What changes were proposed in this pull request? Reproducer: ```python from pyspark.sql.functions import udf f = udf(lambda x: x) spark.range(1).select(f("id")) # cache JVM UDF instance. f = f.asNondeterministic() spark.range(1).select(f("id"))._jdf.logicalPlan().projectList().head().deterministic() ``` It should return `False` but the current master returns `True`. Seems it's because we cache the JVM UDF instance and then we reuse it even after setting `deterministic` disabled once it's called. ## How was this patch tested? Manually tested. I am not sure if I should add the test with a lot of JVM accesses with the intetnal stuff .. Let me know if anyone feels so. I will add. Author: hyukjinkwon <gurwls223@gmail.com> Closes #20409 from HyukjinKwon/SPARK-23233. (cherry picked from commit 3227d14feb1a65e95a2bf326cff6ac95615cc5ac) Signed-off-by: gatorsmile <gatorsmile@gmail.com>	27 January 2018, 19:26:28 UTC
65600bf	Jose Torres	27 January 2018, 07:06:03 UTC	[SPARK-23245][SS][TESTS] Don't access `lastExecution.executedPlan` in StreamTest ## What changes were proposed in this pull request? `lastExecution.executedPlan` is lazy val so accessing it in StreamTest may need to acquire the lock of `lastExecution`. It may be waiting forever when the streaming thread is holding it and running a continuous Spark job. This PR changes to check if `s.lastExecution` is null to avoid accessing `lastExecution.executedPlan`. ## How was this patch tested? Jenkins Author: Jose Torres <jose@databricks.com> Closes #20413 from zsxwing/SPARK-23245. (cherry picked from commit 6328868e524121bd00595959d6d059f74e038a6b) Signed-off-by: Shixiong Zhu <zsxwing@gmail.com>	27 January 2018, 07:06:11 UTC
234c854	Dongjoon Hyun	27 January 2018, 00:57:32 UTC	[MINOR][SS][DOC] Fix `Trigger` Scala/Java doc examples ## What changes were proposed in this pull request? This PR fixes Scala/Java doc examples in `Trigger.java`. ## How was this patch tested? N/A. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #20401 from dongjoon-hyun/SPARK-TRIGGER. (cherry picked from commit e7bc9f0524822a08d857c3a5ba57119644ceae85) Signed-off-by: Sean Owen <sowen@cloudera.com>	27 January 2018, 00:57:45 UTC
20c0efe	Wenchen Fan	27 January 2018, 00:46:51 UTC	[SPARK-23214][SQL] cached data should not carry extra hint info ## What changes were proposed in this pull request? This is a regression introduced by https://github.com/apache/spark/pull/19864 When we lookup cache, we should not carry the hint info, as this cache entry might be added by a plan having hint info, while the input plan for this lookup may not have hint info, or have different hint info. ## How was this patch tested? a new test. Author: Wenchen Fan <wenchen@databricks.com> Closes #20394 from cloud-fan/cache. (cherry picked from commit 5b5447c68ac79715e2256e487e1212861cdab1fc) Signed-off-by: gatorsmile <gatorsmile@gmail.com>	27 January 2018, 00:47:01 UTC
7aaf23c	Shixiong Zhu	27 January 2018, 00:09:57 UTC	[SPARK-23242][SS][TESTS] Don't run tests in KafkaSourceSuiteBase twice ## What changes were proposed in this pull request? KafkaSourceSuiteBase should be abstract class, otherwise KafkaSourceSuiteBase will also run. ## How was this patch tested? Jenkins Author: Shixiong Zhu <zsxwing@gmail.com> Closes #20412 from zsxwing/SPARK-23242. (cherry picked from commit 073744985f439ca90afb9bd0bbc1332c53f7b4bb) Signed-off-by: Shixiong Zhu <zsxwing@gmail.com>	27 January 2018, 00:10:04 UTC
30d16e1	Xingbo Jiang	26 January 2018, 23:01:03 UTC	[SPARK-23207][SQL] Shuffle+Repartition on a DataFrame could lead to incorrect answers ## What changes were proposed in this pull request? Currently shuffle repartition uses RoundRobinPartitioning, the generated result is nondeterministic since the sequence of input rows are not determined. The bug can be triggered when there is a repartition call following a shuffle (which would lead to non-deterministic row ordering), as the pattern shows below: upstream stage -> repartition stage -> result stage (-> indicate a shuffle) When one of the executors process goes down, some tasks on the repartition stage will be retried and generate inconsistent ordering, and some tasks of the result stage will be retried generating different data. The following code returns 931532, instead of 1000000: ``` import scala.sys.process._ import org.apache.spark.TaskContext val res = spark.range(0, 1000 * 1000, 1).repartition(200).map { x => x }.repartition(200).map { x => if (TaskContext.get.attemptNumber == 0 && TaskContext.get.partitionId < 2) { throw new Exception("pkill -f java".!!) } x } res.distinct().count() ``` In this PR, we propose a most straight-forward way to fix this problem by performing a local sort before partitioning, after we make the input row ordering deterministic, the function from rows to partitions is fully deterministic too. The downside of the approach is that with extra local sort inserted, the performance of repartition() will go down, so we add a new config named `spark.sql.execution.sortBeforeRepartition` to control whether this patch is applied. The patch is default enabled to be safe-by-default, but user may choose to manually turn it off to avoid performance regression. This patch also changes the output rows ordering of repartition(), that leads to a bunch of test cases failure because they are comparing the results directly. ## How was this patch tested? Add unit test in ExchangeSuite. With this patch(and `spark.sql.execution.sortBeforeRepartition` set to true), the following query returns 1000000: ``` import scala.sys.process._ import org.apache.spark.TaskContext spark.conf.set("spark.sql.execution.sortBeforeRepartition", "true") val res = spark.range(0, 1000 * 1000, 1).repartition(200).map { x => x }.repartition(200).map { x => if (TaskContext.get.attemptNumber == 0 && TaskContext.get.partitionId < 2) { throw new Exception("pkill -f java".!!) } x } res.distinct().count() res7: Long = 1000000 ``` Author: Xingbo Jiang <xingbo.jiang@databricks.com> Closes #20393 from jiangxb1987/shuffle-repartition. (cherry picked from commit 94c67a76ec1fda908a671a47a2a1fa63b3ab1b06) Signed-off-by: Sameer Agarwal <sameerag@apache.org>	26 January 2018, 23:01:15 UTC
f5911d4	Nick Pentreath	26 January 2018, 21:17:45 UTC	Revert "[SPARK-22797][PYSPARK] Bucketizer support multi-column" This reverts commit ab1b5d921b395cb7df3a3a2c4a7e5778d98e6f01.	26 January 2018, 21:17:45 UTC
ca3613b	Wenchen Fan	26 January 2018, 17:17:05 UTC	[SPARK-23218][SQL] simplify ColumnVector.getArray ## What changes were proposed in this pull request? `ColumnVector` is very flexible about how to implement array type. As a result `ColumnVector` has 3 abstract methods for array type: `arrayData`, `getArrayOffset`, `getArrayLength`. For example, in `WritableColumnVector` we use the first child vector as the array data vector, and store offsets and lengths in 2 arrays in the parent vector. `ArrowColumnVector` has a different implementation. This PR simplifies `ColumnVector` by using only one abstract method for array type: `getArray`. ## How was this patch tested? existing tests. rerun `ColumnarBatchBenchmark`, there is no performance regression. Author: Wenchen Fan <wenchen@databricks.com> Closes #20395 from cloud-fan/vector. (cherry picked from commit dd8e257d1ccf20f4383dd7f30d634010b176f0d3) Signed-off-by: gatorsmile <gatorsmile@gmail.com>	26 January 2018, 17:17:14 UTC
ab1b5d9	Zheng RuiFeng	26 January 2018, 10:28:27 UTC	[SPARK-22797][PYSPARK] Bucketizer support multi-column ## What changes were proposed in this pull request? Bucketizer support multi-column in the python side ## How was this patch tested? existing tests and added tests Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes #19892 from zhengruifeng/20542_py. (cherry picked from commit c22eaa94e85aaac649566495dcf763a5de3c8d06) Signed-off-by: Nick Pentreath <nickp@za.ibm.com>	26 January 2018, 10:28:39 UTC
d6cdc69	Marco Gaido	26 January 2018, 10:23:14 UTC	[SPARK-22799][ML] Bucketizer should throw exception if single- and multi-column params are both set ## What changes were proposed in this pull request? Currently there is a mixed situation when both single- and multi-column are supported. In some cases exceptions are thrown, in others only a warning log is emitted. In this discussion https://issues.apache.org/jira/browse/SPARK-8418?focusedCommentId=16275049&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16275049, the decision was to throw an exception. The PR throws an exception in `Bucketizer`, instead of logging a warning. ## How was this patch tested? modified UT Author: Marco Gaido <marcogaido91@gmail.com> Author: Joseph K. Bradley <joseph@databricks.com> Closes #19993 from mgaido91/SPARK-22799. (cherry picked from commit cd3956df0f96dd416b6161bf7ce2962e06d0a62e) Signed-off-by: Nick Pentreath <nickp@za.ibm.com>	26 January 2018, 10:23:27 UTC
fdf140e	Marcelo Vanzin	26 January 2018, 03:58:20 UTC	[SPARK-23020][CORE] Fix race in SparkAppHandle cleanup, again. Third time is the charm? There was still a race that was left in previous attempts. If the handle closes the connection, the close() implementation would clean up state that would prevent the thread from waiting on the connection thread to finish. That could cause the race causing the test flakiness reported in the bug. The fix is to move the "wait for connection thread" code to a separate close method that is used by the handle; that also simplifies the code a bit and makes it also easier to follow. I included an unrelated, but correct, change to a YARN test so that it triggers when the PR is built. Tested by inserting a sleep in the connection thread to mimic the race; test failed reliably with the sleep, passes now. (Sleep not included in the patch.) Also ran YARN tests to make sure. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #20388 from vanzin/SPARK-23020. (cherry picked from commit 70a68b328b856c17eb22cc86fee0ebe8d64f8825) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	26 January 2018, 03:58:37 UTC
87d128f	Sid Murching	26 January 2018, 00:15:29 UTC	[SPARK-23205][ML] Update ImageSchema.readImages to correctly set alpha values for four-channel images ## What changes were proposed in this pull request? When parsing raw image data in ImageSchema.decode(), we use a [java.awt.Color](https://docs.oracle.com/javase/7/docs/api/java/awt/Color.html#Color(int)) constructor that sets alpha = 255, even for four-channel images (which may have different alpha values). This PR fixes this issue & adds a unit test to verify correctness of reading four-channel images. ## How was this patch tested? Updates an existing unit test ("readImages pixel values test" in `ImageSchemaSuite`) to also verify correctness when reading a four-channel image. Author: Sid Murching <sid.murching@databricks.com> Closes #20389 from smurching/image-schema-bugfix. (cherry picked from commit 7bd46d9871567597216cc02e1dc72ff5806ecdf8) Signed-off-by: Sean Owen <sowen@cloudera.com>	26 January 2018, 00:15:36 UTC
26a8b4e	Kris Mok	26 January 2018, 00:11:33 UTC	[SPARK-23032][SQL] Add a per-query codegenStageId to WholeStageCodegenExec ## What changes were proposed in this pull request? Proposal Add a per-query ID to the codegen stages as represented by `WholeStageCodegenExec` operators. This ID will be used in - the explain output of the physical plan, and in - the generated class name. Specifically, this ID will be stable within a query, counting up from 1 in depth-first post-order for all the `WholeStageCodegenExec` inserted into a plan. The ID value 0 is reserved for "free-floating" `WholeStageCodegenExec` objects, which may have been created for one-off purposes, e.g. for fallback handling of codegen stages that failed to codegen the whole stage and wishes to codegen a subset of the children operators (as seen in `org.apache.spark.sql.execution.FileSourceScanExec#doExecute`). Example: for the following query: ```scala scala> spark.conf.set("spark.sql.autoBroadcastJoinThreshold", 1) scala> val df1 = spark.range(10).select('id as 'x, 'id + 1 as 'y).orderBy('x).select('x + 1 as 'z, 'y) df1: org.apache.spark.sql.DataFrame = [z: bigint, y: bigint] scala> val df2 = spark.range(5) df2: org.apache.spark.sql.Dataset[Long] = [id: bigint] scala> val query = df1.join(df2, 'z === 'id) query: org.apache.spark.sql.DataFrame = [z: bigint, y: bigint ... 1 more field] ``` The explain output before the change is: ```scala scala> query.explain == Physical Plan == SortMergeJoin [z#9L], [id#13L], Inner :- Sort [z#9L ASC NULLS FIRST], false, 0 : +- Exchange hashpartitioning(z#9L, 200) : +- Project [(x#3L + 1) AS z#9L, y#4L] : +- Sort [x#3L ASC NULLS FIRST], true, 0 : +- Exchange rangepartitioning(x#3L ASC NULLS FIRST, 200) : +- Project [id#0L AS x#3L, (id#0L + 1) AS y#4L] : +- Range (0, 10, step=1, splits=8) +- Sort [id#13L ASC NULLS FIRST], false, 0 +- Exchange hashpartitioning(id#13L, 200) +- Range (0, 5, step=1, splits=8) ``` Note how codegen'd operators are annotated with a prefix `""`. See how the `SortMergeJoin` operator and its direct children `Sort` operators are adjacent and all annotated with the `""`, so it's hard to tell they're actually in separate codegen stages. and after this change it'll be: ```scala scala> query.explain == Physical Plan == (6) SortMergeJoin [z#9L], [id#13L], Inner :- (3) Sort [z#9L ASC NULLS FIRST], false, 0 : +- Exchange hashpartitioning(z#9L, 200) : +- (2) Project [(x#3L + 1) AS z#9L, y#4L] : +- (2) Sort [x#3L ASC NULLS FIRST], true, 0 : +- Exchange rangepartitioning(x#3L ASC NULLS FIRST, 200) : +- (1) Project [id#0L AS x#3L, (id#0L + 1) AS y#4L] : +- (1) Range (0, 10, step=1, splits=8) +- (5) Sort [id#13L ASC NULLS FIRST], false, 0 +- Exchange hashpartitioning(id#13L, 200) +- (4) Range (0, 5, step=1, splits=8) ``` Note that the annotated prefix becomes `"(id) "`. See how the `SortMergeJoin` operator and its direct children `Sort` operators have different codegen stage IDs. It'll also show up in the name of the generated class, as a suffix in the format of `GeneratedClass$GeneratedIterator$id`. For example, note how `GeneratedClass$GeneratedIteratorForCodegenStage3` and `GeneratedClass$GeneratedIteratorForCodegenStage6` in the following stack trace corresponds to the IDs shown in the explain output above: ``` "Executor task launch worker for task 42412957" daemon prio=5 tid=0x58 nid=NA runnable java.lang.Thread.State: RUNNABLE at org.apache.spark.sql.execution.UnsafeExternalRowSorter.insertRow(UnsafeExternalRowSorter.java:109) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.sort_addToSorter$(generated.java:32) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.processNext(generated.java:41) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$9$$anon$1.hasNext(WholeStageCodegenExec.scala:494) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.findNextInnerJoinRows$(generated.java:42) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.processNext(generated.java:101) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$11$$anon$2.hasNext(WholeStageCodegenExec.scala:513) at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:253) at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:828) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:828) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:109) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:748) ``` Rationale* Right now, the codegen from Spark SQL lacks the means to differentiate between a couple of things: 1. It's hard to tell which physical operators are in the same WholeStageCodegen stage. Note that this "stage" is a separate notion from Spark's RDD execution stages; this one is only to delineate codegen units. There can be adjacent physical operators that are both codegen'd but are in separate codegen stages. Some of this is due to hacky implementation details, such as the case with `SortMergeJoin` and its `Sort` inputs -- they're hard coded to be split into separate stages although both are codegen'd. When printing out the explain output of the physical plan, you'd only see the codegen'd physical operators annotated with a preceding star (`'*'`) but would have no way to figure out if they're in the same stage. 2. Performance/error diagnosis The generated code has class/method names that are hard to differentiate between queries or even between codegen stages within the same query. If we use a Java-level profiler to collect profiles, or if we encounter a Java-level exception with a stack trace in it, it's really hard to tell which part of a query it's at. By introducing a per-query codegen stage ID, we'd at least be able to know which codegen stage (and in turn, which group of physical operators) was a profile tick or an exception happened. The reason why this proposal uses a per-query ID is because it's stable within a query, so that multiple runs of the same query will see the same resulting IDs. This both benefits understandability for users, and also it plays well with the codegen cache in Spark SQL which uses the generated source code as the key. The downside to using per-query IDs as opposed to a per-session or globally incrementing ID is of course we can't tell apart different query runs with this ID alone. But for now I believe this is a good enough tradeoff. ## How was this patch tested? Existing tests. This PR does not involve any runtime behavior changes other than some name changes. The SQL query test suites that compares explain outputs have been updates to ignore the newly added `codegenStageId`. Author: Kris Mok <kris.mok@databricks.com> Closes #20224 from rednaxelafx/wsc-codegenstageid. (cherry picked from commit e57f394818b0a62f99609e1032fede7e981f306f) Signed-off-by: gatorsmile <gatorsmile@gmail.com>	26 January 2018, 00:11:42 UTC
2f65c20	Huaxin Gao	25 January 2018, 22:50:48 UTC	[SPARK-23081][PYTHON] Add colRegex API to PySpark ## What changes were proposed in this pull request? Add colRegex API to PySpark ## How was this patch tested? add a test in sql/tests.py Author: Huaxin Gao <huaxing@us.ibm.com> Closes #20390 from huaxingao/spark-23081. (cherry picked from commit 8480c0c57698b7dcccec5483d67b17cf2c7527ed) Signed-off-by: hyukjinkwon <gurwls223@gmail.com>	25 January 2018, 22:51:01 UTC
8866f9c	Nick Pentreath	25 January 2018, 13:01:22 UTC	[SPARK-23112][DOC] Add highlights and migration guide for 2.3 Update ML user guide with highlights and migration guide for `2.3`. ## How was this patch tested? Doc only. Author: Nick Pentreath <nickp@za.ibm.com> Closes #20363 from MLnick/SPARK-23112-ml-guide. (cherry picked from commit 8532e26f335b67b74c976712ad82c20ea6dbbf80) Signed-off-by: Nick Pentreath <nickp@za.ibm.com>	25 January 2018, 13:01:50 UTC
c79e771	Liang-Chi Hsieh	25 January 2018, 11:49:58 UTC	[SPARK-21717][SQL] Decouple consume functions of physical operators in whole-stage codegen ## What changes were proposed in this pull request? It has been observed in SPARK-21603 that whole-stage codegen suffers performance degradation, if the generated functions are too long to be optimized by JIT. We basically produce a single function to incorporate generated codes from all physical operators in whole-stage. Thus, it is possibly to grow the size of generated function over a threshold that we can't have JIT optimization for it anymore. This patch is trying to decouple the logic of consuming rows in physical operators to avoid a giant function processing rows. ## How was this patch tested? Added tests. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #18931 from viirya/SPARK-21717. (cherry picked from commit d20bbc2d87ae6bd56d236a7c3d036b52c5f20ff5) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	25 January 2018, 11:50:19 UTC
e66c66c	Bryan Cutler	25 January 2018, 09:48:11 UTC	[SPARK-23163][DOC][PYTHON] Sync ML Python API with Scala ## What changes were proposed in this pull request? This syncs the ML Python API with Scala for differences found after the 2.3 QA audit. ## How was this patch tested? NA Author: Bryan Cutler <cutlerb@gmail.com> Closes #20354 from BryanCutler/pyspark-ml-doc-sync-23163. (cherry picked from commit 39ee2acf96f1e1496cff8e4d2614d27fca76d43b) Signed-off-by: Felix Cheung <felixcheung@apache.org>	25 January 2018, 09:48:34 UTC
abd3e1b	Herman van Hovell	25 January 2018, 08:40:41 UTC	[SPARK-23208][SQL] Fix code generation for complex create array (related) expressions ## What changes were proposed in this pull request? The `GenArrayData.genCodeToCreateArrayData` produces illegal java code when code splitting is enabled. This is used in `CreateArray` and `CreateMap` expressions for complex object arrays. This issue is caused by a typo. ## How was this patch tested? Added a regression test in `complexTypesSuite`. Author: Herman van Hovell <hvanhovell@databricks.com> Closes #20391 from hvanhovell/SPARK-23208. (cherry picked from commit e29b08add92462a6505fef966629e74ba30e994e) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	25 January 2018, 08:41:12 UTC
0126952	zhoukang	25 January 2018, 07:24:52 UTC	[SPARK-23129][CORE] Make deserializeStream of DiskMapIterator init lazily ## What changes were proposed in this pull request? Currently,the deserializeStream in ExternalAppendOnlyMap#DiskMapIterator init when DiskMapIterator instance created.This will cause memory use overhead when ExternalAppendOnlyMap spill too much times. We can avoid this by making deserializeStream init when it is used the first time. This patch make deserializeStream init lazily. ## How was this patch tested? Exist tests Author: zhoukang <zhoukang199191@gmail.com> Closes #20292 from caneGuy/zhoukang/lay-diskmapiterator. (cherry picked from commit 45b4bbfddc18a77011c3bc1bfd71b2cd3466443c) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	25 January 2018, 07:25:46 UTC

Newer
Older