https://github.com/apache/spark

sort by:
Revision Author Date Message Commit Date
a44880b Preparing Spark release v2.4.2-rc1 18 April 2019, 13:24:38 UTC
2d276c0 [SPARK-27403][SQL] Fix `updateTableStats` to update table stats always with new stats or None ## What changes were proposed in this pull request? System shall update the table stats automatically if user set spark.sql.statistics.size.autoUpdate.enabled as true, currently this property is not having any significance even if it is enabled or disabled. This feature is similar to Hives auto-gather feature where statistics are automatically computed by default if this feature is enabled. Reference: https://cwiki.apache.org/confluence/display/Hive/StatsDev As part of fix , autoSizeUpdateEnabled validation is been done initially so that system will calculate the table size for the user automatically and record it in metastore as per user expectation. ## How was this patch tested? UT is written and manually verified in cluster. Tested with unit tests + some internal tests on real cluster. Before fix: ![image](https://user-images.githubusercontent.com/12999161/55688682-cd8d4780-5998-11e9-85da-e1a4e34419f6.png) After fix ![image](https://user-images.githubusercontent.com/12999161/55688654-7d15ea00-5998-11e9-973f-1f4cee27018f.png) Closes #24315 from sujith71955/master_autoupdate. Authored-by: s71955 <sujithchacko.2010@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 239082d9667a4fa4198bd9524d63c739df147e0e) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 17 April 2019, 16:22:57 UTC
fb47b9b [SPARK-27479][BUILD] Hide API docs for org.apache.spark.util.kvstore ## What changes were proposed in this pull request? The API docs should not include the "org.apache.spark.util.kvstore" package because they are internal private APIs. See the doc link: https://spark.apache.org/docs/latest/api/java/org/apache/spark/util/kvstore/LevelDB.html ## How was this patch tested? N/A Closes #24386 from gatorsmile/rmDoc. Authored-by: gatorsmile <gatorsmile@gmail.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com> (cherry picked from commit 61feb1635217ef1d4ebceebc1e7c8829c5c11994) Signed-off-by: gatorsmile <gatorsmile@gmail.com> 17 April 2019, 02:53:20 UTC
df9a506 [SPARK-27453] Pass partitionBy as options in DataFrameWriter Pass partitionBy columns as options and feature-flag this behavior. A new unit test. Closes #24365 from liwensun/partitionby. Authored-by: liwensun <liwen.sun@databricks.com> Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com> (cherry picked from commit 26ed65f4150db1fa37f8bfab24ac0873d2e42936) Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com> 16 April 2019, 22:37:43 UTC
40668c5 [SPARK-27351][SQL] Wrong outputRows estimation after AggregateEstimation wit… ## What changes were proposed in this pull request? The upper bound of group-by columns row number is to multiply distinct counts of group-by columns. However, column with only null value will cause the output row number to be 0 which is incorrect. Ex: col1 (distinct: 2, rowCount 2) col2 (distinct: 0, rowCount 2) => group by col1, col2 Actual: output rows: 0 Expected: output rows: 2 ## How was this patch tested? According unit test has been added, plus manual test has been done in our tpcds benchmark environement. Closes #24286 from pengbo/master. Lead-authored-by: pengbo <bo.peng1019@gmail.com> Co-authored-by: mingbo_pb <mingbo.pb@alibaba-inc.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit c58a4fed8d79aff9fbac9f9a33141b2edbfb0cea) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 15 April 2019, 22:37:21 UTC
ede02b6 Revert "[SPARK-23433][SPARK-25250][CORE] Later created TaskSet should learn about the finished partitions" This reverts commit db86ccb11821231d85b727fb889dec1d58b39e4d. 14 April 2019, 09:01:02 UTC
a8a2ba1 [SPARK-27394][WEBUI] Flush LiveEntity if necessary when receiving SparkListenerExecutorMetricsUpdate (backport 2.4) ## What changes were proposed in this pull request? This PR backports #24303 to 2.4. ## How was this patch tested? Jenkins Closes #24328 from zsxwing/SPARK-27394-2.4. Authored-by: Shixiong Zhu <zsxwing@gmail.com> Signed-off-by: Shixiong Zhu <zsxwing@gmail.com> 10 April 2019, 22:17:04 UTC
3352803 [SPARK-27406][SQL] UnsafeArrayData serialization breaks when two machi… This PR is the branch-2.4 version for https://github.com/apache/spark/pull/24317 Closes #24324 from pengbo/SPARK-27406-branch-2.4. Authored-by: mingbo_pb <mingbo.pb@alibaba-inc.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> 10 April 2019, 09:05:11 UTC
53658ab [SPARK-27419][CORE] Avoid casting heartbeat interval to seconds (2.4) ## What changes were proposed in this pull request? Right now as we cast the heartbeat interval to seconds, any value less than 1 second will be casted to 0. This PR just backports the changes of the heartbeat interval in https://github.com/apache/spark/pull/22473 from master. ## How was this patch tested? Jenkins Closes #24329 from zsxwing/SPARK-27419. Authored-by: Shixiong Zhu <zsxwing@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> 10 April 2019, 01:41:36 UTC
baadfc8 [SPARK-27391][SS] Don't initialize a lazy val in ContinuousExecution job. ## What changes were proposed in this pull request? Fix a potential deadlock in ContinuousExecution by not initializing the toRDD lazy val. Closes #24301 from jose-torres/deadlock. Authored-by: Jose Torres <torres.joseph.f+github@gmail.com> Signed-off-by: Jose Torres <torres.joseph.f+github@gmail.com> (cherry picked from commit 4a5768b2a2adf87e3df278655918f72558f0b3b9) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 05 April 2019, 23:37:23 UTC
ecb0109 [SPARK-27390][CORE][SQL][TEST] Fix package name mismatch This PR aims to clean up package name mismatches. Pass the Jenkins. Closes #24300 from dongjoon-hyun/SPARK-27390. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 982c4c8e3cfc25822e0d755d8d1daa324e6399b8) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 05 April 2019, 18:51:39 UTC
5499902 [SPARK-27358][UI] Update jquery to 1.12.x to pick up security fixes Update jquery -> 1.12.4, datatables -> 1.10.18, mustache -> 2.3.12. Add missing mustache license I manually tested the UI locally with the javascript console open and didn't observe any problems or JS errors. The only 'risky' change seems to be mustache, but on reading its release notes, don't think the changes from 0.8.1 to 2.x would affect Spark's simple usage. Closes #24288 from srowen/SPARK-27358. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com> (cherry picked from commit 23bde447976d0bb33cae67124bac476994634f04) Signed-off-by: Sean Owen <sean.owen@databricks.com> 05 April 2019, 17:59:38 UTC
835dd6a [MINOR][DOC] Fix html tag broken in configuration.md ## What changes were proposed in this pull request? This patch fixes wrong HTML tag in configuration.md which breaks the table tag. This is originally reported in dev mailing list: https://lists.apache.org/thread.html/744bdc83b3935776c8d91bf48fdf80d9a3fed3858391e60e343206f9%3Cdev.spark.apache.org%3E ## How was this patch tested? This change is one-liner and pretty obvious so I guess we may be able to skip testing. Closes #24304 from HeartSaVioR/MINOR-configuration-doc-html-tag-error. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit a840b99daf97de06c9b1b66efed0567244ec4a01) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 05 April 2019, 15:41:33 UTC
1a72c15 [SPARK-27216][CORE][BACKPORT-2.4] Upgrade RoaringBitmap to 0.7.45 to fix Kryo unsafe ser/dser issue ## What changes were proposed in this pull request? Back-port of #24264 to branch-2.4. HighlyCompressedMapStatus uses RoaringBitmap to record the empty blocks. But RoaringBitmap couldn't be ser/deser with unsafe KryoSerializer. It's a bug of RoaringBitmap-0.5.11 and fixed in latest version. ## How was this patch tested? Add a UT Closes #24290 from LantaoJin/SPARK-27216_BACKPORT-2.4. Authored-by: LantaoJin <jinlantao@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com> 04 April 2019, 23:22:57 UTC
bcd56d0 [SPARK-27382][SQL][TEST] Update Spark 2.4.x testing in HiveExternalCatalogVersionsSuite ## What changes were proposed in this pull request? Since Apache Spark 2.4.1 vote passed and is distributed into mirrors, we need to test 2.4.1. This should land on both `master` and `branch-2.4`. ## How was this patch tested? Pass the Jenkins. Closes #24292 from dongjoon-hyun/SPARK-27382. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 938d95437526e1108e6c09f09cec96e9800b2143) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 04 April 2019, 20:50:10 UTC
af0a4bb [SPARK-27338][CORE][FOLLOWUP] remove trailing space ## What changes were proposed in this pull request? https://github.com/apache/spark/pull/24265 breaks the lint check, because it has trailing space. (not sure why it passed jenkins). This PR fixes it. ## How was this patch tested? N/A Closes #24289 from cloud-fan/fix. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> 04 April 2019, 06:55:52 UTC
93c14c6 [SPARK-27338][CORE] Fix deadlock in UnsafeExternalSorter.SpillableIterator when locking both UnsafeExternalSorter.SpillableIterator and TaskMemoryManager ## What changes were proposed in this pull request? In `UnsafeExternalSorter.SpillableIterator#loadNext()` takes lock on the `UnsafeExternalSorter` and calls `freePage` once the `lastPage` is consumed which needs to take a lock on `TaskMemoryManager`. At the same time, there can be another MemoryConsumer using `UnsafeExternalSorter` as part of sorting can try to `allocatePage` needs to get lock on `TaskMemoryManager` which can cause spill to happen which requires lock on `UnsafeExternalSorter` again causing deadlock. This is a classic deadlock situation happening similar to the SPARK-26265. To fix this, we can move the `freePage` call in `loadNext` outside of `Synchronized` block similar to the fix in SPARK-26265 ## How was this patch tested? Manual tests were being done and will also try to add a test. Closes #24265 from venkata91/deadlock-sorter. Authored-by: Venkata krishnan Sowrirajan <vsowrirajan@qubole.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 6c4552c65045cfe82ed95212ee7cff684e44288b) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 04 April 2019, 01:58:49 UTC
ed3ffda [MINOR][DOC][SQL] Remove out-of-date doc about ORC in DataFrameReader and Writer ## What changes were proposed in this pull request? According to current status, `orc` is available even Hive support isn't enabled. This is a minor doc change to reflect it. ## How was this patch tested? Doc only change. Closes #24280 from viirya/fix-orc-doc. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit d04a7371daec4af046a35066f9664c5011162baa) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 03 April 2019, 16:11:27 UTC
cf6bf0f [SPARK-27346][SQL] Loosen the newline assert condition on 'examples' field in ExpressionInfo ## What changes were proposed in this pull request? I haven't tested by myself on Windows and I am not 100% sure if this is going to cause an actual problem. However, this one line: https://github.com/apache/spark/blob/827383a97c11a61661440ff86ce0c3382a2a23b2/sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/ExpressionInfo.java#L82 made me to investigate a lot today. Given my speculation, if Spark is built in Linux and it's executed on Windows, it looks possible for multiline strings, like, https://github.com/apache/spark/blob/5264164a67df498b73facae207eda12ee133be7d/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/mathExpressions.scala#L146-L150 to throw an exception because the newline in the binary is `\n` but `System.lineSeparator` returns `\r\n`. I think this is not yet found because this particular codes are not released yet (see SPARK-26426). Looks just better to loosen the condition and forget about this stuff. This should be backported into branch-2.4 as well. ## How was this patch tested? N/A Closes #24274 from HyukjinKwon/SPARK-27346. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 949d71283932ba4ce50aa6b329665e0f8be7ecf1) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 02 April 2019, 23:28:00 UTC
55e6f7a [SPARK-26998][CORE] Remove SSL configuration from executors ## What changes were proposed in this pull request? Different SSL passwords shown up as command line argument on executor side in standalone mode: * keyStorePassword * keyPassword * trustStorePassword In this PR I've removed SSL configurations from executors. ## How was this patch tested? Existing + additional unit tests. Additionally tested with standalone mode and checked the command line arguments: ``` [gaborsomogyi:~/spark] SPARK-26998(+4/-0,3)+ ± jps 94803 CoarseGrainedExecutorBackend 94818 Jps 90149 RemoteMavenServer 91925 Nailgun 94793 SparkSubmit 94680 Worker 94556 Master 398 [gaborsomogyi:~/spark] SPARK-26998(+4/-1,3)+ ± ps -ef | egrep "94556|94680|94793|94803" 502 94556 1 0 2:02PM ttys007 0:07.39 /Library/Java/JavaVirtualMachines/jdk1.8.0_152.jdk/Contents/Home/bin/java -cp /Users/gaborsomogyi/spark/conf/:/Users/gaborsomogyi/spark/assembly/target/scala-2.12/jars/* -Xmx1g org.apache.spark.deploy.master.Master --host gsomogyi-MBP.local --port 7077 --webui-port 8080 --properties-file conf/spark-defaults.conf 502 94680 1 0 2:02PM ttys007 0:07.27 /Library/Java/JavaVirtualMachines/jdk1.8.0_152.jdk/Contents/Home/bin/java -cp /Users/gaborsomogyi/spark/conf/:/Users/gaborsomogyi/spark/assembly/target/scala-2.12/jars/* -Xmx1g org.apache.spark.deploy.worker.Worker --webui-port 8081 --properties-file conf/spark-defaults.conf spark://gsomogyi-MBP.local:7077 502 94793 94782 0 2:02PM ttys007 0:35.52 /Library/Java/JavaVirtualMachines/jdk1.8.0_152.jdk/Contents/Home/bin/java -cp /Users/gaborsomogyi/spark/conf/:/Users/gaborsomogyi/spark/assembly/target/scala-2.12/jars/* -Dscala.usejavacp=true -Xmx1g org.apache.spark.deploy.SparkSubmit --master spark://gsomogyi-MBP.local:7077 --class org.apache.spark.repl.Main --name Spark shell spark-shell 502 94803 94680 0 2:03PM ttys007 0:05.20 /Library/Java/JavaVirtualMachines/jdk1.8.0_152.jdk/Contents/Home/bin/java -cp /Users/gaborsomogyi/spark/conf/:/Users/gaborsomogyi/spark/assembly/target/scala-2.12/jars/* -Xmx1024M -Dspark.ssl.ui.port=0 -Dspark.driver.port=60902 org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url spark://CoarseGrainedScheduler172.30.65.186:60902 --executor-id 0 --hostname 172.30.65.186 --cores 8 --app-id app-20190326140311-0000 --worker-url spark://Worker172.30.65.186:60899 502 94910 57352 0 2:05PM ttys008 0:00.00 egrep 94556|94680|94793|94803 ``` Closes #24170 from gaborgsomogyi/SPARK-26998. Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> (cherry picked from commit 57aff93886ac7d02b88294672ce0d2495b0942b8) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> 02 April 2019, 16:19:13 UTC
66dfece [SPARK-27244][CORE][TEST][FOLLOWUP] toDebugString redacts sensitive information ## What changes were proposed in this pull request? This PR is a FollowUp of https://github.com/apache/spark/pull/24196. It improves the test case by using the parameters that are being used in the actual scenarios. ## How was this patch tested? N/A Closes #24257 from gatorsmile/followupSPARK-27244. Authored-by: gatorsmile <gatorsmile@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 92b6f86f6d25abbc2abbf374e77c0b70cd1779c7) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 31 March 2019, 05:58:45 UTC
de1238a [MINOR][R] fix R project description update as per this NOTE when running CRAN check ``` The Title field should be in title case, current version then in title case: ‘R Front end for 'Apache Spark'’ ‘R Front End for 'Apache Spark'’ ``` Closes #24255 from felixcheung/rdesc. Authored-by: Felix Cheung <felixcheung_m@hotmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit fa0f791d4d9f083a45ab631a2e9f88a6b749e416) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 31 March 2019, 04:01:55 UTC
b5c099e [SPARK-27267][FOLLOWUP][BRANCH-2.4] Update hadoop-2.6 dependency manifest ## What changes were proposed in this pull request? This updates `hadoop-2.6` dependency manifest in `branch-2.4`, too. - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.4-test-sbt-hadoop-2.7/351/ - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.4-test-sbt-hadoop-2.6/345/ ## How was this patch tested? Pass the Jenkins. Or, `dev/test-dependencies.sh`. Closes #24254 from dongjoon-hyun/SPARK-27267. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 31 March 2019, 00:46:33 UTC
0ed87bf [SPARK-27267][CORE] Update snappy to avoid error when decompressing empty serialized data (See JIRA for problem statement) Update snappy 1.1.7.1 -> 1.1.7.3 to pick up an empty-stream and Java 9 fix. There appear to be no other changes of consequence: https://github.com/xerial/snappy-java/blob/master/Milestone.md Existing tests Closes #24242 from srowen/SPARK-27267. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com> (cherry picked from commit 2ec650d84316d5820795d44dbc3574885b358698) Signed-off-by: Sean Owen <sean.owen@databricks.com> 30 March 2019, 07:42:52 UTC
b41f912 [SPARK-27301][DSTREAM] Shorten the FileSystem cached life cycle to the cleanup method inner scope ## What changes were proposed in this pull request? The cached FileSystem's token will expire if no tokens explicitly are add into it. ```scala 19/03/28 13:40:16 INFO storage.BlockManager: Removing RDD 83189 19/03/28 13:40:16 INFO rdd.MapPartitionsRDD: Removing RDD 82860 from persistence list 19/03/28 13:40:16 INFO spark.ContextCleaner: Cleaned shuffle 6005 19/03/28 13:40:16 INFO storage.BlockManager: Removing RDD 82860 19/03/28 13:40:16 INFO scheduler.ReceivedBlockTracker: Deleting batches: 19/03/28 13:40:16 INFO scheduler.InputInfoTracker: remove old batch metadata: 1553750250000 ms 19/03/28 13:40:17 WARN security.UserGroupInformation: PriviledgedActionException as:ursHADOOP.HZ.NETEASE.COM (auth:KERBEROS) cause:org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken): token (HDFS_DELEGATION_TOKEN token 53240500 for urs) is expired, current time: 2019-03-28 13:40:17,010+0800 expected renewal time: 2019-03-28 13:39:48,523+0800 19/03/28 13:40:17 WARN ipc.Client: Exception encountered while connecting to the server : org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken): token (HDFS_DELEGATION_TOKEN token 53240500 for urs) is expired, current time: 2019-03-28 13:40:17,010+0800 expected renewal time: 2019-03-28 13:39:48,523+0800 19/03/28 13:40:17 WARN security.UserGroupInformation: PriviledgedActionException as:ursHADOOP.HZ.NETEASE.COM (auth:KERBEROS) cause:org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken): token (HDFS_DELEGATION_TOKEN token 53240500 for urs) is expired, current time: 2019-03-28 13:40:17,010+0800 expected renewal time: 2019-03-28 13:39:48,523+0800 19/03/28 13:40:17 WARN hdfs.LeaseRenewer: Failed to renew lease for [DFSClient_NONMAPREDUCE_-1396157959_1] for 53 seconds. Will retry shortly ... org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken): token (HDFS_DELEGATION_TOKEN token 53240500 for urs) is expired, current time: 2019-03-28 13:40:17,010+0800 expected renewal time: 2019-03-28 13:39:48,523+0800 at org.apache.hadoop.ipc.Client.call(Client.java:1468) at org.apache.hadoop.ipc.Client.call(Client.java:1399) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232) at com.sun.proxy.$Proxy11.renewLease(Unknown Source) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.renewLease(ClientNamenodeProtocolTranslatorPB.java:571) at sun.reflect.GeneratedMethodAccessor40.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) at com.sun.proxy.$Proxy12.renewLease(Unknown Source) at org.apache.hadoop.hdfs.DFSClient.renewLease(DFSClient.java:878) at org.apache.hadoop.hdfs.LeaseRenewer.renew(LeaseRenewer.java:417) at org.apache.hadoop.hdfs.LeaseRenewer.run(LeaseRenewer.java:442) at org.apache.hadoop.hdfs.LeaseRenewer.access$700(LeaseRenewer.java:71) at org.apache.hadoop.hdfs.LeaseRenewer$1.run(LeaseRenewer.java:298) at java.lang.Thread.run(Thread.java:748) ``` This PR shorten the FileSystem cached life cycle to the cleanup method inner scope in case of token expiry. ## How was this patch tested? existing ut Closes #24235 from yaooqinn/SPARK-27301. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com> (cherry picked from commit f4c73b7c685b901dd69950e4929c65e3b8dd3a55) Signed-off-by: Sean Owen <sean.owen@databricks.com> 30 March 2019, 07:36:16 UTC
4edc535 [SPARK-27244][CORE] Redact Passwords While Using Option logConf=true ## What changes were proposed in this pull request? When logConf is set to true, config keys that contain password were printed in cleartext in driver log. This change uses the already present redact method in Utils, to redact all the passwords based on redact pattern in SparkConf and then print the conf to driver log thus ensuring that sensitive information like passwords is not printed in clear text. ## How was this patch tested? This patch was tested through `SparkConfSuite` & then entire unit test through sbt Please review http://spark.apache.org/contributing.html before opening a pull request. Closes #24196 from ninadingole/SPARK-27244. Authored-by: Ninad Ingole <robert.wallis@example.com> Signed-off-by: Sean Owen <sean.owen@databricks.com> (cherry picked from commit dbc7ce18b934fbfd0743b1348fc1265778f07027) Signed-off-by: Sean Owen <sean.owen@databricks.com> 29 March 2019, 19:17:32 UTC
298e4fa [SPARK-27275][CORE] Fix potential corruption in EncryptedMessage.transferTo (2.4) ## What changes were proposed in this pull request? Backport https://github.com/apache/spark/pull/24211 to 2.4 ## How was this patch tested? Jenkins Closes #24229 from zsxwing/SPARK-27275-2.4. Authored-by: Shixiong Zhu <zsxwing@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> 28 March 2019, 18:13:11 UTC
f0a3e89 Preparing development version 2.4.2-SNAPSHOT 26 March 2019, 04:38:38 UTC
5830101 Preparing Spark release v2.4.1-rc9 26 March 2019, 04:38:19 UTC
c7fd233 [SPARK-26961][CORE] Enable parallel classloading capability ## What changes were proposed in this pull request? As per https://docs.oracle.com/javase/8/docs/api/java/lang/ClassLoader.html ``Class loaders that support concurrent loading of classes are known as parallel capable class loaders and are required to register themselves at their class initialization time by invoking the ClassLoader.registerAsParallelCapable method. Note that the ClassLoader class is registered as parallel capable by default. However, its subclasses still need to register themselves if they are parallel capable. `` i.e we can have finer class loading locks by registering classloaders as parallel capable. (Refer to deadlock due to macro lock https://issues.apache.org/jira/browse/SPARK-26961). All the classloaders we have are wrapper of URLClassLoader which by itself is parallel capable. But this cannot be achieved by scala code due to static registration Refer https://github.com/scala/bug/issues/11429 ## How was this patch tested? All Existing UT must pass Closes #24126 from ajithme/driverlock. Authored-by: Ajith <ajith2489@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com> (cherry picked from commit b61dce23d2ee7ca95770bc7c390029aae8c65f7e) Signed-off-by: Sean Owen <sean.owen@databricks.com> 26 March 2019, 00:07:51 UTC
6e4cd88 [SPARK-27274][DOCS] Fix references to scala 2.11 in 2.4.1+ docs; Note 2.11 support is deprecated in 2.4.1+ ## What changes were proposed in this pull request? Fix references to scala 2.11 in 2.4.x docs; should default to 2.12. Note 2.11 support is deprecated in 2.4.x. Note that this change isn't needed in master as it's already on 2.12 in docs by default. ## How was this patch tested? Docs build. Closes #24210 from srowen/Scala212docs24. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com> 26 March 2019, 00:06:17 UTC
f27c951 [SPARK-27198][CORE] Heartbeat interval mismatch in driver and executor ## What changes were proposed in this pull request? When heartbeat interval is configured via spark.executor.heartbeatInterval without specifying units, we have time mismatched between driver(considers in seconds) and executor(considers as milliseconds) ## How was this patch tested? Will add UTs Closes #24140 from ajithme/intervalissue. Authored-by: Ajith <ajith2489@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com> 25 March 2019, 20:38:07 UTC
704b75d [SPARK-27094][YARN][BRANCH-2.4] Work around RackResolver swallowing thread interrupt. To avoid the case where the YARN libraries would swallow the exception and prevent YarnAllocator from shutting down, call the offending code in a separate thread, so that the parent thread can respond appropriately to the shut down. As a safeguard, also explicitly stop the executor launch thread pool when shutting down the application, to prevent new executors from coming up after the application started its shutdown. Tested with unit tests + some internal tests on real cluster. Closes #24206 from vanzin/SPARK-27094-2.4. Authored-by: Marcelo Vanzin <vanzin@cloudera.com> Signed-off-by: DB Tsai <d_tsai@apple.com> 25 March 2019, 18:13:20 UTC
0faf828 Revert "Revert "[SPARK-26606][CORE] Handle driver options properly when submitting to standalone cluster mode via legacy Client"" This reverts commits 3fc626d874d0201ada8387a7e5806672c79cd6b3. Closes #24192 from HeartSaVioR/WIP-testing-SPARK-26606-in-branch-2.4. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 24 March 2019, 23:46:36 UTC
0cfefa7 [SPARK-24935][SQL] fix Hive UDAF with two aggregation buffers ## What changes were proposed in this pull request? Hive UDAF knows the aggregation mode when creating the aggregation buffer, so that it can create different buffers for different inputs: the original data or the aggregation buffer. Please see an example in the [sketches library](https://github.com/DataSketches/sketches-hive/blob/7f9e76e9e03807277146291beb2c7bec40e8672b/src/main/java/com/yahoo/sketches/hive/cpc/DataToSketchUDAF.java#L107). However, the Hive UDAF adapter in Spark always creates the buffer with partial1 mode, which can only deal with one input: the original data. This PR fixes it. All credits go to pgandhi999 , who investigate the problem and study the Hive UDAF behaviors, and write the tests. close https://github.com/apache/spark/pull/23778 ## How was this patch tested? a new test Closes #24144 from cloud-fan/hive. Lead-authored-by: pgandhi <pgandhi@verizonmedia.com> Co-authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com> (cherry picked from commit a6c207c9c0c7aa057cfa27d16fe882b396440113) Signed-off-by: gatorsmile <gatorsmile@gmail.com> 24 March 2019, 23:08:39 UTC
3fc626d Revert "[SPARK-26606][CORE] Handle driver options properly when submitting to standalone cluster mode via legacy Client" This reverts commit 6f1a8d8bfdd8dccc9af2d144ea5ad644ddc63a81. 23 March 2019, 19:19:34 UTC
f3ba73a [SPARK-27160][SQL] Fix DecimalType when building orc filters DecimalType Literal should not be casted to Long. eg. For `df.filter("x < 3.14")`, assuming df (x in DecimalType) reads from a ORC table and uses the native ORC reader with predicate push down enabled, we will push down the `x < 3.14` predicate to the ORC reader via a SearchArgument. OrcFilters will construct the SearchArgument, but not handle the DecimalType correctly. The previous impl will construct `x < 3` from `x < 3.14`. ``` $ sbt > sql/testOnly *OrcFilterSuite > sql/testOnly *OrcQuerySuite -- -z "27160" ``` Closes #24092 from sadhen/spark27160. Authored-by: Darcy Shen <sadhen@zoho.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 23 March 2019, 17:33:22 UTC
6f1a8d8 [SPARK-26606][CORE] Handle driver options properly when submitting to standalone cluster mode via legacy Client This patch fixes the issue that ClientEndpoint in standalone cluster doesn't recognize about driver options which are passed to SparkConf instead of system properties. When `Client` is executed via cli they should be provided as system properties, but with `spark-submit` they can be provided as SparkConf. (SpartSubmit will call `ClientApp.start` with SparkConf which would contain these options.) Manually tested via following steps: 1) setup standalone cluster (launch master and worker via `./sbin/start-all.sh`) 2) submit one of example app with standalone cluster mode ``` ./bin/spark-submit --class org.apache.spark.examples.SparkPi --master "spark://localhost:7077" --conf "spark.driver.extraJavaOptions=-Dfoo=BAR" --deploy-mode "cluster" --num-executors 1 --driver-memory 512m --executor-memory 512m --executor-cores 1 examples/jars/spark-examples*.jar 10 ``` 3) check whether `foo=BAR` is provided in system properties in Spark UI <img width="877" alt="Screen Shot 2019-03-21 at 8 18 04 AM" src="https://user-images.githubusercontent.com/1317309/54728501-97db1700-4bc1-11e9-89da-078445c71e9b.png"> Closes #24163 from HeartSaVioR/SPARK-26606. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan@gmail.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> (cherry picked from commit 8a9eb05137cd4c665f39a54c30d46c0c4eb7d20b) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> 22 March 2019, 22:12:35 UTC
95e73b3 [SPARK-27112][CORE] : Create a resource ordering between threads to r… …esolve the deadlocks encountered when trying to kill executors either due to dynamic allocation or blacklisting Closes #24072 from pgandhi999/SPARK-27112-2. Authored-by: pgandhi <pgandhiverizonmedia.com> Signed-off-by: Imran Rashid <irashidcloudera.com> ## What changes were proposed in this pull request? There are two deadlocks as a result of the interplay between three different threads: **task-result-getter thread** **spark-dynamic-executor-allocation thread** **dispatcher-event-loop thread(makeOffers())** The fix ensures ordering synchronization constraint by acquiring lock on `TaskSchedulerImpl` before acquiring lock on `CoarseGrainedSchedulerBackend` in `makeOffers()` as well as killExecutors() method. This ensures resource ordering between the threads and thus, fixes the deadlocks. ## How was this patch tested? Manual Tests Closes #24134 from pgandhi999/branch-2.4-SPARK-27112. Authored-by: pgandhi <pgandhi@verizonmedia.com> Signed-off-by: Imran Rashid <irashid@cloudera.com> 19 March 2019, 21:22:40 UTC
342e91f [SPARK-27178][K8S][BRANCH-2.4] adding nss package to fix tests ## What changes were proposed in this pull request? see also: https://github.com/apache/spark/pull/24111 while performing some tests on our existing minikube and k8s infrastructure, i noticed that the integration tests were failing. i dug in and discovered the following message buried at the end of the stacktrace: ``` Caused by: java.io.FileNotFoundException: /usr/lib/libnss3.so at sun.security.pkcs11.Secmod.initialize(Secmod.java:193) at sun.security.pkcs11.SunPKCS11.<init>(SunPKCS11.java:218) ... 81 more ``` after i added the `nss` package to `resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile`, everything worked. this is also impacting current builds. see: https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-prb-make-spark-distribution-unified/8959/console ## How was this patch tested? i tested locally before pushing, and the build system will test the rest. Closes #24137 from shaneknapp/add-nss-package. Authored-by: shane knapp <incomplete@gmail.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> 18 March 2019, 23:51:57 UTC
361c942 [SPARK-27134][SQL] array_distinct function does not work correctly with columns containing array of array ## What changes were proposed in this pull request? Correct the logic to compute the distinct. Below is a small repro snippet. ``` scala> val df = Seq(Seq(Seq(1, 2), Seq(1, 2), Seq(1, 2), Seq(3, 4), Seq(4, 5))).toDF("array_col") df: org.apache.spark.sql.DataFrame = [array_col: array<array<int>>] scala> val distinctDF = df.select(array_distinct(col("array_col"))) distinctDF: org.apache.spark.sql.DataFrame = [array_distinct(array_col): array<array<int>>] scala> df.show(false) +----------------------------------------+ |array_col | +----------------------------------------+ |[[1, 2], [1, 2], [1, 2], [3, 4], [4, 5]]| +----------------------------------------+ ``` Error ``` scala> distinctDF.show(false) +-------------------------+ |array_distinct(array_col)| +-------------------------+ |[[1, 2], [1, 2], [1, 2]] | +-------------------------+ ``` Expected result ``` scala> distinctDF.show(false) +-------------------------+ |array_distinct(array_col)| +-------------------------+ |[[1, 2], [3, 4], [4, 5]] | +-------------------------+ ``` ## How was this patch tested? Added an additional test. Closes #24073 from dilipbiswal/SPARK-27134. Authored-by: Dilip Biswal <dbiswal@us.ibm.com> Signed-off-by: Sean Owen <sean.owen@databricks.com> (cherry picked from commit aea9a574c44768d1d93ee7e8069729383859292c) Signed-off-by: Sean Owen <sean.owen@databricks.com> 16 March 2019, 19:31:17 UTC
a2f9684 [SPARK-27165][SPARK-27107][BRANCH-2.4][BUILD][SQL] Upgrade Apache ORC to 1.5.5 ## What changes were proposed in this pull request? This PR aims to update Apache ORC dependency to fix [SPARK-27107](https://issues.apache.org/jira/browse/SPARK-27107) . ``` [ORC-452] Support converting MAP column from JSON to ORC Improvement [ORC-447] Change the docker scripts to keep a persistent m2 cache [ORC-463] Add `version` command [ORC-475] ORC reader should lazily get filesystem [ORC-476] Make SearchAgument kryo buffer size configurable ``` ## How was this patch tested? Pass the Jenkins with the existing tests. Closes #24097 from dongjoon-hyun/SPARK-27165-2.4. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 15 March 2019, 03:13:21 UTC
2d4e9cf [SPARK-26742][K8S][BRANCH-2.4] Update k8s client version to 4.1.2 ## What changes were proposed in this pull request? Updates client version and fixes some related issues. ## How was this patch tested? Tested with the latest minikube version and k8s 1.13. KubernetesSuite: - Run SparkPi with no resources - Run SparkPi with a very long application name. - Use SparkLauncher.NO_RESOURCE - Run SparkPi with a master URL without a scheme. - Run SparkPi with an argument. - Run SparkPi with custom labels, annotations, and environment variables. - Run extraJVMOptions check on driver - Run SparkRemoteFileTest using a remote data file - Run SparkPi with env and mount secrets. - Run PySpark on simple pi.py example - Run PySpark with Python2 to test a pyfiles example - Run PySpark with Python3 to test a pyfiles example - Run PySpark with memory customization - Run in client mode. Run completed in 4 minutes, 20 seconds. Total number of tests run: 14 Suites: completed 2, aborted 0 Tests: succeeded 14, failed 0, canceled 0, ignored 0, pending 0 All tests passed. [INFO] ------------------------------------------------------------------------ [INFO] Reactor Summary: [INFO] [INFO] Spark Project Parent POM 2.4.2-SNAPSHOT ............ SUCCESS [ 2.980 s] [INFO] Spark Project Tags ................................. SUCCESS [ 2.880 s] [INFO] Spark Project Local DB ............................. SUCCESS [ 1.954 s] [INFO] Spark Project Networking ........................... SUCCESS [ 3.369 s] [INFO] Spark Project Shuffle Streaming Service ............ SUCCESS [ 1.791 s] [INFO] Spark Project Unsafe ............................... SUCCESS [ 1.845 s] [INFO] Spark Project Launcher ............................. SUCCESS [ 3.725 s] [INFO] Spark Project Core ................................. SUCCESS [ 23.572 s] [INFO] Spark Project Kubernetes Integration Tests 2.4.2-SNAPSHOT SUCCESS [04:25 min] [INFO] ------------------------------------------------------------------------ [INFO] BUILD SUCCESS [INFO] ------------------------------------------------------------------------ [INFO] Total time: 05:08 min [INFO] Finished at: 2019-03-06T18:03:55Z [INFO] ------------------------------------------------------------------------ Closes #23993 from skonto/fix-k8s-version. Authored-by: Stavros Kontopoulos <stavros.kontopoulos@lightbend.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> 14 March 2019, 16:29:52 UTC
7f5bdd7 [MINOR][CORE] Use https for bintray spark-packages repository ## What changes were proposed in this pull request? This patch changes the schema of url from http to https for bintray spark-packages repository. Looks like we already changed the schema of repository url for pom.xml but missed inside the code. ## How was this patch tested? Manually ran the `--package` via `./bin/spark-shell --verbose --packages "RedisLabs:spark-redis:0.3.2"` ``` ... Ivy Default Cache set to: /Users/jlim/.ivy2/cache The jars for the packages stored in: /Users/jlim/.ivy2/jars :: loading settings :: url = jar:file:/Users/jlim/WorkArea/ScalaProjects/spark/dist/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml RedisLabs#spark-redis added as a dependency :: resolving dependencies :: org.apache.spark#spark-submit-parent-2fee2e18-7832-4a4d-9e97-7b3d0fef766d;1.0 confs: [default] found RedisLabs#spark-redis;0.3.2 in spark-packages found redis.clients#jedis;2.7.2 in central found org.apache.commons#commons-pool2;2.3 in central downloading https://dl.bintray.com/spark-packages/maven/RedisLabs/spark-redis/0.3.2/spark-redis-0.3.2.jar ... [SUCCESSFUL ] RedisLabs#spark-redis;0.3.2!spark-redis.jar (824ms) downloading https://repo1.maven.org/maven2/redis/clients/jedis/2.7.2/jedis-2.7.2.jar ... [SUCCESSFUL ] redis.clients#jedis;2.7.2!jedis.jar (576ms) downloading https://repo1.maven.org/maven2/org/apache/commons/commons-pool2/2.3/commons-pool2-2.3.jar ... [SUCCESSFUL ] org.apache.commons#commons-pool2;2.3!commons-pool2.jar (150ms) :: resolution report :: resolve 4586ms :: artifacts dl 1555ms :: modules in use: RedisLabs#spark-redis;0.3.2 from spark-packages in [default] org.apache.commons#commons-pool2;2.3 from central in [default] redis.clients#jedis;2.7.2 from central in [default] --------------------------------------------------------------------- | | modules || artifacts | | conf | number| search|dwnlded|evicted|| number|dwnlded| --------------------------------------------------------------------- | default | 3 | 3 | 3 | 0 || 3 | 3 | --------------------------------------------------------------------- ``` Closes #24061 from HeartSaVioR/MINOR-use-https-to-bintray-repository. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com> (cherry picked from commit f57af2286f85bf67706e14fecfbfd9ef034c2927) Signed-off-by: Sean Owen <sean.owen@databricks.com> 12 March 2019, 23:01:43 UTC
432ea69 [SPARK-26927][CORE] Ensure executor is active when processing events in dynamic allocation manager. There is a race condition in the `ExecutorAllocationManager` that the `SparkListenerExecutorRemoved` event is posted before the `SparkListenerTaskStart` event, which will cause the incorrect result of `executorIds`. Then, when some executor idles, the real executors will be removed even actual executor number is equal to `minNumExecutors` due to the incorrect computation of `newExecutorTotal`(may greater than the `minNumExecutors`), thus may finally causing zero available executors but a wrong positive number of executorIds was kept in memory. What's more, even the `SparkListenerTaskEnd` event can not make the fake `executorIds` released, because later idle event for the fake executors can not cause the real removal of these executors, as they are already removed and they are not exist in the `executorDataMap` of `CoaseGrainedSchedulerBackend`, so that the `onExecutorRemoved` method will never be called again. For details see https://issues.apache.org/jira/browse/SPARK-26927 This PR is to fix this problem. existUT and added UT Closes #23842 from liupc/Fix-race-condition-that-casues-dyanmic-allocation-not-working. Lead-authored-by: Liupengcheng <liupengcheng@xiaomi.com> Co-authored-by: liupengcheng <liupengcheng@xiaomi.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> (cherry picked from commit d5cfe08fdc7ad07e948f329c0bdeeca5c2574a18) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> 12 March 2019, 21:13:20 UTC
dba5bac Preparing development version 2.4.2-SNAPSHOT 10 March 2019, 06:34:15 UTC
746b3dd Preparing Spark release v2.4.1-rc8 10 March 2019, 06:33:54 UTC
a017a1c [SPARK-27097][CHERRY-PICK 2.4] Avoid embedding platform-dependent offsets literally in whole-stage generated code ## What changes were proposed in this pull request? Spark SQL performs whole-stage code generation to speed up query execution. There are two steps to it: - Java source code is generated from the physical query plan on the driver. A single version of the source code is generated from a query plan, and sent to all executors. - It's compiled to bytecode on the driver to catch compilation errors before sending to executors, but currently only the generated source code gets sent to the executors. The bytecode compilation is for fail-fast only. - Executors receive the generated source code and compile to bytecode, then the query runs like a hand-written Java program. In this model, there's an implicit assumption about the driver and executors being run on similar platforms. Some code paths accidentally embedded platform-dependent object layout information into the generated code, such as: ```java Platform.putLong(buffer, /* offset */ 24, /* value */ 1); ``` This code expects a field to be at offset +24 of the `buffer` object, and sets a value to that field. But whole-stage code generation generally uses platform-dependent information from the driver. If the object layout is significantly different on the driver and executors, the generated code can be reading/writing to wrong offsets on the executors, causing all kinds of data corruption. One code pattern that leads to such problem is the use of `Platform.XXX` constants in generated code, e.g. `Platform.BYTE_ARRAY_OFFSET`. Bad: ```scala val baseOffset = Platform.BYTE_ARRAY_OFFSET // codegen template: s"Platform.putLong($buffer, $baseOffset, $value);" ``` This will embed the value of `Platform.BYTE_ARRAY_OFFSET` on the driver into the generated code. Good: ```scala val baseOffset = "Platform.BYTE_ARRAY_OFFSET" // codegen template: s"Platform.putLong($buffer, $baseOffset, $value);" ``` This will generate the offset symbolically -- `Platform.putLong(buffer, Platform.BYTE_ARRAY_OFFSET, value)`, which will be able to pick up the correct value on the executors. Caveat: these offset constants are declared as runtime-initialized `static final` in Java, so they're not compile-time constants from the Java language's perspective. It does lead to a slightly increased size of the generated code, but this is necessary for correctness. NOTE: there can be other patterns that generate platform-dependent code on the driver which is invalid on the executors. e.g. if the endianness is different between the driver and the executors, and if some generated code makes strong assumption about endianness, it would also be problematic. ## How was this patch tested? Added a new test suite `WholeStageCodegenSparkSubmitSuite`. This test suite needs to set the driver's extraJavaOptions to force the driver and executor use different Java object layouts, so it's run as an actual SparkSubmit job. Authored-by: Kris Mok <kris.mokdatabricks.com> Closes #24032 from gatorsmile/testFailure. Lead-authored-by: Kris Mok <kris.mok@databricks.com> Co-authored-by: gatorsmile <gatorsmile@gmail.com> Signed-off-by: DB Tsai <d_tsai@apple.com> 10 March 2019, 06:00:36 UTC
53590f2 [SPARK-27111][SS] Fix a race that a continuous query may fail with InterruptedException Before a Kafka consumer gets assigned with partitions, its offset will contain 0 partitions. However, runContinuous will still run and launch a Spark job having 0 partitions. In this case, there is a race that epoch may interrupt the query execution thread after `lastExecution.toRdd`, and either `epochEndpoint.askSync[Unit](StopContinuousExecutionWrites)` or the next `runContinuous` will get interrupted unintentionally. To handle this case, this PR has the following changes: - Clean up the resources in `queryExecutionThread.runUninterruptibly`. This may increase the waiting time of `stop` but should be minor because the operations here are very fast (just sending an RPC message in the same process and stopping a very simple thread). - Clear the interrupted status at the end so that it won't impact the `runContinuous` call. We may clear the interrupted status set by `stop`, but it doesn't affect the query termination because `runActivatedStream` will check `state` and exit accordingly. I also updated the clean up codes to make sure exceptions thrown from `epochEndpoint.askSync[Unit](StopContinuousExecutionWrites)` won't stop the clean up. Jenkins Closes #24034 from zsxwing/SPARK-27111. Authored-by: Shixiong Zhu <zsxwing@gmail.com> Signed-off-by: Shixiong Zhu <zsxwing@gmail.com> (cherry picked from commit 6e1c0827ece1cdc615196e60cb11c76b917b8eeb) Signed-off-by: Shixiong Zhu <zsxwing@gmail.com> 09 March 2019, 22:33:34 UTC
c1b6fe4 [SPARK-27080][SQL] bug fix: mergeWithMetastoreSchema with uniform lower case comparison When reading parquet file with merging metastore schema and file schema, we should compare field names using uniform case. In current implementation, lowercase is used but one omission. And this patch fix it. Unit test Closes #24001 from codeborui/mergeSchemaBugFix. Authored-by: CodeGod <> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit a29df5fa02111f57965be2ab5e208f5c815265fe) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 09 March 2019, 13:30:55 UTC
0297ff5 Preparing development version 2.4.2-SNAPSHOT 08 March 2019, 20:38:02 UTC
e87fe15 Preparing Spark release v2.4.1-rc7 08 March 2019, 20:37:43 UTC
216eeec [SPARK-26604][CORE][BACKPORT-2.4] Clean up channel registration for StreamManager ## What changes were proposed in this pull request? This is mostly a clean backport of https://github.com/apache/spark/pull/23521 to branch-2.4 ## How was this patch tested? I've tested this with a hack in `TransportRequestHandler` to force `ChunkFetchRequest` to get dropped. Then making a number of `ExternalShuffleClient.fetchChunk` requests (which `OpenBlocks` then `ChunkFetchRequest`) and closing out of my test harness. A heap dump later reveals that the `StreamState` references are unreachable. I haven't run this through the unit test suite, but doing that now. Wanted to get this up as I think folks are waiting for it for 2.4.1 Closes #24013 from abellina/SPARK-26604_cherry_pick_2_4. Lead-authored-by: Liang-Chi Hsieh <viirya@gmail.com> Co-authored-by: Alessandro Bellina <abellina@yahoo-inc.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> 08 March 2019, 03:48:20 UTC
f7ad4ff [SPARK-25863][SPARK-21871][SQL] Check if code size statistics is empty or not in updateAndGetCompilationStats ## What changes were proposed in this pull request? `CodeGenerator.updateAndGetCompilationStats` throws an unsupported exception for empty code size statistics. This pr added code to check if it is empty or not. ## How was this patch tested? Pass Jenkins. Closes #23947 from maropu/SPARK-21871-FOLLOWUP. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org> 07 March 2019, 08:38:48 UTC
9702915 [SPARK-27078][SQL] Fix NoSuchFieldError when read Hive materialized views ## What changes were proposed in this pull request? This pr fix `NoSuchFieldError` when reading Hive materialized views from Hive 2.3.4. How to reproduce: Hive side: ```sql CREATE TABLE materialized_view_tbl (key INT); CREATE MATERIALIZED VIEW view_1 DISABLE REWRITE AS SELECT * FROM materialized_view_tbl; ``` Spark side: ```java bin/spark-sql --conf spark.sql.hive.metastore.version=2.3.4 --conf spark.sql.hive.metastore.jars=maven spark-sql> select * from view_1; 19/03/05 19:55:37 ERROR SparkSQLDriver: Failed in [select * from view_1] java.lang.NoSuchFieldError: INDEX_TABLE at org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$getTableOption$3(HiveClientImpl.scala:438) at scala.Option.map(Option.scala:163) at org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$getTableOption$1(HiveClientImpl.scala:370) at org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$withHiveState$1(HiveClientImpl.scala:277) at org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:215) at org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:214) at org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:260) at org.apache.spark.sql.hive.client.HiveClientImpl.getTableOption(HiveClientImpl.scala:368) ``` ## How was this patch tested? unit tests Closes #23984 from wangyum/SPARK-24360. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 32848eecc55946ad91e62e231d2e310a0270a63d) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 07 March 2019, 00:58:03 UTC
35381dd [SPARK-27019][SQL][WEBUI] onJobStart happens after onExecutionEnd shouldn't overwrite kvstore ## What changes were proposed in this pull request? Currently, when the event reordering happens, especially onJobStart event come after onExecutionEnd event, SQL page in the UI displays weirdly.(for eg:test mentioned in JIRA and also this issue randomly occurs when the TPCDS query fails due to broadcast timeout etc.) The reason is that, In the SQLAppstatusListener, we remove the liveExecutions entry once the execution ends. So, if a jobStart event come after that, then we create a new liveExecution entry corresponding to the execId. Eventually this will overwrite the kvstore and UI displays confusing entries. ## How was this patch tested? Added UT, Also manually tested with the eventLog, provided in the jira, of the failed query. Before fix: ![screenshot from 2019-03-03 03-05-52](https://user-images.githubusercontent.com/23054875/53687929-53e2b800-3d61-11e9-9dca-620fa41e605c.png) After fix: ![screenshot from 2019-03-03 02-40-18](https://user-images.githubusercontent.com/23054875/53687928-4f1e0400-3d61-11e9-86aa-584646ac68f9.png) Closes #23939 from shahidki31/SPARK-27019. Authored-by: Shahid <shahidki31@gmail.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> (cherry picked from commit 62fd133f744ab2d1aa3c409165914b5940e4d328) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> 06 March 2019, 22:02:45 UTC
7df5aa6 [SPARK-27065][CORE] avoid more than one active task set managers for a stage ## What changes were proposed in this pull request? This is another attempt to fix the more-than-one-active-task-set-managers bug. https://github.com/apache/spark/pull/17208 is the first attempt. It marks the TSM as zombie before sending a task completion event to DAGScheduler. This is necessary, because when the DAGScheduler gets the task completion event, and it's for the last partition, then the stage is finished. However, if it's a shuffle stage and it has missing map outputs, DAGScheduler will resubmit it(see the [code](https://github.com/apache/spark/blob/v2.4.0/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1416-L1422)) and create a new TSM for this stage. This leads to more than one active TSM of a stage and fail. This fix has a hole: Let's say a stage has 10 partitions and 2 task set managers: TSM1(zombie) and TSM2(active). TSM1 has a running task for partition 10 and it completes. TSM2 finishes tasks for partitions 1-9, and thinks he is still active because he hasn't finished partition 10 yet. However, DAGScheduler gets task completion events for all the 10 partitions and thinks the stage is finished. Then the same problem occurs: DAGScheduler may resubmit the stage and cause more than one actice TSM error. https://github.com/apache/spark/pull/21131 fixed this hole by notifying all the task set managers when a task finishes. For the above case, TSM2 will know that partition 10 is already completed, so he can mark himself as zombie after partitions 1-9 are completed. However, #21131 still has a hole: TSM2 may be created after the task from TSM1 is completed. Then TSM2 can't get notified about the task completion, and leads to the more than one active TSM error. #22806 and #23871 are created to fix this hole. However the fix is complicated and there are still ongoing discussions. This PR proposes a simple fix, which can be easy to backport: mark all existing task set managers as zombie when trying to create a new task set manager. After this PR, #21131 is still necessary, to avoid launching unnecessary tasks and fix [SPARK-25250](https://issues.apache.org/jira/browse/SPARK-25250 ). #22806 and #23871 are its followups to fix the hole. ## How was this patch tested? existing tests. Closes #23927 from cloud-fan/scheduler. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Imran Rashid <irashid@cloudera.com> (cherry picked from commit cb20fbc43e7f54af1ed30b9eb6d76ca50b4eb750) Signed-off-by: Imran Rashid <irashid@cloudera.com> 06 March 2019, 18:01:07 UTC
db86ccb [SPARK-23433][SPARK-25250][CORE] Later created TaskSet should learn about the finished partitions ## What changes were proposed in this pull request? This is an optional solution for #22806 . #21131 firstly implement that a previous successful completed task from zombie TaskSetManager could also succeed the active TaskSetManager, which based on an assumption that an active TaskSetManager always exists for that stage when this happen. But that's not always true as an active TaskSetManager may haven't been created when a previous task succeed, and this is the reason why #22806 hit the issue. This pr extends #21131 's behavior by adding `stageIdToFinishedPartitions` into TaskSchedulerImpl, which recording the finished partition whenever a task(from zombie or active) succeed. Thus, a later created active TaskSetManager could also learn about the finished partition by looking into `stageIdToFinishedPartitions ` and won't launch any duplicate tasks. ## How was this patch tested? Add. Closes #23871 from Ngone51/dev-23433-25250. Lead-authored-by: wuyi <ngone_5451@163.com> Co-authored-by: Ngone51 <ngone_5451@163.com> Signed-off-by: Imran Rashid <irashid@cloudera.com> (cherry picked from commit e5c61436a5720f13eb6d530ebf80635522bd64c6) Signed-off-by: Imran Rashid <irashid@cloudera.com> 06 March 2019, 17:53:39 UTC
5ec4563 [SPARK-24669][SQL] Invalidate tables in case of DROP DATABASE CASCADE ## What changes were proposed in this pull request? Before dropping database refresh the tables of that database, so as to refresh all cached entries associated with those tables. We follow the same when dropping a table. UT is added Closes #23905 from Udbhav30/SPARK-24669. Authored-by: Udbhav30 <u.agrawal30@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 9bddf7180e9e76e1cabc580eee23962dd66f84c3) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 06 March 2019, 17:07:57 UTC
b583bfe [SPARK-26932][DOC] Add a warning for Hive 2.1.1 ORC reader issue Hive 2.1.1 cannot read ORC table created by Spark 2.4.0 in default, and I add the information into sql-migration-guide-upgrade.md. for details to see: [SPARK-26932](https://issues.apache.org/jira/browse/SPARK-26932) doc build Closes #23944 from haiboself/SPARK-26932. Authored-by: Bo Hai <haibo-self@163.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit c27caead43423d1f994f42502496d57ea8389dc0) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 05 March 2019, 20:08:07 UTC
498fb70 [MINOR][DOCS] Clarify that Spark apps should mark Spark as a 'provided' dependency, not package it ## What changes were proposed in this pull request? Spark apps do not need to package Spark. In fact it can cause problems in some cases. Our examples should show depending on Spark as a 'provided' dependency. Packaging Spark makes the app much bigger by tens of megabytes. It can also bring in conflicting dependencies that wouldn't otherwise be a problem. https://issues.apache.org/jira/browse/SPARK-26146 was what reminded me of this. ## How was this patch tested? Doc build Closes #23938 from srowen/Provided. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com> (cherry picked from commit 39092236819da097e9c8a3b2fa975105f08ae5b9) Signed-off-by: Sean Owen <sean.owen@databricks.com> 05 March 2019, 14:27:03 UTC
ae462b1 [SPARK-27046][DSTREAMS] Remove SPARK-19185 related references from documentation ## What changes were proposed in this pull request? SPARK-19185 is resolved so the reference can be removed from the documentation. ## How was this patch tested? cd docs/ SKIP_API=1 jekyll build Manual webpage check. Closes #23959 from gaborgsomogyi/SPARK-27046. Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com> (cherry picked from commit 5252d8b9872cbf200651b0bb7b8c6edd649ebb58) Signed-off-by: Sean Owen <sean.owen@databricks.com> 04 March 2019, 15:32:19 UTC
3336a21 [SPARK-26990][SQL][BACKPORT-2.4] FileIndex: use user specified field names if possible ## What changes were proposed in this pull request? Back-port of #23894 to branch-2.4. WIth the following file structure: ``` /tmp/data └── a=5 ``` In the previous release: ``` scala> spark.read.schema("A int, ID long").parquet("/tmp/data/").printSchema root |-- ID: long (nullable = true) |-- A: integer (nullable = true) ``` While in current code: ``` scala> spark.read.schema("A int, ID long").parquet("/tmp/data/").printSchema root |-- ID: long (nullable = true) |-- a: integer (nullable = true) ``` We can see that the partition column name `a` is different from `A` as user specifed. This PR is to fix the case and make it more user-friendly. Closes #23894 from gengliangwang/fileIndexSchema. Authored-by: Gengliang Wang <gengliang.wangdatabricks.com> Signed-off-by: Wenchen Fan <wenchendatabricks.com> ## How was this patch tested? Unit test Closes #23909 from bersprockets/backport-SPARK-26990. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> 28 February 2019, 01:37:07 UTC
b031f4a [MINOR][BUILD] Update all checkstyle dtd to use "https://checkstyle.org" ## What changes were proposed in this pull request? Below build failed with Java checkstyle test, but instead of violation it shows FileNotFound on dtd file. https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/102751/ Looks like the link of dtd file is dead `http://www.puppycrawl.com/dtds/configuration_1_3.dtd`. This patch updates the dtd link to "https://checkstyle.org/dtds/" given checkstyle repository also updated the URL path. https://github.com/checkstyle/checkstyle/issues/5601 ## How was this patch tested? Checked the new links. Closes #23887 from HeartSaVioR/java-checkstyle-dtd-change-url. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan@gmail.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> (cherry picked from commit c5de804093540509929f6de211dbbe644b33e6db) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> 25 February 2019, 19:26:17 UTC
073c47b Preparing development version 2.4.2-SNAPSHOT 22 February 2019, 22:54:37 UTC
eb2af24 Preparing Spark release v2.4.1-rc5 22 February 2019, 22:54:15 UTC
ef67be3 [SPARK-26950][SQL][TEST] Make RandomDataGenerator use Float.NaN or Double.NaN for all NaN values ## What changes were proposed in this pull request? Apache Spark uses the predefined `Float.NaN` and `Double.NaN` for NaN values, but there exists more NaN values with different binary presentations. ```scala scala> java.nio.ByteBuffer.allocate(4).putFloat(Float.NaN).array res1: Array[Byte] = Array(127, -64, 0, 0) scala> val x = java.lang.Float.intBitsToFloat(-6966608) x: Float = NaN scala> java.nio.ByteBuffer.allocate(4).putFloat(x).array res2: Array[Byte] = Array(-1, -107, -78, -80) ``` Since users can have these values, `RandomDataGenerator` generates these NaN values. However, this causes `checkEvaluationWithUnsafeProjection` failures due to the difference between `UnsafeRow` binary presentation. The following is the UT failure instance. This PR aims to fix this UT flakiness. - https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/102528/testReport/ ## How was this patch tested? Pass the Jenkins with the newly added test cases. Closes #23851 from dongjoon-hyun/SPARK-26950. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit ffef3d40741b0be321421aa52a6e17a26d89f541) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 22 February 2019, 04:27:17 UTC
b403612 Revert "[R][BACKPORT-2.3] update package description" This reverts commit 8d68d54f2e2cbbe55a4bb87c2216cff896add517. 22 February 2019, 02:14:56 UTC
8d68d54 [R][BACKPORT-2.3] update package description doesn't port cleanly to 2.3. we need this in branch-2.4 and branch-2.3 Closes #23861 from felixcheung/2.3rdesc. Authored-by: Felix Cheung <felixcheung_m@hotmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 36db45d5b90ddc3ce54febff2ed41cd29c0a8a04) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 22 February 2019, 02:13:38 UTC
3282544 Preparing development version 2.4.2-SNAPSHOT 21 February 2019, 23:02:17 UTC
79c1f7e Preparing Spark release v2.4.1-rc4 21 February 2019, 23:01:58 UTC
d857630 [R][BACKPORT-2.4] update package description #23852 doesn't port cleanly to 2.4. we need this in branch-2.4 and branch-2.3 Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #23860 from felixcheung/2.4rdesc. 21 February 2019, 16:42:15 UTC
0926f49 Preparing development version 2.4.2-SNAPSHOT 21 February 2019, 00:46:07 UTC
061185b Preparing Spark release v2.4.1-rc3 21 February 2019, 00:45:49 UTC
274142b [SPARK-26859][SQL] Fix field writer index bug in non-vectorized ORC deserializer ## What changes were proposed in this pull request? This happens in a schema evolution use case only when a user specifies the schema manually and use non-vectorized ORC deserializer code path. There is a bug in `OrcDeserializer.scala` that results in `null`s being set at the wrong column position, and for state from previous records to remain uncleared in next records. There are more details for when exactly the bug gets triggered and what the outcome is in the [JIRA issue](https://jira.apache.org/jira/browse/SPARK-26859). The high-level summary is that this bug results in severe data correctness issues, but fortunately the set of conditions to expose the bug are complicated and make the surface area somewhat small. This change fixes the problem and adds a respective test. ## How was this patch tested? Pass the Jenkins with the newly added test cases. Closes #23766 from IvanVergiliev/fix-orc-deserializer. Lead-authored-by: Ivan Vergiliev <ivan.vergiliev@gmail.com> Co-authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 096552ae4d6fcef5e20c54384a2687db41ba2fa1) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 20 February 2019, 13:53:40 UTC
4c60056 Preparing development version 2.4.2-SNAPSHOT 19 February 2019, 21:54:45 UTC
229ad52 Preparing Spark release v2.4.1-rc2 19 February 2019, 21:54:26 UTC
383b662 [MINOR][DOCS] Fix the update rule in StreamingKMeansModel documentation ## What changes were proposed in this pull request? The formatting for the update rule (in the documentation) now appears as ![image](https://user-images.githubusercontent.com/14948437/52933807-5a0c7980-3309-11e9-8573-642a73e77c26.png) instead of ![image](https://user-images.githubusercontent.com/14948437/52933897-a8ba1380-3309-11e9-8e16-e47c27b4a044.png) Closes #23819 from joelgenter/patch-1. Authored-by: joelgenter <joelgenter@outlook.com> Signed-off-by: Sean Owen <sean.owen@databricks.com> (cherry picked from commit 885aa553c5e8f478b370f8a733102b67f6cd2d99) Signed-off-by: Sean Owen <sean.owen@databricks.com> 19 February 2019, 14:41:29 UTC
633de74 [SPARK-26740][SQL][BRANCH-2.4] Read timestamp/date column stats written by Spark 3.0 ## What changes were proposed in this pull request? - Backport of #23662 to `branch-2.4` - Added `Timestamp`/`DateFormatter` - Set version of column stats to `1` to keep backward compatibility with previous versions ## How was this patch tested? The changes were tested by `StatisticsCollectionSuite` and by `StatisticsSuite`. Closes #23809 from MaxGekk/column-stats-time-date-2.4. Lead-authored-by: Maxim Gekk <max.gekk@gmail.com> Co-authored-by: Maxim Gekk <maxim.gekk@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> 19 February 2019, 03:46:42 UTC
094cabc [SPARK-26897][SQL][TEST][FOLLOW-UP] Remove workaround for 2.2.0 and 2.1.x in HiveExternalCatalogVersionsSuite ## What changes were proposed in this pull request? This pr just removed workaround for 2.2.0 and 2.1.x in HiveExternalCatalogVersionsSuite. ## How was this patch tested? Pass the Jenkins. Closes #23817 from maropu/SPARK-26607-FOLLOWUP. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit e2b8cc65cd579374ddbd70b93c9fcefe9b8873d9) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 18 February 2019, 03:25:16 UTC
dfda97a [SPARK-26897][SQL][TEST] Update Spark 2.3.x testing from HiveExternalCatalogVersionsSuite ## What changes were proposed in this pull request? The maintenance release of `branch-2.3` (v2.3.3) vote passed, so this issue updates PROCESS_TABLES.testingVersions in HiveExternalCatalogVersionsSuite ## How was this patch tested? Pass the Jenkins. Closes #23807 from maropu/SPARK-26897. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org> (cherry picked from commit dcdbd06b687fafbf29df504949db0a5f77608c8e) Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org> 17 February 2019, 23:06:15 UTC
7270283 [SPARK-26864][SQL][BACKPORT-2.4] Query may return incorrect result when python udf is used as a join condition and the udf uses attributes from both legs of left semi join ## What changes were proposed in this pull request? n SPARK-25314, we supported the scenario of having a python UDF that refers to attributes from both legs of a join condition by rewriting the plan to convert an inner join or left semi join to a filter over a cross join. In case of left semi join, this transformation may cause incorrect results when the right leg of join condition produces duplicate rows based on the join condition. This fix disallows the rewrite for left semi join and raises an error in the case like we do for other types of join. In future, we should have separate rule in optimizer to convert left semi join to inner join (I am aware of one case we could do it if we leverage informational constraint i.e when we know the right side does not produce duplicates). **Python** ```SQL >>> from pyspark import SparkContext >>> from pyspark.sql import SparkSession, Column, Row >>> from pyspark.sql.functions import UserDefinedFunction, udf >>> from pyspark.sql.types import * >>> from pyspark.sql.utils import AnalysisException >>> >>> spark.conf.set("spark.sql.crossJoin.enabled", "True") >>> left = spark.createDataFrame([Row(lc1=1, lc2=1), Row(lc1=2, lc2=2)]) >>> right = spark.createDataFrame([Row(rc1=1, rc2=1), Row(rc1=1, rc2=1)]) >>> func = udf(lambda a, b: a == b, BooleanType()) >>> df = left.join(right, func("lc1", "rc1"), "leftsemi").show() 19/02/12 16:07:10 WARN PullOutPythonUDFInJoinCondition: The join condition:<lambda>(lc1#0L, rc1#4L) of the join plan contains PythonUDF only, it will be moved out and the join plan will be turned to cross join. +---+---+ |lc1|lc2| +---+---+ | 1| 1| | 1| 1| +---+---+ ``` **Scala** ```SQL scala> val left = Seq((1, 1), (2, 2)).toDF("lc1", "lc2") left: org.apache.spark.sql.DataFrame = [lc1: int, lc2: int] scala> val right = Seq((1, 1), (1, 1)).toDF("rc1", "rc2") right: org.apache.spark.sql.DataFrame = [rc1: int, rc2: int] scala> val equal = udf((p1: Integer, p2: Integer) => { | p1 == p2 | }) equal: org.apache.spark.sql.expressions.UserDefinedFunction = SparkUserDefinedFunction($Lambda$2141/11016292394666f1b5,BooleanType,List(Some(Schema(IntegerType,true)), Some(Schema(IntegerType,true))),None,false,true) scala> val df = left.join(right, equal(col("lc1"), col("rc1")), "leftsemi") df: org.apache.spark.sql.DataFrame = [lc1: int, lc2: int] scala> df.show() +---+---+ |lc1|lc2| +---+---+ | 1| 1| +---+---+ ``` ## How was this patch tested? Modified existing tests. Closes #23780 from dilipbiswal/dkb_python_udf_2.4_2. Authored-by: Dilip Biswal <dbiswal@us.ibm.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> 16 February 2019, 09:05:10 UTC
fccc6d3 [SPARK-25922][K8S] Spark Driver/Executor "spark-app-selector" label mismatch (branch-2.4) In K8S Cluster mode, the algorithm to generate spark-app-selector/spark.app.id of spark driver is different with spark executor. This patch makes sure spark driver and executor to use the same spark-app-selector/spark.app.id if spark.app.id is set, otherwise it will use superclass applicationId. In K8S Client mode, spark-app-selector/spark.app.id for executors will use superclass applicationId. Manually run. Closes #23779 from vanzin/SPARK-25922. Authored-by: suxingfate <suxingfate@163.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> 15 February 2019, 18:08:33 UTC
bc1e960 [SPARK-26873][SQL] Use a consistent timestamp to build Hadoop Job IDs. ## What changes were proposed in this pull request? Updates FileFormatWriter to create a consistent Hadoop Job ID for a write. ## How was this patch tested? Existing tests for regressions. Closes #23777 from rdblue/SPARK-26873-fix-file-format-writer-job-ids. Authored-by: Ryan Blue <blue@apache.org> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> (cherry picked from commit 33334e2728f8d2e4cf7d542049435b589ed05a5e) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> 14 February 2019, 16:25:48 UTC
61b4787 [SPARK-26572][SQL] fix aggregate codegen result evaluation This PR is a correctness fix in `HashAggregateExec` code generation. It forces evaluation of result expressions before calling `consume()` to avoid multiple executions. This PR fixes a use case where an aggregate is nested into a broadcast join and appears on the "stream" side. The issue is that Broadcast join generates it's own loop. And without forcing evaluation of `resultExpressions` of `HashAggregateExec` before the join's loop these expressions can be executed multiple times giving incorrect results. New UT was added. Closes #23731 from peter-toth/SPARK-26572. Authored-by: Peter Toth <peter.toth@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 2228ee51ce3550d7e6740a1833aae21ab8596764) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 14 February 2019, 15:08:06 UTC
455a57d [MINOR][DOCS] Fix for contradiction in condition formula of keeping intermediate state of window in structured streaming docs This change solves contradiction in structured streaming documentation in formula which tests if specific window will be updated by calculating watermark and comparing with "T" parameter(intermediate state is cleared as (max event time seen by the engine - late threshold > T), otherwise kept(written as "until")). By further examples the "T" seems to be the end of the window, not start like documentation says firstly. For more information please take a look at my question in stackoverflow https://stackoverflow.com/questions/54599594/understanding-window-with-watermark-in-apache-spark-structured-streaming Can be tested by building documentation. Closes #23765 from vitektarasenko/master. Authored-by: Viktor Tarasenko <v.tarasenko@vezet.ru> Signed-off-by: Sean Owen <sean.owen@databricks.com> (cherry picked from commit 5894f767d1f159fc05e11d77d61089efcd0c50b4) Signed-off-by: Sean Owen <sean.owen@databricks.com> 13 February 2019, 14:01:49 UTC
351b44d Preparing development version 2.4.2-SNAPSHOT 12 February 2019, 18:45:14 UTC
50eba0e Preparing Spark release v2.4.1-rc1 12 February 2019, 18:45:06 UTC
af3c711 [SPARK-26082][MESOS][FOLLOWUP][BRANCH-2.4] Add UT on fetcher cache option on MesosClusterScheduler ## What changes were proposed in this pull request? This patch adds UT on testing SPARK-26082 to avoid regression. While #23743 reduces the possibility to make a similar mistake, the needed lines of code for adding tests are not that huge, so I guess it might be worth to add them. ## How was this patch tested? Newly added UTs. Test "supports setting fetcher cache" fails when #23734 is not applied and succeeds when #23734 is applied. Closes #23753 from HeartSaVioR/SPARK-26082-branch-2.4. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 10 February 2019, 22:49:37 UTC
f691b2c Revert "[SPARK-26082][MESOS][FOLLOWUP] Add UT on fetcher cache option on MesosClusterScheduler" This reverts commit e645743ad57978823adac57d95fe02fa6f45dad0. 09 February 2019, 03:51:25 UTC
e645743 [SPARK-26082][MESOS][FOLLOWUP] Add UT on fetcher cache option on MesosClusterScheduler ## What changes were proposed in this pull request? This patch adds UT on testing SPARK-26082 to avoid regression. While #23743 reduces the possibility to make a similar mistake, the needed lines of code for adding tests are not that huge, so I guess it might be worth to add them. ## How was this patch tested? Newly added UTs. Test "supports setting fetcher cache" fails when #23743 is not applied and succeeds when #23743 is applied. Closes #23744 from HeartSaVioR/SPARK-26082-add-unit-test. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit b4e1d145135445eeed85784dab0c2c088930dd26) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 07 February 2019, 16:52:11 UTC
c41a5e1 [SPARK-26082][MESOS] Fix mesos fetch cache config name ## What changes were proposed in this pull request? * change MesosClusterScheduler to use correct argument name for Mesos fetch cache (spark.mesos.fetchCache.enable -> spark.mesos.fetcherCache.enable) ## How was this patch tested? Not sure this requires a test, since it's just a string change. Closes #23734 from mwlon/SPARK-26082. Authored-by: mwlon <mloncaric@hmc.edu> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit c0811e8b4d11892f60b7032ba4c8e3adc40fe82f) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 07 February 2019, 09:22:15 UTC
9b2eedc [SPARK-26734][STREAMING] Fix StackOverflowError with large block queue ## What changes were proposed in this pull request? SPARK-23991 introduced a bug in `ReceivedBlockTracker#allocateBlocksToBatch`: when a queue with more than a few thousand blocks are in the queue, serializing the queue throws a StackOverflowError. This change just adds `dequeueAll` to the new `clone` operation on the queue so that the fix in 23991 is preserved but the serialized data comes from an ArrayBuffer which doesn't have the serialization problems that mutable.Queue has. ## How was this patch tested? A unit test was added. Closes #23716 from rlodge/SPARK-26734. Authored-by: Ross Lodge <rlodge@concentricsky.com> Signed-off-by: Sean Owen <sean.owen@databricks.com> (cherry picked from commit 8427e9ba5cae28233d1bdc54208b46889b83a821) Signed-off-by: Sean Owen <sean.owen@databricks.com> 06 February 2019, 16:44:25 UTC
570edc6 [SPARK-26677][FOLLOWUP][BRANCH-2.4] Update Parquet manifest with Hadoop-2.6 ## What changes were proposed in this pull request? During merging Parquet upgrade PR, `hadoop-2.6` profile dependency manifest is missed. ## How was this patch tested? Manual. ``` ./dev/test-dependencies.sh ``` Also, this will recover `branch-2.4` with `hadoop-2.6` build. - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.4-test-sbt-hadoop-2.6/281/ Closes #23738 from dongjoon-hyun/SPARK-26677-2. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 06 February 2019, 01:22:33 UTC
7187c01 [SPARK-26758][CORE] Idle Executors are not getting killed after spark.dynamiAllocation.executorIdleTimeout value ## What changes were proposed in this pull request? **updateAndSyncNumExecutorsTarget** API should be called after **initializing** flag is unset ## How was this patch tested? Added UT and also manually tested After Fix ![afterfix](https://user-images.githubusercontent.com/35216143/51983136-ed4a5000-24bd-11e9-90c8-c4a562c17a4b.png) Closes #23697 from sandeep-katta/executorIssue. Authored-by: sandeep-katta <sandeep.katta2007@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com> (cherry picked from commit 1dd7419702c5bc7e36fee9fa1eec06b66f25806e) Signed-off-by: Sean Owen <sean.owen@databricks.com> 05 February 2019, 04:13:53 UTC
3d4aa5b [SPARK-26751][SQL] Fix memory leak when statement run in background and throw exception which is not HiveSQLException ## What changes were proposed in this pull request? When we run in background and we get exception which is not HiveSQLException, we may encounter memory leak since handleToOperation will not removed correctly. The reason is below: 1. When calling operation.run() in HiveSessionImpl#executeStatementInternal we throw an exception which is not HiveSQLException 2. Then the opHandle generated by SparkSQLOperationManager will not be added into opHandleSet of HiveSessionImpl , and operationManager.closeOperation(opHandle) will not be called 3. When we close the session we will also call operationManager.closeOperation(opHandle),since we did not add this opHandle into the opHandleSet. For the reasons above,the opHandled will always in SparkSQLOperationManager#handleToOperation,which will cause memory leak. More details and a case has attached on https://issues.apache.org/jira/browse/SPARK-26751 This patch will always throw HiveSQLException when running in background ## How was this patch tested? Exist UT Closes #23673 from caneGuy/zhoukang/fix-hivesessionimpl-leak. Authored-by: zhoukang <zhoukang199191@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com> (cherry picked from commit 255faaf3436e1f41838062ed460f801bb0be40ec) Signed-off-by: Sean Owen <sean.owen@databricks.com> 03 February 2019, 14:46:24 UTC
190e48c [SPARK-26677][BUILD] Update Parquet to 1.10.1 with notEq pushdown fix. ## What changes were proposed in this pull request? Update to Parquet Java 1.10.1. ## How was this patch tested? Added a test from HyukjinKwon that validates the notEq case from SPARK-26677. Closes #23704 from rdblue/SPARK-26677-fix-noteq-parquet-bug. Lead-authored-by: Ryan Blue <blue@apache.org> Co-authored-by: Hyukjin Kwon <gurwls223@apache.org> Co-authored-by: Ryan Blue <rdblue@users.noreply.github.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit f72d2177882dc47b043fdc7dec9a46fe65df4ee9) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 02 February 2019, 17:18:08 UTC
bd4ce51 [SPARK-26806][SS] EventTimeStats.merge should handle zeros correctly ## What changes were proposed in this pull request? Right now, EventTimeStats.merge doesn't handle `zero.merge(zero)` correctly. This will make `avg` become `NaN`. And whatever gets merged with the result of `zero.merge(zero)`, `avg` will still be `NaN`. Then finally, we call `NaN.toLong` and get `0`, and the user will see the following incorrect report: ``` "eventTime" : { "avg" : "1970-01-01T00:00:00.000Z", "max" : "2019-01-31T12:57:00.000Z", "min" : "2019-01-30T18:44:04.000Z", "watermark" : "1970-01-01T00:00:00.000Z" } ``` This issue was reported by liancheng . This PR fixes the above issue. ## How was this patch tested? The new unit tests. Closes #23718 from zsxwing/merge-zero. Authored-by: Shixiong Zhu <zsxwing@gmail.com> Signed-off-by: Shixiong Zhu <zsxwing@gmail.com> (cherry picked from commit 03a928cbecaf38bbbab3e6b957fcbb542771cfbd) Signed-off-by: Shixiong Zhu <zsxwing@gmail.com> 01 February 2019, 19:15:47 UTC
2a83431 [SPARK-26745][SPARK-24959][SQL][BRANCH-2.4] Revert count optimization in JSON datasource by ## What changes were proposed in this pull request? This PR reverts JSON count optimization part of #21909. We cannot distinguish the cases below without parsing: ``` [{...}, {...}] ``` ``` [] ``` ``` {...} ``` ```bash # empty string ``` when we `count()`. One line (input: IN) can be, 0 record, 1 record and multiple records and this is dependent on each input. See also https://github.com/apache/spark/pull/23665#discussion_r251276720. ## How was this patch tested? Manually tested. Closes #23708 from HyukjinKwon/SPARK-26745-backport. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 01 February 2019, 02:22:05 UTC
2b5e033 [SPARK-26757][GRAPHX] Return 0 for `count` on empty Edge/Vertex RDDs ## What changes were proposed in this pull request? Previously a "java.lang.UnsupportedOperationException: empty collection" exception would be thrown due to using `reduce`, rather than `fold` or similar that can tolerate empty RDDs. This behaviour has existed for the Vertex RDDs since it was introduced in b30e0ae0351be1cbc0b1cf179293587b466ee026. It seems this behaviour was inherited by the Edge RDDs via copy-paste in ee29ef3800438501e0ff207feb00a28973fc0769. ## How was this patch tested? Two new unit tests. Closes #23681 from huonw/empty-graphx. Authored-by: Huon Wilson <Huon.Wilson@data61.csiro.au> Signed-off-by: Sean Owen <sean.owen@databricks.com> (cherry picked from commit da526985c7574dccdcc0cca7452e2e999a5b3012) Signed-off-by: Sean Owen <sean.owen@databricks.com> 31 January 2019, 23:27:46 UTC
back to top