https://github.com/apache/spark

sort by:
Revision Author Date Message Commit Date
e2484f6 Preparing Spark release v3.4.0-rc1 21 February 2023, 10:39:21 UTC
f394322 [SPARK-42507][SQL][TESTS] Simplify ORC schema merging conflict error check ### What changes were proposed in this pull request? This PR aims to simplify ORC schema merging conflict error check. ### Why are the changes needed? Currently, `branch-3.4` CI is broken because the order of partitions. - https://github.com/apache/spark/runs/11463120795 - https://github.com/apache/spark/runs/11463886897 - https://github.com/apache/spark/runs/11467827738 - https://github.com/apache/spark/runs/11471484144 - https://github.com/apache/spark/runs/11471507531 - https://github.com/apache/spark/runs/11474764316 ![Screenshot 2023-02-20 at 12 30 19 PM](https://user-images.githubusercontent.com/9700541/220193503-6d6ce2ce-3fd6-4b01-b91c-bc1ec1f41c03.png) ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass the CIs. Closes #40101 from dongjoon-hyun/SPARK-42507. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Xinrong Meng <xinrong@apache.org> (cherry picked from commit 0c20263dcd0c394f8bfd6fa2bfc62031135de06a) Signed-off-by: Xinrong Meng <xinrong@apache.org> 21 February 2023, 09:48:35 UTC
1dfa58d Preparing development version 3.4.1-SNAPSHOT 21 February 2023, 02:43:10 UTC
81d39dc Preparing Spark release v3.4.0-rc1 21 February 2023, 02:43:05 UTC
4560d4c [SPARK-41952][SQL] Fix Parquet zstd off-heap memory leak as a workaround for PARQUET-2160 ### What changes were proposed in this pull request? SPARK-41952 was raised for a while, but unfortunately, the Parquet community does not publish the patched version yet, as a workaround, we can fix the issue on the Spark side first. We encountered this memory issue when migrating data from parquet/snappy to parquet/zstd, Spark executors always occupy unreasonable off-heap memory and have a high risk of being killed by NM. See more discussions at https://github.com/apache/parquet-mr/pull/982 and https://github.com/apache/iceberg/pull/5681 ### Why are the changes needed? The issue is fixed in the parquet community [PARQUET-2160](https://issues.apache.org/jira/browse/PARQUET-2160), but the patched version is not available yet. ### Does this PR introduce _any_ user-facing change? Yes, it's bug fix. ### How was this patch tested? The existing UT should cover the correctness check, I also verified this patch by scanning a large parquet/zstd table. ``` spark-shell --executor-cores 4 --executor-memory 6g --conf spark.executor.memoryOverhead=2g ``` ``` spark.sql("select sum(hash(*)) from parquet_zstd_table ").show(false) ``` - before this patch All executors get killed by NM quickly. ``` ERROR YarnScheduler: Lost executor 1 on hadoop-xxxx.****.org: Container killed by YARN for exceeding physical memory limits. 8.2 GB of 8 GB physical memory used. Consider boosting spark.executor.memoryOverhead. ``` <img width="1872" alt="image" src="https://user-images.githubusercontent.com/26535726/220031678-e9060244-5586-4f0c-8fe7-55bb4e20a580.png"> - after this patch Query runs well, no executor gets killed. <img width="1881" alt="image" src="https://user-images.githubusercontent.com/26535726/220031917-4fe38c07-b38f-49c6-a982-2091a6c2a8ed.png"> Closes #40091 from pan3793/SPARK-41952. Authored-by: Cheng Pan <chengpan@apache.org> Signed-off-by: Chao Sun <sunchao@apple.com> 20 February 2023, 17:44:57 UTC
18d5d5e [SPARK-42423][SQL] Add metadata column file block start and length ### What changes were proposed in this pull request? Support `_metadata.file_block_start` and `_metadata.file_block_length` for datasource file metadata columns. Note that, it does not support data filter since we only know block start and length after splitting files. ### Why are the changes needed? To improve the observability. Currently, we have an built-in function `InputFileBlockStart` which has some issues, e.g. not work for join. It's better to encourage people changing to use the metadata column. File block length is also an important information. People can find how Spark splits the big files. ### Does this PR introduce _any_ user-facing change? yes ### How was this patch tested? Improve exists test and add test Closes #39996 from ulysses-you/SPARK-42423. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit ae97131f1afa5deac2bd183872cedd8829024efa) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 20 February 2023, 14:33:55 UTC
ad0607a [SPARK-42476][CONNECT][DOCS] Complete Spark Connect API reference ### What changes were proposed in this pull request? This PR proposes to complete missing API Reference for Spark Connect. Built API docs should include "Changed in version" for Spark Connect when it's implemented as below: <img width="814" alt="Screen Shot 2023-02-20 at 9 49 09 AM" src="https://user-images.githubusercontent.com/44108233/219986313-374e0959-b8c5-44f6-942c-bba1c0407909.png"> ### Why are the changes needed? Improving usability for Spark Connect. ### Does this PR introduce _any_ user-facing change? No, it's documentation. ### How was this patch tested? Manually built docs, confirmed each function and class one by one. Closes #40067 from itholic/SPARK-42476. Lead-authored-by: itholic <haejoon.lee@databricks.com> Co-authored-by: Haejoon Lee <44108233+itholic@users.noreply.github.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit e6c201df33b123c3bfc632012abeaa0db6c417bc) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 20 February 2023, 14:24:05 UTC
014c60f [SPARK-42477][CONNECT][PYTHON] accept user_agent in spark connect's connection string ### What changes were proposed in this pull request? Currently, the Spark Connect service's `client_type` attribute (which is really [user agent]) is set to `_SPARK_CONNECT_PYTHON` to signify PySpark. With this change, the connection for the Spark Connect remote accepts an optional `user_agent` parameter which is then passed down to the service. [user agent]: https://www.w3.org/WAI/UA/work/wiki/Definition_of_User_Agent ### Why are the changes needed? This enables partners using Spark Connect to set their application as the user agent, which then allows visibility and measurement of integrations and usages of spark connect. ### Does this PR introduce _any_ user-facing change? A new optional `user_agent` parameter is now recognized as part of the Spark Connect connection string. ### How was this patch tested? - unit tests attached - manually running the `pyspark` binary with the `user_agent` connection string set and verifying the payload sent to the server. Similar testing for the default. Closes #40054 from nija-at/user-agent. Authored-by: Niranjan Jayakar <nija@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit b887d3de954ae5b2482087fe08affcc4ac60c669) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 20 February 2023, 14:02:19 UTC
ee4cee0 [SPARK-41741][SQL] Encode the string using the UTF_8 charset in ParquetFilters This PR makes it encode the string using the `UTF_8` charset in `ParquetFilters`. Fix data issue where the default charset is not `UTF_8`. No. Manual test. Closes #40090 from wangyum/SPARK-41741. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Yuming Wang <yumwang@ebay.com> (cherry picked from commit d5fa41efe2b1aa0aa41f558c1bef048b4632cf5c) Signed-off-by: Yuming Wang <yumwang@ebay.com> 20 February 2023, 11:19:52 UTC
5d617e3 [SPARK-41959][SQL] Improve v1 writes with empty2null ### What changes were proposed in this pull request? Cleanup some unnecessary `Empty2Null` related code ### Why are the changes needed? V1Writes checked idempotency using WriteFiles, so it's unnecessary to check if empty2null if exists again. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? pass CI Closes #39475 from ulysses-you/SPARK-41959. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 547737b82dfee7e800930fd91bf2761263f68881) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 20 February 2023, 08:41:24 UTC
0c24adf [SPARK-42398][SQL] Refine default column value DS v2 interface ### What changes were proposed in this pull request? The current default value DS V2 API is a bit inconsistent. The `createTable` API only takes `StructType`, so implementations must know the special metadata key of the default value to access it. The `TableChange` API has the default value as an individual field. This API adds a new `Column` interface, which holds both current default (as a SQL string) and exist default (as a v2 literal). `createTable` API now takes `Column`. This avoids the need of special metadata key and is also more extensible when adding more special cols like generated cols. This is also type-safe and makes sure the exist default is literal. The implementation is free to decide how to encode and store default values. Note: backward compatibility is taken care of. ### Why are the changes needed? better DS v2 API for default value ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? existing tests Closes #40049 from cloud-fan/table2. Lead-authored-by: Wenchen Fan <wenchen@databricks.com> Co-authored-by: Wenchen Fan <cloud0fan@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 70a098c83da4cff2bdc8d15a5a8b513a32564dbc) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 20 February 2023, 08:31:05 UTC
ca268ef [SPARK-42427][SQL][TESTS][FOLLOW-UP] Remove duplicate overflow test for conv This PR proposes to remove duplicate test (see https://github.com/apache/spark/commit/cb463fb40e8f663b7e3019c8d8560a3490c241d0). This test fails with ANSI mode on https://github.com/apache/spark/actions/runs/4213931226/jobs/7314033662, and it's a duplicate. Should better remove. No, test-only. Manually tested. Closes #40088 from HyukjinKwon/SPARK-42427-followup. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 4a816ee0925d9cbfe2bd73e512137379f3334a77) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 20 February 2023, 04:25:20 UTC
7090212 [SPARK-42048][PYTHON][CONNECT] Fix the alias name for numpy literals ### What changes were proposed in this pull request? Fixes the alias name for numpy literals. Also fixes `F.lit` in Spark Connect to support `np.bool_` objects. ### Why are the changes needed? Currently the alias name for literals created from numpy scalars contains something like `CAST(` ... `AS <type>)`, but it should be removed and return only the value string as same as literals from Python numbers. ### Does this PR introduce _any_ user-facing change? The alias name will be changed. ### How was this patch tested? Modifed/enabled related tests. Closes #40076 from ueshin/issues/SPARK-42048/lit. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit db9dbd90d8edff222636bebf25df2fb96adef534) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 20 February 2023, 00:08:20 UTC
9249488 [SPARK-41818][CONNECT][PYTHON][FOLLOWUP][TEST] Enable a doctest for DataFrame.write ### What changes were proposed in this pull request? Enables a doctest for `DataFrame.write`. ### Why are the changes needed? Now that `DataFrame.write.saveAsTable` was fixed, we can enabled the doctest. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Enabled the doctest. Closes #40071 from ueshin/issues/SPARK-41818/doctest. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit b056e59e1cb246682b4f4cf178dca8b5e555f018) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 20 February 2023, 00:07:08 UTC
3bca65a [SPARK-42482][CONNECT] Scala Client Write API V1 ### What changes were proposed in this pull request? Implemented the basic Dataset#write API to allow users to write the df into tables, csv etc. files. ### Why are the changes needed? Basic write operation. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Integration tests. Closes #40061 from zhenlineo/write. Authored-by: Zhen Li <zhenlineo@users.noreply.github.com> Signed-off-by: Herman van Hovell <herman@databricks.com> (cherry picked from commit ede1a541182e043b1f79a9ffbfc4a7fa97604078) Signed-off-by: Herman van Hovell <herman@databricks.com> 19 February 2023, 17:04:53 UTC
fdbc57a Preparing development version 3.4.1-SNAPSHOT 18 February 2023, 12:12:49 UTC
96cff93 Preparing Spark release v3.4.0-rc1 18 February 2023, 12:11:25 UTC
2b54f07 [SPARK-42430][DOC][FOLLOW-UP] Revise the java doc for TimestampNTZ & ANSI interval types ### What changes were proposed in this pull request? As https://github.com/apache/spark/pull/40005#pullrequestreview-1299089504 pointed out, the java doc for data type recommends using factory methods provided in org.apache.spark.sql.types.DataTypes. Since the ANSI interval types missed the `DataTypes` as well, this PR also revise their doc. ### Why are the changes needed? Unify the data type doc ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Local preview <img width="826" alt="image" src="https://user-images.githubusercontent.com/1097932/219821685-321c2fd1-6248-4930-9c61-eec68f0dcb50.png"> Closes #40074 from gengliangwang/reviseNTZDoc. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Max Gekk <max.gekk@gmail.com> (cherry picked from commit 8cfd5bf1ca4042541232ef1787349ddb876adcfa) Signed-off-by: Max Gekk <max.gekk@gmail.com> 18 February 2023, 07:30:42 UTC
681559e [SPARK-42481][CONNECT] Implement agg.{max,min,mean,count,avg,sum} ### What changes were proposed in this pull request? Adding more API to `agg` including max,min,mean,count,avg,sum. ### Why are the changes needed? API coverage ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? UT Closes #40070 from amaliujia/rw-agg2. Authored-by: Rui Wang <rui.wang@databricks.com> Signed-off-by: Herman van Hovell <herman@databricks.com> (cherry picked from commit 74f53b8d008b8fd570439d5cc56a0c0753ff4910) Signed-off-by: Herman van Hovell <herman@databricks.com> 18 February 2023, 00:49:52 UTC
e4a6e58 Preparing development version 3.4.1-SNAPSHOT 17 February 2023, 21:33:40 UTC
09c2a32 Preparing Spark release v3.4.0-rc1 17 February 2023, 21:33:33 UTC
b38c2e1 [SPARK-42461][CONNECT] Scala Client implement first batch of functions ### What changes were proposed in this pull request? This PR adds the following functions to Spark Connect Scala Client: - Sort Functions - Aggregate Functions - Misc Functions - Math Functions ### Why are the changes needed? We want to the Spark Connect Scala Client to reach parity with the original functions API. ### Does this PR introduce _any_ user-facing change? Yes, it adds a lot of functions. ### How was this patch tested? Added test for all functions and their significant variations. Closes #40050 from hvanhovell/SPARK-42461. Authored-by: Herman van Hovell <herman@databricks.com> Signed-off-by: Herman van Hovell <herman@databricks.com> (cherry picked from commit e6a84fef2fa43b880c89f2f54a5fa78033021fb1) Signed-off-by: Herman van Hovell <herman@databricks.com> 17 February 2023, 12:19:17 UTC
5949e6d [SPARK-42474][CORE][K8S] Add extraJVMOptions JVM GC option K8s test cases ### What changes were proposed in this pull request? This PR aims to add JVM GC option test coverage to K8s Integration Suite. To reuse the existing code, `isG1GC` variable is moved from `MemoryManager` to `Utils`. ### Why are the changes needed? To provide more test coverage for JVM Options. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs ``` [info] KubernetesSuite: [info] - SPARK-42190: Run SparkPi with local[*] (4 seconds, 990 milliseconds) [info] - Run SparkPi with no resources (7 seconds, 101 milliseconds) [info] - Run SparkPi with no resources & statefulset allocation (7 seconds, 27 milliseconds) [info] - Run SparkPi with a very long application name. (7 seconds, 100 milliseconds) [info] - Use SparkLauncher.NO_RESOURCE (7 seconds, 947 milliseconds) [info] - Run SparkPi with a master URL without a scheme. (6 seconds, 932 milliseconds) [info] - Run SparkPi with an argument. (9 seconds, 47 milliseconds) [info] - Run SparkPi with custom labels, annotations, and environment variables. (6 seconds, 969 milliseconds) [info] - All pods have the same service account by default (6 seconds, 916 milliseconds) [info] - Run extraJVMOptions check on driver (3 seconds, 964 milliseconds) [info] - Run extraJVMOptions JVM GC option check - G1GC (3 seconds, 948 milliseconds) [info] - Run extraJVMOptions JVM GC option check - Other GC (4 seconds, 51 milliseconds) ... ``` Closes #40062 from dongjoon-hyun/SPARK-42474. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit ba8abdda3703ce9d60e26678290739d080020418) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 17 February 2023, 09:11:28 UTC
c4d3b0e [SPARK-39851][SQL] Improve join stats estimation if one side can keep uniqueness ### What changes were proposed in this pull request? This PR improves join stats estimation if one side can keep uniqueness(The distinct keys of the children of the join are a subset of the join keys). A common case is: ```sql SELECT i_item_sk ss_item_sk FROM item, (SELECT DISTINCT iss.i_brand_id brand_id, iss.i_class_id class_id, iss.i_category_id category_id FROM item iss) x WHERE i_brand_id = brand_id AND i_class_id = class_id AND i_category_id = category_id ``` In this case, the row count of the join will definitely not expand. Before this PR: ``` == Optimized Logical Plan == Project [i_item_sk#4 AS ss_item_sk#54], Statistics(sizeInBytes=370.8 MiB, rowCount=3.24E+7) +- Join Inner, (((i_brand_id#11 = brand_id#51) AND (i_class_id#13 = class_id#52)) AND (i_category_id#15 = category_id#53)), Statistics(sizeInBytes=1112.3 MiB, rowCount=3.24E+7) :- Project [i_item_sk#4, i_brand_id#11, i_class_id#13, i_category_id#15], Statistics(sizeInBytes=4.6 MiB, rowCount=2.02E+5) : +- Filter ((isnotnull(i_brand_id#11) AND isnotnull(i_class_id#13)) AND isnotnull(i_category_id#15)), Statistics(sizeInBytes=84.6 MiB, rowCount=2.02E+5) : +- Relation spark_catalog.default.item[i_item_sk#4,i_item_id#5,i_rec_start_date#6,i_rec_end_date#7,i_item_desc#8,i_current_price#9,i_wholesale_cost#10,i_brand_id#11,i_brand#12,i_class_id#13,i_class#14,i_category_id#15,i_category#16,i_manufact_id#17,i_manufact#18,i_size#19,i_formulation#20,i_color#21,i_units#22,i_container#23,i_manager_id#24,i_product_name#25] parquet, Statistics(sizeInBytes=85.2 MiB, rowCount=2.04E+5) +- Aggregate [brand_id#51, class_id#52, category_id#53], [brand_id#51, class_id#52, category_id#53], Statistics(sizeInBytes=2.6 MiB, rowCount=1.37E+5) +- Project [i_brand_id#62 AS brand_id#51, i_class_id#64 AS class_id#52, i_category_id#66 AS category_id#53], Statistics(sizeInBytes=3.9 MiB, rowCount=2.02E+5) +- Filter ((isnotnull(i_brand_id#62) AND isnotnull(i_class_id#64)) AND isnotnull(i_category_id#66)), Statistics(sizeInBytes=84.6 MiB, rowCount=2.02E+5) +- Relation spark_catalog.default.item[i_item_sk#55,i_item_id#56,i_rec_start_date#57,i_rec_end_date#58,i_item_desc#59,i_current_price#60,i_wholesale_cost#61,i_brand_id#62,i_brand#63,i_class_id#64,i_class#65,i_category_id#66,i_category#67,i_manufact_id#68,i_manufact#69,i_size#70,i_formulation#71,i_color#72,i_units#73,i_container#74,i_manager_id#75,i_product_name#76] parquet, Statistics(sizeInBytes=85.2 MiB, rowCount=2.04E+5) ``` After this PR: ``` == Optimized Logical Plan == Project [i_item_sk#4 AS ss_item_sk#54], Statistics(sizeInBytes=2.3 MiB, rowCount=2.02E+5) +- Join Inner, (((i_brand_id#11 = brand_id#51) AND (i_class_id#13 = class_id#52)) AND (i_category_id#15 = category_id#53)), Statistics(sizeInBytes=7.0 MiB, rowCount=2.02E+5) :- Project [i_item_sk#4, i_brand_id#11, i_class_id#13, i_category_id#15], Statistics(sizeInBytes=4.6 MiB, rowCount=2.02E+5) : +- Filter ((isnotnull(i_brand_id#11) AND isnotnull(i_class_id#13)) AND isnotnull(i_category_id#15)), Statistics(sizeInBytes=84.6 MiB, rowCount=2.02E+5) : +- Relation spark_catalog.default.item[i_item_sk#4,i_item_id#5,i_rec_start_date#6,i_rec_end_date#7,i_item_desc#8,i_current_price#9,i_wholesale_cost#10,i_brand_id#11,i_brand#12,i_class_id#13,i_class#14,i_category_id#15,i_category#16,i_manufact_id#17,i_manufact#18,i_size#19,i_formulation#20,i_color#21,i_units#22,i_container#23,i_manager_id#24,i_product_name#25] parquet, Statistics(sizeInBytes=85.2 MiB, rowCount=2.04E+5) +- Aggregate [brand_id#51, class_id#52, category_id#53], [brand_id#51, class_id#52, category_id#53], Statistics(sizeInBytes=2.6 MiB, rowCount=1.37E+5) +- Project [i_brand_id#62 AS brand_id#51, i_class_id#64 AS class_id#52, i_category_id#66 AS category_id#53], Statistics(sizeInBytes=3.9 MiB, rowCount=2.02E+5) +- Filter ((isnotnull(i_brand_id#62) AND isnotnull(i_class_id#64)) AND isnotnull(i_category_id#66)), Statistics(sizeInBytes=84.6 MiB, rowCount=2.02E+5) +- Relation spark_catalog.default.item[i_item_sk#55,i_item_id#56,i_rec_start_date#57,i_rec_end_date#58,i_item_desc#59,i_current_price#60,i_wholesale_cost#61,i_brand_id#62,i_brand#63,i_class_id#64,i_class#65,i_category_id#66,i_category#67,i_manufact_id#68,i_manufact#69,i_size#70,i_formulation#71,i_color#72,i_units#73,i_container#74,i_manager_id#75,i_product_name#76] parquet, Statistics(sizeInBytes=85.2 MiB, rowCount=2.04E+5) ``` ### Why are the changes needed? Plan more broadcast joins to improve query performance. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test and TPC-DS benchmark test. SQL Before this PR(Seconds) After this PR(Seconds) q14a 187  164 Closes #39923 from wankunde/SPARK-39851. Lead-authored-by: wankunde <wankunde@163.com> Co-authored-by: Kun Wan <wankun@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 1c0bd1f9f813a341bbfdecb2c5ccde7fbc1bac2d) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 17 February 2023, 04:48:08 UTC
ca811db [SPARK-39859][SQL] Support v2 DESCRIBE TABLE EXTENDED for columns ### What changes were proposed in this pull request? Support v2 DESCRIBE TABLE EXTENDED for columns ### Why are the changes needed? DS v1/v2 command parity ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? UT Closes #40058 from huaxingao/describe_col. Authored-by: huaxingao <huaxin_gao@apple.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit ebab0ef7c8572e1dac41474c5991f482dbe9d253) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 17 February 2023, 04:41:04 UTC
9f22acd Preparing development version 3.4.1-SNAPSHOT 17 February 2023, 03:54:24 UTC
d89d6ab Preparing Spark release v3.4.0-rc1 17 February 2023, 03:54:15 UTC
7c1c8be [SPARK-42468][CONNECT] Implement agg by (String, String)* ### What changes were proposed in this pull request? Starting to support basic aggregation in Scala client. The first step is to support aggregation by strings. ### Why are the changes needed? API coverage ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? UT Closes #40057 from amaliujia/rw-agg. Authored-by: Rui Wang <rui.wang@databricks.com> Signed-off-by: Herman van Hovell <herman@databricks.com> (cherry picked from commit cc471a52d162d0e4d4063372253ed06a62f5cb19) Signed-off-by: Herman van Hovell <herman@databricks.com> 17 February 2023, 03:03:03 UTC
d3a841d [SPARK-42002][CONNECT][PYTHON][FOLLOWUP] Enable tests in ReadwriterV2ParityTests ### What changes were proposed in this pull request? Enables tests in `ReadwriterV2ParityTests`. ### Why are the changes needed? Now that `DataFrameWriterV2` for Spark Connect is implemented, we can enable tests in `ReadwriterV2ParityTests`. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Enabled tests. Closes #40060 from ueshin/issues/SPARK-42002/tests. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Ruifeng Zheng <ruifengz@apache.org> (cherry picked from commit 961d5bd909744fea24e2391cd1a7aea3c96c418d) Signed-off-by: Ruifeng Zheng <ruifengz@apache.org> 17 February 2023, 01:26:37 UTC
0efbfd5 [SPARK-39904][SQL][FOLLOW-UP] Rename CSV option `prefersDate` as `preferDate` ### What changes were proposed in this pull request? Rename the CSV data source option `prefersDate` as `preferDate`. ### Why are the changes needed? All the CSV data source options doesn't have a `s` on the verb in the naming. For example, `inferSchema`, `ignoreLeadingWhiteSpace` and `ignoreTrailingWhiteSpace`. The renaming makes the naming consistent. Also, the title of JIRA https://issues.apache.org/jira/browse/SPARK-39904 uses `preferDate` as well. ### Does this PR introduce _any_ user-facing change? No, the data source option is not released yet. ### How was this patch tested? Existing UT Closes #40043 from gengliangwang/renameCSVOption. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Gengliang Wang <gengliang@apache.org> (cherry picked from commit 6ead12e4ac08cef6c1346df3d380e85e5937a842) Signed-off-by: Gengliang Wang <gengliang@apache.org> 17 February 2023, 01:23:30 UTC
cb1fb7a [SPARK-42465][CONNECT] ProtoToPlanTestSuite should use analyzed plans instead of parsed plans ### What changes were proposed in this pull request? This makes `ProtoToPlanTestSuite` use analyzed plans instead of parsed plans. ### Why are the changes needed? This is to increase the fidelity of the `ProtoToPlanTestSuite`, especially since we are going to be adding functions. Functions are special because the spark connect planner leaves them unresolved, the actual binding only happens in the analyzer. Without running the analyzer we would not know if the bindings are correct. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? It is a test change. Closes #40056 from hvanhovell/SPARK-42465. Authored-by: Herman van Hovell <herman@databricks.com> Signed-off-by: Herman van Hovell <herman@databricks.com> (cherry picked from commit 64aa84b21c7664205a060933d8d9c3067da8218b) Signed-off-by: Herman van Hovell <herman@databricks.com> 16 February 2023, 22:33:22 UTC
7e2642d [SPARK-42326][SQL] Integrate `_LEGACY_ERROR_TEMP_2099` into `UNSUPPORTED_DATATYPE` ### What changes were proposed in this pull request? This PR proposes to integrate `_LEGACY_ERROR_TEMP_2099` into `UNSUPPORTED_DATATYPE`. And also introduce new error class `UNSUPPORTED_ARROWTYPE`. ### Why are the changes needed? We should assign proper name for LEGACY errors. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Updated UT. Closes #39979 from itholic/LEGACY_2099. Authored-by: itholic <haejoon.lee@databricks.com> Signed-off-by: Max Gekk <max.gekk@gmail.com> (cherry picked from commit 9855b137032bf9504dff96eb5bb9951accacac0f) Signed-off-by: Max Gekk <max.gekk@gmail.com> 16 February 2023, 19:04:35 UTC
1c70f5f [SPARK-42464][CONNECT] Fix ProtoToPlanTestSuite for Scala 2.13 ### What changes were proposed in this pull request? The `ProtoToPlanTestSuite` were broken for Scala 2.13 . This was caused by the following two problems: - Explain output between 2.12 and 2.13 is not stable because we render collection implementations as well. For this I changed the rendering of the offending classes to be version agnostic. - UDF code had deserialization issues. This was always the risk. I have removed those tests, we will work on improving UDF coverage in a follow-up. ### Why are the changes needed? We want to test Scala 2.13 as well. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Test only change. Closes #40055 from hvanhovell/SPARK-42464. Authored-by: Herman van Hovell <herman@databricks.com> Signed-off-by: Herman van Hovell <herman@databricks.com> (cherry picked from commit 229807efd06562fc07fce9a257062d8a66068761) Signed-off-by: Herman van Hovell <herman@databricks.com> 16 February 2023, 18:47:16 UTC
130c82c [SPARK-42287][CONNECT][BUILD] Fix shading so that the JVM client jar can include all 3rd-party dependencies in the runtime ### What changes were proposed in this pull request? Fix the JVM client dependencies and shading result. The common jar should not be shaded. The shading should be done in client or server. The common jar shall depends on minimal dependency if possible. The client jar should provides all compiled dependencies out of the box, including netty etc. The client sbt and mvn shall produce the same shading result. The current client dependency summary: ``` [INFO] --- maven-dependency-plugin:3.3.0:tree (default-cli) spark-connect-client-jvm_2.12 --- [INFO] org.apache.spark:spark-connect-client-jvm_2.12:jar:3.5.0-SNAPSHOT [INFO] +- org.apache.spark:spark-connect-common_2.12:jar:3.5.0-SNAPSHOT:compile [INFO] | +- org.scala-lang:scala-library:jar:2.12.17:compile [INFO] | +- io.grpc:grpc-netty:jar:1.47.0:compile [INFO] | | +- io.grpc:grpc-core:jar:1.47.0:compile [INFO] | | | +- com.google.code.gson:gson:jar:2.9.0:runtime [INFO] | | | +- com.google.android:annotations:jar:4.1.1.4:runtime [INFO] | | | \- org.codehaus.mojo:animal-sniffer-annotations:jar:1.19:runtime [INFO] | | \- io.perfmark:perfmark-api:jar:0.25.0:runtime [INFO] | +- io.grpc:grpc-protobuf:jar:1.47.0:compile [INFO] | | +- io.grpc:grpc-api:jar:1.47.0:compile [INFO] | | | \- io.grpc:grpc-context:jar:1.47.0:compile [INFO] | | +- com.google.api.grpc:proto-google-common-protos:jar:2.0.1:compile [INFO] | | \- io.grpc:grpc-protobuf-lite:jar:1.47.0:compile [INFO] | +- io.grpc:grpc-services:jar:1.47.0:compile [INFO] | | \- com.google.protobuf:protobuf-java-util:jar:3.19.2:runtime [INFO] | \- io.grpc:grpc-stub:jar:1.47.0:compile [INFO] +- com.google.protobuf:protobuf-java:jar:3.21.12:compile [INFO] +- com.google.guava:guava:jar:31.0.1-jre:compile [INFO] | +- com.google.guava:failureaccess:jar:1.0.1:compile [INFO] | +- com.google.guava:listenablefuture:jar:9999.0-empty-to-avoid-conflict-with-guava:compile [INFO] | +- com.google.code.findbugs:jsr305:jar:3.0.0:compile [INFO] | +- org.checkerframework:checker-qual:jar:3.12.0:compile [INFO] | +- com.google.errorprone:error_prone_annotations:jar:2.7.1:compile [INFO] | \- com.google.j2objc:j2objc-annotations:jar:1.3:compile [INFO] +- io.netty:netty-codec-http2:jar:4.1.87.Final:compile [INFO] | +- io.netty:netty-common:jar:4.1.87.Final:compile [INFO] | +- io.netty:netty-buffer:jar:4.1.87.Final:compile [INFO] | +- io.netty:netty-transport:jar:4.1.87.Final:compile [INFO] | | \- io.netty:netty-resolver:jar:4.1.87.Final:compile [INFO] | +- io.netty:netty-codec:jar:4.1.87.Final:compile [INFO] | +- io.netty:netty-handler:jar:4.1.87.Final:compile [INFO] | \- io.netty:netty-codec-http:jar:4.1.87.Final:compile [INFO] +- io.netty:netty-handler-proxy:jar:4.1.87.Final:compile [INFO] | \- io.netty:netty-codec-socks:jar:4.1.87.Final:compile [INFO] +- io.netty:netty-transport-native-unix-common:jar:4.1.87.Final:compile [INFO] +- org.spark-project.spark:unused:jar:1.0.0:compile ``` ### Why are the changes needed? Fix the client jar package. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. Closes #39866 from zhenlineo/fix-jars. Authored-by: Zhen Li <zhenlineo@users.noreply.github.com> Signed-off-by: Herman van Hovell <herman@databricks.com> (cherry picked from commit 49af23aa87d7a566e16c81afbcfb49c0e2064536) Signed-off-by: Herman van Hovell <herman@databricks.com> 16 February 2023, 18:26:11 UTC
53f11b1 [SPARK-42457][CONNECT] Adding SparkSession#read ### What changes were proposed in this pull request? Add SparkSession Read API to read data into Spark via Scala Client: ``` DataFrameReader.format(…).option(“key”, “value”).schema(…).load() ``` The following methods are skipped by the Scala Client on purpose: ``` [info] deprecated method json(org.apache.spark.api.java.JavaRDD)org.apache.spark.sql.Dataset in class org.apache.spark.sql.DataFrameReader does not have a correspondent in client version [info] deprecated method json(org.apache.spark.rdd.RDD)org.apache.spark.sql.Dataset in class org.apache.spark.sql.DataFrameReader does not have a correspondent in client version [info] method json(org.apache.spark.sql.Dataset)org.apache.spark.sql.Dataset in class org.apache.spark.sql.DataFrameReader does not have a correspondent in client version [info] method csv(org.apache.spark.sql.Dataset)org.apache.spark.sql.Dataset in class org.apache.spark.sql.DataFrameReader does not have a correspondent in client version ``` ### Why are the changes needed? To read data from csv etc. format. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? E2E, Golden tests. Closes #40025 from zhenlineo/session-read. Authored-by: Zhen Li <zhenlineo@users.noreply.github.com> Signed-off-by: Herman van Hovell <herman@databricks.com> (cherry picked from commit 8d863e306cb105b715c8b9206a2bdd944dafa90b) Signed-off-by: Herman van Hovell <herman@databricks.com> 16 February 2023, 18:23:02 UTC
04b7cdf [SPARK-40943][SQL] Make `MSCK` keyword optional in `REPAIR TABLE` syntax Make the `MSCK` keyword optional in `MSCK REPAIR TABLE` commands so that it can be omitted. The use of the keyword `MSCK`, meaning metastore check, is arcane and does not add value for the command. Removing it makes the meaning of `REPAIR TABLE` commands more clear. The [Spark documentation page](https://spark.apache.org/docs/latest/sql-ref-syntax-ddl-repair-table.html) for this command is titled "REPAIR TABLE". It only mentions the MSCK keyword in the content of the page, indicating that this keyword is not necessary for the semantic understanding of the command. Additionally, here is a reference from [MySQL](https://dev.mysql.com/doc/refman/8.0/en/repair-table.html), which completely omits the MSCK keyword. Yes, previously, it was no possible to specify only `REPAIR TABLE` without `MSCK`. Now, it is possible. No changes to existing behaviour using the original `MSCK REPAIR TABLE` syntax are in this PR. Unit tests. Closes #38433 from ben-zhang/SPARK-40943. Authored-by: Ben Zhang <ben.zhang@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 64aef23d9a8c4f9222d6cf9994545157487f78b1) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 16 February 2023, 07:02:46 UTC
48e185d [SPARK-42460][CONNECT] Clean-up results in ClientE2ETestSuite ### What changes were proposed in this pull request? Clean-up results in `ClientE2ETestSuite`. ### Why are the changes needed? `ClientE2ETestSuite` is very noisy because we do not clean-up results. This makes testing a bit annoying. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? It is a test. Closes #40048 from hvanhovell/SPARK-42460. Authored-by: Herman van Hovell <herman@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 9626aef10c29c1e405602326e481058a41211dba) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 16 February 2023, 06:51:19 UTC
771b867 [SPARK-42462][K8S] Prevent `docker-image-tool.sh` from publishing OCI manifests ### What changes were proposed in this pull request? This is found during Apache Spark 3.3.2 docker image publishing. It's not an Apache Spark but important for `docker-image-tool.sh` to provide backward compatibility during cross-building. This PR targets for all **future releases**, Apache Spark 3.4.0/3.3.3/3.2.4. ### Why are the changes needed? Docker `buildx` v0.10.0 publishes OCI Manifests by default which is not supported by `docker manifest` command like the following. https://github.com/docker/buildx/issues/1509 ``` $ docker manifest inspect apache/spark:v3.3.2 no such manifest: docker.io/apache/spark:v3.3.2 ``` Note that the published images are working on both AMD64/ARM64 machines, but `docker manifest` cannot be used. For example, we cannot create `latest` tag. ### Does this PR introduce _any_ user-facing change? This will fix the regression of Docker `buildx`. ### How was this patch tested? Manually builds the multi-arch image and check `manifest`. ``` $ docker manifest inspect apache/spark:v3.3.2 { "schemaVersion": 2, "mediaType": "application/vnd.docker.distribution.manifest.list.v2+json", "manifests": [ { "mediaType": "application/vnd.docker.distribution.manifest.v2+json", "size": 3444, "digest": "sha256:30ae5023fc384ae3b68d2fb83adde44b1ece05f926cfceecac44204cdc9e79cb", "platform": { "architecture": "amd64", "os": "linux" } }, { "mediaType": "application/vnd.docker.distribution.manifest.v2+json", "size": 3444, "digest": "sha256:aac13b5b5a681aefa91036d2acae91d30a743c2e78087c6df79af4de46a16e1b", "platform": { "architecture": "arm64", "os": "linux" } } ] } ``` Closes #40051 from dongjoon-hyun/SPARK-42462. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 2ac70ae5381333aa899d82f6cd4c3bbae524e1c2) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 16 February 2023, 05:52:33 UTC
3035ebb [SPARK-42459][CONNECT] Create pyspark.sql.connect.utils to keep common codes ### What changes were proposed in this pull request? This PR proposes to `pyspark.sql.connect.utils` to keep common codes, especially about dependnecies. ### Why are the changes needed? For example, [SPARK-41457](https://issues.apache.org/jira/browse/SPARK-41457) added `require_minimum_grpc_version` in `pyspark.sql.pandas.utils` which is actually unrelated from the connect module. we should move all to a separate utils directory for better readability and maintenance. ### Does this PR introduce _any_ user-facing change? No, dev-only. ### How was this patch tested? Existing tests should cover this. Closes #40047 from HyukjinKwon/refactor-utils. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 7ee8a32077b09cb847b6ac41cdc5067cf7bd83e9) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 16 February 2023, 03:04:33 UTC
513709e [SPARK-41817][CONNECT][PYTHON][TEST] Enable the doctest pyspark.sql.connect.readwriter.DataFrameReader.option ### What changes were proposed in this pull request? Enables the doctest `pyspark.sql.connect.readwriter.DataFrameReader.option`. ### Why are the changes needed? The issue described [SPARK-41817](https://issues.apache.org/jira/browse/SPARK-41817) was already fixed at #39545. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? The enabled test. Closes #40046 from ueshin/issues/SPARK-41817/read. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit e3d98a7e07596c647d974703fce4cd7d41ad736e) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 16 February 2023, 02:32:10 UTC
42337fe [SPARK-42453][CONNECT] Implement function max in Scala client ### What changes were proposed in this pull request? This PR tries to implement functions.max for Scala client. ### Why are the changes needed? API coverage. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing scala client testing framework. Closes #40041 from amaliujia/rw-scala-count. Authored-by: Rui Wang <rui.wang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 5d81a1cf91a53fd3fe7051f25fdeedf389413f09) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 16 February 2023, 01:43:28 UTC
e9f109a [SPARK-42456][DOCS][PYTHON] Consolidating the PySpark version upgrade note pages into a single page to make it easier to read ### What changes were proposed in this pull request? Creating a new PySpark migration guide sub page and consolidating the existing 9 separate pages into this one new page. This makes it easier to take a look across multiple version upgrades by simply scrolling on the page instead of having to navigate back and forth. Note that this is similar to the Spark Core Migration Guide page here: https://spark.apache.org/docs/latest/core-migration-guide.html Also updating the existing main Migration Guide page to point to this new sub page and making some minor language updates to help readers. ### Why are the changes needed? To improve usability of the PySpark doc site. ### Does this PR introduce _any_ user-facing change? Yes, the user facing PySpark documentation is updated. ### How was this patch tested? Built and tested the PySpark documentation web site locally. Closes #40044 from allanf-db/pyspark_doc_updates. Authored-by: Allan Folting <allan.folting@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 10c0528cf71a092bd78bfec2c7cfe69a3ebd99ec) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 16 February 2023, 01:35:11 UTC
545abcf [SPARK-42426][CONNECT] Fix DataFrameWriter.insertInto to call the corresponding method instead of saveAsTable ### What changes were proposed in this pull request? Fixes `DataFrameWriter.insertInto` to call the corresponding method instead of `saveAsTable`. ### Why are the changes needed? Currently `SparkConnectPlanner` calls `saveAsTable` instead of `insertInto` even for `DataFrameWriter.insertInto` in Spark Connect, but they have different logic internally, so we should use the corresponding method. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Enabled related tests. Closes #40024 from ueshin/issues/SPARK-42426/insertInto. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 3cbc900d2c2947c85447ef2bd8c1f385ca6e1c49) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 16 February 2023, 00:29:44 UTC
d3a57df [SPARK-42455][SQL] Rename JDBC option inferTimestampNTZType as preferTimestampNTZ Similar with https://github.com/apache/spark/pull/37327, this PR renames the JDBC data source option `inferTimestampNTZType` as `preferTimestampNTZ` It is simpler and more straightforward. Also, it is consistent with the CSV data source option introduced in https://github.com/apache/spark/pull/37327, No, the TimestampNTZ project is not released yet. UT Closes #40042 from gengliangwang/inferNTZOption. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 8194522225240a2192b9132858d0a324c0e94eb2) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 16 February 2023, 00:29:02 UTC
cffd4c7 [SPARK-42384][SQL] Check for null input in generated code for mask function ### What changes were proposed in this pull request? When generating code for the mask function, call `ctx.nullSafeExec` to produce null safe code. This change assumes that the mask function returns null only when the input is null (which appears to be the case, from reading the code of `Mask.transformInput`). ### Why are the changes needed? The following query fails with a `NullPointerException`: ``` create or replace temp view v1 as select * from values (null), ('AbCD123-$#') as data(col1); cache table v1; select mask(col1) from v1; 23/02/07 16:36:06 ERROR Executor: Exception in task 0.0 in stage 3.0 (TID 3) java.lang.NullPointerException at org.apache.spark.sql.catalyst.expressions.codegen.UnsafeWriter.write(UnsafeWriter.java:110) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760) ``` The generated code calls `UnsafeWriter.write(0, value_0)` regardless of whether `Mask.transformInput` returns null or not. The `UnsafeWriter.write` method for `UTF8String` does not expect a null pointer. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New unit tests. Closes #39945 from bersprockets/mask_npe_issue. Authored-by: Bruce Robbins <bersprockets@gmail.com> Signed-off-by: Gengliang Wang <gengliang@apache.org> (cherry picked from commit 7ff8ba29f2cb03686fb79c44e82223485e480ea4) Signed-off-by: Gengliang Wang <gengliang@apache.org> 16 February 2023, 00:26:35 UTC
e724f1e [SPARK-42441][CONNECT] Scala Client add Column APIs ### What changes were proposed in this pull request? This PR adds most Column APIs for the Spark Connect Scala Client. ### Why are the changes needed? We want the Scala Client to have API parity with the existing SparkSession/Dataset APIs. ### How was this patch tested? Golden files, and I added a test for local behavior. Closes #40027 from hvanhovell/SPARK-42441. Authored-by: Herman van Hovell <herman@databricks.com> Signed-off-by: Herman van Hovell <herman@databricks.com> (cherry picked from commit d4a5beb45b8a6fbde3b7d252fcd83755cef97aa3) Signed-off-by: Herman van Hovell <herman@databricks.com> 15 February 2023, 18:43:08 UTC
abf40a8 [SPARK-42445][R] Fix SparkR `install.spark` function ### What changes were proposed in this pull request? This PR fixes `SparkR` `install.spark` method. ``` $ curl -LO https://dist.apache.org/repos/dist/dev/spark/v3.3.2-rc1-bin/SparkR_3.3.2.tar.gz $ R CMD INSTALL SparkR_3.3.2.tar.gz $ R R version 4.2.1 (2022-06-23) -- "Funny-Looking Kid" Copyright (C) 2022 The R Foundation for Statistical Computing Platform: aarch64-apple-darwin20 (64-bit) R is free software and comes with ABSOLUTELY NO WARRANTY. You are welcome to redistribute it under certain conditions. Type 'license()' or 'licence()' for distribution details. Natural language support but running in an English locale R is a collaborative project with many contributors. Type 'contributors()' for more information and 'citation()' on how to cite R or R packages in publications. Type 'demo()' for some demos, 'help()' for on-line help, or 'help.start()' for an HTML browser interface to help. Type 'q()' to quit R. > library(SparkR) Attaching package: ‘SparkR’ The following objects are masked from ‘package:stats’: cov, filter, lag, na.omit, predict, sd, var, window The following objects are masked from ‘package:base’: as.data.frame, colnames, colnames<-, drop, endsWith, intersect, rank, rbind, sample, startsWith, subset, summary, transform, union > install.spark() Spark not found in the cache directory. Installation will start. MirrorUrl not provided. Looking for preferred site from apache website... Preferred mirror site found: https://dlcdn.apache.org/spark Downloading spark-3.3.2 for Hadoop 2.7 from: - https://dlcdn.apache.org/spark/spark-3.3.2/spark-3.3.2-bin-hadoop2.7.tgz trying URL 'https://dlcdn.apache.org/spark/spark-3.3.2/spark-3.3.2-bin-hadoop2.7.tgz' simpleWarning in download.file(remotePath, localPath): downloaded length 0 != reported length 196 > install.spark(hadoopVersion="3") Spark not found in the cache directory. Installation will start. MirrorUrl not provided. Looking for preferred site from apache website... Preferred mirror site found: https://dlcdn.apache.org/spark Downloading spark-3.3.2 for Hadoop 3 from: - https://dlcdn.apache.org/spark/spark-3.3.2/spark-3.3.2-bin-3.tgz trying URL 'https://dlcdn.apache.org/spark/spark-3.3.2/spark-3.3.2-bin-3.tgz' simpleWarning in download.file(remotePath, localPath): downloaded length 0 != reported length 196 ``` Note that this is a regression at Spark 3.3.0 and not a blocker for on-going Spark 3.3.2 RC vote. ### Why are the changes needed? https://spark.apache.org/docs/latest/api/R/reference/install.spark.html#ref-usage ![Screenshot 2023-02-14 at 10 07 49 PM](https://user-images.githubusercontent.com/9700541/218946460-ab7eab1b-65ae-4cb2-bc7c-5810ad359ac9.png) First, the existing Spark 2.0.0 link is broken. - https://spark.apache.org/docs/latest/api/R/reference/install.spark.html#details - http://apache.osuosl.org/spark/spark-2.0.0/spark-2.0.0-bin-hadoop2.7.tgz (Broken) Second, Spark 3.3.0 changed the Hadoop postfix pattern from the distribution files so that the function raises errors as described before. - http://archive.apache.org/dist/spark/spark-3.2.3/spark-3.2.3-bin-hadoop2.7.tgz (Old Pattern) - http://archive.apache.org/dist/spark/spark-3.3.0/spark-3.3.0-bin-hadoop2.tgz (New Pattern) ### Does this PR introduce _any_ user-facing change? No, this fixes a bug like Spark 3.2.3 and older versions. ### How was this patch tested? Pass the CI and manual testing. Please note that the link pattern is correct although it fails because 3.5.0 is not published yet. ``` $ NO_MANUAL=1 ./dev/make-distribution.sh --r $ R CMD INSTALL R/SparkR_3.5.0-SNAPSHOT.tar.gz $ R > library(SparkR) > install.spark() Spark not found in the cache directory. Installation will start. MirrorUrl not provided. Looking for preferred site from apache website... Preferred mirror site found: https://dlcdn.apache.org/spark Downloading spark-3.5.0 for Hadoop 3 from: - https://dlcdn.apache.org/spark/spark-3.5.0/spark-3.5.0-bin-hadoop3.tgz trying URL 'https://dlcdn.apache.org/spark/spark-3.5.0/spark-3.5.0-bin-hadoop3.tgz' simpleWarning in download.file(remotePath, localPath): downloaded length 0 != reported length 196 ``` Closes #40031 from dongjoon-hyun/SPARK-42445. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit b47d29a5299620bf4a87e33bb2de4db81a572edf) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 15 February 2023, 17:31:25 UTC
a46278f [SPARK-42440][CONNECT] Initial set of Dataframe APIs for Scala Client ### What changes were proposed in this pull request? Add a lot of the existing Dataframe APIs to the Spark Connect Scala Client. This PR does not contain: - Typed APIs - Aggregation - Streaming (not supported by connect just yet) - NA/Stats functions - TempView registration. ### Why are the changes needed? We want the Scala Client Dataset to reach parity with the existing Dataset. ### How was this patch tested? Added a lot of golden tests. Added a number of test cases to the E2E suite for the functionality that requires server interaction. Closes #40019 from hvanhovell/SPARK-42440. Authored-by: Herman van Hovell <herman@databricks.com> Signed-off-by: Herman van Hovell <herman@databricks.com> (cherry picked from commit 9843c7c9a259250f94678e0502fa21f58c80d27c) Signed-off-by: Herman van Hovell <herman@databricks.com> 15 February 2023, 14:54:19 UTC
ac0667b [MINOR][PYTHON][DOCS] Add `applyInPandasWithState` to API references ### What changes were proposed in this pull request? Add `applyInPandasWithState` to API references ### Why are the changes needed? It's missing in doc ### Does this PR introduce _any_ user-facing change? yes ### How was this patch tested? CI Closes #40037 from zhengruifeng/py_doc_applyInPandasWithState. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit fd490d5946c66e19e3bbd17a72feff46bf4efc35) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 15 February 2023, 14:08:24 UTC
d351569 [SPARK-42405][SQL] Improve array insert documentation ### What changes were proposed in this pull request? Part of cleanup from existing PR https://github.com/apache/spark/pull/38867 - documentation on the scala class ArrayInsert should match the python array_insert function. See comment here: https://github.com/apache/spark/pull/38867#discussion_r1097054656. ### Why are the changes needed? See https://github.com/apache/spark/pull/38867#discussion_r1097054656. ### Does this PR introduce _any_ user-facing change? Yes- better documentation of the array_insert function ### How was this patch tested? Not applicable/ standard unit testing. Closes #39975 from Daniel-Davies/ddavies/SPARK-42405. Authored-by: Daniel Davies <ddavies@palantir.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit a14c6bb2710cb7d43538e9754ca536f0269eb3c4) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 15 February 2023, 13:00:38 UTC
ca49136 [SPARK-42436][SQL] Improve multiTransform to generate alternatives dynamically ### What changes were proposed in this pull request? This PR improves `TreeNode.multiTransform()` to generate the alternative sequences only if needed and fully dynamically. Consider the following simplified example: ``` (a + b).multiTransform { case a => Stream(1, 2) case b => Stream(10, 20) } ``` the result is the cartesian product: `Stream(1 + 10, 2 + 10, 1 + 20, 2 + 20)`. Currently `multiTransform` calculates the 2 alternative streams for `a` and `b` **before** start building building the cartesian product stream using `+`. So kind of caches the "inner" `Stream(1, 2)` in the beginning and when the "outer" stream (`Stream(10, 20)`) iterates from `10` to `20` reuses the cache. Although this caching is sometimes useful it has 2 drawbacks: - If the "outer" (`b` alternatives) stream returns `Seq.emtpy` (to indicate pruning) the alternatives for the `a` are unecessary calculated and will be discarded. - The "inner" stream transformation can't depend on the current "outer" stream alternative. E.g. let's see the above `a + b` example but we want to transform both `a` and `b` to `1` and `2`, and we want to have only those alternatives where these 2 are transformed equal (`Stream(1 + 1, 2 + 2)`). This is currently it is not possible with a single `multiTransform` call due to the inner stream alternatives are calculated in advance and cached. But, if `multiTransform` would be dynamic and the "inner" alternatives stream would be recalculated when the "outer" alternatives stream iterates then this would be possible: ``` // Cache var a_or_b = None (a + b).multiTransform { case a | b => // Return alternatives from cache if this is not the first encounter a_or_b.getOrElse( // Besides returning the alternatives for the first encounter, also set up a mechanism to // update the cache when the new alternatives are requested. Stream(Literal(1), Literal(2)).map { x => a_or_b = Some(Seq(x)) x }.append { a_or_b = None Seq.empty }) } ``` Please note: - that this is a simplified example and we could have run 2 simple `transforms` to get the exprected 2 expressions, but `multiTransform` can do other orthogonal transformations in the same run (e.g. `c` -> `Seq(100, 200)`) and `multiTransform` has the advantage of returning the results lazlily as a stream. - the original behaviour of caching "inner" alternative streams is still doable and actually our current usecases in `AliasAwareOutputExpression` and in `BroadcastHashJoinExec` still do it as they store the alternatives in advance in maps and the `multiTransform` call just gets the alternatives from those maps when needed. ### Why are the changes needed? Improvement to make `multiTransform` more versatile. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added new UTs. Closes #40016 from peter-toth/SPARK-42436-multitransform-generate-alternatives-dynamically. Authored-by: Peter Toth <peter.toth@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 403e3d219fd8771a3cab3f4f58331896ebe16747) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 15 February 2023, 12:57:14 UTC
ad183e7 [SPARK-42442][SQL] Use spark.sql.timestampType for data source inference With the configuration `spark.sql.timestampType`, TIMESTAMP in Spark is a user-specified alias associated with one of the TIMESTAMP_LTZ and TIMESTAMP_NTZ variations. This is quite complicated to Spark users. There is another option `spark.sql.sources.timestampNTZTypeInference.enabled` for schema inference. I would like to introduce it in https://github.com/apache/spark/pull/40005 but having two flags seems too much. After thoughts, I decide to merge `spark.sql.sources.timestampNTZTypeInference.enabled` into `spark.sql.timestampType` and let `spark.sql.timestampType` control the schema inference behavior. We can have followups to add data source options "inferTimestampNTZType" for CSV/JSON/partiton column like JDBC data source did. Make the new feature simpler. No, the feature is not released yet. Existing UT I also tried ``` git grep spark.sql.sources.timestampNTZTypeInference.enabled git grep INFER_TIMESTAMP_NTZ_IN_DATA_SOURCES ``` to make sure the flag INFER_TIMESTAMP_NTZ_IN_DATA_SOURCES is removed. Closes #40022 from gengliangwang/unifyInference. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 46226c2f14db185e95e8f83783a70fa86741b2eb) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 15 February 2023, 11:40:05 UTC
9e85162 [SPARK-42401][SQL][FOLLOWUP] Always set `containsNull=true` for `array_insert` ### What changes were proposed in this pull request? Always set `containsNull=true` in the data type returned by `ArrayInsert#dataType`. ### Why are the changes needed? PR #39970 fixed an issue where the data type for `array_insert` did not always have `containsNull=true` when the user was explicitly inserting a nullable value into the array. However, that fix does not handle the case where `array_insert` implicitly inserts null values into the array (e.g., when the insertion position is out-of-range): ``` spark-sql> select array_insert(array('1', '2', '3', '4'), -6, '5'); 23/02/14 16:10:19 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0) java.lang.NullPointerException at org.apache.spark.sql.catalyst.expressions.codegen.UnsafeWriter.write(UnsafeWriter.java:110) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.project_doConsume_0$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) ``` Because we can't know at planning time whether the insertion position will be out of range, we should always set `containsNull=true` on the data type for `array_insert`. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New unit tests. Closes #40026 from bersprockets/array_insert_null_anytime. Authored-by: Bruce Robbins <bersprockets@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit fb5b44f46b35562330e5e89133a0bca8e0bee36b) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 15 February 2023, 11:38:36 UTC
ef821b9 [SPARK-42002][CONNECT][FOLLOWUP] Remove unused imports ### What changes were proposed in this pull request? Remove unused imports ### Why are the changes needed? clean up ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? existing UT Closes #40030 from zhengruifeng/connect_42002_followup. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 7eb1d2ed4c0bbf6acce1cc72a999f9f66b6e35f1) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 15 February 2023, 11:37:25 UTC
559dee3 [SPARK-42446][DOCS][PYTHON] Updating PySpark documentation to enhance usability ### What changes were proposed in this pull request? Updates to the PySpark documentation web site: - Fixing typo on the Getting Started page (Version => Versions) - Capitalizing "In/Out" in the DataFrame Quick Start notebook - Adding "(Legacy)" to the Spark Streaming heading on the Spark Streaming page - Reorganizing the User Guide page to list PySpark guides first, minor language updates, and removing links to legacy streaming and RDD programming guides to not promote these as prominently and focus on the recommended APIs ### Why are the changes needed? To improve usability of the PySpark doc website by adding guidance (calling out legacy APIs), fixing a few language issues, and making PySpark content more prominent. ### Does this PR introduce _any_ user-facing change? Yes, the user facing PySpark documentation is updated. ### How was this patch tested? Built and manually reviewed/tested the PySpark documentation web site locally. Closes #40032 from allanf-db/pyspark_docs. Authored-by: Allan Folting <allan.folting@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 00dc3d8533e518cea1bdd3cf9439e4ef0a14d600) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 15 February 2023, 11:36:21 UTC
835f9b7 [SPARK-42431][CONNECT][FOLLOWUP] Use `Distinct` to delay analysis for `Union` ### What changes were proposed in this pull request? use `Distinct` instead of `Deduplicate` ### Why are the changes needed? to delay analysis, see https://github.com/apache/spark/pull/40008#discussion_r1106529796 ### Does this PR introduce _any_ user-facing change? plan shown in `explain` may change ### How was this patch tested? updated UT Closes #40029 from zhengruifeng/connect_analyze_distinct. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 28153bed56a198882fbfc9068b6bba6fe3be338a) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 15 February 2023, 11:31:39 UTC
d18ee4d [SPARK-42427][SQL] ANSI MODE: Conv should return an error if the internal conversion overflows ### What changes were proposed in this pull request? In ANSI SQL mode, function Conv() should return an error if the internal conversion overflows For example, before the change: ``` > select conv('fffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff', 16, 10) 18446744073709551615 ``` After the change ``` > select conv('fffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff', 16, 10) org.apache.spark.SparkArithmeticException: [ARITHMETIC_OVERFLOW] Overflow in function conv(). If necessary set "spark.sql.ansi.enabled" to "false" to bypass this error. == SQL(line 1, position 8) == select conv('fffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff', 16, 10) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ``` ### Why are the changes needed? Similar to the other SQL functions, this PR shows the overflow errors of `conv()` to users under ANSI SQL mode, instead of returning an unexpected number. ### Does this PR introduce _any_ user-facing change? Yes, function `conv()` will return an error if the internal conversion overflows ### How was this patch tested? UTs Closes #40001 from gengliangwang/fixConv. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Gengliang Wang <gengliang@apache.org> (cherry picked from commit cb463fb40e8f663b7e3019c8d8560a3490c241d0) Signed-off-by: Gengliang Wang <gengliang@apache.org> 15 February 2023, 05:57:24 UTC
cb0f9fe [SPARK-42342][PYTHON][CONNECT][TEST] Fix FunctionsParityTests.test_raise_error to call the proper test ### What changes were proposed in this pull request? This is a follow-up of #39882. Fixes `FunctionsParityTests.test_raise_error` to call the proper test. ### Why are the changes needed? `FunctionsParityTests.test_raise_error` should've called `check_raise_error`. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? The fixed test. Closes #40021 from ueshin/issues/SPARK-42342/test. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 3ed1b9513867e4d5bcf048f3042a91ffd3524771) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 15 February 2023, 00:23:01 UTC
f70e274 [SPARK-41775][PYTHON][FOLLOW-UP] Updating docs for readability ### What changes were proposed in this pull request? Added some changes to the docstrings to make the docs more readable. <img width="752" alt="image" src="https://user-images.githubusercontent.com/81988348/218838122-c28fcba7-dff8-498d-9be2-57b8efc498cd.png"> <img width="725" alt="image" src="https://user-images.githubusercontent.com/81988348/218882816-729508fb-afc7-49b5-8ff8-92d42390fa2d.png"> ### Why are the changes needed? Just for readability's sake. ### Does this PR introduce _any_ user-facing change? No? ### How was this patch tested? Not Needed. Closes #40020 from rithwik-db/docs-update. Authored-by: Rithwik Ediga Lakhamsani <rithwik.ediga@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit ebf419a31db2f2730ae14cf1ac8573de8c552d7f) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 15 February 2023, 00:22:23 UTC
8f6008a [SPARK-42406][PROTOBUF] Fix recursive depth setting for Protobuf functions ### What changes were proposed in this pull request? This fixes how setting for limiting recursive depth is handled in Protobuf functions (`recursive.fields.max.depth`). Original documentation says as setting it to '0' removes the recursive field. But we never did that. We allow at least once. E.g. schema for recursive message 'EventPerson' does not change between the settings '0' or '1'. This fixes it by requiring the max depth to be at least 1. It also fixes how the recursion enfored. Updated the tests and added an extra test with new protobuf 'EventPersonWrapper'. I will annotate the diff inline pointing to main fixes. ### Why are the changes needed? This fixes a bug with enforcing `recursive.fields.max.depth` and clarifies more in the documentation. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - Unit tests Closes #40011 from rangadi/recursive-depth. Authored-by: Raghu Angadi <raghu.angadi@databricks.com> Signed-off-by: Gengliang Wang <gengliang@apache.org> (cherry picked from commit 96516c8e3129929518ae4c9877983868e75dc4a4) Signed-off-by: Gengliang Wang <gengliang@apache.org> 14 February 2023, 23:12:28 UTC
ca0346b [SPARK-42430][SQL][DOC] Add documentation for TimestampNTZ type ### What changes were proposed in this pull request? Add documentation for TimestampNTZ type ### Why are the changes needed? Add documentation for the new data type TimestampNTZ. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Build docs and preview: <img width="782" alt="image" src="https://user-images.githubusercontent.com/1097932/218656254-096df429-851d-4046-8a6f-f368819c405b.png"> <img width="777" alt="image" src="https://user-images.githubusercontent.com/1097932/218656277-e8cfe747-2c45-476d-b70f-83c654e0b0f2.png"> Closes #40005 from gengliangwang/ntzDoc. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Gengliang Wang <gengliang@apache.org> (cherry picked from commit 46a234125d3f125ba1f9ccd6af0ec1ba61016c1e) Signed-off-by: Gengliang Wang <gengliang@apache.org> 14 February 2023, 19:25:09 UTC
25bbb5c [SPARK-42418][DOCS][PYTHON] PySpark documentation updates to improve discoverability and add more guidance ### What changes were proposed in this pull request? Updates to the PySpark documentation web pages that help users choose which API to use when and make it easier to discover relevant content and navigate the documentation pages. This is the first of a series of updates. ### Why are the changes needed? The PySpark documentation web site does not do enough to help users choose which API to use and it is not easy to navigate. ### Does this PR introduce _any_ user-facing change? Yes, the user facing PySpark documentation is updated. ### How was this patch tested? Built and tested the PySpark documentation web site locally. Closes #39992 from allanf-db/pyspark_doc_updates. Authored-by: Allan Folting <allan.folting@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit fa7c13add487984ae84527ecd254b5c396ba9fee) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 14 February 2023, 13:00:18 UTC
4a61ce2 [SPARK-42433][PYTHON][CONNECT] Add `array_insert` to Connect ### What changes were proposed in this pull request? 1, make `array_insert` accept int `pos` and `Any` value 2, add it to connect ### Why are the changes needed? to be consistent with other pyspark functions ### Does this PR introduce _any_ user-facing change? yes ### How was this patch tested? added tests Closes #40010 from zhengruifeng/py_array_insert. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 14cc5a5341ad4f50c041c3a721d5f46586c83fd1) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 14 February 2023, 12:59:13 UTC
5db48ef [SPARK-42431][CONNECT] Avoid calling `LogicalPlan.output` before analysis in `Union` ### What changes were proposed in this pull request? Avoid calling `output` before analysis; Avoid applying optimizer rules before analysis; Remove the usage of optimizer `CombineUnions`, since it may discard the `PLAN_ID_TAG` ### Why are the changes needed? it is not expected to calling `output` and apply optimizer rules before analysis ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? updated UT Closes #40008 from zhengruifeng/connect_analyze. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit cbcea0e304340cae5b7a822f468388e03dd4d46f) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 14 February 2023, 12:58:29 UTC
3d07f5d [SPARK-42434][PYTHON][CONNECT] array_append` should accept `Any` value ### What changes were proposed in this pull request? `array_append` should accept `Any` value ### Why are the changes needed? make `array_append` consistent with other array functions in pyspark ### Does this PR introduce _any_ user-facing change? yes ### How was this patch tested? added UT Closes #40012 from zhengruifeng/py_array_append. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 6e090eba4fd0f4f841177ffeefc3a376242075a2) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 14 February 2023, 12:57:49 UTC
de31a0f [SPARK-42428][CONNECT][PYTHON] Standardize __repr__ of CommonInlineUserDefinedFunction ### What changes were proposed in this pull request? Standardize __repr__ of CommonInlineUserDefinedFunction. ### Why are the changes needed? To reach parity with vanilla PySpark. ### Does this PR introduce _any_ user-facing change? The column representation reaches parity with the vanilla PySpark's now, as shown below. Before ``` >>> udf(lambda x : x + 1)(df.id) Column<'<lambda>(id), True, "string", 100, b'\x80\x05\x95\xe1\x01\x00\x00\x00\x00\x00\x00\x8c\x1fpyspark.cloudpickle.cloudpickle\x94\x8c\x0e_make_function\x94\x93\x94(h\x00\x8c\r_builtin_type\x94\x93\x94\x8c\x08CodeType\x94\x85\x94R\x94(K\x01K\x00K\x00K\x01K\x02KCC\x08|\x00d\x01\x17\x00S\x00\x94NK\x01\x86\x94)\x8c\x01x\x94\x85\x94\x8c\x07<stdin>\x94\x8c\x08<lambda>\x94K\x01C\x00\x94))t\x94R\x94}\x94(\x8c\x0b__package__\x94N\x8c\x08__name__\x94\x8c\x08__main__\x94uNNNt\x94R\x94\x8c$pyspark.cloudpickle.cloudpickle_fast\x94\x8c\x12_function_setstate\x94\x93\x94h\x16}\x94}\x94(h\x13h\r\x8c\x0c__qualname__\x94h\r\x8c\x0f__annotations__\x94}\x94\x8c\x0e__kwdefaults__\x94N\x8c\x0c__defaults__\x94N\x8c\n__module__\x94h\x14\x8c\x07__doc__\x94N\x8c\x0b__closure__\x94N\x8c\x17_cloudpickle_submodules\x94]\x94\x8c\x0b__globals__\x94}\x94u\x86\x94\x86R0\x8c\x11pyspark.sql.types\x94\x8c\nStringType\x94\x93\x94)\x81\x94\x86\x94.', f3.9'> ``` Now ``` >>> udf(lambda x : x + 1)(df.id) Column<'<lambda>(id)'> ``` ### How was this patch tested? Existing tests. Closes #40003 from xinrong-meng/udf_repr. Authored-by: Xinrong Meng <xinrong@apache.org> Signed-off-by: Ruifeng Zheng <ruifengz@apache.org> (cherry picked from commit 676332a1f4d6a522f009fa91e593904588cfe9f2) Signed-off-by: Ruifeng Zheng <ruifengz@apache.org> 14 February 2023, 09:05:33 UTC
7f1b6fe [SPARK-42429][BUILD] Fix a sbt compile error for `getArgument` when using IntelliJ ### What changes were proposed in this pull request? When build with IntelliJ, I hit the following error from time to time: ``` spark/sql/core/src/test/scala/org/apache/spark/sql/execution/command/PlanResolutionSuite.scala:149:18 value getArgument is not a member of org.mockito.invocation.InvocationOnMock invocation.getArgument[Identifier](0).name match { ``` Sometimes the error can be solved by rebuilt with sbt or maven. But the best might be just avoid using the method that causes this compilation error. ### Why are the changes needed? Make the life easier for IDE users. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Manual and existing tests Closes #40004 from zhenlineo/fix-ide-old-errors. Authored-by: Zhen Li <zhenlineo@users.noreply.github.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 23dd2c97ab00f13dfc855045047e4902112db793) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 14 February 2023, 07:56:06 UTC
c7066b8 [SPARK-41818][SPARK-42000][CONNECT][FOLLOWUP] Fix leaked test case ### What changes were proposed in this pull request? After https://github.com/apache/spark/pull/40000, the connect framework make format as optional but miss updating some test case. ### Why are the changes needed? recover CI <img width="1052" alt="image" src="https://user-images.githubusercontent.com/12025282/218662255-fd8de2f9-f64e-43c6-8e19-0f2268882b44.png"> ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? pass CI Closes #40006 from ulysses-you/followup. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: Ruifeng Zheng <ruifengz@apache.org> (cherry picked from commit a5cf0189a9fcf54a57aed1305b09523784166c0c) Signed-off-by: Ruifeng Zheng <ruifengz@apache.org> 14 February 2023, 07:38:38 UTC
ea0f50b [SPARK-42263][CONNECT][PYTHON] Implement `spark.catalog.registerFunction` ### What changes were proposed in this pull request? Implement `spark.catalog.registerFunction`. ### Why are the changes needed? To reach parity with vanilla PySpark. ### Does this PR introduce _any_ user-facing change? Yes. `spark.catalog.registerFunction` is supported, as shown below. ```py >>> udf ... def f(): ... return 'hi' ... >>> spark.catalog.registerFunction('HI', f) <function f at 0x7fcdd8341dc0> >>> spark.sql("SELECT HI()").collect() [Row(HI()='hi')] ``` ### How was this patch tested? Unit tests. Closes #39984 from xinrong-meng/catalog_register. Authored-by: Xinrong Meng <xinrong@apache.org> Signed-off-by: Xinrong Meng <xinrong@apache.org> (cherry picked from commit c619a402451df9ae5b305e5a48eb244c9ffd2eb6) Signed-off-by: Xinrong Meng <xinrong@apache.org> 14 February 2023, 05:56:38 UTC
4f29e2e [SPARK-42324][SQL] Assign name to _LEGACY_ERROR_TEMP_1001 ### What changes were proposed in this pull request? This PR proposes to assign name to _LEGACY_ERROR_TEMP_1001, "UNRESOLVED_USING_COLUMN_FOR_JOIN". ### Why are the changes needed? We should assign proper name to _LEGACY_ERROR_TEMP_* ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? `./build/sbt "sql/testOnly org.apache.spark.sql.SQLQueryTestSuite*` Closes #39963 from itholic/LEGACY_1001. Authored-by: itholic <haejoon.lee@databricks.com> Signed-off-by: Max Gekk <max.gekk@gmail.com> (cherry picked from commit 57ae4f49702d200585c56e3f000cca1d729f6f64) Signed-off-by: Max Gekk <max.gekk@gmail.com> 14 February 2023, 05:22:47 UTC
c176000 [SPARK-41999][CONNECT][PYTHON] Fix bucketBy/sortBy to properly use the first column name ### What changes were proposed in this pull request? Fixes `DataFrameWriter.bucketBy` and `sortBy` to porperly use the first column name. ### Why are the changes needed? Currently `DataFrameWriter.bucketBy` and `sortBy` mistakenly drop the first column name, which ends with `NoSuchElementException`. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Enabled related tests. Closes #40002 from ueshin/issues/SPARK-41999/bucketBy. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 18aa6e381bb6440ee5bc655edfd9aeddcedd30f8) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 14 February 2023, 03:55:23 UTC
0469479 [SPARK-41818][SPARK-42000][CONNECT] Fix saveAsTable to find the default source ### What changes were proposed in this pull request? Fixes `DataFrameWriter.saveAsTable` to find the default source. ### Why are the changes needed? Currently `DataFrameWriter.saveAsTable` fails when `format` is not specified because protobuf defines `source` as required and it will be an empty string instead of `null`, then `DataFrameWriter` tries to find the data source `""`. The `source` field should be optional to let Spark decide the default source. ### Does this PR introduce _any_ user-facing change? Users can call `saveAsTable` without `format`. ### How was this patch tested? Enabled related tests. Closes #40000 from ueshin/issues/SPARK-42000/saveAsTable. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit c1f242eb4c514d7fba8e0c47c96e40cba82a39ad) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 14 February 2023, 00:06:03 UTC
76f9594 [SPARK-41775][PYTHON][FOLLOW-UP] Updating error message for training using PyTorch functions ### What changes were proposed in this pull request? Replaced an uninsightful `FileNotFoundError` with a better `RuntimeError` exception when training fails. ### Why are the changes needed? Improve user experience. ### Does this PR introduce _any_ user-facing change? Just the message that is shown to the user. ### How was this patch tested? N/A Closes #39987 from rithwik-db/error-bug-fix. Authored-by: Rithwik Ediga Lakhamsani <rithwik.ediga@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 889889e993ad1621f73440ff287dfd0a54c0ea4f) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 13 February 2023, 19:03:10 UTC
83f8ddd [SPARK-42416][SQL] Dateset operations should not resolve the analyzed logical plan again ### What changes were proposed in this pull request? For the following query ``` sql( """ |CREATE TABLE app ( | uid STRING, | st TIMESTAMP, | ds INT |) USING parquet PARTITIONED BY (ds); |""".stripMargin) sql( """ |create or replace temporary view view1 as WITH new_app AS ( | SELECT a.* FROM app a) |SELECT | uid, | 20230208 AS ds | FROM | new_app | GROUP BY | 1, | 2 |""".stripMargin) val df = sql("select uid from view1") df.show() ``` Spark will throw the following error ``` [GROUP_BY_POS_OUT_OF_RANGE] GROUP BY position 20230208 is not in select list (valid range is [1, 2]).; line 9 pos 4 ``` This is because the logical plan in `df` is not set as analyzed after changes in https://github.com/apache/spark/commit/6adda258e5155761a861a96af4f5410b8a7f304d#diff-583171e935b2dc349378063a5841c5b98b30a2d57ac3743a9eccfe7bffcb8f2aR126, which sets the result of `inlineCTE(plan)` as analyzed instead. Resolving ordinals is not idempotent. In `df`, the `GROUP BY 1, 2` is resolved as `GROUP BY uid, 20230208`. `Dateset.show()` will resolve the logical plan in `df` again and treat the 20230208 in `GROUP BY uid, 20230208` as ordinals again. Thus the error happens ### Why are the changes needed? Bug fix ### Does this PR introduce _any_ user-facing change? No, the regression is not released yet. ### How was this patch tested? New UT Closes #39988 from gengliangwang/group_by_error. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Gengliang Wang <gengliang@apache.org> (cherry picked from commit 5267ad82d992ad80dccda8f59d99d84c3e85a32c) Signed-off-by: Gengliang Wang <gengliang@apache.org> 13 February 2023, 18:58:05 UTC
8baa84f [SPARK-42327][SQL] Assign name to `_LEGACY_ERROR_TEMP_2177` ### What changes were proposed in this pull request? This PR proposes to assign the name `MALFORMED_RECORD_IN_PARSING` to `_LEGACY_ERROR_TEMP_2177` and improve the error message. ### Why are the changes needed? We should assign proper name to LEGACY errors, and show actionable error messages. ### Does this PR introduce _any_ user-facing change? No, but error message improvements. ### How was this patch tested? Updated UTs. Closes #39980 from itholic/LEGACY_2177. Authored-by: itholic <haejoon.lee@databricks.com> Signed-off-by: Max Gekk <max.gekk@gmail.com> (cherry picked from commit 6f47ac43e55c332f63876cf4f8ecf1b41b277651) Signed-off-by: Max Gekk <max.gekk@gmail.com> 13 February 2023, 16:34:07 UTC
804c7f1 [SPARK-42420][CORE] Register WriteTaskResult, BasicWriteTaskStats, and ExecutedWriteSummary to KryoSerializer ### What changes were proposed in this pull request? This PR aims to register `WriteTaskResult`, `BasicWriteTaskStats`, and `ExecutedWriteSummary` to `KryoSerializer`. ### Why are the changes needed? This is used during the writing operation. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. Closes #39993 from williamhyun/kryowrite. Authored-by: William Hyun <william@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit de0052ded40617f817bd71464c73ff7ee516b115) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 13 February 2023, 15:32:34 UTC
4a73f25 [SPARK-42422][BUILD] Upgrade `maven-shade-plugin` to 3.4.1 ### What changes were proposed in this pull request? This pr aims upgrade `maven-shade-plugin` from 3.2.4 to 3.4.1 ### Why are the changes needed? The `maven-shade-plugin` was [built by Java 8](https://github.com/apache/maven-shade-plugin/commit/33273411d30377773bc866bba46ec5f2fc39e60b) from 3.4.1, all other changes as follows: - https://github.com/apache/maven-shade-plugin/releases/tag/maven-shade-plugin-3.3.0 - https://github.com/apache/maven-shade-plugin/compare/maven-shade-plugin-3.3.0...maven-shade-plugin-3.4.1 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - Pass GitHub Actions - Manual check: There are 6 modules actually use shade function, checked the maven compilation logs manually: 1. spark-core Before ``` [INFO] --- maven-shade-plugin:3.2.4:shade (default) spark-core_2.12 --- [INFO] Including org.eclipse.jetty:jetty-plus:jar:9.4.50.v20221201 in the shaded jar. [INFO] Including org.eclipse.jetty:jetty-security:jar:9.4.50.v20221201 in the shaded jar. [INFO] Including org.eclipse.jetty:jetty-util:jar:9.4.50.v20221201 in the shaded jar. [INFO] Including org.eclipse.jetty:jetty-server:jar:9.4.50.v20221201 in the shaded jar. [INFO] Including org.eclipse.jetty:jetty-io:jar:9.4.50.v20221201 in the shaded jar. [INFO] Including org.eclipse.jetty:jetty-http:jar:9.4.50.v20221201 in the shaded jar. [INFO] Including org.eclipse.jetty:jetty-continuation:jar:9.4.50.v20221201 in the shaded jar. [INFO] Including org.eclipse.jetty:jetty-servlet:jar:9.4.50.v20221201 in the shaded jar. [INFO] Including org.eclipse.jetty:jetty-proxy:jar:9.4.50.v20221201 in the shaded jar. [INFO] Including org.eclipse.jetty:jetty-client:jar:9.4.50.v20221201 in the shaded jar. [INFO] Including org.eclipse.jetty:jetty-servlets:jar:9.4.50.v20221201 in the shaded jar. [INFO] Including com.google.protobuf:protobuf-java:jar:3.21.12 in the shaded jar. [INFO] Including org.spark-project.spark:unused:jar:1.0.0 in the shaded jar. ``` After ``` [INFO] --- maven-shade-plugin:3.4.1:shade (default) spark-core_2.12 --- [INFO] Including org.eclipse.jetty:jetty-plus:jar:9.4.50.v20221201 in the shaded jar. [INFO] Including org.eclipse.jetty:jetty-security:jar:9.4.50.v20221201 in the shaded jar. [INFO] Including org.eclipse.jetty:jetty-util:jar:9.4.50.v20221201 in the shaded jar. [INFO] Including org.eclipse.jetty:jetty-server:jar:9.4.50.v20221201 in the shaded jar. [INFO] Including org.eclipse.jetty:jetty-io:jar:9.4.50.v20221201 in the shaded jar. [INFO] Including org.eclipse.jetty:jetty-http:jar:9.4.50.v20221201 in the shaded jar. [INFO] Including org.eclipse.jetty:jetty-continuation:jar:9.4.50.v20221201 in the shaded jar. [INFO] Including org.eclipse.jetty:jetty-servlet:jar:9.4.50.v20221201 in the shaded jar. [INFO] Including org.eclipse.jetty:jetty-proxy:jar:9.4.50.v20221201 in the shaded jar. [INFO] Including org.eclipse.jetty:jetty-client:jar:9.4.50.v20221201 in the shaded jar. [INFO] Including org.eclipse.jetty:jetty-servlets:jar:9.4.50.v20221201 in the shaded jar. [INFO] Including com.google.protobuf:protobuf-java:jar:3.21.12 in the shaded jar. [INFO] Including org.spark-project.spark:unused:jar:1.0.0 in the shaded jar. ``` 2. spark-network-yarn Before ``` [INFO] --- maven-shade-plugin:3.2.4:shade (default) spark-network-yarn_2.12 --- [INFO] Including org.apache.spark:spark-network-shuffle_2.12:jar:3.5.0-SNAPSHOT in the shaded jar. [INFO] Including org.apache.spark:spark-network-common_2.12:jar:3.5.0-SNAPSHOT in the shaded jar. [INFO] Including io.netty:netty-all:jar:4.1.87.Final in the shaded jar. [INFO] Including io.netty:netty-buffer:jar:4.1.87.Final in the shaded jar. [INFO] Including io.netty:netty-codec:jar:4.1.87.Final in the shaded jar. [INFO] Including io.netty:netty-codec-http:jar:4.1.87.Final in the shaded jar. [INFO] Including io.netty:netty-codec-http2:jar:4.1.87.Final in the shaded jar. [INFO] Including io.netty:netty-codec-socks:jar:4.1.87.Final in the shaded jar. [INFO] Including io.netty:netty-common:jar:4.1.87.Final in the shaded jar. [INFO] Including io.netty:netty-handler:jar:4.1.87.Final in the shaded jar. [INFO] Including io.netty:netty-transport-native-unix-common:jar:4.1.87.Final in the shaded jar. [INFO] Including io.netty:netty-handler-proxy:jar:4.1.87.Final in the shaded jar. [INFO] Including io.netty:netty-resolver:jar:4.1.87.Final in the shaded jar. [INFO] Including io.netty:netty-transport:jar:4.1.87.Final in the shaded jar. [INFO] Including io.netty:netty-transport-classes-epoll:jar:4.1.87.Final in the shaded jar. [INFO] Including io.netty:netty-transport-classes-kqueue:jar:4.1.87.Final in the shaded jar. [INFO] Including io.netty:netty-transport-native-epoll:jar:linux-x86_64:4.1.87.Final in the shaded jar. [INFO] Including io.netty:netty-transport-native-epoll:jar:linux-aarch_64:4.1.87.Final in the shaded jar. [INFO] Including io.netty:netty-transport-native-kqueue:jar:osx-aarch_64:4.1.87.Final in the shaded jar. [INFO] Including io.netty:netty-transport-native-kqueue:jar:osx-x86_64:4.1.87.Final in the shaded jar. [INFO] Including org.apache.commons:commons-lang3:jar:3.12.0 in the shaded jar. [INFO] Including org.fusesource.leveldbjni:leveldbjni-all:jar:1.8 in the shaded jar. [INFO] Including org.rocksdb:rocksdbjni:jar:7.9.2 in the shaded jar. [INFO] Including com.fasterxml.jackson.core:jackson-databind:jar:2.14.2 in the shaded jar. [INFO] Including com.fasterxml.jackson.core:jackson-core:jar:2.14.2 in the shaded jar. [INFO] Including com.fasterxml.jackson.core:jackson-annotations:jar:2.14.2 in the shaded jar. [INFO] Including org.apache.commons:commons-crypto:jar:1.1.0 in the shaded jar. [INFO] Including com.google.crypto.tink:tink:jar:1.7.0 in the shaded jar. [INFO] Including com.google.code.gson:gson:jar:2.8.9 in the shaded jar. [INFO] Including io.dropwizard.metrics:metrics-core:jar:4.2.15 in the shaded jar. [INFO] Including org.roaringbitmap:RoaringBitmap:jar:0.9.39 in the shaded jar. [INFO] Including org.roaringbitmap:shims:jar:0.9.39 in the shaded jar. [INFO] Including com.google.code.findbugs:jsr305:jar:3.0.0 in the shaded jar. [INFO] Including org.spark-project.spark:unused:jar:1.0.0 in the shaded jar. ``` After ``` [INFO] --- maven-shade-plugin:3.4.1:shade (default) spark-network-yarn_2.12 --- [INFO] Including org.apache.spark:spark-network-shuffle_2.12:jar:3.5.0-SNAPSHOT in the shaded jar. [INFO] Including org.apache.spark:spark-network-common_2.12:jar:3.5.0-SNAPSHOT in the shaded jar. [INFO] Including io.netty:netty-all:jar:4.1.87.Final in the shaded jar. [INFO] Including io.netty:netty-buffer:jar:4.1.87.Final in the shaded jar. [INFO] Including io.netty:netty-codec:jar:4.1.87.Final in the shaded jar. [INFO] Including io.netty:netty-codec-http:jar:4.1.87.Final in the shaded jar. [INFO] Including io.netty:netty-codec-http2:jar:4.1.87.Final in the shaded jar. [INFO] Including io.netty:netty-codec-socks:jar:4.1.87.Final in the shaded jar. [INFO] Including io.netty:netty-common:jar:4.1.87.Final in the shaded jar. [INFO] Including io.netty:netty-handler:jar:4.1.87.Final in the shaded jar. [INFO] Including io.netty:netty-transport-native-unix-common:jar:4.1.87.Final in the shaded jar. [INFO] Including io.netty:netty-handler-proxy:jar:4.1.87.Final in the shaded jar. [INFO] Including io.netty:netty-resolver:jar:4.1.87.Final in the shaded jar. [INFO] Including io.netty:netty-transport:jar:4.1.87.Final in the shaded jar. [INFO] Including io.netty:netty-transport-classes-epoll:jar:4.1.87.Final in the shaded jar. [INFO] Including io.netty:netty-transport-classes-kqueue:jar:4.1.87.Final in the shaded jar. [INFO] Including io.netty:netty-transport-native-epoll:jar:linux-x86_64:4.1.87.Final in the shaded jar. [INFO] Including io.netty:netty-transport-native-epoll:jar:linux-aarch_64:4.1.87.Final in the shaded jar. [INFO] Including io.netty:netty-transport-native-kqueue:jar:osx-aarch_64:4.1.87.Final in the shaded jar. [INFO] Including io.netty:netty-transport-native-kqueue:jar:osx-x86_64:4.1.87.Final in the shaded jar. [INFO] Including org.apache.commons:commons-lang3:jar:3.12.0 in the shaded jar. [INFO] Including org.fusesource.leveldbjni:leveldbjni-all:jar:1.8 in the shaded jar. [INFO] Including org.rocksdb:rocksdbjni:jar:7.9.2 in the shaded jar. [INFO] Including com.fasterxml.jackson.core:jackson-databind:jar:2.14.2 in the shaded jar. [INFO] Including com.fasterxml.jackson.core:jackson-core:jar:2.14.2 in the shaded jar. [INFO] Including com.fasterxml.jackson.core:jackson-annotations:jar:2.14.2 in the shaded jar. [INFO] Including org.apache.commons:commons-crypto:jar:1.1.0 in the shaded jar. [INFO] Including com.google.crypto.tink:tink:jar:1.7.0 in the shaded jar. [INFO] Including com.google.code.gson:gson:jar:2.8.9 in the shaded jar. [INFO] Including io.dropwizard.metrics:metrics-core:jar:4.2.15 in the shaded jar. [INFO] Including org.roaringbitmap:RoaringBitmap:jar:0.9.39 in the shaded jar. [INFO] Including org.roaringbitmap:shims:jar:0.9.39 in the shaded jar. [INFO] Including com.google.code.findbugs:jsr305:jar:3.0.0 in the shaded jar. [INFO] Including org.spark-project.spark:unused:jar:1.0.0 in the shaded jar. ``` 3. spark-protobuf Before ``` [INFO] --- maven-shade-plugin:3.2.4:shade (default) spark-protobuf_2.12 --- [INFO] Including com.google.protobuf:protobuf-java:jar:3.21.12 in the shaded jar. ``` After ``` [INFO] --- maven-shade-plugin:3.4.1:shade (default) spark-protobuf_2.12 --- [INFO] Including com.google.protobuf:protobuf-java:jar:3.21.12 in the shaded jar. ``` 4. spark-connect-common Before ``` [INFO] --- maven-shade-plugin:3.2.4:shade (default) spark-connect-common_2.12 --- [INFO] Including com.google.guava:guava:jar:31.0.1-jre in the shaded jar. [INFO] Including com.google.guava:listenablefuture:jar:9999.0-empty-to-avoid-conflict-with-guava in the shaded jar. [INFO] Including com.google.guava:failureaccess:jar:1.0.1 in the shaded jar. ``` After ``` [INFO] --- maven-shade-plugin:3.4.1:shade (default) spark-connect-common_2.12 --- [INFO] Including com.google.guava:guava:jar:31.0.1-jre in the shaded jar. [INFO] Including com.google.guava:listenablefuture:jar:9999.0-empty-to-avoid-conflict-with-guava in the shaded jar. [INFO] Including com.google.guava:failureaccess:jar:1.0.1 in the shaded jar. ``` 5. spark-connect Before ``` [INFO] --- maven-shade-plugin:3.2.4:shade (default) spark-connect_2.12 --- [INFO] Including org.apache.spark:spark-connect-common_2.12:jar:3.5.0-SNAPSHOT in the shaded jar. [INFO] Including com.google.guava:guava:jar:31.0.1-jre in the shaded jar. [INFO] Including com.google.guava:listenablefuture:jar:9999.0-empty-to-avoid-conflict-with-guava in the shaded jar. [INFO] Including org.checkerframework:checker-qual:jar:3.12.0 in the shaded jar. [INFO] Including com.google.errorprone:error_prone_annotations:jar:2.7.1 in the shaded jar. [INFO] Including com.google.j2objc:j2objc-annotations:jar:1.3 in the shaded jar. [INFO] Including com.google.guava:failureaccess:jar:1.0.1 in the shaded jar. [INFO] Including com.google.protobuf:protobuf-java:jar:3.21.12 in the shaded jar. [INFO] Including io.grpc:grpc-netty:jar:1.47.0 in the shaded jar. [INFO] Including io.grpc:grpc-core:jar:1.47.0 in the shaded jar. [INFO] Including com.google.code.gson:gson:jar:2.9.0 in the shaded jar. [INFO] Including com.google.android:annotations:jar:4.1.1.4 in the shaded jar. [INFO] Including org.codehaus.mojo:animal-sniffer-annotations:jar:1.19 in the shaded jar. [INFO] Including io.perfmark:perfmark-api:jar:0.25.0 in the shaded jar. [INFO] Including io.grpc:grpc-protobuf:jar:1.47.0 in the shaded jar. [INFO] Including io.grpc:grpc-api:jar:1.47.0 in the shaded jar. [INFO] Including io.grpc:grpc-context:jar:1.47.0 in the shaded jar. [INFO] Including com.google.api.grpc:proto-google-common-protos:jar:2.0.1 in the shaded jar. [INFO] Including io.grpc:grpc-protobuf-lite:jar:1.47.0 in the shaded jar. [INFO] Including io.grpc:grpc-services:jar:1.47.0 in the shaded jar. [INFO] Including com.google.protobuf:protobuf-java-util:jar:3.19.2 in the shaded jar. [INFO] Including io.grpc:grpc-stub:jar:1.47.0 in the shaded jar. ``` After ``` [INFO] --- maven-shade-plugin:3.4.1:shade (default) spark-connect_2.12 --- [INFO] Including org.apache.spark:spark-connect-common_2.12:jar:3.5.0-SNAPSHOT in the shaded jar. [INFO] Including com.google.guava:guava:jar:31.0.1-jre in the shaded jar. [INFO] Including com.google.guava:listenablefuture:jar:9999.0-empty-to-avoid-conflict-with-guava in the shaded jar. [INFO] Including org.checkerframework:checker-qual:jar:3.12.0 in the shaded jar. [INFO] Including com.google.errorprone:error_prone_annotations:jar:2.7.1 in the shaded jar. [INFO] Including com.google.j2objc:j2objc-annotations:jar:1.3 in the shaded jar. [INFO] Including com.google.guava:failureaccess:jar:1.0.1 in the shaded jar. [INFO] Including com.google.protobuf:protobuf-java:jar:3.21.12 in the shaded jar. [INFO] Including io.grpc:grpc-netty:jar:1.47.0 in the shaded jar. [INFO] Including io.grpc:grpc-core:jar:1.47.0 in the shaded jar. [INFO] Including com.google.code.gson:gson:jar:2.9.0 in the shaded jar. [INFO] Including com.google.android:annotations:jar:4.1.1.4 in the shaded jar. [INFO] Including org.codehaus.mojo:animal-sniffer-annotations:jar:1.19 in the shaded jar. [INFO] Including io.perfmark:perfmark-api:jar:0.25.0 in the shaded jar. [INFO] Including io.grpc:grpc-protobuf:jar:1.47.0 in the shaded jar. [INFO] Including io.grpc:grpc-api:jar:1.47.0 in the shaded jar. [INFO] Including io.grpc:grpc-context:jar:1.47.0 in the shaded jar. [INFO] Including com.google.api.grpc:proto-google-common-protos:jar:2.0.1 in the shaded jar. [INFO] Including io.grpc:grpc-protobuf-lite:jar:1.47.0 in the shaded jar. [INFO] Including io.grpc:grpc-services:jar:1.47.0 in the shaded jar. [INFO] Including com.google.protobuf:protobuf-java-util:jar:3.19.2 in the shaded jar. [INFO] Including io.grpc:grpc-stub:jar:1.47.0 in the shaded jar. ``` 6. spark-connect-client-jvm Before ``` [INFO] --- maven-shade-plugin:3.2.4:shade (default) spark-connect-client-jvm_2.12 --- [INFO] Including org.apache.spark:spark-connect-common_2.12:jar:3.5.0-SNAPSHOT in the shaded jar. [INFO] Including com.google.guava:failureaccess:jar:1.0.1 in the shaded jar. [INFO] Including io.grpc:grpc-netty:jar:1.47.0 in the shaded jar. [INFO] Including io.grpc:grpc-core:jar:1.47.0 in the shaded jar. [INFO] Including io.grpc:grpc-protobuf:jar:1.47.0 in the shaded jar. [INFO] Including io.grpc:grpc-api:jar:1.47.0 in the shaded jar. [INFO] Including io.grpc:grpc-context:jar:1.47.0 in the shaded jar. [INFO] Including io.grpc:grpc-protobuf-lite:jar:1.47.0 in the shaded jar. [INFO] Including io.grpc:grpc-services:jar:1.47.0 in the shaded jar. [INFO] Including com.google.protobuf:protobuf-java-util:jar:3.19.2 in the shaded jar. [INFO] Including io.grpc:grpc-stub:jar:1.47.0 in the shaded jar. [INFO] Including com.google.protobuf:protobuf-java:jar:3.21.12 in the shaded jar. [INFO] Including com.google.guava:guava:jar:31.0.1-jre in the shaded jar. [INFO] Including com.google.guava:listenablefuture:jar:9999.0-empty-to-avoid-conflict-with-guava in the shaded jar. ``` After ``` [INFO] --- maven-shade-plugin:3.4.1:shade (default) spark-connect-client-jvm_2.12 --- [INFO] Including org.apache.spark:spark-connect-common_2.12:jar:3.5.0-SNAPSHOT in the shaded jar. [INFO] Including com.google.guava:failureaccess:jar:1.0.1 in the shaded jar. [INFO] Including io.grpc:grpc-netty:jar:1.47.0 in the shaded jar. [INFO] Including io.grpc:grpc-core:jar:1.47.0 in the shaded jar. [INFO] Including io.grpc:grpc-protobuf:jar:1.47.0 in the shaded jar. [INFO] Including io.grpc:grpc-api:jar:1.47.0 in the shaded jar. [INFO] Including io.grpc:grpc-context:jar:1.47.0 in the shaded jar. [INFO] Including io.grpc:grpc-protobuf-lite:jar:1.47.0 in the shaded jar. [INFO] Including io.grpc:grpc-services:jar:1.47.0 in the shaded jar. [INFO] Including com.google.protobuf:protobuf-java-util:jar:3.19.2 in the shaded jar. [INFO] Including io.grpc:grpc-stub:jar:1.47.0 in the shaded jar. [INFO] Including com.google.protobuf:protobuf-java:jar:3.21.12 in the shaded jar. [INFO] Including com.google.guava:guava:jar:31.0.1-jre in the shaded jar. [INFO] Including com.google.guava:listenablefuture:jar:9999.0-empty-to-avoid-conflict-with-guava in the shaded jar. ``` Also checked the file numbers and file names in jars after decompression, the result is the same. Closes #39994 from LuciferYang/SPARK-42422. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Sean Owen <srowen@gmail.com> (cherry picked from commit 9b625fdc7ddfb9eb29e0cc48ebbd0693a8f52f23) Signed-off-by: Sean Owen <srowen@gmail.com> 13 February 2023, 14:27:36 UTC
5e92386 [SPARK-42313][SQL] Assign name to `_LEGACY_ERROR_TEMP_1152` ### What changes were proposed in this pull request? This PR proposes to integrate _LEGACY_ERROR_TEMP_1152, "PATH_ALREADY_EXISTS". ### Why are the changes needed? We should assign proper name to _LEGACY_ERROR_TEMP_* ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? `./build/sbt "sql/testOnly org.apache.spark.sql.SQLQueryTestSuite*` Closes #39953 from itholic/LEGACY_1152. Lead-authored-by: itholic <haejoon.lee@databricks.com> Co-authored-by: Haejoon Lee <44108233+itholic@users.noreply.github.com> Signed-off-by: Max Gekk <max.gekk@gmail.com> (cherry picked from commit 39dbcd7edd3edc9ef68c41d8190e2e9e74f4cedd) Signed-off-by: Max Gekk <max.gekk@gmail.com> 13 February 2023, 11:06:24 UTC
f17549e [SPARK-41812][SPARK-41823][CONNECT][SQL][PYTHON] Resolve ambiguous columns issue in `Join` ### What changes were proposed in this pull request? In Python Client - generate `plan_id` for each proto plan (It's up to the Client to guarantee the uniqueness); - attach `plan_id` to the column created by `DataFrame[col_name]` or `DataFrame.col_name`; - Note that `F.col(col_name)` doesn't have `plan_id`; In Connect Planner: - attach `plan_id` to `UnresolvedAttribute`s and `LogicalPlan `s via `TreeNodeTag` In Analyzer: - for an `UnresolvedAttribute` with `plan_id`, search the matching node in the plan, and resolve it with the found node if possible **Out of scope:** - resolve `self-join` - add a `DetectAmbiguousSelfJoin`-like rule for detection ### Why are the changes needed? Fix bug, before this PR: ``` df1.join(df2, df1["value"] == df2["value"]) <- fail due to can not resolve `value` df1.join(df2, df1["value"] == df2["value"]).select(df1.value) <- fail due to can not resolve `value` df1.select(df2.value) <- should fail, but run as `df1.select(df1.value)` and return the incorrect results ``` ### Does this PR introduce _any_ user-facing change? yes ### How was this patch tested? added tests, enabled tests Closes #39925 from zhengruifeng/connect_plan_id. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Ruifeng Zheng <ruifengz@apache.org> (cherry picked from commit 167bbca49c1c12ccd349d4330862c136b38d4522) Signed-off-by: Ruifeng Zheng <ruifengz@apache.org> 13 February 2023, 08:22:39 UTC
a71aed9 [SPARK-42401][SQL] Set `containsNull` correctly in the data type for array_insert/array_append ### What changes were proposed in this pull request? In the `DataType` instance returned by `ArrayInsert#dataType` and `ArrayAppend#dataType`, set `containsNull` to true if either - the input array has `containsNull` set to true - the expression to be inserted/appended is nullable. ### Why are the changes needed? The following two queries return the wrong answer: ``` spark-sql> select array_insert(array(1, 2, 3, 4), 5, cast(null as int)); [1,2,3,4,0] <== should be [1,2,3,4,null] Time taken: 3.879 seconds, Fetched 1 row(s) spark-sql> select array_append(array(1, 2, 3, 4), cast(null as int)); [1,2,3,4,0] <== should be [1,2,3,4,null] Time taken: 0.068 seconds, Fetched 1 row(s) spark-sql> ``` The following two queries throw a `NullPointerException`: ``` spark-sql> select array_insert(array('1', '2', '3', '4'), 5, cast(null as string)); 23/02/10 11:24:59 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 2) java.lang.NullPointerException at org.apache.spark.sql.catalyst.expressions.codegen.UnsafeWriter.write(UnsafeWriter.java:110) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.project_doConsume_0$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) ... spark-sql> select array_append(array('1', '2', '3', '4'), cast(null as string)); 23/02/10 11:25:10 ERROR Executor: Exception in task 0.0 in stage 3.0 (TID 3) java.lang.NullPointerException at org.apache.spark.sql.catalyst.expressions.codegen.UnsafeWriter.write(UnsafeWriter.java:110) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.project_doConsume_0$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) ... spark-sql> ``` The bug arises because both `ArrayInsert` and `ArrayAppend` use the first child's data type as the function's data type. That is, it uses the first child's `containsNull` setting, regardless of whether the insert/append operation might produce an array containing a null value. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New unit tests. Closes #39970 from bersprockets/array_insert_anomaly. Authored-by: Bruce Robbins <bersprockets@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 718b6b7ed8277d5f6577367ab0d49f60f9777df7) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 13 February 2023, 06:49:52 UTC
95919e2 [SPARK-42269][CONNECT][PYTHON] Support complex return types in DDL strings ### What changes were proposed in this pull request? Support complex return types in DDL strings. ### Why are the changes needed? Parity with vanilla PySpark. ### Does this PR introduce _any_ user-facing change? Yes. ```py # BEFORE >>> spark.range(2).select(udf(lambda x: (x, x), "struct<x:integer, y:integer>")("id")) ... AssertionError: returnType should be singular >>> spark.udf.register('f', lambda x: (x, x), "struct<x:integer, y:integer>") ... AssertionError: returnType should be singular # AFTER >>> spark.range(2).select(udf(lambda x: (x, x), "struct<x:integer, y:integer>")("id")) DataFrame[<lambda>(id): struct<x:int,y:int>] >>> spark.udf.register('f', lambda x: (x, x), "struct<x:integer, y:integer>") <function <lambda> at 0x7faee0eaaca0> ``` ### How was this patch tested? Unit tests. Closes #39964 from xinrong-meng/collection_ret_type. Authored-by: Xinrong Meng <xinrong@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 3985b91633f5e49c8c97433651f81604dad193e9) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 13 February 2023, 05:10:41 UTC
85ec8d3 [SPARK-42331][SQL] Fix metadata col can not been resolved ### What changes were proposed in this pull request? This pr makes metadata output consistent during analysis by checking the output and reuse these if exists. This pr also deduplicates the metadata output when merging into the output. ### Why are the changes needed? Let's say a process of resolving metadata: ``` Project (_metadata.file_size) File (_metadata.file_size > 0) Relation ``` 1. `ResolveReferences` resolves _metadata.file_size for `Filter` 2. `ResolveReferences` can not resolve _metadata.file_size for `Project`, due to Filter is not resolved (data type does not match) 3. then `AddMetadataColumns` will merge metadata output into output 4. the next round of `ResolveReferences` can not resolve _metadata.file_size for `Project` since we filter not the confict names(output already contains the metadata output), see code: ``` def isOutputColumn(col: MetadataColumn): Boolean = { outputNames.exists(name => resolve(col.name, name)) } // filter out metadata columns that have names conflicting with output columns. if the table // has a column "line" and the table can produce a metadata column called "line", then the // data column should be returned, not the metadata column. hasMeta.metadataColumns.filterNot(isOutputColumn).toAttributes ``` And we also can not skip metadata column during filter confict name, otherwise the new generated metadata attribute will have different expr id with previous. One failed example: ```scala SELECT _metadata.row_index FROM t WHERE _metadata.row_index >= 0; ``` ### Does this PR introduce _any_ user-facing change? yes, bug fix ### How was this patch tested? add test for v1, v2 and streaming relation Closes #39870 from ulysses-you/SPARK-42331. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 5705436f70e6c6d5a127db7773d3627c8e3d695a) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 13 February 2023, 04:43:49 UTC
4dd2ef9 [SPARK-42410][CONNECT][TESTS][FOLLOWUP] Fix `PlanGenerationTestSuite` together ### What changes were proposed in this pull request? This is a follow-up of #39982 to fix `PlanGenerationTestSuite` together. ### Why are the changes needed? SPARK-42377 added two test suites originally which fails at Scala 2.13, but SPARK-42410 missed `PlanGenerationTestSuite ` while fixing `ProtoToParsedPlanTestSuite`. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. ``` $ dev/change-scala-version.sh 2.13 $ build/sbt "connect-client-jvm/testOnly org.apache.spark.sql.PlanGenerationTestSuite" -Pscala-2.13 [info] PlanGenerationTestSuite: ... [info] - function udf 2.13 (514 milliseconds) ... $ SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "connect-client-jvm/testOnly org.apache.spark.sql.PlanGenerationTestSuite" -Pscala-2.13 ... [info] PlanGenerationTestSuite: ... [info] - function udf 2.13 (574 milliseconds) ``` Closes #39986 from dongjoon-hyun/SPARK-42410-2. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit f8fda8aa1356b3ac902570e0a5536d9c00838490) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 13 February 2023, 02:57:58 UTC
d3b59aa [SPARK-42410][CONNECT][TESTS] Support Scala 2.12/2.13 tests in `connect` module ### What changes were proposed in this pull request? This PR aims to support both Scala 2.12/13 tests in `connect` module by splitting the golden files. ### Why are the changes needed? After #39933, Scala 2.13 CIs are broken . - **master**: https://github.com/apache/spark/actions/runs/4157806226/jobs/7192575602 - **branch-3.4**: https://github.com/apache/spark/actions/runs/4155578848/jobs/7188777977 ``` [info] - function_udf *** FAILED *** (29 milliseconds) [info] java.io.InvalidClassException: org.apache.spark.sql.TestUDFs$$anon$1; local class incompatible: stream classdesc serialVersionUID = 505010451380771093, local class serialVersionUID = 643575318841761245 ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs and do Scala 2.13 tests manually. ``` $ dev/change-scala-version.sh 2.13 $ build/sbt -Dscala.version=2.13.8 -Pscala-2.13 -Phadoop-3 assembly/package "connect/test" ... [info] - function_udf_2.13 (19 milliseconds) [info] Run completed in 8 seconds, 964 milliseconds. [info] Total number of tests run: 119 [info] Suites: completed 8, aborted 0 [info] Tests: succeeded 119, failed 0, canceled 0, ignored 0, pending 0 [info] All tests passed. [success] Total time: 18 s, completed Feb 12, 2023, 3:33:40 PM ``` Closes #39982 from dongjoon-hyun/SPARK-42410. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit a5db9b976d8f37d22f8a91f461c05cbb20601d8a) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 13 February 2023, 01:31:33 UTC
92d8269 [SPARK-41963][CONNECT] Fix DataFrame.unpivot to raise the same error class when the `values` argument is empty ### What changes were proposed in this pull request? Fixes `DataFrame.unpivot` to raise the same error class when the `values` argument is an empty list/tuple. ### Why are the changes needed? Currently `DataFrame.unpivot` raises a different error class, `UNPIVOT_REQUIRES_VALUE_COLUMNS` for PySpark vs. `UNPIVOT_VALUE_DATA_TYPE_MISMATCH` for Spark Connect. In `Unpivot`, an empty list/tuple as `values` argument is different from `None`. It should handle them differently. ### Does this PR introduce _any_ user-facing change? `DataFrame.unpivot` on Spark Connect will raise the same error class as PySpark. ### How was this patch tested? Enabled `DataFrameParityTests.test_unpivot_negative`. Closes #39960 from ueshin/issues/SPARK-41963/unpivot. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 633a486c65067b483524b079810b5590ac482a48) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 13 February 2023, 00:45:11 UTC
35cd461 [SPARK-40453][SPARK-41715][CONNECT][TESTS][FOLLOWUP] Skip freqItems doctest due to Scala 2.13 failure ### What changes were proposed in this pull request? This is a follow-up of #39947 to ignore `freqItems` doctest back. ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Closes #39983 from dongjoon-hyun/SPARK-40453. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit d703808347ca61ed05541e7252a0e881b1be7431) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 13 February 2023, 00:43:20 UTC
2729e12 [SPARK-42312][SQL] Assign name to _LEGACY_ERROR_TEMP_0042 ### What changes were proposed in this pull request? This PR proposes to assign name to _LEGACY_ERROR_TEMP_0042, "INVALID_SET_SYNTAX". ### Why are the changes needed? We should assign proper name to _LEGACY_ERROR_TEMP_* ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? `./build/sbt "sql/testOnly org.apache.spark.sql.SQLQueryTestSuite*` Closes #39951 from itholic/LEGACY_0042. Lead-authored-by: itholic <haejoon.lee@databricks.com> Co-authored-by: Haejoon Lee <44108233+itholic@users.noreply.github.com> Signed-off-by: Max Gekk <max.gekk@gmail.com> (cherry picked from commit 4a27c604eef6f06672a3d2aaa5e40285e15bacab) Signed-off-by: Max Gekk <max.gekk@gmail.com> 12 February 2023, 14:13:19 UTC
cff23c0 [SPARK-42408][CORE] Register DoubleType to KryoSerializer ### What changes were proposed in this pull request? This PR aims to register `DoubleType` to `KryoSerializer` ### Why are the changes needed? There was an exception when running a TPCDS 3TB test. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Test manually. Closes #39978 from williamhyun/double. Authored-by: William Hyun <william@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 559c1f8e9f3b9d15934aff9f80785177dc5624c5) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 12 February 2023, 06:37:29 UTC
6f4f78d [SPARK-42377][CONNECT][TESTS] Test framework for Spark Connect Scala Client ### What changes were proposed in this pull request? This PR adds a test framework for the Spark Connect Scala Client. This framework consists of two tests: - `PlanGenerationTestSuite`: This is the client side testing. Tests in this suite generate queries (relations), and compare the generated query plan to a golden file. In order to facilitate debugging and review the json of the proto file is also generated when the golden file is generated. The goal of this test is to validate whether the generated plan matches expectations, and to catch changes in the generated queries. **This is the only test suite where tests need to be added.** - `ProtoToParsedPlanTestSuite`: This is server side testing. This test uses the protos generated by the `PlanGenerationTestSuite` and tries to convert them to a catalyst parsed plan (unresolved `LogicaPlan`). The string representation of the generated parsed plan is compared to a golden file. When adding tests using this framework, both the `PlanGenerationTestSuite` and `ProtoToParsedPlanTestSuite` need to be run, and in this order. ### Why are the changes needed? This makes it easier to test new features for the scala client. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? It is a test framework. I have added tests for most existing features. Closes #39933 from hvanhovell/SPARK-42377. Authored-by: Herman van Hovell <herman@databricks.com> Signed-off-by: Herman van Hovell <herman@databricks.com> (cherry picked from commit b5bf3326ded9b548d49045eb6280dd347f5413fe) Signed-off-by: Herman van Hovell <herman@databricks.com> 12 February 2023, 04:25:57 UTC
312bc36 [SPARK-42390][CONNECT][BUILD] Upgrade buf from 1.13.1 to 1.14.0 ### What changes were proposed in this pull request? The pr aims to upgrade buf from 1.13.1 to 1.14.0. ### Why are the changes needed? <img width="805" alt="image" src="https://user-images.githubusercontent.com/15246973/217976923-c5a8bdd9-de4f-4824-b4c5-4b22b5eafe16.png"> Release Notes: https://github.com/bufbuild/buf/releases https://github.com/bufbuild/buf/compare/v1.13.1...v1.14.0 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass GA. Closes #39959 from panbingkun/SPARK-42390. Authored-by: panbingkun <pbk1982@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit eaba93a3243811907215a92f17f8f418c23ff8c8) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 12 February 2023, 00:12:01 UTC
ec5314f [SPARK-42310][SQL] Assign name to _LEGACY_ERROR_TEMP_1289 ### What changes were proposed in this pull request? This PR proposes to assign name to _LEGACY_ERROR_TEMP_1289, "INVALID_COLUMN_NAME". ### Why are the changes needed? We should assign proper name to _LEGACY_ERROR_TEMP_* ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? `./build/sbt "sql/testOnly org.apache.spark.sql.SQLQueryTestSuite*` Closes #39946 from itholic/LEGACY_1289. Authored-by: itholic <haejoon.lee@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 8111115997659ad7b8854c2af77d4515a06b7407) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 12 February 2023, 00:11:13 UTC
02cb227 [SPARK-42265][SPARK-41820][CONNECT] Fix createTempView and its variations to work with not analyzed plans Fixes `createTempView` and its variations to work with not analyzed plans. - `createTempView` - `createOrReplaceTempView` - `createGlobalTempView` - `createOrReplaceGlobalTempView` Currently `SparkConnectPlanner` creates `CreateViewCommand` with `isAnalyzed = true`, but the child plan can be not-analyzed yet. Users can run `createTempView` and its variations with not analyzed plans. Enabled the related tests. Closes #39968 from ueshin/issues/SPARK-41279/createTempView. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 90be059509a687172188f268eb3f9663819740a3) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 12 February 2023, 00:06:25 UTC
3b22267 [SPARK-42402][CONNECT] Support parameterized SQL by `sql()` ### What changes were proposed in this pull request? Supports parameterized SQL by `sql()`. Note: `SparkSession.sql` in PySpark also supports string formatter, but it will be handled separately. ### Why are the changes needed? Currently `SparkSession.sql` in Spark Connect doesn't support parameterized SQL. ### Does this PR introduce _any_ user-facing change? The parameterized SQL will be available. For example: ```py >>> spark.sql("SELECT * FROM range(10) WHERE id > :minId", args = {"minId" : "7"}).toPandas() id 0 8 1 9 ``` ### How was this patch tested? Added a test. Closes #39971 from ueshin/issues/SPARK-42402/parameterized_sql. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 03227e18793c6902a816bbacdce78031ce37b14a) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 12 February 2023, 00:05:12 UTC
42e7d66 [SPARK-42403][CORE] JsonProtocol should handle null JSON strings ### What changes were proposed in this pull request? This PR fixes a regression introduced by #36885 which broke JsonProtocol's ability to parse `null` string values: the old Json4S-based parser would correctly parse null literals, whereas the new code rejects them via an overly-strict type check. This PR solves this problem by relaxing the type checking in `extractString` so that `null` literals in JSON can be parsed as `null` strings. ### Why are the changes needed? Fix a regression which prevents the history server from parsing certain types of event logs which contain null strings, including stacktraces containing generated code frames and ExceptionFailure messages where the exception message is `null`. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added new unit test in JsonProtocolSuite. Closes #39973 from JoshRosen/SPARK-42403-handle-null-strings-in-json-protocol-read-path. Authored-by: Josh Rosen <joshrosen@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 84ddd409c11e4da769c5b1f496f2b61c3d928c07) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 11 February 2023, 05:54:35 UTC
18b1cb1 [SPARK-42162] Introduce MultiCommutativeOp expression as a memory optimization for canonicalizing large trees of commutative expressions ### What changes were proposed in this pull request? - This PR introduces a new expression called `MultiCommutativeOp` which is used by the commutative expressions (e.g., `Add`, `Multiply`, `And`, `Or`, `BitwiseOr`, `BitwiseAnd`, `BitwiseXor`) during canonicalization. - During canonicalization, when there is a list of consecutive commutative expressions, we now create a MultiCommutative expression with references to original operands, instead of creating new objects. - This new expression is added as a memory optimization to reduce generating a large number of intermediate objects during canonicalization. ### Why are the changes needed? - With the [recent changes](https://github.com/apache/spark/pull/37851) in the expression canonicalization, a complex query with a large number of commutative operations could end up consuming significantly more (sometimes > 10X) memory on the executors. - In our case, this issue happens for a specific complex query that has a huge expression tree containing Add operators interleaved by non Add operators. - The issue is related to canonicalization and why it is causing issues in the executors is because the codegen component relies on expression canonicalization to deduplicate expressions. - When we have a large number of Adds interleaved by non-Add operators, [this line](https://github.com/apache/spark/pull/37851/files#diff-7278f2db37934522ee7c74b71525153234cff245cefaf996957e4a9ff3dbaacdR1171) ends up materializing a new canonicalized expression tree at every non-Add operator. - In our case, analyzing the executor heap histogram shows that the additional memory is consumed by a large number of Add objects. - The high memory usage causes the executors to lose heartbeat signals and results in task failures. - The proposed `MultiCommutativeOp` expression avoids generating new Add expressions and keeps the extra memory usage to a minimum. ### Does this PR introduce _any_ user-facing change? - No ### How was this patch tested? - Existing unit tests and new unit tests. Closes #39722 from db-scnakandala/SPARK-42162. Authored-by: Supun Nakandala <supun.nakandala@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 99431e28f950bb25c421abd51888a3f9f4b46685) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 10 February 2023, 15:56:54 UTC
0a07a54 [SPARK-42394][SQL] Fix the usage information of bin/spark-sql --help ### What changes were proposed in this pull request? #36786 made the SQL shell exit w/ hive session state cleanup all the time. The cleanup step will initialize a hive ms client if absent, which is not necessary for `--help` option and may cause errors like the following ```java CLI options: 23/02/10 11:01:13 WARN Utils: Your hostname, hulk.local resolves to a loopback address: 127.0.0.1; using 10.221.102.180 instead (on interface en0) 23/02/10 11:01:13 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address 23/02/10 11:01:13 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 23/02/10 11:01:14 WARN HiveConf: HiveConf of name hive.stats.jdbc.timeout does not exist 23/02/10 11:01:14 WARN HiveConf: HiveConf of name hive.stats.retries.wait does not exist 23/02/10 11:01:14 ERROR Datastore: Exception thrown creating StoreManager. See the nested exception Error creating transactional connection factory org.datanucleus.exceptions.NucleusException: Error creating transactional connection factory at org.datanucleus.store.AbstractStoreManager.registerConnectionFactory(AbstractStoreManager.java:214) at org.datanucleus.store.AbstractStoreManager.<init>(AbstractStoreManager.java:162) at org.datanucleus.store.rdbms.RDBMSStoreManager.<init>(RDBMSStoreManager.java:285) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at org.datanucleus.plugin.NonManagedPluginRegistry.createExecutableExtension(NonManagedPluginRegistry.java:606) at org.datanucleus.plugin.PluginManager.createExecutableExtension(PluginManager.java:301) at org.datanucleus.NucleusContextHelper.createStoreManagerForProperties(NucleusContextHelper.java:133) at org.datanucleus.PersistenceNucleusContextImpl.initialise(PersistenceNucleusContextImpl.java:422) at org.datanucleus.api.jdo.JDOPersistenceManagerFactory.freezeConfiguration(JDOPersistenceManagerFactory.java:817) at org.datanucleus.api.jdo.JDOPersistenceManagerFactory.createPersistenceManagerFactory(JDOPersistenceManagerFactory.java:334) at org.datanucleus.api.jdo.JDOPersistenceManagerFactory.getPersistenceManagerFactory(JDOPersistenceManagerFactory.java:213) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ``` and make the help information unreadable. This PR extracts a simple function for usage print ### Why are the changes needed? bugfix ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? built a local distribution and verified manually. Closes #39966 from yaooqinn/SPARK-42394. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Kent Yao <yao@apache.org> (cherry picked from commit 70633e7158f90bb1a7005b3c65fe10751e08e7bf) Signed-off-by: Kent Yao <yao@apache.org> 10 February 2023, 11:25:25 UTC
96517aa [SPARK-42276][BUILD][CONNECT] Add `ServicesResourceTransformer` rule to connect server module shade configuration ### What changes were proposed in this pull request? This pr aims add `ServicesResourceTransformer` rule to connect server module shade configuration to make sure `grpc.ManagedChannelProvider` and `grpc.ServerProvider` can be used in server side. ### Why are the changes needed? Keep `grpc.ManagedChannelProvider` and `grpc.ServerProvider` and other spi usable after grpc being shaded, sbt doesn't need to be fixed because `sbt-assembly` does this by default. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - Pass Github Actions - Manual test, build a spark client and do as follows: ``` bin/spark-shell --jars spark-connect_2.12-3.5.0-SNAPSHOT.jar --driver-class-path spark-connect_2.12-3.5.0-SNAPSHOT.jar --conf spark.plugins=org.apache.spark.sql.connect.SparkConnectPlugin --conf spark.connect.grpc.binding.port=15102 ``` then run some code in spark-shell Before ```scala 23/02/01 20:44:58 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). Spark context Web UI available at http://localhost:4040 Spark context available as 'sc' (master = local[*], app id = local-1675255501816). Spark session available as 'spark'. Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 3.5.0-SNAPSHOT /_/ Using Scala version 2.12.17 (OpenJDK 64-Bit Server VM, Java 11.0.17) Type in expressions to have them evaluated. Type :help for more information. scala> org.sparkproject.connect.grpc.ServerProvider.provider org.sparkproject.connect.grpc.ManagedChannelProvider$ProviderNotFoundException: No functional server found. Try adding a dependency on the grpc-netty or grpc-netty-shaded artifact at org.sparkproject.connect.grpc.ServerProvider.provider(ServerProvider.java:44) ... 47 elided scala> org.sparkproject.connect.grpc.ManagedChannelProvider.provider org.sparkproject.connect.grpc.ManagedChannelProvider$ProviderNotFoundException: No functional channel service provider found. Try adding a dependency on the grpc-okhttp, grpc-netty, or grpc-netty-shaded artifact at org.sparkproject.connect.grpc.ManagedChannelProvider.provider(ManagedChannelProvider.java:45) ... 47 elided ``` After ```scala 23/02/01 21:00:13 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). Spark context Web UI available at http://localhost:4040 Spark context available as 'sc' (master = local[*], app id = local-1675256417224). Spark session available as 'spark'. Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 3.5.0-SNAPSHOT /_/ Using Scala version 2.12.17 (OpenJDK 64-Bit Server VM, Java 11.0.17) Type in expressions to have them evaluated. Type :help for more information. scala> org.sparkproject.connect.grpc.ManagedChannelProvider.provider res0: org.sparkproject.connect.grpc.ManagedChannelProvider = org.sparkproject.connect.grpc.netty.NettyChannelProvider68aa505b scala> org.sparkproject.connect.grpc.ServerProvider.provider res2: org.sparkproject.connect.grpc.ServerProvider = org.sparkproject.connect.grpc.netty.NettyServerProvider4a5d8ae4 ``` Closes #39848 from LuciferYang/SPARK-42276. Lead-authored-by: yangjie01 <yangjie01@baidu.com> Co-authored-by: YangJie <yangjie01@baidu.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit af50b47e12040f86c4f81ff84407ad820cb252c1) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 10 February 2023, 02:58:26 UTC
f11338a [MINOR][SS] Fix setTimeoutTimestamp doc ### What changes were proposed in this pull request? This patch updates the API doc of `setTimeoutTimestamp` of `GroupState`. ### Why are the changes needed? Update incorrect API doc. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Doc change only. Closes #39958 from viirya/fix_group_state. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com> (cherry picked from commit a180e67d3859a4e145beaf671c1221fb4d6cbda7) Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com> 10 February 2023, 02:17:36 UTC
f8656ff [SPARK-40453][SPARK-41715][CONNECT] Take super class into account when throwing an exception ### What changes were proposed in this pull request? This PR proposes to take the super classes into account when throwing an exception from the server to Python side by adding more metadata of classes, causes and traceback in JVM. In addition, this PR matches the exceptions being thrown to the regular PySpark exceptions defined: https://github.com/apache/spark/blob/04550edd49ee587656d215e59d6a072772d7d5ec/python/pyspark/errors/exceptions/captured.py#L108-L147 ### Why are the changes needed? Right now, many exceptions cannot be handled (e.g., `NoSuchDatabaseException` that inherits `AnalysisException`) in Python side. ### Does this PR introduce _any_ user-facing change? No to end users. Yes, it matches the exceptions to the regular PySpark exceptions. ### How was this patch tested? Unittests fixed. Closes #39947 from HyukjinKwon/SPARK-41715. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit c5230e485d781ecfa996a674443709b0ce261f36) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 10 February 2023, 01:50:08 UTC
b2602c7 [SPARK-42338][CONNECT] Add details to non-fatal errors to raise a proper exception in the Python client ### What changes were proposed in this pull request? Adds details to non-fatal errors to raise a proper exception in the Python client, which makes `df.sample` raise `IllegalArgumentException` as same as PySpark, except for the timing that is delayed to call actions. ### Why are the changes needed? Currently `SparkConnectService` does not add details for `NonFatal` exceptions to the `PRCStatus`, so the Python client can't detect the exception properly and raises `SparkConnectGrpcException` instead. It also should have the details for the Python client. ### Does this PR introduce _any_ user-facing change? Users will see a proper exception when they call `df.sample` with illegal arguments, but in a different timing. ### How was this patch tested? Enabled `DataFrameParityTests.test_sample`. Closes #39957 from ueshin/issues/SPARK-42338/sample. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit ced675071f780f75435bef5d72b115ffa783e19e) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 09 February 2023, 23:48:55 UTC
back to top