Revision - db03653 - [SPARK-48910][SQL] Use HashSet/HashMap to avoid linear searches [...]

Revision db036537ab5593b2520742b3b1a1028bb0fcc7fa authored by Vladimir Golubev on 29 July 2024, 14:07:02 UTC, committed by Wenchen Fan on 29 July 2024, 14:07:02 UTC

[SPARK-48910][SQL] Use HashSet/HashMap to avoid linear searches in PreprocessTableCreation

### What changes were proposed in this pull request?

Use `HashSet`/`HashMap` instead of doing linear searches over the `Seq`. In case of 1000s of partitions this significantly improves the performance.

### Why are the changes needed?

To avoid the O(n*m) passes in the `PreprocessTableCreation`

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Existing UTs

### Was this patch authored or co-authored using generative AI tooling?

No

Closes #47484 from vladimirg-db/vladimirg-db/get-rid-of-linear-searches-preprocess-table-creation.

Authored-by: Vladimir Golubev <vladimir.golubev@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>

1 parent efc6a75

Files
Changes

Permalinks

File	Mode	Size
.github
R
assembly
bin
binder
build
common
conf
connect
connector
core
data
dev
docs
examples
graphx
hadoop-cloud
launcher
licenses
licenses-binary
mllib
mllib-local
project
python
repl
resource-managers
sbin
sql
streaming
tools
ui-test
.asf.yaml	-rw-r--r--	1.3 KB
.gitattributes	-rw-r--r--	130 bytes
.gitignore	-rw-r--r--	1.9 KB
CONTRIBUTING.md	-rw-r--r--	997 bytes
LICENSE	-rw-r--r--	13.2 KB
LICENSE-binary	-rw-r--r--	21.7 KB
NOTICE	-rw-r--r--	2.0 KB
NOTICE-binary	-rw-r--r--	51.6 KB
README.md	-rw-r--r--	4.3 KB
pom.xml	-rw-r--r--	140.3 KB
scalastyle-config.xml	-rw-r--r--	25.9 KB

Showing with 0 additions and 0 deletions (0 / 0 diffs computed)

Computing file changes ...

[SPARK-48910][SQL] Use HashSet/HashMap to avoid linear searches in PreprocessTableCreation

README.md