Revision - 7658f77 - [SPARK-39910][SQL] Delegate path qualification to filesystem [...] - origin: https://github.com/apache/spark

visit type:

https://github.com/apache/spark

05 April 2024, 20:24:39 UTC

Revision 7658f77a613c91364c4b6c986e1861c7bd5487db authored by Tigran Manasyan on 08 February 2024, 12:29:09 UTC, committed by Wenchen Fan on 08 February 2024, 12:30:05 UTC

[SPARK-39910][SQL] Delegate path qualification to filesystem during DataSource file path globbing

In current version `DataSource#checkAndGlobPathIfNecessary` qualifies paths via `Path#makeQualified` and `PartitioningAwareFileIndex` qualifies via `FileSystem#makeQualified`. Most `FileSystem` implementations simply delegate to `Path#makeQualified`, but others, like `HarFileSystem` contain fs-specific logic, that can produce different result. Such inconsistencies can lead to a situation, when spark can't find partitions of the source file, because qualified paths, built by `Path` and `FileSystem` are different. Therefore, for uniformity, the `FileSystem` path qualification should be used in `DataSource#checkAndGlobPathIfNecessary`.

Allow users to read files from hadoop archives (.har) using DataFrameReader API

No

New tests were added in `DataSourceSuite` and `DataFrameReaderWriterSuite`

No

Closes #43463 from tigrulya-exe/SPARK-39910-use-fs-path-qualification.

Authored-by: Tigran Manasyan <t.manasyan@arenadata.io>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit b7edc5fac0f4e479cbc869d54a9490c553ba2613)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>

1 parent 77f8b38

Files
Changes

Permalinks

Tip revision: 7658f77a613c91364c4b6c986e1861c7bd5487db authored by Tigran Manasyan on 08 February 2024, 12:29:09 UTC
[SPARK-39910][SQL] Delegate path qualification to filesystem during DataSource file path globbing

Tip revision: 7658f77

File	Mode	Size
.github
R
assembly
bin
binder
build
common
conf
connector
core
data
dev
docs
examples
graphx
hadoop-cloud
launcher
licenses
licenses-binary
mllib
mllib-local
project
python
repl
resource-managers
sbin
sql
streaming
tools
.asf.yaml	-rw-r--r--	1.3 KB
.gitattributes	-rw-r--r--	130 bytes
.gitignore	-rw-r--r--	1.8 KB
CONTRIBUTING.md	-rw-r--r--	997 bytes
LICENSE	-rw-r--r--	13.0 KB
LICENSE-binary	-rw-r--r--	22.4 KB
NOTICE	-rw-r--r--	2.0 KB
NOTICE-binary	-rw-r--r--	56.5 KB
README.md	-rw-r--r--	4.5 KB
appveyor.yml	-rw-r--r--	2.8 KB
pom.xml	-rw-r--r--	139.1 KB
scalastyle-config.xml	-rw-r--r--	23.7 KB

Showing with 0 additions and 0 deletions (0 / 0 diffs computed)

Computing file changes ...

https://github.com/apache/spark

[SPARK-39910][SQL] Delegate path qualification to filesystem during DataSource file path globbing

README.md