Revision - 02f32ee - [SPARK-41375][SS] Avoid empty latest KafkaSourceOffset - origin: https://github.com/apache/spark

visit type:

https://github.com/apache/spark

06 August 2024, 02:29:59 UTC

Revision 02f32ee358cc0a398aa7321bc5613cb92b306f6f authored by wecharyu on 08 December 2022, 08:12:30 UTC, committed by Jungtaek Lim on 08 December 2022, 08:12:45 UTC

[SPARK-41375][SS] Avoid empty latest KafkaSourceOffset

### What changes were proposed in this pull request?

Add the empty offset filter in `latestOffset()` for Kafka Source, so that offset remains unchanged if Kafka provides no topic partition during fetch.

### Why are the changes needed?

KafkaOffsetReader may fetch empty partitions in some extreme cases like getting partitions while Kafka cluster is reassigning partitions, this will produce an empty `PartitionOffsetMap` (although there are topic-partitions being unchanged) and stored in `committedOffsets` after `runBatch()`.

Then in the next batch, we fetch partitions normally and get the actual offsets, but when fetching data of this batch in `KafkaOffsetReaderAdmin#getOffsetRangesFromResolvedOffsets()` all partitions in endOffsets will be considered as new partitions since the startOffsets is empty, then these "new partitions" will fetch earliest offsets, which will cause the data duplication.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Add an unit test.

Closes #38898 from wecharyu/SPARK-41375.

Authored-by: wecharyu <yuwq1996@gmail.com>
Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
(cherry picked from commit 043475a87844f11c252fb0ebab469148ae6985d7)
Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>

1 parent 5a91b21

Files
Changes

Permalinks

Tip revision: 02f32ee358cc0a398aa7321bc5613cb92b306f6f authored by wecharyu on 08 December 2022, 08:12:30 UTC
[SPARK-41375][SS] Avoid empty latest KafkaSourceOffset

Tip revision: 02f32ee

File	Mode	Size
.github
.idea
R
assembly
bin
binder
build
common
conf
core
data
dev
docs
examples
external
graphx
hadoop-cloud
launcher
licenses
licenses-binary
mllib
mllib-local
project
python
repl
resource-managers
sbin
sql
streaming
tools
.asf.yaml	-rw-r--r--	1.1 KB
.gitattributes	-rw-r--r--	130 bytes
.gitignore	-rw-r--r--	2.0 KB
CONTRIBUTING.md	-rw-r--r--	997 bytes
LICENSE	-rw-r--r--	13.1 KB
LICENSE-binary	-rw-r--r--	22.4 KB
NOTICE	-rw-r--r--	2.0 KB
NOTICE-binary	-rw-r--r--	56.5 KB
README.md	-rw-r--r--	4.4 KB
appveyor.yml	-rw-r--r--	2.7 KB
pom.xml	-rw-r--r--	137.5 KB
scalastyle-config.xml	-rw-r--r--	22.0 KB

Showing with 0 additions and 0 deletions (0 / 0 diffs computed)

Computing file changes ...

https://github.com/apache/spark

[SPARK-41375][SS] Avoid empty latest KafkaSourceOffset

README.md