Revision - 2839280 - [SPARK-22355][SQL] Dataset.collect is not threadsafe - origin: https://github.com/apache/spark

visit type:

https://github.com/apache/spark

05 April 2024, 20:24:39 UTC

Revision 2839280adc930593c64a74892fec79dcc666d468 authored by Wenchen Fan on 27 October 2017, 00:51:16 UTC, committed by gatorsmile on 27 October 2017, 00:52:26 UTC

[SPARK-22355][SQL] Dataset.collect is not threadsafe

It's possible that users create a `Dataset`, and call `collect` of this `Dataset` in many threads at the same time. Currently `Dataset#collect` just call `encoder.fromRow` to convert spark rows to objects of type T, and this encoder is per-dataset. This means `Dataset#collect` is not thread-safe, because the encoder uses a projection to output the object to a re-usable row.

This PR fixes this problem, by creating a new projection when calling `Dataset#collect`, so that we have the re-usable row for each method call, instead of each Dataset.

N/A

Author: Wenchen Fan <wenchen@databricks.com>

Closes #19577 from cloud-fan/encoder.

(cherry picked from commit 5c3a1f3fad695317c2fff1243cdb9b3ceb25c317)
Signed-off-by: gatorsmile <gatorsmile@gmail.com>

1 parent a607ddc

Files
Changes

Permalinks

Tip revision: 2839280adc930593c64a74892fec79dcc666d468 authored by Wenchen Fan on 27 October 2017, 00:51:16 UTC
[SPARK-22355][SQL] Dataset.collect is not threadsafe

Tip revision: 2839280

File	Mode	Size
.github
R
assembly
bin
build
common
conf
core
data
dev
docs
examples
external
graphx
launcher
licenses
mllib
mllib-local
project
python
repl
resource-managers
sbin
sql
streaming
tools
.gitattributes	-rw-r--r--	40 bytes
.gitignore	-rw-r--r--	1.2 KB
.travis.yml	-rw-r--r--	1.7 KB
CONTRIBUTING.md	-rw-r--r--	995 bytes
LICENSE	-rw-r--r--	17.5 KB
NOTICE	-rw-r--r--	24.1 KB
README.md	-rw-r--r--	3.7 KB
appveyor.yml	-rw-r--r--	1.9 KB
pom.xml	-rw-r--r--	94.8 KB
scalastyle-config.xml	-rw-r--r--	17.4 KB

Showing with 0 additions and 0 deletions (0 / 0 diffs computed)

Computing file changes ...

https://github.com/apache/spark

[SPARK-22355][SQL] Dataset.collect is not threadsafe

README.md