Revision - 11ee9d1 - [SPARK-11813][MLLIB] Avoid serialization of vocab in Word2Vec

Revision 11ee9d191e26a41a44ff0ca8730a129934942ee7 authored by Yuhao Yang on 18 November 2015, 21:25:15 UTC, committed by Xiangrui Meng on 18 November 2015, 21:26:39 UTC

[SPARK-11813][MLLIB] Avoid serialization of vocab in Word2Vec

jira: https://issues.apache.org/jira/browse/SPARK-11813

I found the problem during training a large corpus. Avoid serialization of vocab in Word2Vec has 2 benefits.
1. Performance improvement for less serialization.
2. Increase the capacity of Word2Vec a lot.
Currently in the fit of word2vec, the closure mainly includes serialization of Word2Vec and 2 global table.
the main part of Word2vec is the vocab of size: vocab * 40 * 2 * 4 = 320 vocab
2 global table: vocab * vectorSize * 8. If vectorSize = 20, that's 160 vocab.

Their sum cannot exceed Int.max due to the restriction of ByteArrayOutputStream. In any case, avoiding serialization of vocab helps decrease the size of the closure serialization, especially when vectorSize is small, thus to allow larger vocabulary.

Actually there's another possible fix, make local copy of fields to avoid including Word2Vec in the closure. Let me know if that's preferred.

Author: Yuhao Yang <hhbyyh@gmail.com>

Closes #9803 from hhbyyh/w2vVocab.

(cherry picked from commit e391abdf2cb6098a35347bd123b815ee9ac5b689)
Signed-off-by: Xiangrui Meng <meng@databricks.com>

1 parent 19835ec

Files
Changes

Permalinks

File	Mode	Size
assembly
bagel
bin
conf
core
data
dev
docker
docs
ec2
examples
external
extras
graphx
mllib
project
python
repl
sbin
sbt
sql
streaming
tools
yarn
.gitignore	-rw-r--r--	927 bytes
.rat-excludes	-rw-r--r--	880 bytes
.travis.yml	-rw-r--r--	1.1 KB
CHANGES.txt	-rw-r--r--	535.0 KB
LICENSE	-rw-r--r--	30.1 KB
NOTICE	-rw-r--r--	22.0 KB
README.md	-rw-r--r--	4.7 KB
make-distribution.sh	-rwxr-xr-x	7.5 KB
pom.xml	-rw-r--r--	39.7 KB
scalastyle-config.xml	-rw-r--r--	7.5 KB
tox.ini	-rw-r--r--	828 bytes

Showing with 0 additions and 0 deletions (0 / 0 diffs computed)

Computing file changes ...

[SPARK-11813][MLLIB] Avoid serialization of vocab in Word2Vec

README.md