https://github.com/facebookresearch/pythia
Revision c6712145f5f4de20d2cfa2a1be38d9f1de6f1758 authored by Aapo Kyrola on 28 April 2021, 01:28:31 UTC, committed by Facebook GitHub Bot on 28 April 2021, 01:29:36 UTC
Summary:
[feat]
Pull Request resolved: https://github.com/facebookresearch/mmf/pull/902

Before adding batch to reporter, sync with all workers and validate all have same batch size. If not, skip the batch. Reports in the end how many were skipped.

This should add quite negligible overhead. And once the problem itself is fixed (probably by constructing the dataloader with 'drop_incomplete' -- but i am not sure where to add this), this could be removed -- but perhaps better to just keep it.

In addition:
- remove master's logging to a file and instead log to stderr
- add CompleteInTimeOrDie() so that we get better traces than NCCL deadlocks cause.
- some tiny changes.

Reviewed By: apsdehal

Differential Revision: D27954629

fbshipit-source-id: cf000586772f4510b717153ec66e1f9c6c269ac3
1 parent c29b3cd
History
Tip revision: c6712145f5f4de20d2cfa2a1be38d9f1de6f1758 authored by Aapo Kyrola on 28 April 2021, 01:28:31 UTC
[mmf] validate batch sizes are same + debugging improvements (#902)
Tip revision: c671214
File Mode Size
.circleci
.github
docs
mmf
mmf_cli
projects
tests
tools
website
.editorconfig -rw-r--r-- 191 bytes
.flake8 -rw-r--r-- 187 bytes
.gitignore -rw-r--r-- 267 bytes
.pre-commit-config.yaml -rw-r--r-- 1.1 KB
LICENSE -rw-r--r-- 1.5 KB
MANIFEST.in -rw-r--r-- 130 bytes
NOTICES -rw-r--r-- 4.5 KB
README.md -rw-r--r-- 2.2 KB
pyproject.toml -rw-r--r-- 925 bytes
requirements.txt -rw-r--r-- 379 bytes
setup.py -rw-r--r-- 5.1 KB

README.md

back to top