https://gitlab.com/mcoavoux/mtgpy-release-findings-2021.git
Tip revision: c9972219cd75049269d26632d2bb79619d661298 authored by mcoavoux on 20 May 2021, 13:04:44 UTC
up readme
up readme
Tip revision: c997221
README.md
This is the code release for:
BERT-Proof Syntactic Structures: Investigating Errors in Discontinuous Constituency Parsing
Maximin Coavoux
Findings of ACL 2021
For the test-suite, please check the supplementary materials of the paper on the ACL anthology.
## Install dependencies
```bash
conda create --name mtgpy python=3.6 --file conda-requirements.txt
conda activate mtgpy
pip install -r requirements.txt
```
In case, `disco-dop` install fails, see instructions on the [original repo](https://github.com/andreasvc/disco-dop/).
## Pretrained models
I release 12 pretrained models (4 training corpora x 3 models).
Training corpora are the discontinuous Penn Treebank (English, `dptb`), the Negra corpus (German, `negra`),
the Tiger corpus (German, `tiger_spmrl`) and an instantiation of Tiger where
the sentences in [`discosuite`](https://www.phil-fak.uni-duesseldorf.de/beyond-cfg/resources/discosuite/) are removed (in order to make it possible to evaluate on them).
The three training configurations are: `supervised`, `bert` finetuning, frozen `fast-text` embeddings.
See the last table in the paper's Appendix for all results.
## Parse with pretrained models
Command line examples to parse with pretrained models
```bash
# python src/mtg.py eval <model path> <input file> <output file> [--eval-batchsize <int>] [--gpu 0]
# --gpu is None by default, use --gpu 0 to use the first GPU device
python src/mtg.py eval models/negra_fast_text/ sample_data/german_sample.tokens german_parsed1.discbracket
python src/mtg.py eval models/tiger_spmrl_bert/ sample_data/german_sample.tokens german_parsed2.discbracket
python src/mtg.py eval models/dptb_bert/ sample_data/train_sample.tokens english_parsed.discbracket
```
## Train
If you want to train your own models, check command lines in `models/<any model>/commandline`.
Currently, the only supported format for the input treebanks is discbracket.