Content - e3a567d1e923198bbc64ad671999c65710fc306a - da468a2/README.md

visit type:
Tip revision: a7cf5a5127a961b208c7882ebaf9a32d6bdcb674 authored by Marco Antonio Valenzuela Escárcega on 18 October 2016, 00:45:29 UTC
change test aggregation
Tip revision: a7cf5a5
README.md
[![Build Status](https://travis-ci.org/clulab/reach.svg?branch=master)](https://travis-ci.org/clulab/reach)
[![Gitter](https://badges.gitter.im/clulab/reach.svg)](https://gitter.im/clulab/reach?utm_source=badge&utm_medium=badge&utm_campaign=pr-badge)
[![Maven Central](https://maven-badges.herokuapp.com/maven-central/org.clulab/reach_2.11/badge.svg)](https://maven-badges.herokuapp.com/maven-central/org.clulab/reach_2.11)

Reach
=====

# What is it?

Reach stands for **Re**ading and **A**ssembling **C**ontextual and **H**olistic Mechanisms from Text. In plain English, Reach is an information extraction system for the biomedical domain, which aims to read scientific literature and extract cancer signaling pathways. Reach implements a fairly complete extraction pipeline, including: recognition of biochemical entities (proteins, chemicals, etc.), grounding them to known knowledge bases such as Uniprot, extraction of BioPAX-like interactions, e.g., phosphorylation, complex assembly, positive/negative regulations, and coreference resolution, for both entities and interactions.  

Reach is developed using [Odin](https://github.com/clulab/processors/wiki/ODIN-(Open-Domain-INformer)), our open-domain information extraction framework, which is released within our [`processors`](https://github.com/clulab/processors) repository.

Please scroll down to the bottom of this page for additional resources, including a Reach output visualizer, REST API, and datasets created with Reach.

# Licensing
All our own code is licensed under Apache License Version 2.0. **However, some of the libraries used here, most notably CoreNLP, are GPL v2.** If `BioNLPProcessor` is not removed from this package, technically our whole code becomes GPL v2 since `BioNLPProcessor` builds on Stanford's `CoreNLP` functionality. Soon, we will split the code into multiple components, so licensing becomes less ambiguous.

# Changes
+ **1.3.3** - Sub-project split into main, assembly, export.
+ **1.3.3** - Uses bioresources 1.1.15 and processors 5.9.6.  Introduces [`json` serialization/deserialization of `CorefMention` (including grounding, modifications, etc.)](https://gist.github.com/myedibleenso/8383af789b37ba598ff64ddd12c8b35b).
+ [more...](CHANGES.md)

# Authors  

Reach was created by the following members of the [CLU lab at the University of Arizona](http://clulab.cs.arizona.edu/):

+ [Marco Valenzuela](https://github.com/marcovzla)  
+ [Gus Hahn-Powell](https://github.com/myedibleenso)  
+ [Dane Bell](https://github.com/danebell)  
+ [Tom Hicks](https://github.com/hickst)  
+ [Enrique Noriega](https://github.com/enoriega)  
+ [Mihai Surdeanu](https://github.com/MihaiSurdeanu)  

# Citations

If you use Reach, please cite this paper:

```
@inproceedings{Valenzuela+:2015aa,
  author    = {Valenzuela-Esc\'{a}rcega, Marco A. and Gustave Hahn-Powell and Thomas Hicks and Mihai Surdeanu},
  title     = {A Domain-independent Rule-based Framework for Event Extraction},
  organization = {ACL-IJCNLP 2015},
  booktitle = {Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing: Software Demonstrations (ACL-IJCNLP)},
  url = {http://www.aclweb.org/anthology/P/P15/P15-4022.pdf},
  year      = {2015},
  pages = {127--132},
  Note = {Paper available at \url{http://www.aclweb.org/anthology/P/P15/P15-4022.pdf}},
}
```

More publications from the Reach project are available [here](https://github.com/clulab/reach/wiki/Publications).

# Installation

This software requires Java 1.8, Scala 2.11, and CoreNLP 3.x or higher.

The `jar` is available on Maven Central. To use, simply add the following dependency to your `pom.xml`:

```xml
<dependency>
   <groupId>org.clulab</groupId>
   <artifactId>reach_2.11</artifactId>
   <version>1.3.2</version>
</dependency>
```

 The equivalent SBT dependencies are:

```scala
libraryDependencies ++= Seq(
    "org.clulab" %% "reach" % "1.3.2"
)
```

# How to compile the source code

This is a standard sbt project, so use the usual commands (i.e. `sbt compile`, `sbt assembly`, etc.) to compile.
Add the generated jar files under `target/` to your `$CLASSPATH`, along with the other necessary dependency jars. Take a look at `build.sbt` to see which dependencies are necessary at runtime.

# Running things

## Processing a directory of `.nxml` papers

The most common usage of Reach is to parse a directory containing one or more papers in the `.nxml`, or `.csv`/`.tsv` format.
In order to run the system on such a directory of papers, you must create a `.conf` file.  See `src/main/resources/application.conf` for an example configuration file.  The directory containing the files to be processed should be specified using the `papersDir` variable.

```scala
sbt "runMain org.clulab.reach.ReachCLI /path/to/yourapplication.conf"
```

If the configuration file is omitted, Reach uses the default `.conf`. That is, the command:

```scala
sbt "runMain org.clulab.reach.ReachCLI"
```

will run the system using the `.conf` file under `src/main/resources/application.conf`.

## The interactive shell for rule debugging

```scala
sbt "runMain org.clulab.reach.ReachShell"
```

enter `:help` to get a list of available commands.

## The sieve-based assembly system
Reach now provides a sieve-based system for assembly of event mentions.  While still under development, the system currently has support for (1) exact deduplication for both entity and event mentions, (2) unification of mentions through coreference resolution, and (3) the reporting of intra and inter-sentence causal precedence relations (ex. A causally precedes B) using linguistic features, and (4) a feature-based classifier for causal precedence.  Future versions will include additional sieves for causal precedence and improved approximate deduplication.

For more details on the sieve-based assembly system, please refer to the following paper:

```
@inproceedings{GHP+:2016aa,
  author       = {Gus Hahn-Powell and
Dane Bell and
Marco A. Valenzuela-Esc\'{a}rcega and Mihai Surdeanu},
  title        = {This before That: Causal Precedence in the Biomedical Domain},
  booktitle    = {Proceedings of the 2016 Workshop on Biomedical Natural Language Processing},
  organization = {Association for Computational Linguistics}
  year         = {2016}
  Note         = {Paper available at \url{https://arxiv.org/abs/1606.08089}}
}
```

The sieve-based assembly system can be run over a directory of `.nxml` and/or `.csv` files:
 ```scala
 sbt "runMain org.clulab.reach.ReachCLI"
 ```

In `src/main/resources/application.conf`, you will need to...

1. set `outputType` to "assembly-tsv"
2. set your input directory of papers via `papersDir`
3. set your output directory via `outDir`

Currently, two `.tsv` files are produced for assembly results **within** each paper:

1. results meeting [MITRE's (March 2016) requirements](https://github.com/clulab/reach/blob/3d4f82c87f1b4c7299ff2ceae8adc352212bd430/src/main/scala/org/clulab/assembly/AssemblyExporter.scala#L337-L352)
2. results without MITRE's constraints

Two additional output files are produced for assembly results **across** all papers:  

1. results meeting [MITRE's (March 2016) requirements](https://github.com/clulab/reach/blob/3d4f82c87f1b4c7299ff2ceae8adc352212bd430/src/main/scala/org/clulab/assembly/AssemblyExporter.scala#L337-L352)  
2. results without MITRE's constraints

### The interactive Assembly shell

You can run interactively explore assembly output for various snippets of text using the assembly shell:

```scala
sbt "runMain org.clulab.assembly.AssemblyShell"
```

# Modifying the code
Reach builds upon our Odin event extraction framework. If you want to modify event and entity grammars, please refer to [Odin's Wiki](https://github.com/sistanlp/processors/wiki/ODIN-(Open-Domain-INformer)) page for details. Please read the included Odin manual for details on the rule language and the Odin API.

# Reach web services

We have developed a series of web services on top of the Reach library. All are freely available [here](http://agathon.sista.arizona.edu:8080/odinweb/).

# Reach datasets

We have generated multiple datasets by reading publications from the [open-access PubMed subset](http://www.ncbi.nlm.nih.gov/pmc/tools/openftlist/) using Reach. All datasets are freely available [here](https://github.com/clulab/reach/wiki/Datasets).

# Funding

The development of Reach was funded by the [DARPA Big Mechanism program](http://www.darpa.mil/program/big-mechanism) under ARO contract W911NF-14-1-0395.