Revision 08ab6bb3e30dbc8838087f428b7d7554b6bd9bfb authored by Pierre Romera on 17 May 2022, 08:29:45 UTC, committed by Pierre Romera on 17 May 2022, 08:29:45 UTC
1 parent da1aa0a
Raw File
# Datashare


## Download

## Documentation

Datashare's user guide can be found here:

## Follow new updates and features

[@ICIJorg]( publishes video tweets of new features with the hashtag [#ICIJDatashare](

## Frontend

This repository is only the backend part of Datashare.

Please find the frontend here :

## Description

Datashare is a free open-source desktop application developed by non-profit International Consortium of Investigative Journalists (ICIJ). 

Datashare allows investigative journalists to:
- access all their documents in one place locally on their computer while securing them from potential third-party interferences
- search pdfs, images, texts, spreadsheets, slides and any files, simultaneously
- automatically detect and filter by people, organizations and locations

## Translation of the interface

You're welcome to suggest translations on Datashare's Crowdin Please contact us if you would like to add a language.

## Installing and using

### Using with elasticsearch

You can download the script at

To access web GUI, go in your documents folder and launch `path/to/` then connect datashare on http://localhost:8080

### Using only Named Entity Recognition

You can use the datashare docker container only for HTTP exposed name finding API.

Just run : 

    docker run -ti -p 8080:8080 -v /path/to/dist/:/home/datashare/dist icij/datashare:0.10 -m NER

A bit of explanation : 
- `-p 8080:8080` maps the 8080 to 8080, the you could access datashare at localhost:8080 (If you want to access it at localhost:8081, the change to `-p 8081:8080`)
- `-m NER` runs datashare without index at all on a stateless mode
- `-v /path/to/dist:/home/datashare/dist` maps the directory where the NLP models will be read (and downloaded if they don't exist)

Then query with curl the server with : 

    curl -i localhost:8080/ner/findNames/CORENLP --data-binary @path/to/a/file.txt

The last path part (CORENLP) is the framework. You can choose it among CORENLP, IXAPIPE, MITIE or OPENNLP.    

### **Extract Text from Files** 
  - [TikaDocument]( from ICIJ/extract 
    [Apache Tika]( v1.18 (Apache Licence v2.0)
    with [Tesseract]( v4.0 alpha 


  [Tika File Formats](

### **Extract Persons, Organizations or Locations from Text** 
  - `org.icij.datashare.text.nlp.corenlp.CorenlpPipeline` 
    [Stanford CoreNLP]( v3.8.0, 
    (Conditional Random Fields), 
    *Composite GPL v3+* 

  - `org.icij.datashare.text.nlp.ixapipe.IxapipePipeline` 
    [Ixa Pipes Nerc]( v1.6.1, 
    *Apache Licence v2.0*

  - `org.icij.datashare.text.nlp.mitie.MitiePipeline` 
    [MIT Information Extraction]( v0.8, 
    (Structural Support Vector Machines), 
    *Boost Software License v1.0*

  - `org.icij.datashare.text.nlp.opennlp.OpennlpPipeline` 
    [Apache OpenNLP]( v1.6.0, 
    (Maximum Entropy), 
    *Apache Licence v2.0*

*Natural Language Processing Stages Support*

| `NlpStage`       |
| `TOKEN`          |
| `SENTENCE`       |
| `POS`            |
| `NER`            |

*Named Entity Recognition Language Support*

| *`NlpStage.NER`*           | `ENGLISH`  | `SPANISH`  | `GERMAN`  | `FRENCH`  | `CHINESE` |
| `NlpPipeline.Type.CORENLP` |     X      |      X     |      X    |  (w/ EN)  |     X     |
| `NlpPipeline.Type.OPENNLP` |     X      |      X     |      -    |     X     |     -     |
| `NlpPipeline.Type.IXAPIPE` |     X      |      X     |      X    |     -     |     -     |
| `NlpPipeline.Type.MITIE`   |     X      |      X     |      X    |     -     |     -     |

*Named Entity Categories Support*

| `NamedEntity.Category` |
|----------------------  |
| `ORGANIZATION`         |
| `PERSON`               |
| `LOCATION`             |

*Parts-of-Speech Language Support*

|  *`NlpStage.POS`*          | `ENGLISH`  | `SPANISH`  | `GERMAN`  | `FRENCH`  |
| `NlpPipeline.Type.CORE`    |     X      |      X     |     X     |     X     |
| `NlpPipeline.Type.OPEN`    |     X      |      X     |     X     |     X     |
| `NlpPipeline.Type.IXA`     |     X      |      X     |     X     |     X     |
| `NlpPipeline.Type.MITIE`   |     -      |      -     |      -    |     -     |

### **Store and Search Documents and Named Entities**

 - `org.icij.datashare.text.indexing.elasticsearch.ElasticsearchIndexer`
   [Elasticsearch]( v7.9.1, *Apache Licence v2.0*

## Compilation / Build

[JDK 11](,
[Maven 3]( and a running [PostgreSQL]( database (hostname `postgres`) 
with two databases `datashare` and `test` with write access for user `test` / password `test`. You'll need also a running
elasticsearch instance with `elasticsearch` as hostname ; and a redis server named `redis` as well.

mvn validate
mvn -pl commons-test -am install
mvn -pl datashare-db liquibase:update
mvn test

## Keeping the development environment up to date

It is important to keep `datashare` and `datashare-client` up to date by pulling from each repository's master branch. 

To ensure that updates are registered, `make clean dist` must be run locally from each repository. 

If dependencies have been updated on `datashare-client`, run `yarn` **before** `make clean dist`.

If the database models have changed within `datashare`, run the following commands **before** `make clean dist`:

sh datashare-db/scr/
mvn -pl commons-test -am install
mvn -pl datashare-db liquibase:update
mvn test

## License

Datashare is released under the [GNU Affero General Public License](

## Bug report, comment or (pull) request

We welcome feedback as well as contributions!

For any bug, question, comment or (pull) request, 

please contact us at
back to top