Raw File
README.md
# pofatu-data

This repository is used for curating the Pofatu dataset, which is browsable online at
https://pofatu.clld.org and [published on Zenodo](https://doi.org/10.5281/zenodo.3634435).

[Released versions](https://github.com/pofatu/pofatu-data/releases) of this repository
provide analysis-friendly formats of the data in the [`dist`](dist/) directory, in particular:

- a set of CSV files, described by [metadata](dist/metadata.json), following the 
  [CSV on the Web - CSVW](https://www.w3.org/TR/tabular-data-primer/) standard.
- an [SQLite](https://sqlite.org/index.html) database file.

See also

> Hermann, A., Forkel, R., McAlister, A. et al. Pofatu, a curated and open-access database for geochemical sourcing of archaeological materials. Sci Data 7, 141 (2020). DOI: [10.1038/s41597-020-0485-8](https://doi.org/10.1038/s41597-020-0485-8)

In the following, we assume that:
- you have the Pofatu data available locally, either
through downloading a [released versions](https://github.com/pofatu/pofatu-data/releases) or via cloning the repository using `git`,
- you have navigated in the shell to the root of the repository, i.e. the directory
  where this README.md is located.


## Querying the CSVW data

While somewhat tedious, the CSVW data can be queried almost "manually", i.e. with very little
tooling. In the following example we will use command line tools available in the [csvkit](https://csvkit.readthedocs.io/en/1.0.3/cli.html)
package, but scientific computing environments like R (with R-Studio) or Python (with Pandas)
will provide similar (or better) functionality.

From `dist/metadata.json` we learn that `dist/samples.csv` contains a column `location_region`,
which specifies a rough region in which a sample was collected.

```shell
$ csvcut -c location_region dist/samples.csv | sort | uniq
AUSTRAL
...
VANUATU
```

If we are interested in samples from Vanuatu, we can then list `ID`s of these:

```shell
$ csvgrep -c location_region -m VANUATU dist/samples.csv | csvcut -c ID
ID
reepmeyer2008_ANU9001
reepmeyer2008_ANU9002
...
```

Once we have identified a sample we are interested in, we can list all measurements recorded in
Pofatu about this sample:

```shell
$ csvcut -c Sample_ID,parameter,value_string dist/measurements.csv | csvgrep -c Sample_ID -m"reepmeyer2008_ANU9001" | csvcut -c parameter,value_string
parameter,value_string
SiO2 [%],70.98
TiO2 [%],0.36
Al2O3 [%],14.02
FeO [%],3.02
CaO [%],0.99
MgO [%],0.19
MnO [%],0.17
K2O [%],5.67
Na2O [%],4.27
P [ppm],285.3549225066772
Sc [ppm],9.78581002071119
Ti [ppm],1756.8461601308043
V [ppm],1.2513331781460335
...
```


## Querying the SQLite data

Of course, querying interrelated data from multiple tables is a lot more convenient using a 
relational database. To retrieve **all** measurements for samples from Vanuatu at once, we only
need to run the SQL query
```sql
select
    s.id, m.parameter, m.value_string 
from 
    "samples.csv" as s, "measurements.csv" as m 
where 
    m.sample_id = s.id and s.location_region == 'VANUATU';
```

We can do so using the [SQLite command line program](https://www.sqlite.org/download.html):
```shell script
$ sqlite3 dist/pofatu.sqlite "select s.id, m.parameter, m.value_string from \"samples.csv\" as s, \"measurements.csv\" as m where m.sample_id = s.id and s.location_region == 'VANUATU'"
reepmeyer2008_ANU9001|SiO2 [%]|70.98
reepmeyer2008_ANU9001|TiO2 [%]|0.36
reepmeyer2008_ANU9001|Al2O3 [%]|14.02
reepmeyer2008_ANU9001|FeO [%]|3.02
reepmeyer2008_ANU9001|CaO [%]|0.99
reepmeyer2008_ANU9001|MgO [%]|0.19
reepmeyer2008_ANU9001|MnO [%]|0.17
reepmeyer2008_ANU9001|K2O [%]|5.67
reepmeyer2008_ANU9001|Na2O [%]|4.27
reepmeyer2008_ANU9001|P [ppm]|285.3549225066772
...
```

If you installed the Python package [`pypofatu`](https://pypi.org/project/pypofatu/), you can run
the query using the `pofatu query` subcommand:
```shell script
$ pofatu query "select s.id, m.parameter, m.value_string from \"samples.csv\" as s, \"measurements.csv\" as m where m.sample_id = s.id and s.location_region == 'VANUATU' limit 10"
INFO    SQLite database at dist/pofatu.sqlite
ID                         parameter      value_string
reepmeyer2008_ANU9001  SiO2 [%]              70.98
reepmeyer2008_ANU9001  TiO2 [%]               0.36
reepmeyer2008_ANU9001  Al2O3 [%]             14.02
reepmeyer2008_ANU9001  FeO [%]                3.02
reepmeyer2008_ANU9001  CaO [%]                0.99
reepmeyer2008_ANU9001  MgO [%]                0.19
reepmeyer2008_ANU9001  MnO [%]                0.17
reepmeyer2008_ANU9001  K2O [%]                5.67
reepmeyer2008_ANU9001  Na2O [%]               4.27
reepmeyer2008_ANU9001  P [ppm]              285.35
```

You can also explore the data using the [datasette](https://datasette.readthedocs.io/en/stable/installation.html#install-using-pip)
tool, which provides a user interface to browse the databases in your browser.

For a more real-life example of using Pofatu data, see [the cookbook](doc/cookbook.md)

Exporting the results of a query to CSV is simple. See the relevant 
[section of the SQLite documentation](https://sqlite.org/cli.html#csv_export)
for details.
back to top