https://github.com/TIBHannover/2018-07-09-FAIR-Data-and-Software
Raw File
Tip revision: fb7e21bcd1e1518210ae3f1703575eeb05a22221 authored by Katrin Leinweber on 12 March 2020, 12:03:19 UTC
Include citation hint also on workshop website
Tip revision: fb7e21b
---
source: Rmd
title: "Reusing datasets: from the abstract to the technical details"
output: markdown_document
teaching: 20
exercises: 10
questions:
- "Which properties help us reuse datasets?"
objectives:
- "Finding both data and software for reuse."
- "Visualising data from two different sets."
keypoints:
- "The FAIR-er a dataset, the easier its reuse in answering a new research question."
bibliography: references.bib
---

```{r setup, include=FALSE}
library(knitr)
opts_chunk$set(echo = TRUE)
opts_chunk$set(fig.path = "../fig/")
```

We want to find out, whether the Arctic and Antarctic ice core records show
different temperature curves in the past. In order to do that, we can analyse ice core
data from two different projects:

- [NGRIP (Arctic)](https://doi.pangaea.de/10.1594/PANGAEA.586886)
- [DomeC (Antarctic)](https://doi.pangaea.de/10.1594/PANGAEA.57629)

Please read both datasets' abstracts now [@northgreenlandicecoreprojectmembers2007ymoo; @lorius1979icrf].

---

- Noticed the DOIs & metadata? Both datasets are **F**indable.
- Noticed the `Download` links over `https`? Easily **a**ccessible.
- The list of parameters in both cases show that `Age` and `d18O H2O` were 
  measured. Looks **i**nteroperable.
- [CC-BY-licensed](https://creativecommons.org/licenses/by/3.0/) means we are
  allowed to **r**euse the material, if we "give appropriate credit, provide a 
  link to the license, and indicate if changes were made." :-) 

However, will the data also be FAIR on the technical level, where we actually work?
Will it be _machine-reusable_? We're not going to use Excel for 8k datapoints, right?!

Let's plan backwards from the the desired outcome: Comparing the temperature proxy
measurements in a diagram. In order to get there, we need to:

1. combine and/or align the x- and y-axes of both datasets,
1. find out, whether we need to convert the values and/or units,
1. extract the values from the dataset,
1. know the datasets' structures,
1. download the datasets, and
1. do all that in a reproducible manner ;-)

The last point makes it clear that we will work in a script file (`.R` or `.Rmd`).

> ## Challenge: How do we best download the datasets?
>
> We could for example:
>
> 1. `Download dataset as tab-delimited text` manually, save the files, then read them in.
> 1. Write our own little download function, e.g. with a vector of dataset IDs as input (`c(586886, 57629)`).
>
> > ## Which one do you prefer?
> >
> > There is a third option ;-) Looking for an R package or a Python module
> > related to the data repository. Search
> > [CRAN.R-project.org/web/packages/available_packages_by_name.html
> > ](https://cran.r-project.org/web/packages/available_packages_by_name.html)
> >  or [https://ropensci.org/packages/](ROpenSci.org/packages) for
> > "PANGAEA" [@pangaear].
> {: .solution}
{: .challenge}


> ## Challenge: Before installing your search result, check whether it seems useful.
>
> How would you go about this in case of an R package or Python module?
>
> > ## Solution:
> >
> > [ROpenSci.github.io/pangaear/reference](https://ropensci.github.io/pangaear/reference/index.html)
> > gives an overview of its functions. `pg_data()` sounds like what we need.
> {: .solution}
{: .challenge}


```{r download, warning=FALSE, message=FALSE}
# install.packages("pangaear")
library(pangaear)
NGRIP <- pg_data(doi = "10.1594/PANGAEA.57629")
DomeC <- pg_data("10.1594/PANGAEA.586886")
```

Before analysing any data, we should get an overview of the R objects we created just now by the downloads:

```{r str1}
str(NGRIP)
```

```{r str2}
str(DomeC)
```

Both lists contain some metadata and the actual `data` as a `tbl_df`. What is that? [Hint: It's from the tidyverse](https://tibble.tidyverse.org/).

In order to answer our research question (see above) we need to be able to combine
both the `Age` and the `d18O`.

To verify that both variables are really labelled in exactly the same way, we extract both tibbles and compare their variable names.

```{r vars}
NGRIP <- NGRIP[[1]]$data
DomeC <- DomeC[[1]]$data
intersect(names(NGRIP), names(DomeC))

# less elegant, but also possible:
# c("Age [ka BP]", "d18O H2O [per mil SMOW]") %in% c(names(NGRIP), names(DomeC))
#> [1] TRUE TRUE
```

We get only exactly two variable names. This is great, because if there had been even the slightest difference in the name, unit, or a spelling mistake, we would have seen less output, because the `names` wouldn't have `intersect`-ed.

We can now plot both datasets' `d18O H2O` values against the same `Age` axis. Because the variables names contain spaces and brackets, we have to use the "back tick" character (`` ` ``) around both. `Snake_case` colum names would have saved us from this, but hey!

```{r plot}
library(ggplot2)
ggplot(data = NGRIP,
       mapping = aes(x = `Age [ka BP]`, y = `d18O H2O [per mil SMOW]`)) +
  geom_point(color = "dark green", alpha = 0.1) + # inherits data & mapping from above
  geom_point(data = DomeC, alpha = 0.1 # overwrites above data, but inherits x & y
             ) +
  scale_x_reverse() # because Age means the past
```

Incidentally, the `Dome C` core (black, east Antarctica) captured higher `d18O`
concentrations, than `NGRIP` (*Green*land). Because of the inverse relationship
of `d18O` to temperature [@EPSTEIN1953213], it seems that the Southern Hemisphere
was been _cooler_ than the North.


## Conclusion

Integrating the two datasets with this little code was possible, because both variables are named exactly the same and thus easily machine-readable.

Granted, **i**nteroperability encompasses a [few qualities](https://blogs.tib.eu/wp/tib/2017/09/12/the-fair-data-principles-for-research-data/#i) besides uniform variable names. However, the datasets were well **f**indable & **a**ccessible, and community-standard variable naming and the [CC BY 3.0 license](https://creativecommons.org/licenses/by/3.0/) meant good **r**eusability.

**This is the power of FAIR Data combined with usable software: Reducing the burden of finding, downloading, cleaning up datasets, and actually using them.**


## Supplement: Harmonising different variable names

a) either rename one between downloading and plotting, or

```{r str_diff, eval=FALSE}
names(NGRIP$`some other Age variable's name`) <- names(DomeC$Age_ka_BP)
names(NGRIP$`some other d18O variable's name`) <- names(DomeC$d18O_H2O_per_mil_SMOW)
```

b) specify in `geom_point` which second `y` should be plotted.

```{r plot2, eval=FALSE}
ggplot(data = DomeC,
       mapping = aes(x = `Age [ka BP]`, y = `d18O H2O [per mil SMOW]`)) +
  geom_point() +
  geom_point(data = NGRIP,
             mapping = aes(x = `some other Age variable's name`,
                           y = `some other d18O variable's name`)
             color = "dark green")
```


## References
back to top