Skip to main content
  • Home
  • Development
  • Documentation
  • Donate
  • Operational login
  • Browse the archive

swh logo
SoftwareHeritage
Software
Heritage
Archive
Features
  • Search

  • Downloads

  • Save code now

  • Add forge now

  • Help

https://doi.org/10.5281/zenodo.16605839
30 July 2025, 11:03:17 UTC
  • Code
  • Branches (0)
  • Releases (1)
  • Visits
    • Branches
    • Releases
      • 1
      • 1
    • 0f5f153
    • /
    • gfkpth-nominalperson_cldf-1601a59
    • /
    • README.md
    Raw File Download

    To reference or cite the objects present in the Software Heritage archive, permalinks based on SoftWare Hash IDentifiers (SWHIDs) must be used.
    Select below a type of object currently browsed in order to display its associated SWHID and permalink.

    • content
    • directory
    • snapshot
    • release
    origin badgecontent badge
    swh:1:cnt:e562774fc0b6f0ce2a68b0a665faa99e2ada85ab
    origin badgedirectory badge
    swh:1:dir:a70f143ba7d223dbd670faaa0f4e97519defa8c5
    origin badgesnapshot badge
    swh:1:snp:d0cfea9b1ffa98ff771c88515bb3e457a41969b8
    origin badgerelease badge
    swh:1:rel:0d000184e0d7a882416678f7dfd31b4bdab46463

    This interface enables to generate software citations, provided that the root directory of browsed objects contains a citation.cff or codemeta.json file.
    Select below a type of object currently browsed in order to generate citations for them.

    • content
    • directory
    • snapshot
    • release
    (requires biblatex-software package)
    Generating citation ...
    (requires biblatex-software package)
    Generating citation ...
    (requires biblatex-software package)
    Generating citation ...
    (requires biblatex-software package)
    Generating citation ...
    README.md
    # Overview
    
    This repository contains a database with syntactic data on >100 languages with a focus on properties relating to the phenomenon of (ad)nominal person. The code here creates a [CLDF](https://github.com/cldf/cldf/)-conformant database, a format designed for linguistic applications, from older (and somewhat unorderly) csv-files. The format also allows for the automatic generation of an SQLite database.
    
    A previous implementation of the data as a relational (SQL) database can be found [here](https://github.com/gfkpth/nominal_person). That repo also contains the a jupyter notebook extracting example sentences for nominal person from a LaTeX-file, see  [here](https://github.com/gfkpth/nominal_person/tree/main/db-creation-notes/CLDF).
    
    ## Background: (Ad-)nominal person
    
    One common form of nominal person are personal pronouns forming co-constituents of a co-referring nominal expression as in English *we linguists*.
    More detailed information on relevant phenomena and the range of cross-linguistic variation can be found in [Höhn (2020)](https://doi.org/10.5334/gjgl.1121) and [Höhn (2024)](https://doi.org/10.1515/lingty-2023-0080) as well as [my dissertation](https://ling.auf.net/lingbuzz/003618). If you use this data, I'd appreciate it if you'd let me know. If you are a linguist interested in (ad)nominal person and struggle with using this database, feel free to get in touch.
    
    
    ## Dataset Properties
    
    | Property | Value |
    |----------|-------|
    | dc:conformsTo | http://cldf.clld.org/v1.0/terms.rdf#StructureDataset |
    | dc:identifier | https://github.com/gfkpth/nominalperson_cldf/ |
    | dc:license | CC-BY |
    | dc:source | sources.bib |
    | dc:title | Adnominal person database |
    | dcat:accessURL | https://github.com/gfkpth/nominalperson_cldf |
    | prov:wasDerivedFrom | [{'rdf:about': 'https://github.com/gfkpth/nominalperson_cldf', 'rdf:type': 'prov:Entity', 'dc:created': 'ef41cd2', 'dc:title': 'Repository'}, {'rdf:about': 'https://github.com/glottolog/glottolog', 'rdf:type': 'prov:Entity', 'dc:created': 'v5.1-17-g06771a4e45', 'dc:title': 'Glottolog'}] |
    | prov:wasGeneratedBy | [{'dc:title': 'python', 'dc:description': '3.13.5'}, {'dc:title': 'python-packages', 'dc:relation': 'requirements.txt'}] |
    | rdf:ID | nominalperson_cldf |
    | rdf:type | http://www.w3.org/ns/dcat#Distribution |
    
    ## Dataset Components
    
    | File | Type | Rows |
    |------|------|------|
    | values.csv | ValueTable | 1800 |
    | languages.csv | LanguageTable | 134 |
    | examples.csv | ExampleTable | 158 |
    | codes.csv | CodeTable | 62 |
    | parameters.csv | ParameterTable | 17 |
    | sources.bib | Sources | 178 |
    
    
    ## Files
    
    - [cldf/](cldf/): contains the CLDF-version of the database
    - [raw/](raw/): contains the raw files used to generate the CLDF database
      - when using the data in here to regenerate the database, make sure to run `cldfbench download nominalperson_cldf.py` first
    - [etc/](etc/): not currently used
    - [nominalperson.sqlite](nominalperson.sqlite): an SQLite-version of the database
    - [cldfbench_nominalperson_cldf.py](cldfbench_nominalperson_cldf.py): controls the generation of the CLDF database from the data in `raw/`
    
    
    ## SQLite database
    
    The file [nominalperson.sqlite](nominalperson.sqlite) contains an SQLite-version of the database. The database schema is visualised below:
    
    ![Diagram of nominalperson.sqlite](assets/nominalperson-sqlite.png)
    
    
    # Feature description
    
    For a description of most of the linguistically relevant properties used you can refer to
    
    - the [parameter-codes.json](raw/parameter-codes.json)
    - the list in [this previous project](https://github.com/gfkpth/nominal_person?tab=readme-ov-file#db-scheme).
    
    To be extended
    
    
    # Notes on the setup of a new CLDF
    
    For the original tutorial see here: <https://github.com/cldf/cldfbench/blob/master/doc/tutorial.md>. The information in the README files of [`cldfbench`](https://github.com/cldf/cldfbench) and [`pycldf`](https://github.com/cldf/pycldf) is also going to be helpful.
    
    In this section, I aim to document the concrete steps needed for my slightly more complex dataset.
    
    
    ## Preliminaries
    
    0. Ensure that the python program `cldfbench` is installed, for instructions see [here](https://github.com/cldf/cldfbench/blob/master/README.md). Also install [`pycldf`](https://github.com/cldf/pycldf).
    
    Using a virtual environment is recommended, I am using conda to create one inside the folder I want to use for the cldf.
    
    ```shell
    conda create -p .venv python==3.12
    conda activate .venv
    pip install cldfbench pycldf
    ```
    
    1. Create a new cldf structure. For consistency, make sure to provide as ID the name of the folder that you want to use.
    
    ```shell
    cd ..
    cldfbench new
    ```
    
    I am moving one level up in the file structure because I created my virtual environment inside the target folder. If you're using a system-installed cldfbench (or create your virtual environment in a different location), you can just run `cldfbench new` in the parent folder of where you want your cldf folder to sit.
    
    2. Insert any raw data that should be integrated into the dataset into the `raw/` Folder.
    
    3. Run catconfig to install [glottolog](https://github.com/glottolog/glottolog) into the local system and create a catalog.ini file pointing to it. To make the API for glottolog available you also need to install the package `pyglottolog`, which I suggest installing before if it is not available in your system/virtual environment.
    
    ```shell
    pip install pyglottolog
    cldfbench catconfig
    ```
    
    This prompts you to install several auxiliary datasets. For the current use case, I want glottolog, so I pick yes to that and decline the concepticon and clts.
    Note that on Linux this clones <https://github.com/glottolog/glottolog> into ~/.config/cldf/glottolog, with a weight of about 1.5GB.
    
    Alternatively, you could download/`git clone https://github.com/glottolog/glottolog` into some other location and create a ~/.config/cldf/catalog.ini file with the appropriate path as follows (I'm not sure if this has any downsides to directly following the catconfig flow in use cases I haven't encountered yet):
    
    ```shell
    # ~/.config/cldf/catalog.ini
    [clones]
    glottolog = /path/of/your/local/glottolog/repository
    ```
    
    
    ## Setting up the cldfbench_[projectname].py
    
    As described in the available tutorials, `cldfbench new` creates a general template on which to build. The crucial code for converting my csv-data into CLDF needs to go to `cldfbench-[projectname].py`. 
    
    The following code generates the CLDF-conformant dataset in the `cldf/` folder:
    
    ```shell
    cldfbench makecldf cldfbench_nominalperson_cldf.py
    ```
    
    The following code validates the generated dataset and displays errors or warnings encountered during validation:
    
    ```shell
    cldf validate cldf/
    ```
    
    ## Generating an SQLite database
    
    The following code generates an SQLite database from the CLDF
    
    ```shell
    cldf createdb cldf nominalperson.sqlite
    ```
    
    # Versions
    
    ## v1.0 (2025-07-30)
    
    - first usable version
    
    
    # To Do:
    
    ## Content
    
    - automatic association of examples to specific parameter values is rather coarse at the moment, consider manual mapping
    
    ## Documentation
    
    - adding more detail to README
    - write-up for usage, especially regarding parameter-codes.json
    - details for tutorial-like guide
      - What are possible values for `args.writer.objects['ValueTable']`?
      - How does the following part skipped over in the tutorial work when needed?
        - "Because we only create a single CLDF dataset here, we do not need to call with self.cldf_writer(...) as ds: explicitly. Instead, an initialized cldfbench.cldf.CLDFWriter instance is available as args.writer."
      - sources and examples need to be supplied to CLDF as list (not as string, even though the output CSV files list them as strings)
      - add note to fix reference to actual metadata file in .github/workflows/cldf-validation.yml (last line is hard-coded to `cldf/cldf-metadata.json`, adapt this to the name of your metadata file in order to get automatic testing to work correctly)
    
    
    # Acknowledgements
    
    I am very grateful to Robert Forkel for help with fixing issues and getting a better understanding of the way CLDF works. 
    
    Parts of the data reported here were collected while working in a project funded through the European Research Council Advanced Grant No. 269752 “Rethinking Comparative Syntax” (PI Ian Roberts).
    
    # License
    
    Shield: [![CC BY 4.0][cc-by-shield]][cc-by]
    
    This work is licensed under a
    [Creative Commons Attribution 4.0 International License][cc-by].
    
    [![CC BY 4.0][cc-by-image]][cc-by]
    
    [cc-by]: http://creativecommons.org/licenses/by/4.0/
    [cc-by-image]: https://i.creativecommons.org/l/by/4.0/88x31.png
    [cc-by-shield]: https://img.shields.io/badge/License-CC%20BY%204.0-lightgrey.svg
    

    back to top

    Software Heritage — Copyright (C) 2015–2026, The Software Heritage developers. License: GNU AGPLv3+.
    The source code of Software Heritage itself is available on our development forge.
    The source code files archived by Software Heritage are available under their own copyright and licenses.
    Terms of use: Archive access, API— Content policy— Contact— JavaScript license information— Web API