Skip to main content
  • Home
  • Development
  • Documentation
  • Donate
  • Operational login
  • Browse the archive

swh logo
SoftwareHeritage
Software
Heritage
Archive
Features
  • Search

  • Downloads

  • Save code now

  • Add forge now

  • Help

https://github.com/hongtaoh/32vis
14 October 2025, 02:07:18 UTC
  • Code
  • Branches (1)
  • Releases (0)
  • Visits
Revision 9960413711b0efb1f51ff7cce3548d259be8d8cb authored by Hongtao Hao on 24 May 2025, 20:13:11 UTC, committed by GitHub on 24 May 2025, 20:13:11 UTC
Update README.md
1 parent 5a056b6
  • Files
  • Changes
    • Branches
    • Releases
    • HEAD
    • refs/heads/master
    • 9960413711b0efb1f51ff7cce3548d259be8d8cb
    No releases to show
  • eae3568
  • /
  • workflow
  • /
  • README.md
Raw File Download
Take a new snapshot of a software origin

If the archived software origin currently browsed is not synchronized with its upstream version (for instance when new commits have been issued), you can explicitly request Software Heritage to take a new snapshot of it.

Use the form below to proceed. Once a request has been submitted and accepted, it will be processed as soon as possible. You can then check its processing state by visiting this dedicated page.
swh spinner

Processing "take a new snapshot" request ...

To reference or cite the objects present in the Software Heritage archive, permalinks based on SoftWare Hash IDentifiers (SWHIDs) must be used.
Select below a type of object currently browsed in order to display its associated SWHID and permalink.

  • revision
  • directory
  • content
  • snapshot
origin badgerevision badge
swh:1:rev:9960413711b0efb1f51ff7cce3548d259be8d8cb
origin badgedirectory badge Iframe embedding
swh:1:dir:3575633dc0ae68095b1eaec37af2f60cc2ac56c8
origin badgecontent badge Iframe embedding
swh:1:cnt:5d2f0907f10b105a3a76335fc95b95c336b63276
origin badgesnapshot badge
swh:1:snp:df3c5521f2cc2bcd7e55e45465def0e224e45267

This interface enables to generate software citations, provided that the root directory of browsed objects contains a citation.cff or codemeta.json file.
Select below a type of object currently browsed in order to generate citations for them.

  • revision
  • directory
  • content
  • snapshot
Generate software citation in BibTex format (requires biblatex-software package)
Generating citation ...
Generate software citation in BibTex format (requires biblatex-software package)
Generating citation ...
Generate software citation in BibTex format (requires biblatex-software package)
Generating citation ...
Generate software citation in BibTex format (requires biblatex-software package)
Generating citation ...
Tip revision: 9960413711b0efb1f51ff7cce3548d259be8d8cb authored by Hongtao Hao on 24 May 2025, 20:13:11 UTC
Update README.md
Tip revision: 9960413
README.md
# Workflow

The folder contains three main items:

  - `Snakefile`. We use [Snakemake](https://github.com/hongtaoh/snakemake-tutorial) as our workflow to generate and process data. This snakefile is where rules and data directories are stored. 
  - `scripts`. This folder contains all the necessary scripts that we ran to get the data in the `data` folder. 
  - `notebooks`. This folder contains necessary validation notebooks. You'll see why they are needed below. 

In the following, I'll explain how we got the data we had in the `data` folder, step by step, in the hope that everybody can reproduce our results. 

## Steps

### 1. Get raw data

The only raw data we had is [Vispubdata](https://docs.google.com/spreadsheets/u/1/d/1xgoOPu28dQSSGPIp_HHQs0uvvcyLNdkMF9XtRajhhxU/edit?usp=sharing). I downloaded it as a csv file and stored it as `data/raw/vispubdata.csv`. 

### 2. Get titles and dois of VIS 2021 papers

NOTE: BECAUSE CROSSREF API IS UNSTABLE, YIELDING DIFFERENT RESULTS FROM TIME TO TIME. WE DECIDED TO GET THE DOIS OF VIS2021 PAPERS AND PUT THE DATA INTO THE `raw` DATA FOLDER. THE INITIAL SCRIPT WE USED, I.E., `get_dois_2021.py`, CAN BE FOUND IN [`scripts/deprecated`](https://github.com/hongtaoh/32vis/tree/master/workflow/scripts/deprecated) FOLDER. 

Vispubdata does not contain data for VIS 2021 papers. We had to obtain the data on our own. 

I obtained the titles of VIS 2021 papers using the script of `get_titles_2021.py`. The output of this step is `processed/titles_2021.csv` which contains 170 paper titles. 

Then, I used the script of `get_dois_2021.py` to get the DOIs for the paper titles in `processed/titles_2021.csv` through the API of CrossRef ("habanero"). 23 paper failed to have valid IEEE doi prefix (‘10.1109’). For papers with IEEE DOI Prefix, there are 4 papers whose CrossRef title does not match with the title shown on IEEE VIS 2021. Therefore, there are in total 23 + 4 = 27 papers whose query results are false. For these 27 papers, I manually collected their DOIs from IEEE Xplore. The output file is `processed/dois_2021.csv`. I then merged this file with VISPUBDATA and got `vispubdata_plus.csv` which contains data for VIS papers from 1990 to 2021. 

Note that in this step, I have already validated the match results. How I did it is in the notebook of `01-obtain-2021-paper-doi-from-crossref-and-ieee.ipynb`. For papers whose DOIs contain '10.1109' AND whose CR titles match title I scraped, I didn't inspect them because they are correct results. Then I filtered out papers whose titles do not match AND whose DOI does not contain '10.1109'. For papers whose DOI does not contain '10.1109', it is apparent that they are wrong results. 

### 3. Validating DOIs of VISPUBDATA

I excluded papers with paper type of "M" (posters, keynote files, panels) from my analysis. I found only one invalid DOI ('10.0000/00000001'). I checked the titles and found no duplicates. See `workflow/notebooks/03-inspection-of-dois-of-vispubdata-only-j-and-c.ipynb` for details. 

### 4. Validate DOIs for IEEE VIS 2021 papers

In the notebook of `workflow/04-inspection-of-dois-of-2021-papers.ipynb`, I manually checked the DOIs of IEEE VIS 2021 papers. No strange things found. 

### 6. Getting "vispd plus Good Papers"

I then ran `get_vispd_plus_good_papers.py` to get `vispd_plus_good_papers.txt`. 

What `get_vispd_plus_good_papers.py` does is to exclude papers with paper type of "M" and that SINGLE ONE invalid DOI ('10.0000/00000001') in VISPUBDATA_PLUS. 

### 7. VISPUBDATA-OpenAlex Match-1

For each paper in `vispd_plus_good_papers.txt`, I obtained the associated information on OpenAlex, for example, publication year/data, DOI, URL, Title, Venue id and name, etc. I did it with the script of `get_vispd_openalex_match_1.py`. I employed a combination of title query and doi query.

With the notebook of `workflow/notebooks/05-Checking-no-matching-and-no-result-titles-of_vispd_openalex_match_1.ipynb`, I checked:
  - How many papers are there in `title_query_empty_doi_query_404_1` (i.e., those that cannot be found on openalex via a combination of title and doi query) and how many of them can be identified via title modification (i.e., slightly modify the title to query to get the results in openalex. This is because titles on OpenAlex do not necessarily are exactly the same as those on VISPUBDATA, even when they are exactly the same paper.)
  - For papers with successful query results (no matter whether it's through title query or doi query), are they the same paper on OpenAlex and on VISPUBDATA? I checked this because OpenAlex might return multiple results based on title query. In the script of `get_vispd_openalex_match_1.py`, i only used the first result. 

    Therefore, even for papers with successful query results (i.e., title query has results, or title query unsuccessful but doi query is successful), I need to manually check whether their results are the same papers as those on VISPUBDATA. 

The conclusion of my manual validation is that only one paper did not have results in successful title query and this paper indeed did not exist in OpenAlex database. 

There are 72 papers whose title queries are successful (and not empty) but whose title AND DOI do not match with the information on VISPUBDATA. I then tried DOI query for these 72 papers. The results showed that DOI queries worked: only two paper were not able to be identified through DOI query. Among these two, one paper could be identified through using a different index in title query. 

In sum, only two papers were not idetifiable in OpenAlex database. 

### 8. VISPUBDATA-OpenAlex Match-2

Based on the result of VISPUBDATA-OpenAlex Match-1, I created the script of `get_vispd_openalex_match_2.py`. What this script does is: 

  1. Use DOI query for 71 papers. 
  2. Use a different index for title query for one paper.  

The output file is `vispd_openalex_match_2.csv`. 

I manaully checked the results of `vispd_openalex_match_2.csv` in the notebook of `workflow/notebooks/06-Checking-no-matching-and-no-result-titles-of_vispd_openalex_match_2.ipynb`. The no_result has 1 paper;  It simply does not exist in openalex. I then checked how many papers' title AND DOI do not match with VISPUBDATA. I found that only one paper.  

In sum, a totla of 1+1 = 2 papers among all the 3242 IEEE VIS papers do not exist on openalex. 

### 9. GET PAPERS_TO_SUTDY

`get_papers_to_study.py`, this step simply removes the nine papers inaccessible in OpenAlex from `vispd_plus_good_papers`. This results in a total of 3,240 DOIs. 

### 10. GET OPENALEX DFS

`get_openalex_dfs.py`, this step is very similar to, and also based on VISPUBDATA-OpenAlex Match-2. What I did in `get_openalex_dfs.py` is to take all the DOIs in `vispd_plus_good_papers.txt` and query them (title, or doi) in OpenAlex. There are four main outputs: paper_df, author_df, reference_df, and concept_df. 

After the script is run and the dfs obtained. I used the notebook of `workflow/notebooks/07-Checking-no-matching-and-no-result-titles-of_openalex_paper_df.ipynb` to check the results. It turns out the results are good. No_matching, no_result, and failed_doi are all empty. I then checked the rows where both the title and DOI do not match with those of VISPUBDATA_PLUS. It turns out, all of them are identical papers. THIS MEANS THAT MY SCRIPT OF `GET_OPENALEX_DFS.PY` has the output I desired. 

### 11. GET CITATION DFs

For every paper that has cited papers in `papers_to_study.txt`, I collected its meta data, author info, and concept info. 

In `openalex_paper_df.csv`, for every paper ("A"), there is a url pointing to data about all the paper that have cited "A". I simply opened this url and collect all the information (paper meta, author, concept, etc.). 

### 12. GET REFERENCE DFs

I then ran `get_openalex_reference_dfs.py` to get all the information (paper meta, author, concept, etc.) on all papers that were cited in 3,274 VIS papers. When I got the citing papers' data, OpenAlex provided an URL that contains all citing papers. However, for cited paper (i.e., reference papers), this is not the case. I had to manually collect all cited papers' info one by one. Since there are many duplicates in cited papers, I first dedupped them and then obtain their information. Then, I merged cited papers' info with `data/processed/openalex_reference_df.csv`. When I generated outputs, I obtain both the full datasets (i.e., VIS-cited pairs) and also the "unique" datasets which only included cited papers' information. 

### 13. Get IEEE author and paper title

`get_ieee_author_and_paper_title.py` is simply to scrape author and paper title information for 3,233 VIS papers. 

### 14. Ger merged author df 

`get_merged_author_df.py`:

In this notebook, I first compared the number of authors in IEEE and OpenAlex. I checked PDFs and confirmed that IEEE was wrong in one case, and missed data in the other case (the one that directs me to computer.org). I corrected the wrong one in ieee and filled the missing one. 

Later on, I merged the two author datasets. After merging, I compared this merged dataset witht DBLP data from Vispubdata. I found that four papers' authors data were incorrect in my merged data. I corrected my data. The next step is to fill in author affiliation data. I found that around 50 authorships miss affiliation info. I manually filled it. While doing that, I manually updated the affiliation name, country origin, affiliation type, and sometimes author names in both IEEE and OpenAlex data. I had to manually collect 15 authors' affiliation. 

After that, I merged the two datasets, and change affiliation type to binary types. 

MORE DETAILS OF THIS PROCEDURE CAN BE FOUND IN THE COMMENT OF `get_merged_author_df.py`.

### 15. Get awards info

`scrape_award_papers.py`, this scripts scrape awards info, specifically, year, doi, award, track, title, and author info. The data source is `http://ieeevis.org/year/2022/info/history/best-paper-award`.

### 16. Get Google Scholar data

`get_gscholar_data.py` obtains citation counts on Google Scholar in early March of 2022. 

### 17. Try to get Web of Science ID for 3,233 VIS papers

`get_wos_id.py`, this step is simply to report statistics in the Methods section of Supplementary Material. I want to see how many VIS papers were readily available on Web of Science through simple DOI query. 

### 18. Get classification models

`CLASS_country.py` and `CLASS_type.py` got classification model with logistic regression for country codes and affiliation types, respectively. 

### 19. Get cleaned author and cleaned paper data 

`get_HT_cleaned_author_df.py` and `get_HT_cleaned_paper_df.py` obtained the cleaned version of author and paper meta data used in our data analyses. 

### 20. Generate data for plots

  - `plot_data_author_chord_diagram_data.py` generates data for [Cross Country Collaboration Chord Diagram](https://observablehq.com/@hongtaoh/chord-speed-diagram)

  - `plot_top_concepts_trends.py` generates data for [Concepts popularity trends tiny charts](https://observablehq.com/@hongtaoh/concepts-popularity-trends-tiny-charts)

  - `plot_sankey_data.py` generates data for [Sankey diagram of citation flows](https://observablehq.com/@hongtaoh/sankey-diagram-of-citation-flows)

  - `plot_vis_concepts_cooccurance_data.py` generates data for [IEEE VIS paper concepts cooccurance chord diagram](https://observablehq.com/@hongtaoh/ieee-vis-paper-concepts-cooccurance-chord-diagram)

The diff you're trying to view is too large. Only the first 1000 changed files have been loaded.
Showing with 0 additions and 0 deletions (0 / 0 diffs computed)
swh spinner

Computing file changes ...

back to top

Software Heritage — Copyright (C) 2015–2025, The Software Heritage developers. License: GNU AGPLv3+.
The source code of Software Heritage itself is available on our development forge.
The source code files archived by Software Heritage are available under their own copyright and licenses.
Terms of use: Archive access, API— Content policy— Contact— JavaScript license information— Web API