https://github.com/hongtaoh/32vis
Raw File
Tip revision: 7369347f01be67cf1389f2ebbf19bf9ffbc797e7 authored by Hongtao Hao on 28 April 2023, 02:39:24 UTC
Rename 2023-04-27-check-cited-papers-venues.ipynb to 2023-04-27-check-citing-papers-venues.ipynb
Tip revision: 7369347
README.md
# Data

## 1. Raw

### `vispubdata.csv`

Taken from https://docs.google.com/spreadsheets/d/1xgoOPu28dQSSGPIp_HHQs0uvvcyLNdkMF9XtRajhhxU/edit

## 2. Interim

### Checking

data files in this subfolder is to make sure the papers I identified on OpenAlex correctly correspond to actual VIS papers. 

#### `title_query_empty_doi_query_404_1.txt`, `title_query_404_1.txt`

These are the outputs of `get_vispd_openalex_match_1.py`. 

#### `title_query_404_2.txt`, `doi_query_404_2.txt`, `title_query_empty_doi_query_404_2.txt`

These are the output of `get_vispd_openalex_match_2.py`.

#### `title_query_empty_doi_query_404_dfs.txt`, `title_query_404_dfs.txt`, `doi_query_404_dfs.txt`

These are the output of `get_openalex_dfs.py`

### methods_reporting

data files in this folder contains data I need to report statistics in the Methods section. 

#### `crossref.csv`

I queried VIS paper DOIs on Crossref API and obtained the reference counts and first author affilations.

#### `ieee_citation_metrics.csv`

I randomly selected 100 VIS papers and obtained their citation counts (on Crossref, Scopus, and Web of Science) as displayed on IEEE Xplore 

#### `wod_id.csv`

I queried VIS papers on Web of Science by DOI and created a table containing a paper's DOI and its corresponding Web of Science ID. What I want to do is to see how many VIS papers can be identified via DOI query on Web of Science. 

### `vispd_openalex_match_1.csv`, `vispd_openalex_match_2.csv`, `openalex_author_df.csv`

These are outputs of `get_vispd_openalex_match_1.py`, `get_vispd_openalex_match_2.py`, and `get_openalex_dfs.py`.

### `ieee_author_df.csv`

This is author data scraped from IEEE Xplore. 

### `award_paper_df.csv`

This is data on award-winning VIS papers.  

## 3. Processed

### `titles_2021.csv`

paper titles for IEEE VIS 2021. 

### `dois_2021.csv`

DOIS for IEEE VIS 2021 papers. 

### `vispubdata_plus.csv`

I appended data of IEEE VIS 2021 to the raw dataset of `vispubdata.csv` which contains data for papers published in 1990-2020. 

### `vispd_plus_good_papers.txt`

This contains a list of paper DOIs. I excluded papers with paper type of "M" and that invalid DOI ('10.0000/00000001') in `VISPUBDATA_PLUS.csv`.

### `papers_to_study.txt`

This is a list of DOIs. This list is **very important** because it contains the DOIs for papers we will include in the analysis of the present study. 

`papers_to_study.txt` is different from `vispd_plus_good_papers.txt` because it excludes the nine papers inaccessible in OpenAlex. 

### `openalex_paper_df.csv`, `openalex_author_df.csv`, `openalex_reference_df.csv`, `openalex_concept_df.csv`

These four datasets are the outputs of `get_openalex_dfs.py`. They contain data for papers included in `papers_to_study.txt`. The four datasets are about paper meta info, author info, reference lists, and field of study, respectively. 

### `large` folder

The large folder contains four large datasets:

  1. `openalex_citation_author_df.csv`
  2. `openalex_citation_concept_df.csv`
  3. `openalex_reference_author_df.csv`
  4. `openalex_reference_concept_df.csv`

These data are about author and concepts in VIS's cited (i.e., references) and citing papers (i.e., those who were citing VIS). 

### `openalex_reference_paper_df.csv`

This is basically the same as those in `large` folder. I put it `processed` directly because it is below 50M and could be uploaded to GitHub directly. 

### `openalex_reference_author_df_unique.csv`, `openalex_reference_concept_df_unique.csv`, `openalex_reference_paper_df_unique.csv`

These are about author, concept, and paper metadata for VIS's cited papers. I added the word "unique" because there are no duplicated cited papers in these three files. In, for example, `openalex_reference_author_df.csv`, there are duplicated cited papers because one cited paper might be cited by many different VIS papers. 

### `ieee_paper_df.csv`

This is paper metadata of VIS papers scraped from IEEE Xplore. 

### `gscholar_data.csv`

This is citation data of VIS papers scraped from Google Scholar.

### `merged_author_df.csv`

I merged author `ieee_author_df.csv` and `openalex_author_df`, compared the merged with Vispubdata, corrected incorrect information, and manually filled in missing author affiliation information. 

## 4. ht_class

### `merged_aff_type_predicted.csv` and `merged_country_predicted.csv`

These are the outputs of applying classifiers to merged author dataset. 

### `ht_cleaned_author_df.csv` and `ht_cleaned_paper_df.csv`

These two are the cleaned version of author and paper datasets. 

## 5. plots

  - `author cord` contains data for [Cross Country Collaboration Chord Diagram](https://observablehq.com/@hongtaoh/chord-speed-diagram)

  - `top_concepts_trends_df.csv` is data for [Concepts popularity trends tiny charts](https://observablehq.com/@hongtaoh/concepts-popularity-trends-tiny-charts)

  - `sankey` contains data for [Sankey diagram of citation flows](https://observablehq.com/@hongtaoh/sankey-diagram-of-citation-flows)

  - `cooccurance` contains data for [IEEE VIS paper concepts cooccurance chord diagram](https://observablehq.com/@hongtaoh/ieee-vis-paper-concepts-cooccurance-chord-diagram)
back to top