https://github.com/hongtaoh/32vis
Tip revision: 7369347f01be67cf1389f2ebbf19bf9ffbc797e7 authored by Hongtao Hao on 28 April 2023, 02:39:24 UTC
Rename 2023-04-27-check-cited-papers-venues.ipynb to 2023-04-27-check-citing-papers-venues.ipynb
Rename 2023-04-27-check-cited-papers-venues.ipynb to 2023-04-27-check-citing-papers-venues.ipynb
Tip revision: 7369347
README.md
# Data
## 1. Raw
### `vispubdata.csv`
Taken from https://docs.google.com/spreadsheets/d/1xgoOPu28dQSSGPIp_HHQs0uvvcyLNdkMF9XtRajhhxU/edit
## 2. Interim
### Checking
data files in this subfolder is to make sure the papers I identified on OpenAlex correctly correspond to actual VIS papers.
#### `title_query_empty_doi_query_404_1.txt`, `title_query_404_1.txt`
These are the outputs of `get_vispd_openalex_match_1.py`.
#### `title_query_404_2.txt`, `doi_query_404_2.txt`, `title_query_empty_doi_query_404_2.txt`
These are the output of `get_vispd_openalex_match_2.py`.
#### `title_query_empty_doi_query_404_dfs.txt`, `title_query_404_dfs.txt`, `doi_query_404_dfs.txt`
These are the output of `get_openalex_dfs.py`
### methods_reporting
data files in this folder contains data I need to report statistics in the Methods section.
#### `crossref.csv`
I queried VIS paper DOIs on Crossref API and obtained the reference counts and first author affilations.
#### `ieee_citation_metrics.csv`
I randomly selected 100 VIS papers and obtained their citation counts (on Crossref, Scopus, and Web of Science) as displayed on IEEE Xplore
#### `wod_id.csv`
I queried VIS papers on Web of Science by DOI and created a table containing a paper's DOI and its corresponding Web of Science ID. What I want to do is to see how many VIS papers can be identified via DOI query on Web of Science.
### `vispd_openalex_match_1.csv`, `vispd_openalex_match_2.csv`, `openalex_author_df.csv`
These are outputs of `get_vispd_openalex_match_1.py`, `get_vispd_openalex_match_2.py`, and `get_openalex_dfs.py`.
### `ieee_author_df.csv`
This is author data scraped from IEEE Xplore.
### `award_paper_df.csv`
This is data on award-winning VIS papers.
## 3. Processed
### `titles_2021.csv`
paper titles for IEEE VIS 2021.
### `dois_2021.csv`
DOIS for IEEE VIS 2021 papers.
### `vispubdata_plus.csv`
I appended data of IEEE VIS 2021 to the raw dataset of `vispubdata.csv` which contains data for papers published in 1990-2020.
### `vispd_plus_good_papers.txt`
This contains a list of paper DOIs. I excluded papers with paper type of "M" and that invalid DOI ('10.0000/00000001') in `VISPUBDATA_PLUS.csv`.
### `papers_to_study.txt`
This is a list of DOIs. This list is **very important** because it contains the DOIs for papers we will include in the analysis of the present study.
`papers_to_study.txt` is different from `vispd_plus_good_papers.txt` because it excludes the nine papers inaccessible in OpenAlex.
### `openalex_paper_df.csv`, `openalex_author_df.csv`, `openalex_reference_df.csv`, `openalex_concept_df.csv`
These four datasets are the outputs of `get_openalex_dfs.py`. They contain data for papers included in `papers_to_study.txt`. The four datasets are about paper meta info, author info, reference lists, and field of study, respectively.
### `large` folder
The large folder contains four large datasets:
1. `openalex_citation_author_df.csv`
2. `openalex_citation_concept_df.csv`
3. `openalex_reference_author_df.csv`
4. `openalex_reference_concept_df.csv`
These data are about author and concepts in VIS's cited (i.e., references) and citing papers (i.e., those who were citing VIS).
### `openalex_reference_paper_df.csv`
This is basically the same as those in `large` folder. I put it `processed` directly because it is below 50M and could be uploaded to GitHub directly.
### `openalex_reference_author_df_unique.csv`, `openalex_reference_concept_df_unique.csv`, `openalex_reference_paper_df_unique.csv`
These are about author, concept, and paper metadata for VIS's cited papers. I added the word "unique" because there are no duplicated cited papers in these three files. In, for example, `openalex_reference_author_df.csv`, there are duplicated cited papers because one cited paper might be cited by many different VIS papers.
### `ieee_paper_df.csv`
This is paper metadata of VIS papers scraped from IEEE Xplore.
### `gscholar_data.csv`
This is citation data of VIS papers scraped from Google Scholar.
### `merged_author_df.csv`
I merged author `ieee_author_df.csv` and `openalex_author_df`, compared the merged with Vispubdata, corrected incorrect information, and manually filled in missing author affiliation information.
## 4. ht_class
### `merged_aff_type_predicted.csv` and `merged_country_predicted.csv`
These are the outputs of applying classifiers to merged author dataset.
### `ht_cleaned_author_df.csv` and `ht_cleaned_paper_df.csv`
These two are the cleaned version of author and paper datasets.
## 5. plots
- `author cord` contains data for [Cross Country Collaboration Chord Diagram](https://observablehq.com/@hongtaoh/chord-speed-diagram)
- `top_concepts_trends_df.csv` is data for [Concepts popularity trends tiny charts](https://observablehq.com/@hongtaoh/concepts-popularity-trends-tiny-charts)
- `sankey` contains data for [Sankey diagram of citation flows](https://observablehq.com/@hongtaoh/sankey-diagram-of-citation-flows)
- `cooccurance` contains data for [IEEE VIS paper concepts cooccurance chord diagram](https://observablehq.com/@hongtaoh/ieee-vis-paper-concepts-cooccurance-chord-diagram)