Skip to content

Fuzzy matching and disambiguating scholars from the Microsoft Academic Graph corpus. A part of the larger soc_of_soc project.

Notifications You must be signed in to change notification settings

TimothyElder/mag

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

64 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

mag

Measuring Scholarly Productivity of Sociologists Using Microsoft Academic Graph

This repo is a part of the broader soc_of_soc project. With the scripts in this repository we complete several steps toward analyzing the individual academic productivity using the Microsoft Academic Graph corpus. The primary objective here is linking the network data to the publication data through probabilistic matching and some network analysis of journals.

The code in this repo does the following things:

  1. Fuzzy matching between the names of faculty members from the network data and the names of authors in the Microsoft Academic Graph using the fuzzymatcher library. This generates many candidate matches.

  2. Filtering the full Microsoft Academic Graph corpus down to only include the probable matches using Dask.

  3. Generating a network of journals in the corpus that are linked by shared authors. We use this network to help filter out non-sociologists from the dataset. The idea is that we can take eigenvector centrality scores of the journals and then score authors by how much they publish in high centrality (and therefore more likely to be sociology journals)

Data

This code requires data that lives in several spots:

Network Data from ASA Guide to Graduate Department

The data from the ASA Guide to Graduate Department lives locally on my computer but is uploaded to two other locations. The Midway3 servers of the Research Computing Center of the University of Chicago and the Cronus Server of Social Science Computing Services, also at UChicago.

Microsoft Academic Graph Data

The Microsoft Academic Graph (MAG) lives on the Midway3 Servers and all the scripts related to parsing that data should be run on midway3, typically using sbatch scripts which describe the partitions, nodes and cores to use.

Here is a table to show what scripts should be run where and in what order, as well as a short description:

Script Type Where To Run Description Output
fuzzy_matches.py Data Midway3 Performs fuzzy matching produces many csvs /matches
get_author_candidates.py Data Midway3 filters the MAG corpus to get authors from fuzzy matching and saves key of faculty to author names from fuzzy matching key_faculty2authors.csv, authors.csv
filter_mag_corpus.py Data Midway3 Filters out the complete MAG data down to the names we feed it authors2papers.csv, papers.csv
filter_journals.py Data Midway3 Does the same but now for journals journals.csv
filtered_cited Data Midway3 Filters papers to get only the ones citing our authors papers citing.csv
net_project.py Data Midway3 Projects the two-mode network to a one-mode, journal to journal network journal2journal_mat.csv, authors2journals_mat.csv
journal_net.R Analysis Local Performs first analyses on the journal to journal network None

About

Fuzzy matching and disambiguating scholars from the Microsoft Academic Graph corpus. A part of the larger soc_of_soc project.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published