Skip to content

Latest commit

 

History

History
90 lines (62 loc) · 6.38 KB

README.md

File metadata and controls

90 lines (62 loc) · 6.38 KB

arXiv.org AI/ML Analysis

GitHub GitHub Pipenv locked Python version GitHub Pipenv locked dependency version GitHub Pipenv locked dependency version GitHub Pipenv locked dependency version GitHub Pipenv locked dependency version GitHub Pipenv locked dependency version GitHub Pipenv locked dependency version

U.C. Berkeley, Masters in Information & Data Science program - datascience@berkeley
Summer 2020, W209 - Data Visualization - Andrew Reagan, PhD - Section 4


Description

This repo contains the draft work for the visualization of AI/ML research papers catalogued on arXiv.org for calendary years 1993 through 2019. Categories under consideration have been limited to:

  • Computer Science: Artificial Intelligence [cs: AI]
  • Computer Science: Machine Learning [cs: LG]
  • Statistics: Machine Learning [stat: ML]

There are two external visuals for this project:


This project leverages the following visualization frameworks:


Highlight of key files included in this repository:

File Description
ArXiV AI & ML Analytics - Midterm.pptx Midterm presentation for arXiv AI & ML Analysis solution
w209_assignment_2__cris_benge.pdf Assignment 2, covering thorough review and initial hypothesis testing of arXiv data.
load_base_data.py Processes the base arXiv categories data, storing the output into a single Pandas.DataFrame (HDF5 file)
refine_data_for_analysis.py Processes the consolidated (but raw) arXiv categories data, generating the final analysis output data (CSV file)
utils/preprocessing.py Utility class; used for loading the raw arXiv data and generating the processed analysis dataset
plot_utils.py Utility class; used for generating various plots in EDA
Exploratory Data Analysis.ipynb Jupyter Notebook demonstrating the basic exploratory data analysis performed
Clustering and Topic Modeling.ipynb Jupyter Notebook containing the walk-through for clustering and topic modeling of the arXiv dataset

Visualization Samples




References

Data was collected from the tremendous work provided by the arxiv_archive repo and all due credit is referred to:

Geiger, R. Stuart (2020). ArXiV Archive: A Tidy and Complete Archive of Metadata for Papers on arxiv.org. doi | url.


License

Licensed under the MIT License. See LICENSE file for more details.