Authors : Rahul Kulkarni | Anu Yadav | Cristopher Benge
U.C. Berkeley, Masters in Information & Data Science program - datascience@berkeley
Summer 2020, W209 - Data Visualization - Andrew Reagan, PhD - Section 4
This repo contains the draft work for the visualization of AI/ML research papers catalogued on arXiv.org for calendary years 1993 through 2019. Categories under consideration have been limited to:
- Computer Science: Artificial Intelligence [
cs: AI
] - Computer Science: Machine Learning [
cs: LG
] - Statistics: Machine Learning [
stat: ML
]
There are two external visuals for this project:
- Bokeh ArXiV Paper Clustering Visual (hosted in Azure Web App)
- Power BI ArXiV Paper Dashboard (requires access grant and Power BI login)
This project leverages the following visualization frameworks:
File | Description |
---|---|
ArXiV AI & ML Analytics - Midterm.pptx | Midterm presentation for arXiv AI & ML Analysis solution |
w209_assignment_2__cris_benge.pdf | Assignment 2, covering thorough review and initial hypothesis testing of arXiv data. |
load_base_data.py | Processes the base arXiv categories data, storing the output into a single Pandas.DataFrame (HDF5 file) |
refine_data_for_analysis.py | Processes the consolidated (but raw) arXiv categories data, generating the final analysis output data (CSV file) |
utils/preprocessing.py | Utility class; used for loading the raw arXiv data and generating the processed analysis dataset |
plot_utils.py | Utility class; used for generating various plots in EDA |
Exploratory Data Analysis.ipynb | Jupyter Notebook demonstrating the basic exploratory data analysis performed |
Clustering and Topic Modeling.ipynb | Jupyter Notebook containing the walk-through for clustering and topic modeling of the arXiv dataset |
Data was collected from the tremendous work provided by the arxiv_archive repo and all due credit is referred to: Geiger, R. Stuart (2020). ArXiV Archive: A Tidy and Complete Archive of Metadata for Papers on arxiv.org.
doi | url.
Licensed under the MIT License. See LICENSE file for more details.