Skip to content

This repository contains all the important resources for the Boa dataset for Github Python projects.

Notifications You must be signed in to change notification settings

boalang/MSR19-DataShowcase

Repository files navigation

MSR19-DataShowcase

This repository contains all the important resources for the Boa dataset for Github Python projects. This dataset will help to analyze top rated data science projects in Github. The work is published in MSR 2019, Montreal, Canada.

Boa Meets Python: A Boa Dataset of Data Science Software in Python Language

By: Sumon Biswas, Md Johirul Islam, Yijia Huang, and Hridesh Rajan

abstract

The popularity of Python programming language has surged in recent years due to its increasing usage in Data Science. The availability of Python repositories in Github presents an opportunity for mining software repository research, e.g., suggesting the best practices in developing Data Science applications, identifying bug-patterns, recommending code enhancements, etc. To enable this research, we have created a new dataset that includes 1,558 mature Github projects that develop Python software for Data Science tasks. By analyzing the metadata and code, we have included the projects in our dataset which use a diverse set of machine learning libraries and managed by a variety of users and organizations. The dataset is made publicly available through Boa infrastructure both as a collection of raw projects as well as in a processed form that could be used for performing large scale analysis using Boa language. We also present two initial applications to demonstrate the potential of the dataset that could be leveraged by the community.

ACM Reference

Biswas, S. et al. 2019. Boa Meets Python: A Boa Dataset of Data Science Software in Python Language. MSR’19: 16th International Conference on Mining Software Repositories (May 2019).

Important Links

The full paper is found here: http://design.cs.iastate.edu/papers/MSR-19/msr19.pdf

The other instructions for using the dataset can be found here: http://design.cs.iastate.edu/papers/MSR-19/

The example Boa queries for the dataset:

The open source code for Boa compiler and Python dataset generation is available in the pydatagen branch of Boa compiler: https://github.com/boalang/compiler/tree/pydatagen

About

This repository contains all the important resources for the Boa dataset for Github Python projects.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published