GitHub - QortalSeth/GithubAnalyzer: Scrapes Github to determine the current state of machine learning projects on the platform.

Github is one of the most popular version control systems in use today, with over 100 million projects available to users. Because of this, it is one of the best sources to check on the current state of Computer Science. My Github Analyzer application scrapes thousands of machine learning projects in order to determine which machine learning libraries are most commonly used, and analyze various statistics about machine learning projects as a whole.

Data Collection

The Python library Scrapy was used to get data from Github. The web scraping consisted of two spiders:

GithubLinksSpider.py - Use search terms to get project links
GithubProjectsSpider.py - Use project links to get data for each project

Each search term on Github allows the user to view 100 pages of information with 10 project links per page. I used 5 search terms classified by language to find projects:

Python
Jupyter Notebook
Java
Javascript
All languages

Each search term produced 1000 links, although the final result produced around 4235 unique results. Getting project links was difficult due to Github servers timing out requests unless the rate of requests per second was low. Fortunately, the scraping for individual projects was much more forgiving in that regard.

Results

The following data was obtained by analyzing the readme of each project and searching for references to each library, as well as their common aliases.

This plot shows the popularity of each library. It demonstrates that most libraries aren’t used much, and the top 5 libraries have by far the most usage.

These 2 plots demonstrate the number of total commits the scraped projects have for each library. They demonstrate that the libraries with the most commits are very different than their base popularity would suggest. The Histogram shows that most libraries have under 10000 commits, although there are significant outliers that may influence the results of the bar plot.

This plot demonstrates that the libraries whose projects have the most stars are Pyevolve and NuPIC, while the rest have very few stars.

This plot groups the projects by license. It reveals that the vast majority of projects have no license, although mit, apache, and gpl have a few uses.

This plot graphs the relationship between commits and releases. It shows that a lot of projects have 0 or 1 releases, but once the number of releases is greater than 1, there seems to be a positive correlation between commits and releases.

Conclusion

The common pattern in the data is that the vast majority of projects on Github are small and don’t have any significant number of commits, stars, or other indicators of influence. The same seems to be true for different libraries, which have a few that most people use, but the rest have very little use. As a whole, Github seems to have a few big projects that get most of the activity from users, and the rest are small and inconsequential.

Future work

One of the difficulties of using Scrapy on Github is that Github uses Javascript rendering on some html tags, such as the number of contributors and the date of each project. I attempted to use scrapy-splash to render these tags, but it didn’t work. If I were to redo this project, I would use Selenium instead of Scrapy, since it was built with Javascript support as a primary feature.

Another feature I would add is to scrape the commit history for each project in order to see how machine learning projects have changed over time and which contributors are the most active.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
GithubAnalyzer		GithubAnalyzer
Images		Images
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Collection

Results

Conclusion

Future work

About

Releases

Packages

Languages

QortalSeth/GithubAnalyzer

Folders and files

Latest commit

History

Repository files navigation

Data Collection

Results

Conclusion

Future work

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages