Skip to content

Latest commit

 

History

History
62 lines (43 loc) · 1.8 KB

README.md

File metadata and controls

62 lines (43 loc) · 1.8 KB

MIT OpenCourseWare Crawler

Crawl Output

Last updated: November 27, 2023

Description

This is a simple crawler to save the available courses on MIT OpenCourseWare. This crawler will export the courses with video lectures as a CSV file.

You can crawl for courses other than video lectures by changing the @start_urls in crawler.rb.

Docker Run (Recommended)

This is the simplest way to run the crawler. It will run the crawler and save the results in results.csv using a Docker volume.

$ docker build -t ocw-crawl:1.0 .
$ docker run --volume $(pwd)/results.csv:/app/results.csv \
             --rm \
             --name ocw-crawl \
             ocw-crawl:1.0

Manually Run

To run the crawler without Docker, you'll need to install an older version of Ruby that's compatible with kimurai. You'll also need geckodriver and Firefox. Read more about setting up kimurai here if you run into trouble.

Setup

Install Ruby 2.5.0 and run bundle install.

$ asdf install ruby 2.5.0
$ asdf global ruby 2.5.0
$ gem install bundler
$ bundle install # install dependencies

Run

$ ruby crawler.rb
...

Possible Improvements

  • Use OCW Sitemaps to crawl all courses
  • Get more information about each course from the sitemap
    • Course materials often follow these patterns:
      • Syllabus: /pages/syllabus/
      • Course download: /download/
      • Resources: /resources/*/
        • PDFs, slides, lectures notes, etc.
      • Course pages: /pages/*/
        • Readings: /pages/readings/
  • Turn the data into an app or API