Public Data Scraper for Parliament Data for the EU and other Parliaments
- Install git (if not present already)
- Clone project using
git clone https://github.com/sampritipanda/simple_app.git
- Install Ruby (version >= 2.1) and Bundler
- Run
bundle install
to install the required gems - Run the script using
ruby eu_scraper.rb
or./eu_scraper.rb
- Find the scraped questions in the docs/ folder
- Ruby - The Language
- Nokogiri - For HTML Parsing
##Scala-based Asynchronous crawler Setup
- Install sbt, git and latest version of scala(sbt will do the update for you)
git clone https://github.com/DengYiping/parliament-scaper.git
sbt run
- sbt will first automatically download the necessary dependencies, and it will run the script.
###Technologies Used in Scala crawler:
- Scala: a functional programming language on JVM
- Akka: a effective framework for asynchronous, non-blocking and event-driven programming in Scala
- Spray-client: a light-weighted HTTP client based on Akka Actor model.
##Python Based Crawler Setup
- Install the requirements for this crawler
pip install -r requirements.txt
- Run
$ python eu_scraper.py
###Technologies Used in Python Crawler:
- Requests library
- lxml library for DOM traversal
##Python-async parser setup
- Create a virtual environment inside
python-async
folder withvirtualenv --python=python3.4 venv
- Activate you virtual environment with
source venv/bin/activate
- Install all appropriate requirements with
pip install -r requirements.txt
- Run the parser with
$ python parser.py
Changing the parser behavior
- Change
YEARS_TO_PARSE
in order to parse data from different years - Change
FOLDER_TO_DOWNLOAD
in order to change the name of the folder to download the data into.
###Technologies Used in Python-async parser:
- Requests + requests-futures for async requests
- threading for async downloading
- beautifulsoup4 for DOM parsing
- tqdm for progress bar
##Python-Based Scraper (pol's scraper) This scraper uses the BeautifulSoup package to parse and extract data from parliament's site. The script can also calculate how many pages it has to download based on the number of questions to be scraped.
- Install the requirements
pip install -r requirements.txt
- Run
$ python scraper.py
##Scrape it all - Generic Scraper(pol's scraper 2) This scraper uses the BeautifulSoup package to parse and extract data from parliament's site. The script can also calculate how many pages it has to download based on the number of docs to be scraped.
Generic Scraper - All years, All languages. Scrapes entire database.
- Install the requirements
pip install -r requirements.txt
- Run
$ python scrape_it_all.py