Skip to content

Latest commit

 

History

History
44 lines (32 loc) · 1.5 KB

README.md

File metadata and controls

44 lines (32 loc) · 1.5 KB

scraper

Web Scraper used to scrape Amazon Product Pages and avoid capchas.

Author: Julius Remigio

Required Libraries

Install required libraries:

pip install -r requirements.txt

Middlewares used:

See scrapy documentation:

Proxies

The proxies in proxy.txt was created using free public proxies that were available at the time of scraping. It should be updated regularly with working proxies to increase rate of success.

List of public proxies: http://proxylist.hidemyass.com/

Starting a scaping session:

Sessions are started using scrapy CLI utility. Custom parameters are passed using the -a parameter. Custom Parameters:

  • file - csv file with column header 'asin' (list of amazon products to scrape)
  • html - folder to store html of scraped products

Example:

scrapy crawl product -a html=./../../html -a file=./../reviews_Women.csv.gz -o ./../reviews_Women.jl --logfile ./../reviews_Women.csv.log

Settings.py

Used to changing scraping behavior such as retries and middleware configuration

Spiders

spider directory contains all spider classes. Currently there is only a products spider for scraping amazon product pages.

Notebooks

Notebooks are used for transforming the data and preparing it for model consumption.