Skip to content

Latest commit

 

History

History
30 lines (20 loc) · 1006 Bytes

README.md

File metadata and controls

30 lines (20 loc) · 1006 Bytes

JobCrawler - Scrapy Web Crawler

JobCrawler is a full ETL pipeline that collects job listing information from Indeed.com, parses, cleans and transforms the results and sends them to an API for further data validation and storage.

This project was originally intended to collect information regarding job listings, companies and their posting habits for use in a web application (JobStat) as a live dashboard and research tool for job seekers.

At this time, this project only collects information regarding Data Analytics and Software Development (Python) job listings.

Tech Stack

  • Python 3.10
    • Scrapy
  • Bash
    • Shell scripts for database backups to Amazon S3
  • AWS
    • EC2 instance as a deployment server
    • S3 for db backup storage

Deployment


To Dos

  1. Implement CI/CD
  2. Finalize documentation
  3. Convert data store to send to API
    3a. [ ] Send JSON to API
    3b. [ ] Handle server responses appropriately