JobCrawler - Scrapy Web Crawler

JobCrawler is a full ETL pipeline that collects job listing information from Indeed.com, parses, cleans and transforms the results and sends them to an API for further data validation and storage.

This project was originally intended to collect information regarding job listings, companies and their posting habits for use in a web application (JobStat) as a live dashboard and research tool for job seekers.

At this time, this project only collects information regarding Data Analytics and Software Development (Python) job listings.

Tech Stack

Python 3.10
- Scrapy
Bash
- Shell scripts for database backups to Amazon S3
AWS
- EC2 instance as a deployment server
- S3 for db backup storage

Deployment

To Dos

Implement CI/CD
Finalize documentation
Convert data store to send to API
3a. [ ] Send JSON to API
3b. [ ] Handle server responses appropriately

Name		Name	Last commit message	Last commit date
Latest commit History 54 Commits
JobCrawler		JobCrawler
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
crawl_jobs.py		crawl_jobs.py
pyproject.toml		pyproject.toml
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt
scrapy.cfg		scrapy.cfg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

JobCrawler - Scrapy Web Crawler

Tech Stack

Deployment

To Dos

About

Releases 1

Packages

Languages

License

DResthal/JobCrawler

Folders and files

Latest commit

History

Repository files navigation

JobCrawler - Scrapy Web Crawler

Tech Stack

Deployment

To Dos

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages