Hiking

Hiking is a job crawler crawling job information from a job website

Source code is made available under the Apache 2.0 license.

Installation

Chrome

Hiking will use Chrome as the driver to crawl web pages, please make sure that Firefox has been installed in your computer.

brew install chromedriver

PhantomJS

Hiking also can run without GUI (Firefox), which means it can crawling pages in headless browser mode.

In order to support this, you need to download webkit PhantomJS, and extract it in your computer.

If you extact PhantomJS in the ~/phantomjs-2.1.1 directory, please run the command to check it is the right version.

~/phantomjs-2.1.1/bin/phantomjs -v

Note: For Ubuntu Server, It however still relies on Fontconfig (the package fontconfig or libfontconfig, depending on the distribution), please use fellowing commands to install these dependecies

sudo apt-get update
sudo apt-get install fontconfig-config
sudo apt-get install fontconfig

Install MongoDB

Update Homebrew’s package database

brew update

Install MongoDB

brew install mongodb

Start MongoDB

brew services start mongodb

Export crawling results

mongoexport --db jobs_db --collection jobs --out jobs_161012.csv --type=csv --fields dummy,dummy,industry,company,location,deadline,dummy,dummy,dummy,job_name,url

Install python environment

Now, we need to install the python environment and all dependent packages. And all steps to build the environment has been scripted in install_env.

So you can install environment just through running the script.

./install_env

Run

You can run sample code code/job_app.py to crawling the test job website.

code/job_app.py

and the crawling results will been printed in the console.

If you want to store the data in MongoDB's database, you can run the commond with -o and -d argument.

code/job_app.py -o mongodb -d jobs_db:jobs

and the results will save in the jobs collection of jobs_db database.

If you want to start the crawler in headless browser model, you can run the command with --headless argument to specify the location of the PhantomJS webkit.

code/app.py --headless=[the location of the binary package of phantomjs]

Try with -h argument for more information.

code/app.py -h

Extension

If you want to crawl another website or another part of a website, you need to define a RunConfig in the directory of code/configs

the customized config should be the instance of RunConfig in code/hiking/core/run_config.py.

For more information, please reference code/run_configs/qiaobutang_top20.py and code/run_configs/shixiseng.py

When you create the RunConfig, you need to create a new application to run the applcation with the RunConfig, For more information, please reference code/job_app.py.

Name		Name	Last commit message	Last commit date
Latest commit History 75 Commits
code		code
.gitignore		.gitignore
README.md		README.md
install_env		install_env
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Hiking

Installation

Chrome

PhantomJS

Install MongoDB

Install python environment

Run

Extension

About

Releases

Packages

Contributors 2

Languages

david-liu/hiking

Folders and files

Latest commit

History

Repository files navigation

Hiking

Installation

Chrome

PhantomJS

Install MongoDB

Install python environment

Run

Extension

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages