Salas' (leVirve@Github)
- Familiar with basic Python syntax
- Knowledge about HTTP
Crawler can be separated into two parts, crawling and parsing.
- Crawl: Fetch raw data through URI
- Make HTTP request
- Receive the response, and pass to
Parse
- Parse:
- Parse out structured data from response (HTML or somethings)
- Get interested information from structured data.
- (Optional) Maybe store or serve the information for other applications.
We use Python in implementation,
-
Crawl
:- Python has its built-in library for HTTP requests.
- However,
requests
is a recommended solution by Python official.
Python Document: The Requests package is recommended for a higher-level http client interface.
-
Parse
:- Python has its own
HTMLParser
serves as the basis for parsing text files formatted in HTML or XHTML. - Somehow, we may choose other libraries for the jobs. Such as,
lxml
for better HTML text processingBeautifulsoup
serves better interface for parsing tree operationsSelenium
acts as a real browser for complicated interactions.
- Python has its own
pip install requests
Using python and requests
to fetch raw response. See requests
quickstart for more operations.
Sample of requests
power (doc)
>>> # Use keyword parameters for advanced usage
>>> r = requests.get('https://api.github.com/user', auth=('user', 'pass'))
>>> r.status_code
200
>>> # access data in `dict` way
>>> r.headers['content-type']
'application/json; charset=utf8'
>>> # get the encoding of response text
>>> r.encoding
'utf-8'
>>> # raw text string
>>> r.text
'{"type":"User"...'
>>> # make response to json type
>>> r.json()
{'private_gists': 419, 'total_private_repos': 77, ...}
Python code for crawling.
import requests
url = 'https://api.github.com/events'
response = requests.get(url)
That's it. You get a response from server.
We've got HTTP response in previous section. Now we are going to parse it.
- If response is well-formatted data:
json
response.json()
will turn json data into a pythondict()
format.- Access it directly to get target information .
- Not
"""
response object
- content: binary content
- text: text string
- json: json object
"""
data = response.json()
text = response.text
process(text)