-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathnotes.txt
107 lines (90 loc) · 3.1 KB
/
notes.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
Framework:
Begins with Spiders:
data--->Item Pipeline or
requests--->Scrapy Engine
1.Begin the Program
scrapy startproject name(mySpider)
2.items.py
the data needed to be stored
3.enter xxx/spiders----create a spider
scrapy gensipder name scope(xxx.com)
4.run the spider
scrapy crawl xxx(name)
execute:scrapy crawl name (| -o xxx.csv/json)
check the structure:
tree
Normally, we only need to modify the Item Pipeline and spiders.
scrapy.cfg: the basic settings--->functions
Concurrent setting
pipelines ducuments
distributed
cleaning data
items.py: the model of items--->scrapy.Field() stores the string
---->use it like dict
pipelines.py
the function process_item is stable
settings.py:
1.concurrent requests:16 default
2.Download_delay
3.cookies_enabled: whether will be traced
4.default_request_headers: if it is set, then there is no need to be set in spider again
5.spider_middlewares: priority(if there is more than one)
*6.Item_Pipelines:priority and can set yourself pipelines
#7.can set the redis_host and others(in distributed system)
generate the spider:
enter into the directory of spiders:
use the command:scrapy genspider name(muspider) scope(incast.cn)
In the beginning, we'd better make some notes,like
//represent any directory
./represent current directory
teacher_list=response.xpath('//div[@class="li_txt"]'):
for each in teacher_list:
#name
each.xpath('./h3/text()')
#title
each.xpath('./h4/text()')
#info
each.xpath('./p/text()')
then begins the program
1.set the items.py:
load the libirary
define the items: inhert the scrapy.Item
2.begin the spider:
1.name
2.allowd_domains
3.start_urls (can be more)
*4.parse
1.the whole page
response.body----special
2.specfic part
use xpath to match
converse the result into normal context-->
xxx.xpath('./p/text()').extract()
store the result into items:
import xxx
item['xxx']=name[0] |.encode('gbk')
...
3.identify the data is item or request:
item: yield item --- pipelines.py
request: yield scrapy.Request(self.url,callback=self.parse)---two parameters
resend the request to scheduler into queue,
out queue, send to the downloader
'''
Execute the scrapy in the pycharm:
*set the start script in "edit configuration" --- add Python ---- in the script path
---- add start.py
'''
set the pipeline:
use function yield in myspider to yield item
transfer the item to pipeline---set pipeline:
1.ensure the item_pipeline is defined in the setting.py
the last parameter is a class,which will be defined in the pipelines.py
2.In the pipelines:
the process_item(three positions) function is must.
__init__(one position) and close_spider(two positions) is optional
(1)__init__ --- open a json document
(2)process_item:
dict(item)---transfer the data into dict
json.dumps()----transfer the dict into json data, two parameters
dict(item),ensure_ascii=Flase(optional)
(3)close the filename