1. Initialize docker container:
The container is mounted from image mongoDB_wb
, used as a database of spider.
2. Create Weibo Spider
sudo chmod 755 ./init/init_wb.sh
./init/init_wb.sh
init_wb.sh
will create the necessary configurations and mapped directories for mongoDB_wb
in the docker container. The data is stored in Home/mongoDB_wb
3. Initialize the Weibo database:
Then,according to the hint of init_wb.sh
,we need to execute the following command to call the script db_init_wb.js
, which is used to initialize the database.
sudo docker exec -it wb_spider mongo 127.0.0.1:27018 /etc/resource/db_init_wb.js
db_init_wb.js
will create 2 users respectively: admin
(administrator) and weibo
(usual user),and 6 tables user
,user_post
, tag_post
, review
and error_log
.
NB: You will be asked to input your own password when you create the
admin
and
4.Modify the params
Rewrite ./wb_spider/database/DBconnector.py
,modify the mongo_pwd
in function __init__
to your own password,which is used for Spider to connect to the database。
def __init__(self):
self.mongo_uri = "127.0.0.1:27018" # IP used to connect with Docker.
self.mongo_database = "weibo" # database created from init_db.js
self.mongo_user_name = "weibo" # the user in database 'weibo'
self.mongo_pass_wd = "Your password."
If you initialize the spider as the instruction above, then you can skip the guide about how to start docker.
If you initialize your spider before and wanna use it directly, following the next command. (If it has been a long time that your forget the name of your spider container, use docker ls
to check the name of the container you wanna use)
docker start wb_spider # start your spider container, here wb_spider is the name of container
docker exec -it wb_spider /bin/bash # enter the bash commandline mode of your spider container
After you enter into the bash of your spider container, you can run the command below to start the mongo database service:
mongo [IP address] # start mongo service, for example: mongo 127.0.0.1:27018
Then specify the database you want to use, and authenticate with your username and password which are created when initializing the spider.
use weibo # use the database 'weibo'
db.auth('weibo', '123456') # db.auth( <username>, <password> )
Congratulation, you connect to your database successfully ! Now you can check and modify the data your crawled from the website. There are some common mongodb shell commands:
show dbs # display the database
show collections # show the data tables & collections
db.createCollection(name) # create a collection
db.COLLECTION_NAME.drop(); # drop a collection
db.COLLECTION_NAME.find(condition) # query in a collection
db.colloction.remove(CONDITION) # delete document
More commands of mongodb shell, please refer to mongo tutorial .
There are 4 spiders available now, and the corresponding commands are as follow:
Spider Name | CMD | Function |
---|---|---|
wb_spider |
scrapy crawl wb_spider -a uid=xxx|xxx |
Collecting the target users’ information and all blog posts, which must be introduced -a uid = xxx|xxx" (the target collection user's UID ) |
user_spider |
scrapy crawl user_spdier -a uid=xxx|xxx |
Collect the target users’ information, parameters are the same asweibo_spider . |
user_post_spider |
scrapy crawl user_post_spider -a uid=xxx|xxx |
Collect all the blog posts of the target users, parameters are the same aswb_spider . |
tag_post_spider |
scrapy crawl tag_post_spider -a uid=xxx|xxx |
Collect all the blog posts of the target hashtag and reviews of each post. |
NB: When you use
tag_post_spider
, you need to use$%23$ to replace$#$ , in other words, the parametersuid
should be%23[keyword]
, such as%23陕西
(The completed command is
scrapy crawl tag_post_spider -a uid="%23陕西"
).
According to the command line tool of scrapy, we can execute cmd
command with python to call the sprider:
from scrapy.cmdline import execute
if __name__ == '__main__':
spider_cmd = "scrapy crawl wb_spider -a uid=user1_id|user2_id|...."
execute(spider_cmd.split())
📦wb_spider
┣ 📂base
┃ ┣ 📜BaseConfig.py
┃ ┣ 📜BaseSpider.py
┃ ┣ 📜Pipeline.py
┃ ┗ 📜__init__.py
┣ 📂config
┃ ┣ 📜ReviewConfig.py
┃ ┣ 📜TagPostConfig.py
┃ ┣ 📜UserConfig.py
┃ ┣ 📜UserPostConfig.py
┃ ┗ 📜__init__.py
┣ 📂database
┃ ┣ 📜DBConnector.py
┃ ┗ 📜__init__.py
┣ 📂items
┃ ┣ 📜ErrorItem.py
┃ ┣ 📜LongtextItem.py
┃ ┣ 📜ReviewItem.py
┃ ┣ 📜TagPostItem.py
┃ ┣ 📜UserItem.py
┃ ┣ 📜UserPostItem.py
┃ ┗ 📜__init__.py
┣ 📂middlewares
┃ ┣ 📜DepthMiddleware.py
┃ ┣ 📜FakeUserAgentMiddleware.py
┃ ┣ 📜InitialMiddleware.py
┃ ┣ 📜ProxyMiddleware.py
┃ ┣ 📜RetryMiddleware.py
┃ ┗ 📜__init__.py
┣ 📂pipelines
┃ ┣ 📜ErrorPipeline.py
┃ ┣ 📜LongtextPipeline.py
┃ ┣ 📜ReviewPipeline.py
┃ ┣ 📜TagPostPipeline.py
┃ ┣ 📜UserPipeline.py
┃ ┣ 📜UserPostPipeline.py
┃ ┗ 📜__init__.py
┣ 📂resource
┃ ┗ 📜0.1.11.json
┣ 📂spiders
┃ ┣ 📂_spider
┃ ┃ ┣ 📜__init__.py
┃ ┃ ┣ 📜tag_post_spider.py
┃ ┃ ┣ 📜user_info_spider.py
┃ ┃ ┗ 📜user_post_spider.py
┃ ┣ 📜__init__.py
┃ ┗ 📜wb_spider.py
┣ 📜__init__.py
┗ 📜settings.py