© HSE University
Tested with Python 3.9 via virtual environment:
$ python3.9 -m venv venv
$ source venv/bin/activate
$ pip install -r requirements.txt
The official website: https://www.hse.ru/news/
Start the spider from its root directory with the following command:
$ scrapy crawl news
Comand for saving scraped data to a file:
$ scrapy crawl news -o file_name.csv # file_name.json
CSS-selectors are used for extracting data from web-pages.
Sample log:
A part of the spider's code may be uncommented with the spider settings "ROBOTSTXT_OBEY" set to False for getting number of post views (wheather use it or not is up to you, but it's actually disallowed by hse robots.txt).
PostgreSQL + SQLAlchemy are used in this project.
The database is designed in a way that the tables have one-to many and many-to-many relations.
Configure your database settings in a separate file (secrets.py):
postgresql = "{dialect}+{driver}://{user}:{password}@{host}:{port}/{db_name}".format(
dialect="xxxx",
driver="xxxx",
user="xxxx",
password="xxxx",
host="xxxx",
port="xxxx",
db_name="xxxx",
)
Apart from saving scraped items to the DB, pipelines take care of dropping duplicates and posts without any text.
The data set I use is a CSV spreadsheet with 558 scraped posts.
To create a social graph within a list of people, mentioned in HSE news posts, we'll use spaCy for NER and NetworkX to bind nodes and edges.
Display top-mentioned names:
Top 10 mentioned people:
('Ярослав Кузьминов', 62)
('Ярослав Кузьминова', 26)
('Исак Фрумин', 16)
('Сергей Рощин', 14)
('Лилия Овчаров', 14)
('Алексей Иванов', 13)
('Владимир Путин', 13)
('Леонид Гохберг', 12)
('Исиэз Ниу', 11)
('Валерий Касамара', 11)
For any reference follow:
- Scrapy - https://docs.scrapy.org/
- SQLAlchemy - https://docs.sqlalchemy.org/
- spaCy - https://spacy.io/usage/spacy-101
- NetworkX - https://networkx.org/