-
Notifications
You must be signed in to change notification settings - Fork 18
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Scrapy template: Minor code update and docs improvement (#236)
- Loading branch information
Showing
9 changed files
with
84 additions
and
45 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,14 +1,23 @@ | ||
# Define your item pipelines here | ||
# | ||
# See the Scrapy documentation: http://doc.scrapy.org/en/latest/topics/item-pipeline.html | ||
""" | ||
Scrapy item pipelines module | ||
import scrapy | ||
This module defines Scrapy item pipelines for scraped data. Item pipelines are processing components | ||
that handle the scraped items, typically used for cleaning, validating, and persisting data. | ||
For detailed information on creating and utilizing item pipelines, refer to the official documentation: | ||
http://doc.scrapy.org/en/latest/topics/item-pipeline.html | ||
""" | ||
|
||
from scrapy import Spider | ||
|
||
from .items import TitleItem | ||
|
||
|
||
class TitleItemPipeline: | ||
""" | ||
This item pipeline defines processing steps for TitleItem objects scraped by spiders. | ||
""" | ||
|
||
def process_item(self, item: TitleItem, spider: scrapy.Spider) -> TitleItem: | ||
def process_item(self, item: TitleItem, spider: Spider) -> TitleItem: | ||
# Do something with the item here, such as cleaning it or persisting it to a database | ||
return item |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,16 +1,18 @@ | ||
# Scrapy settings for this project | ||
# | ||
# For simplicity, this file contains only settings considered important or commonly used. | ||
# | ||
# You can find more settings consulting the documentation: http://doc.scrapy.org/en/latest/topics/settings.html | ||
""" | ||
Scrapy settings module | ||
# Do not change this since it would break the Scrapy <-> Apify interaction | ||
TWISTED_REACTOR = 'twisted.internet.asyncioreactor.AsyncioSelectorReactor' | ||
This module contains Scrapy settings for the project, defining various configurations and options. | ||
# The following settings can be updated by the user | ||
For more comprehensive details on Scrapy settings, refer to the official documentation: | ||
http://doc.scrapy.org/en/latest/topics/settings.html | ||
""" | ||
|
||
# You can update these options and add new ones | ||
BOT_NAME = 'titlebot' | ||
SPIDER_MODULES = ['src.spiders'] | ||
DEPTH_LIMIT = 1 # This will be overridden by the `max_depth` option from Actor input if running using Apify | ||
ITEM_PIPELINES = {'src.pipelines.TitleItemPipeline': 123} | ||
LOG_LEVEL = 'INFO' | ||
NEWSPIDER_MODULE = 'src.spiders' | ||
REQUEST_FINGERPRINTER_IMPLEMENTATION = '2.7' | ||
ROBOTSTXT_OBEY = True # obey robots.txt rules | ||
ITEM_PIPELINES = {'src.pipelines.TitleItemPipeline': 123} | ||
ROBOTSTXT_OBEY = True | ||
SPIDER_MODULES = ['src.spiders'] |
14 changes: 9 additions & 5 deletions
14
templates/python-scrapy/src/spiders/__init__.py
100755 → 100644
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,5 +1,9 @@ | ||
# This package will contain the spiders of your Scrapy project | ||
# | ||
# Please refer to the Scrapy documentation for information on how to create and manage your spiders. | ||
# | ||
# https://docs.scrapy.org/en/latest/topics/spiders.html | ||
""" | ||
Scrapy spiders package | ||
This package contains the spiders for your Scrapy project. Spiders are the classes that define how to scrape | ||
and process data from websites. | ||
For detailed information on creating and utilizing spiders, refer to the official documentation: | ||
https://docs.scrapy.org/en/latest/topics/spiders.html | ||
""" |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters