This repository contains a Python-based web crawler designed to extract URLs based on specified parameters. The primary purpose is to provide a convenient tool for gathering URLs related to medical content, with the ability to filter by categories, geography, and date range.
- Search for relevant URLs on Google.
- Filter URLs based on primary category, secondary category, geography, and date range.
- Output the results to a CSV file.
Make sure you have the following installed:
- Python (version X.X.X)
- Any other specific dependencies
-
Clone the repository:
git clone https://github.com/your-username/web-crawler.git
-
Navigate to the project directory:
cd web-crawler
-
Install dependencies:
pip install -r requirements.txt
To run the web crawler, execute the following command:
python crawler.py --parameters parameters.json
Provide the input parameters in a JSON object format. Here's an example:
{
"primary_category": "Medical Journal",
"secondary_category": "Orthopedic",
"geography": "India",
"date_range": "2022"
}
Adjust the parameters according to your specific requirements.
The crawler will generate a CSV file containing the extracted URLs. Additional relevant data may be included in the output as deemed appropriate.