Skip to content

Swrnv-qc/web-crawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Web Crawler for URL Extraction

Overview

This repository contains a Python-based web crawler designed to extract URLs based on specified parameters. The primary purpose is to provide a convenient tool for gathering URLs related to medical content, with the ability to filter by categories, geography, and date range.

Table of Contents

Features

  • Search for relevant URLs on Google.
  • Filter URLs based on primary category, secondary category, geography, and date range.
  • Output the results to a CSV file.

Getting Started

Prerequisites

Make sure you have the following installed:

  • Python (version X.X.X)
  • Any other specific dependencies

Installation

  1. Clone the repository:

    git clone https://github.com/your-username/web-crawler.git
  2. Navigate to the project directory:

    cd web-crawler
  3. Install dependencies:

    pip install -r requirements.txt

Usage

To run the web crawler, execute the following command:

python crawler.py --parameters parameters.json

Parameters

Provide the input parameters in a JSON object format. Here's an example:

{
  "primary_category": "Medical Journal",
  "secondary_category": "Orthopedic",
  "geography": "India",
  "date_range": "2022"
}

Adjust the parameters according to your specific requirements.

Output

The crawler will generate a CSV file containing the extracted URLs. Additional relevant data may be included in the output as deemed appropriate.

About

Assignment 2

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages