Multithreaded Web Crawler in C++

This project implements a multithreaded web crawler using C++. It downloads web pages, extracts hyperlinks, and recursively processes them up to a specified depth. The program is designed to efficiently handle multiple URLs in parallel using multithreading.

Features

HTML Downloading: Uses libcurl to fetch and save web pages locally.
Hyperlink Extraction: Extracts links from HTML files using regular expressions.
Link Validation: Validates hyperlinks to ensure they are well-formed.
Recursive Crawling: Processes web pages up to a depth of 4.
Multithreading: Handles multiple URLs concurrently using C++ threads.
Performance Metrics: Reports time taken for processing each web page.
Rate Limiting: Introduces delays to prevent server overload.

Requirements

C++17 or later
libcurl library installed

Installation

Clone the repository:

git clone <your-repo-url>
cd <your-repo-folder>

Install libcurl:

sudo apt-get update
sudo apt-get install libcurl4-openssl-dev

Compile the program:

g++ -std=c++17 -pthread -lcurl -o web_crawler main.cpp

Usage

Run the program:
```
./web_crawler
```

Example Output:

Time taken to generate thread 1 is: 2.3 seconds
Thread_id: 1    Link: https://www.iitism.ac.in/

Time taken to generate thread 2 is: 1.7 seconds
Thread_id: 2    Link: https://en.wikipedia.org/wiki/Main_page

Time taken to generate thread 3 is: 2.1 seconds
Thread_id: 3    Link: https://codeforces.com/

Code Overview

`get_page`:

Fetches and saves the HTML content of a URL to a file using libcurl.

`extract_hyperlinks`:

Extracts all hyperlinks from the saved HTML file using regular expressions.

`cleanUp`:

Validates and cleans up extracted links.

`dfs_crawler`:

Recursively crawls web pages to extract and process hyperlinks. Includes depth control and multithreading support.

`main`:

Initializes threads for crawling multiple URLs concurrently.

Project Structure

|-- main.cpp          # Main program file
|-- README.md         # Project documentation

To-Do

Add error handling for network issues.
Store extracted links in a file or database for further processing.
Introduce dynamic depth control via user input.
Improve rate-limiting logic.

Contributing

Contributions are welcome! If you have ideas for improvements, please submit a pull request or open an issue.

Contact

If you have any questions or suggestions, feel free to reach out!

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
MultiThreaded-Web-Crawler-OSProject		MultiThreaded-Web-Crawler-OSProject
README.md		README.md
crawler.cpp		crawler.cpp
image.png		image.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Multithreaded Web Crawler in C++

Features

Requirements

Installation

Usage

Code Overview

`get_page`:

`extract_hyperlinks`:

`cleanUp`:

`dfs_crawler`:

`main`:

Project Structure

To-Do

Contributing

Contact

About

Releases

Packages

Languages

VIKASH1596KUMARKHARWAR/OS---Web-Crawler-Project

Folders and files

Latest commit

History

Repository files navigation

Multithreaded Web Crawler in C++

Features

Requirements

Installation

Usage

Code Overview

get_page:

extract_hyperlinks:

cleanUp:

dfs_crawler:

main:

Project Structure

To-Do

Contributing

Contact

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

`get_page`:

`extract_hyperlinks`:

`cleanUp`:

`dfs_crawler`:

`main`:

Packages