This project implements a multithreaded web crawler using C++. It downloads web pages, extracts hyperlinks, and recursively processes them up to a specified depth. The program is designed to efficiently handle multiple URLs in parallel using multithreading.
- HTML Downloading: Uses
libcurl
to fetch and save web pages locally. - Hyperlink Extraction: Extracts links from HTML files using regular expressions.
- Link Validation: Validates hyperlinks to ensure they are well-formed.
- Recursive Crawling: Processes web pages up to a depth of 4.
- Multithreading: Handles multiple URLs concurrently using C++ threads.
- Performance Metrics: Reports time taken for processing each web page.
- Rate Limiting: Introduces delays to prevent server overload.
- C++17 or later
libcurl
library installed
-
Clone the repository:
git clone <your-repo-url> cd <your-repo-folder>
-
Install
libcurl
:sudo apt-get update sudo apt-get install libcurl4-openssl-dev
-
Compile the program:
g++ -std=c++17 -pthread -lcurl -o web_crawler main.cpp
-
Run the program:
./web_crawler
-
Example Output:
Time taken to generate thread 1 is: 2.3 seconds Thread_id: 1 Link: https://www.iitism.ac.in/ Time taken to generate thread 2 is: 1.7 seconds Thread_id: 2 Link: https://en.wikipedia.org/wiki/Main_page Time taken to generate thread 3 is: 2.1 seconds Thread_id: 3 Link: https://codeforces.com/
Fetches and saves the HTML content of a URL to a file using libcurl
.
Extracts all hyperlinks from the saved HTML file using regular expressions.
Validates and cleans up extracted links.
Recursively crawls web pages to extract and process hyperlinks. Includes depth control and multithreading support.
Initializes threads for crawling multiple URLs concurrently.
|-- main.cpp # Main program file
|-- README.md # Project documentation
- Add error handling for network issues.
- Store extracted links in a file or database for further processing.
- Introduce dynamic depth control via user input.
- Improve rate-limiting logic.
Contributions are welcome! If you have ideas for improvements, please submit a pull request or open an issue.
If you have any questions or suggestions, feel free to reach out!