PDF Comparison Tool Documentation

Overview

The main goal of this python Python script compares two PDF documents of resolutions from the Conselho Superior de Ensino, Pesquisa e Extensão (Consepe) at the Universidade Federal da Paraíba (UFPB), but they differ in their content and purpose. so, my job this to identifies changes between them, and generates an HTML diff view. It uses natural language processing techniques to compare text segments and visualize the differences.

Dependencies

PyMuPDF (fitz)
Transformers
PyTorch
scikit-learn
NumPy
difflib
re
html

Main Components

1. Device Configuration

def define_device():
    # ... (function implementation)

This function determines the available hardware (CUDA GPU, Apple Silicon, or CPU) for processing.

2. Model Loading

def load_model(model_name='sentence-transformers/all-MiniLM-L6-v2', device=None):
    # ... (function implementation)

Loads a pre-trained language model and tokenizer for text embedding.

3. PDF Text Extraction

def extract_text_from_pdf(pdf_path):
    # ... (function implementation)

Extracts text content from a PDF file.

4. Text Segmentation

def segment_text(text):
    # ... (function implementation)

Segments the extracted text into chunks based on article numbers.

5. Embedding Computation

def compute_embeddings(text_chunks, tokenizer, model, device):
    # ... (function implementation)

Computes embeddings for each text chunk using the loaded model.

6. Similarity Calculation

def compute_similarity_matrix(embeddings1, embeddings2):
    # ... (function implementation)

Computes the cosine similarity between embeddings of original and updated text chunks.

7. Match Finding

def find_matches(similarity_matrix, threshold=0.9):
    # ... (function implementation)

Identifies matching chunks based on similarity scores.

8. HTML Diff Generation

def generate_diff_html(original_chunks, updated_chunks, matches, output_path='diff_view.html'):
    # ... (function implementation)

Generates an HTML file visualizing the differences between the original and updated PDFs.

Main Execution Flow

The main() function orchestrates the entire process:

Configure the device
Load the model and tokenizer
Extract text from both PDFs
Segment the extracted text
Compute embeddings for both texts
Calculate similarity matrix
Find matches based on similarity
Generate and save the HTML diff view

Usage

Download all the python dependencies

pip install -r requirements.txt

and then, run it.

python main.py

Output

The script generates an HTML file named "diff_view.html" which provides a visual representation of the differences between the two PDFs. The output includes:

Removed chunks (highlighted in red)
Updated chunks (with line-by-line differences)
Unchanged chunks
Similarity scores for each chunk
Summary of changes (lines added/removed)

This HTML file can be opened in any web browser for easy viewing and analysis of the differences between the two PDF documents.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
src		src
.gitattributes		.gitattributes
.gitignore		.gitignore
A.pdf		A.pdf
B.pdf		B.pdf
README.md		README.md
diff_view.html		diff_view.html
main.py		main.py
pdf_diff_result.pdf		pdf_diff_result.pdf
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PDF Comparison Tool Documentation

Overview

Dependencies

Main Components

1. Device Configuration

2. Model Loading

3. PDF Text Extraction

4. Text Segmentation

5. Embedding Computation

6. Similarity Calculation

7. Match Finding

8. HTML Diff Generation

Main Execution Flow

Usage

Output

About

Releases

Packages

Languages

PucaVaz/compare-pdf

Folders and files

Latest commit

History

Repository files navigation

PDF Comparison Tool Documentation

Overview

Dependencies

Main Components

1. Device Configuration

2. Model Loading

3. PDF Text Extraction

4. Text Segmentation

5. Embedding Computation

6. Similarity Calculation

7. Match Finding

8. HTML Diff Generation

Main Execution Flow

Usage

Output

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages