The main goal of this python Python script compares two PDF documents of resolutions from the Conselho Superior de Ensino, Pesquisa e Extensão (Consepe) at the Universidade Federal da Paraíba (UFPB), but they differ in their content and purpose. so, my job this to identifies changes between them, and generates an HTML diff view. It uses natural language processing techniques to compare text segments and visualize the differences.
- PyMuPDF (fitz)
- Transformers
- PyTorch
- scikit-learn
- NumPy
- difflib
- re
- html
def define_device():
# ... (function implementation)
This function determines the available hardware (CUDA GPU, Apple Silicon, or CPU) for processing.
def load_model(model_name='sentence-transformers/all-MiniLM-L6-v2', device=None):
# ... (function implementation)
Loads a pre-trained language model and tokenizer for text embedding.
def extract_text_from_pdf(pdf_path):
# ... (function implementation)
Extracts text content from a PDF file.
def segment_text(text):
# ... (function implementation)
Segments the extracted text into chunks based on article numbers.
def compute_embeddings(text_chunks, tokenizer, model, device):
# ... (function implementation)
Computes embeddings for each text chunk using the loaded model.
def compute_similarity_matrix(embeddings1, embeddings2):
# ... (function implementation)
Computes the cosine similarity between embeddings of original and updated text chunks.
def find_matches(similarity_matrix, threshold=0.9):
# ... (function implementation)
Identifies matching chunks based on similarity scores.
def generate_diff_html(original_chunks, updated_chunks, matches, output_path='diff_view.html'):
# ... (function implementation)
Generates an HTML file visualizing the differences between the original and updated PDFs.
The main()
function orchestrates the entire process:
- Configure the device
- Load the model and tokenizer
- Extract text from both PDFs
- Segment the extracted text
- Compute embeddings for both texts
- Calculate similarity matrix
- Find matches based on similarity
- Generate and save the HTML diff view
Download all the python dependencies
pip install -r requirements.txt
and then, run it.
python main.py
The script generates an HTML file named "diff_view.html" which provides a visual representation of the differences between the two PDFs. The output includes:
- Removed chunks (highlighted in red)
- Updated chunks (with line-by-line differences)
- Unchanged chunks
- Similarity scores for each chunk
- Summary of changes (lines added/removed)
This HTML file can be opened in any web browser for easy viewing and analysis of the differences between the two PDF documents.