Smarter PDF Comparison: How to Catch Real Changes, Not Formatting Noise
comparison of PDFs by extracting content, detecting key changes, and producing a highlighted PDF and a detailed report.

Code version control is fixed. PDF version control? Still a mess.
Whether legal contracts, board reports, or policy documents are under review, one small edit in a sentence, chart, or figure can affect big decisions. But PDFs don't support tracking changes. They are static, layout-intense, and structure-blind. Identifying significant differences between two versions typically involves tedious, line-by-line checks or expensive proprietary software.
Imagine receiving two versions of a 30-page document and being asked to report what has changed. Perhaps a paragraph moved, a sentence had new wording or figures within a table slightly changed. Perhaps it might look the same upon first sight but the meaning of those changes can be huge.
The issue is, most software either drowns you in noise (marking every trivial layout change) or entirely skips significant content changes.
That's because PDFs are visual documents, not structured. PDFs capture how things appear, not the way things relate. Text might be scattered across floating containers, images embedded without any logical metadata, and reading order broken.
This problem needed to be approached differently.
What if we could just pull out the actual content i.e. text and images and grasp it structurally, compare it sensibly, and just ignore the rest?
This blog post steps through how we constructed a solid system to do just that.
Why PDF Comparisons are Difficult?
At first glance, comparing two PDFs might seem like merely determining whether they "look" different. Beneath the surface, however, PDFs formulate multiple profound challenges.
First, PDFs are layout-first, not content-first. They were designed to preserve appearance across devices and printers, but not to encode logical meaning. A paragraph may be split across several floating boxes. A tiny font size change could make an otherwise identical sentence look entirely different at the file level. Conventional diffing tools, which assume structured text, get confused easily.
Second, images such as tables, charts, and diagrams contain important information but PDFs tend to handle them as in-place images without metadata. Looking at two diagrams isn't comparing every pixel, it's determining if the content has significantly changed.
Lastly, real-world edits are not insertions or deletions. A paragraph may shift from page two to page four. A table may have an additional row. Such semantic changes are subtle but critical and traditional methods miss important updates or mark pointless noise.
To compare PDFs correctly is to do more than detect differences. It is to comprehend the document's logical content and structure and to overlook superficial formatting noise.
How to Compare PDFs Effectively?
To address the issue of comparing PDFs correctly, the process is divided into layers; addressing text and images separately and then merging the results.
Extracting Text and Images
The first step involves separating the content into two parts: text and images.
- For text, pdfminer is used to extract text precisely from the PDF, not just scrape the entire page.
- For images, Pillow and OpenCV (cv2) are used to extract and preprocess the images in the document.
This separation allows the focus to remain on the content’s meaning, rather than its visual layout.
Comparing Text Semantically
When comparing text, the focus is on the meaning of changes, not the formatting. difflib.SequenceMatcher is used to identify actual content changes such as insertions, deletions, and moved text.
Minor formatting issues like line breaks or small shifts in margins are ignored, ensuring that a reflowed paragraph isn’t flagged as a change, but a changed sentence is.
Comparing Images Structurally
For visuals, the goal is to detect real differences in images and diagrams without being distracted by pixel shifts or compression noise. skimage.metrics.structural_similarity (SSIM) is used to detect meaningful visual changes, such as an updated graph or a new data point, without flagging irrelevant changes caused by compression artifacts.
Merging the Diffs
Finally, both text and image results are merged to provide the complete comparison. The output consists of:
- A new highlighted PDF, clearly showing where changes occurred while preserving the original document structure.
- A separate differences report, listing all the modifications in an easy-to-review format.
The end result is a content-aware and noise-resistant system that detects meaningful changes in PDFs, distinguishing between real edits and superficial formatting shifts.
What Needed to Be Built?
Building the system required more than just writing scripts. The goal was to create an actual system that could:
- Handle PDF uploads, selections, and metadata management
- Extract and preprocess PDFs reliably
- Analyze and compare extracted content across multiple layers (text, images)
- Generate user-friendly outputs, such as highlighted PDFs and structured difference reports
- Orchestrate all processing reliably, with error handling and retries
The system was designed with a modular architecture, using modern, lightweight tools to ensure it is open, portable, and easy to extend.
Data Flow: How It All Fits Together?
To compare different versions of PDF documents effectively, the system follows a structured process. This approach ensures that all changes, whether textual or visual are identified and clearly presented.
Document Upload:
- The user uploads a PDF.
- The system processes the file by extracting metadata and storing both the file and its metadata in the workspace’s file system and the documents database.
Document Selection & Processing:
- The user selects different versions of the document for comparison.
- The system retrieves the selected PDFs and extracts their pages.
- Text and images are extracted for further processing.
Comparison & Differences Detection:
- The system analyzes the extracted content to detect replacements, insertions, deletions, and position changes.
- The detected differences are then processed and formatted for visualization.
Generating Output:
- The system generates two outputs:
- Highlighted PDFs, with changes visually marked.
- A differences report in PDF format, listing all detected modifications.
- Both documents are made available to the user for review.
This data flow illustrates the clear and systematic process used to compare PDF versions, ensuring that meaningful changes are easily identified and presented.Below is an image that illustrates the entire data flow:
Open Source Libraries That Powered Everything
The system is built using open, tested libraries that ensure precision and control:
- pdfminer: Low-level text extraction from PDFs
- Pillow (PIL): Image processing toolkit
- OpenCV (cv2): Image manipulation and preprocessing
- skimage: Structural similarity for smart image comparison
- difflib: Robust sequence matching for text diffs
- pypdf: Read, modify, and generate PDFs with change highlights
- FastAPI: Lightweight backend for handling API requests
- Svelte + Vite: Modern frontend stack for UI
- Prefect: Workflow orchestration and task management
Each tool is chosen for its focus on precision and control, avoiding black-box solutions and proprietary frameworks.
Code: The Approach
APIs served as the glue, orchestrating the different parts of the system. All heavy processing was handled asynchronously, ensuring fast and reliable comparisons even for large documents.
For more details on the code approach, check out this technical discussion. The project used a modular and scalable design. The backend service, developed using FastAPI, was responsible for uploads, selection, and orchestration.
The extraction, comparison, and result generation were handled by a Prefect-powered processing pipeline. This facilitated convenient workflows and clear task management.
Frontend elements, implemented with Svelte, offered a user-friendly interface for document upload, selection, and downloading.
For clean, independent logic, text and image comparison modules were separated. This provided an effortless way to add functionality in the future, such as OCR or sophisticated image diffing capabilities.
The APIs acted as glue, controlling the various components of the system. Everything that did heavy processing was done asynchronously, allowing for quick and sure comparisons even on big documents.
For additional information regarding the code solution, see this technical discussion.
Pros and Cons of the PDF Comparison System
This system has several advantages that make it efficient and cost-effective for comparing PDF documents, but there are also some challenges to consider. Below is a breakdown of the pros and cons:
Conclusion
This PDF comparison tool offers a real-world solution for effectively determining differences among versions of PDF documents. It solves the key issues related to comparing text and images, providing a simplified mechanism for updating documents and enhancing version control accuracy.
While the existing system is effective and operational, there is room for improvement, such as improving text analysis or offering users greater levels of customization. As improvements continue to be made, this method will become increasingly robust, with more flexibility and efficiency in document comparison operations..