Smarter PDF Comparison: How to Catch Real Changes, Not Formatting Noise

Code version control is fixed. PDF version control? Still a mess.

Whether legal contracts, board reports, or policy documents are under review, one small edit in a sentence, chart, or figure can affect big decisions. But PDFs don't support tracking changes. They are static, layout-intense, and structure-blind. Identifying significant differences between two versions typically involves tedious, line-by-line checks or expensive proprietary software.

Imagine receiving two versions of a 30-page document and being asked to report what has changed. Perhaps a paragraph moved, a sentence had new wording or figures within a table slightly changed. Perhaps it might look the same upon first sight but the meaning of those changes can be huge.

The issue is, most software either drowns you in noise (marking every trivial layout change) or entirely skips significant content changes.

That's because PDFs are visual documents, not structured. PDFs capture how things appear, not the way things relate. Text might be scattered across floating containers, images embedded without any logical metadata, and reading order broken.

This problem needed to be approached differently.

What if we could just pull out the actual content i.e. text and images and grasp it structurally, compare it sensibly, and just ignore the rest?

This blog post steps through how we constructed a solid system to do just that.

Why PDF Comparisons are Difficult?

At first glance, comparing two PDFs might seem like merely determining whether they "look" different. Beneath the surface, however, PDFs formulate multiple profound challenges.

First, PDFs are layout-first, not content-first. They were designed to preserve appearance across devices and printers, but not to encode logical meaning. A paragraph may be split across several floating boxes. A tiny font size change could make an otherwise identical sentence look entirely different at the file level. Conventional diffing tools, which assume structured text, get confused easily.

Second, images such as tables, charts, and diagrams contain important information but PDFs tend to handle them as in-place images without metadata. Looking at two diagrams isn't comparing every pixel, it's determining if the content has significantly changed.

Lastly, real-world edits are not insertions or deletions. A paragraph may shift from page two to page four. A table may have an additional row. Such semantic changes are subtle but critical and traditional methods miss important updates or mark pointless noise.

To compare PDFs correctly is to do more than detect differences. It is to comprehend the document's logical content and structure and to overlook superficial formatting noise.

How to Compare PDFs Effectively?

To address the issue of comparing PDFs correctly, the process is divided into layers; addressing text and images separately and then merging the results.

Extracting Text and Images

The first step involves separating the content into two parts: text and images.

For text, pdfminer is used to extract text precisely from the PDF, not just scrape the entire page.
For images, Pillow and OpenCV (cv2) are used to extract and preprocess the images in the document.

This separation allows the focus to remain on the content’s meaning, rather than its visual layout.

Comparing Text Semantically

When comparing text, the focus is on the meaning of changes, not the formatting. difflib.SequenceMatcher is used to identify actual content changes such as insertions, deletions, and moved text.

Minor formatting issues like line breaks or small shifts in margins are ignored, ensuring that a reflowed paragraph isn’t flagged as a change, but a changed sentence is.

Comparing Images Structurally

For visuals, the goal is to detect real differences in images and diagrams without being distracted by pixel shifts or compression noise. skimage.metrics.structural_similarity (SSIM) is used to detect meaningful visual changes, such as an updated graph or a new data point, without flagging irrelevant changes caused by compression artifacts.

Merging the Diffs

Finally, both text and image results are merged to provide the complete comparison. The output consists of:

A new highlighted PDF, clearly showing where changes occurred while preserving the original document structure.
A separate differences report, listing all the modifications in an easy-to-review format.

The end result is a content-aware and noise-resistant system that detects meaningful changes in PDFs, distinguishing between real edits and superficial formatting shifts.

What Needed to Be Built?

Building the system required more than just writing scripts. The goal was to create an actual system that could:

Handle PDF uploads, selections, and metadata management
Extract and preprocess PDFs reliably
Analyze and compare extracted content across multiple layers (text, images)
Generate user-friendly outputs, such as highlighted PDFs and structured difference reports
Orchestrate all processing reliably, with error handling and retries

The system was designed with a modular architecture, using modern, lightweight tools to ensure it is open, portable, and easy to extend.

Data Flow: How It All Fits Together?

To compare different versions of PDF documents effectively, the system follows a structured process. This approach ensures that all changes, whether textual or visual are identified and clearly presented.

Document Upload:

The user uploads a PDF.
The system processes the file by extracting metadata and storing both the file and its metadata in the workspace’s file system and the documents database.

Document Selection & Processing:

The user selects different versions of the document for comparison.
The system retrieves the selected PDFs and extracts their pages.
Text and images are extracted for further processing.

Comparison & Differences Detection:

The system analyzes the extracted content to detect replacements, insertions, deletions, and position changes.
The detected differences are then processed and formatted for visualization.

Generating Output:

The system generates two outputs:
1. Highlighted PDFs, with changes visually marked.
2. A differences report in PDF format, listing all detected modifications.
Both documents are made available to the user for review.

This data flow illustrates the clear and systematic process used to compare PDF versions, ensuring that meaningful changes are easily identified and presented.Below is an image that illustrates the entire data flow:

Open Source Libraries That Powered Everything

The system is built using open, tested libraries that ensure precision and control:

pdfminer: Low-level text extraction from PDFs
Pillow (PIL): Image processing toolkit
OpenCV (cv2): Image manipulation and preprocessing
skimage: Structural similarity for smart image comparison
difflib: Robust sequence matching for text diffs
pypdf: Read, modify, and generate PDFs with change highlights
FastAPI: Lightweight backend for handling API requests
Svelte + Vite: Modern frontend stack for UI
Prefect: Workflow orchestration and task management

Each tool is chosen for its focus on precision and control, avoiding black-box solutions and proprietary frameworks.

Code: The Approach

APIs served as the glue, orchestrating the different parts of the system. All heavy processing was handled asynchronously, ensuring fast and reliable comparisons even for large documents.

For more details on the code approach, check out this technical discussion. The project used a modular and scalable design. The backend service, developed using FastAPI, was responsible for uploads, selection, and orchestration.

The extraction, comparison, and result generation were handled by a Prefect-powered processing pipeline. This facilitated convenient workflows and clear task management.

Frontend elements, implemented with Svelte, offered a user-friendly interface for document upload, selection, and downloading.

For clean, independent logic, text and image comparison modules were separated. This provided an effortless way to add functionality in the future, such as OCR or sophisticated image diffing capabilities.

The APIs acted as glue, controlling the various components of the system. Everything that did heavy processing was done asynchronously, allowing for quick and sure comparisons even on big documents.

For additional information regarding the code solution, see this technical discussion.

Pros and Cons of the PDF Comparison System

This system has several advantages that make it efficient and cost-effective for comparing PDF documents, but there are also some challenges to consider. Below is a breakdown of the pros and cons:

Pros	Cons
Scalable & Modular: Easy to expand	Complex Setup: Time-consuming setup
Efficient: Asynchronous processing	Limited Diff Types: Only text & image
Flexible: User-friendly web interface	Accuracy Issues: Complex layouts may not compare well
Clean Design: Separate comparison modules	Performance: May slow with large docs
Open Source: Uses existing libraries	Customization: Advanced features need extra work
Affordable: Reduces manual effort	External Dependencies: Updates may cause issues

Conclusion

This PDF comparison tool offers a real-world solution for effectively determining differences among versions of PDF documents. It solves the key issues related to comparing text and images, providing a simplified mechanism for updating documents and enhancing version control accuracy.

While the existing system is effective and operational, there is room for improvement, such as improving text analysis or offering users greater levels of customization. As improvements continue to be made, this method will become increasingly robust, with more flexibility and efficiency in document comparison operations..

Smarter PDF Comparison: How to Catch Real Changes, Not Formatting Noise

Why PDF Comparisons are Difficult?

How to Compare PDFs Effectively?

Extracting Text and Images

Comparing Text Semantically

Comparing Images Structurally

Merging the Diffs

What Needed to Be Built?

Data Flow: How It All Fits Together?

Open Source Libraries That Powered Everything

Pros and Cons of the PDF Comparison System

Conclusion

About the Author

More Posts

Internal Developer Platforms, Developer Portals & Beyond

Changelog LTS Version Q3 25

From Dev Chaos to Cloud Clarity: No More “It Works on My Machine”

From Friction to Flow — How Codesphere Transforms Productivity

Contact

Support

Karlsruhe

Munich