Document objects, finds each text chunk inside the PDF using a three-tier matching strategy, and returns a new PDF binary with every matched passage highlighted in yellow — giving your users a direct visual connection between retrieved chunks and their original source.
Key Features
3-Tier Text Matching
Locates each chunk using a cascading strategy: exact match first, then sentence-level matching, then collapsed-whitespace matching — so even lightly reformatted text is found reliably.
Stateless & Async
Every request is fully self-contained. The service holds no session state between calls, making it trivial to scale horizontally or deploy behind a load balancer.
Python Library Mode
Import and call the highlighter directly in your own Python code without running a server. Install once with
pip and integrate it into any existing RAG workflow.Docker Ready
Ship a production container in minutes. The service listens on a configurable port and has no external runtime dependencies beyond its Python packages.
How It Works
RAG PDF Highlighter follows a straightforward three-step process on every request:- Download the PDF — The service fetches the PDF from the URL you provide using an async HTTP client, so your application never needs to handle the raw file transfer itself.
- Locate each chunk — For every
Documentin your list, the highlighter searches the corresponding page (or the full document if no page is specified) using exact matching, falling back to sentence-level and then collapsed-whitespace matching until a location is found. - Return the annotated PDF — The service writes yellow highlight annotations over every matched passage and streams the modified PDF binary back in the response, ready to save or serve directly to your users.
RAG PDF Highlighter requires no authentication. All endpoints are open by default, so make sure you deploy behind an appropriate network boundary or API gateway if you need access control.