PDF Text Matching Strategies in RAG PDF Highlighter

PDFs can encode the same visible text in many different ways. A scanned document converted to PDF may store individual characters with embedded spaces between them. A text exported from a word processor may use ligatures or special Unicode spaces. To reliably locate your text chunks regardless of how the underlying text layer is structured, RAG PDF Highlighter applies three matching strategies in sequence, falling back to the next strategy whenever the current one finds no match.

Matching strategies

Strategy 1: Exact Match

The first strategy normalises both the query text and the page text by collapsing all consecutive whitespace runs into a single space, then searches for the normalised chunk verbatim on the target page.This approach is the fastest and works correctly for the majority of well-structured PDFs — for example, documents exported directly from LaTeX, Word, or Google Docs where the text layer faithfully represents what is displayed.When it works best: Clean, machine-generated PDFs where text is stored as continuous strings without unexpected whitespace artefacts.

Strategy 2: Sentence-Level Match

If the exact match finds nothing, the service splits the chunk into individual sentence fragments and searches for each fragment separately on the target page. Only fragments with 20 or more characters are used — shorter fragments are too ambiguous and could match unintended regions.Each matched fragment produces its own set of bounding boxes, and all boxes are collected and highlighted independently. This means a single chunk that spans a visual line break or column boundary will still be fully highlighted, with separate highlight rectangles covering each sentence-length portion.When it works best: Chunks produced by text splitters that combine multiple sentences, or chunks whose text crosses a heading, table caption, or other visual break that causes the exact string to not appear contiguously in the text layer.

Strategy 3: Collapsed-Whitespace Match

If neither of the previous strategies finds a match, the service removes all whitespace characters entirely from both the query and the page text, then uses a sliding-window character search to locate the query within the collapsed page string. When a match is found, the character positions are mapped back to the original text to recover the correct bounding boxes.This strategy handles a common artefact in scanned or re-exported PDFs where the text layer stores a space character between every individual character — for example, "T h e q u i c k" instead of "The quick". Collapsing whitespace on both sides makes these documents searchable without any special pre-processing on your part.When it works best: Scanned PDFs that have been OCR-processed, PDFs exported from certain design tools, or any document where the text layer contains unexpected inter-character spacing.

Strategies are tried in order — exact match first, then sentence-level, then collapsed-whitespace — and the service stops and returns results as soon as any strategy finds at least one match. A chunk will never be searched by all three strategies if an earlier strategy succeeds.

What happens when no strategy matches

If all three strategies fail to find the chunk on its target page, the chunk is silently skipped. The service continues processing the remaining documents and returns a valid annotated PDF. No error is raised, and no placeholder annotation is inserted for the unmatched chunk.

Tips for improving match rates

Verify page numbers are correct and 0-indexed. The metadata.page field uses 0-based indexing, so page 1 of the PDF must be 0. An off-by-one error here is the most common cause of missed highlights.
Keep chunks at a reasonable size. Single words are too ambiguous and may appear many times on a page. Full-page chunks are unlikely to match as a single contiguous string. Chunks produced by a sentence or paragraph splitter tend to work best.
Extract text with the same tool used to produce the chunks. If you extracted the PDF text with PyMuPDF and split it with LangChain, the chunks will reflect PyMuPDF’s text layer exactly. Switching extraction tools mid-pipeline can introduce subtle whitespace or encoding differences that prevent matching.

​Matching strategies

​What happens when no strategy matches

​Tips for improving match rates

Matching strategies

What happens when no strategy matches

Tips for improving match rates