Matching strategies
Strategy 1: Exact Match
Strategy 1: Exact Match
The first strategy normalises both the query text and the page text by collapsing all consecutive whitespace runs into a single space, then searches for the normalised chunk verbatim on the target page.This approach is the fastest and works correctly for the majority of well-structured PDFs — for example, documents exported directly from LaTeX, Word, or Google Docs where the text layer faithfully represents what is displayed.When it works best: Clean, machine-generated PDFs where text is stored as continuous strings without unexpected whitespace artefacts.
Strategy 2: Sentence-Level Match
Strategy 2: Sentence-Level Match
If the exact match finds nothing, the service splits the chunk into individual sentence fragments and searches for each fragment separately on the target page. Only fragments with 20 or more characters are used — shorter fragments are too ambiguous and could match unintended regions.Each matched fragment produces its own set of bounding boxes, and all boxes are collected and highlighted independently. This means a single chunk that spans a visual line break or column boundary will still be fully highlighted, with separate highlight rectangles covering each sentence-length portion.When it works best: Chunks produced by text splitters that combine multiple sentences, or chunks whose text crosses a heading, table caption, or other visual break that causes the exact string to not appear contiguously in the text layer.
Strategy 3: Collapsed-Whitespace Match
Strategy 3: Collapsed-Whitespace Match
If neither of the previous strategies finds a match, the service removes all whitespace characters entirely from both the query and the page text, then uses a sliding-window character search to locate the query within the collapsed page string. When a match is found, the character positions are mapped back to the original text to recover the correct bounding boxes.This strategy handles a common artefact in scanned or re-exported PDFs where the text layer stores a space character between every individual character — for example,
"T h e q u i c k" instead of "The quick". Collapsing whitespace on both sides makes these documents searchable without any special pre-processing on your part.When it works best: Scanned PDFs that have been OCR-processed, PDFs exported from certain design tools, or any document where the text layer contains unexpected inter-character spacing.Strategies are tried in order — exact match first, then sentence-level, then collapsed-whitespace — and the service stops and returns results as soon as any strategy finds at least one match. A chunk will never be searched by all three strategies if an earlier strategy succeeds.
What happens when no strategy matches
If all three strategies fail to find the chunk on its target page, the chunk is silently skipped. The service continues processing the remaining documents and returns a valid annotated PDF. No error is raised, and no placeholder annotation is inserted for the unmatched chunk.Tips for improving match rates
- Verify page numbers are correct and 0-indexed. The
metadata.pagefield uses 0-based indexing, so page 1 of the PDF must be0. An off-by-one error here is the most common cause of missed highlights. - Keep chunks at a reasonable size. Single words are too ambiguous and may appear many times on a page. Full-page chunks are unlikely to match as a single contiguous string. Chunks produced by a sentence or paragraph splitter tend to work best.
- Extract text with the same tool used to produce the chunks. If you extracted the PDF text with PyMuPDF and split it with LangChain, the chunks will reflect PyMuPDF’s text layer exactly. Switching extraction tools mid-pipeline can introduce subtle whitespace or encoding differences that prevent matching.