Document Payload Format and Page Metadata in PDF Highlighter

Every text chunk you want highlighted must be sent as a document payload that tells the service what to find and where to look. The format is intentionally minimal — just the text itself and the page number — so it maps directly onto the Document objects that LangChain text splitters produce without any transformation.

DocumentPayload schema

Each item in the documents array must conform to the following structure:

{
  "page_content": "Text chunk to highlight in the PDF",
  "metadata": {
    "page": 0
  }
}

Fields

page_content

string

required

The exact text to locate and highlight in the PDF. The service normalises whitespace before searching, so minor differences in spacing between your chunk and the PDF text layer are handled automatically. Do not truncate or paraphrase the chunk — pass the full string as produced by your text splitter.

metadata.page

integer

default:"0"

The 0-indexed page number of the PDF page where this chunk appears. The service uses this value to narrow its search to a single page; it does not scan the entire document for each chunk. If omitted, the service defaults to page 0. If the value is out of range for the given PDF, the chunk is silently skipped.

Page numbers are 0-indexed. Page 1 of the PDF corresponds to "page": 0, page 2 corresponds to "page": 1, and so on. This matches the convention used by PyMuPDF and LangChain’s PDF loaders. Passing 1-indexed page numbers is the most common cause of missed highlights.

LangChain integration

If you are using LangChain to load and split your PDF, the Document objects produced by a text splitter already contain the fields you need. You can pass them to the API with minimal conversion:

from langchain_core.documents import Document

# A LangChain Document produced by a text splitter
doc = Document(
    page_content="The quick brown fox jumps over the lazy dog",
    metadata={"page": 2}  # 0-indexed page number
)

# Convert to API payload
payload = {
    "page_content": doc.page_content,
    "metadata": doc.metadata
}

Most LangChain PDF loaders — including PyMuPDFLoader and PyPDFLoader — populate metadata["page"] automatically with 0-indexed values, so you can often pass the splitter output directly without any manual page-number handling.

Sending multiple documents

You will typically want to highlight several chunks from different pages in a single request. Pass all your payloads as a JSON array in the request body:

import httpx

chunks = [
    {"page_content": "First chunk on page one", "metadata": {"page": 0}},
    {"page_content": "A relevant passage on page three", "metadata": {"page": 2}},
    {"page_content": "Another excerpt further in the document", "metadata": {"page": 7}},
]

response = httpx.post(
    "http://localhost:8000/highlight",
    json={
        "pdf_url": "https://example.com/document.pdf",
        "documents": chunks,
    },
)

with open("annotated.pdf", "wb") as f:
    f.write(response.content)

Chunks from different pages can be mixed freely in the array. The service groups highlights by page internally before writing annotations, so order does not matter.

You can include any additional fields in metadata — such as source, chunk_id, score, or custom tags — and they will be passed through transparently. The service only reads metadata.page for highlighting purposes; all other metadata fields are ignored and do not affect the output.

​DocumentPayload schema

​Fields

​LangChain integration

​Sending multiple documents

DocumentPayload schema

Fields

LangChain integration

Sending multiple documents