Integrate PDF Highlighting into Your LangChain RAG Pipeline

When your RAG pipeline answers a user’s question, it draws on specific passages from one or more source documents. RAG PDF Highlighter lets you close the loop visually: after retrieval, you send the retrieved chunks back to the service alongside the original PDF URL, and you get back an annotated PDF with every source passage marked in yellow. Users can see exactly which sentences informed the answer, making your application more transparent and trustworthy.

Integration Pattern

Set up your RAG retriever

Load your PDF, split it into chunks, and index those chunks in a vector store using a standard LangChain setup. This step happens once at startup or index-build time.

Run retrieval to get relevant Document chunks

For each user query, call your retriever to get the most relevant Document objects. Each Document carries the chunk text in page_content and a metadata dict that includes the page number (0-indexed).

Call POST /highlight with the PDF URL and retrieved chunks

Pass the original PDF URL and the retrieved Document objects to the /highlight endpoint. The service downloads the PDF, locates each chunk on its page, and draws yellow highlights over the matching text.

Return the highlighted PDF to your user

Stream or serve the binary PDF response directly to your user’s browser or application. No further processing is needed.

End-to-End Example

The example below shows a complete RAG pipeline — from loading the PDF to delivering a highlighted result — using LangChain and the RAG PDF Highlighter API.

import requests
from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings

# 1. Load and index the PDF
PDF_URL = "https://example.com/research-paper.pdf"
loader = PyPDFLoader(PDF_URL)
docs = loader.load()
splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
chunks = splitter.split_documents(docs)

vectorstore = Chroma.from_documents(chunks, OpenAIEmbeddings())

# 2. Retrieve relevant chunks for a query
retriever = vectorstore.as_retriever(search_kwargs={"k": 5})
results = retriever.invoke("What are the main findings?")

# 3. Highlight the retrieved chunks
response = requests.post(
    "http://localhost:8000/highlight",
    json={
        "pdf_url": PDF_URL,
        "documents": [
            {"page_content": doc.page_content, "metadata": doc.metadata}
            for doc in results
        ]
    }
)

# 4. Save the highlighted PDF
with open("highlighted_results.pdf", "wb") as f:
    f.write(response.content)

The chunks you send in the documents array must originate from the same PDF you specify in pdf_url. The service locates passages by searching for the exact chunk text on the given page of the downloaded PDF. If the chunks come from a different document, they will not match and no highlights will be applied.

Use PyPDFLoader to load your PDFs when building the index. PyPDFLoader stores page numbers in metadata["page"] as 0-indexed integers, which is exactly the format RAG PDF Highlighter expects. Using a loader with a different page-numbering convention may result in highlights appearing on the wrong pages.

​Integration Pattern

​End-to-End Example

Integration Pattern

End-to-End Example