You do not need to run the FastAPI server to use RAG PDF Highlighter. The package exposes its core utilities as importable functions, so you can embed PDF highlighting directly inside any Python application — a script, a Jupyter notebook, or a larger service — without spinning up a separate process.
Install the Package
pip install rag-pdf-highlighter
Python 3.10 or later is required.
Complete Workflow Example
The snippet below shows the full end-to-end flow: download a PDF, apply highlights, read the result into memory, and clean up the temporary files.
import asyncio
from langchain_core.documents import Document
from rag_pdf_highlighter.utils.pdf_helpers import (
download_pdf,
highlight_chunks_in_pdf,
cleanup_file
)
from rag_pdf_highlighter.exceptions import PDFDownloadError, HighlightError
async def highlight_pdf(pdf_url: str, documents: list[Document]) -> bytes:
pdf_path = None
output_path = None
try:
pdf_path = await download_pdf(pdf_url)
output_path = highlight_chunks_in_pdf(pdf_path, documents)
with open(output_path, "rb") as f:
return f.read()
finally:
if pdf_path:
cleanup_file(pdf_path)
if output_path:
cleanup_file(output_path)
# Usage
documents = [
Document(
page_content="The quick brown fox jumps over the lazy dog",
metadata={"page": 0}
)
]
pdf_bytes = asyncio.run(
highlight_pdf("https://example.com/document.pdf", documents)
)
with open("highlighted.pdf", "wb") as f:
f.write(pdf_bytes)
Always call cleanup_file() in a finally block for both pdf_path and output_path. Both functions write to temporary files on disk. If your code raises an exception before cleanup runs, those files will accumulate and consume disk space. The finally pattern above guarantees cleanup regardless of whether an error occurs.
Error Handling
Import the exception classes to handle specific failure modes gracefully.
from rag_pdf_highlighter.exceptions import (
HighlightError,
PDFDownloadError,
PDFNotFoundError,
NoDocumentsError,
)
async def safe_highlight(pdf_url: str, documents: list[Document]) -> bytes | None:
pdf_path = None
output_path = None
try:
pdf_path = await download_pdf(pdf_url)
output_path = highlight_chunks_in_pdf(pdf_path, documents)
with open(output_path, "rb") as f:
return f.read()
except PDFDownloadError as e:
print(f"Could not fetch the PDF: {e}")
except NoDocumentsError as e:
print(f"No documents provided: {e}")
except HighlightError as e:
print(f"Highlighting failed: {e}")
finally:
if pdf_path:
cleanup_file(pdf_path)
if output_path:
cleanup_file(output_path)
return None
The exception hierarchy is:
| Exception | Cause |
|---|
PDFDownloadError | The URL fetch failed (network error, non-200 response) |
PDFNotFoundError | The local PDF file path does not exist |
NoDocumentsError | An empty document list was passed |
HighlightError | Base class — catch this to handle any highlighting error |
download_pdf is an async function and must be called with await. highlight_chunks_in_pdf is a regular synchronous function — call it directly without await. If you are calling from synchronous code, wrap the async parts with asyncio.run() as shown in the example above. If you are already inside an async context (e.g., a FastAPI route or an async test), use await for download_pdf directly.