What summarising a PDF actually means
A PDF is a layout file. It describes what text appears where, in what font, at what position on a page — but it doesn't preserve the document's logical structure the way HTML or Markdown does. A scientific paper PDF and a coffee-shop menu PDF look like the same kind of file to most software; they differ only in what's written and where.
Summarising a PDF means extracting the text content from the layout, feeding the text to a language model, and producing a condensed version. Two parts: extraction and summarisation. Both have failure modes. Both have edge cases. The combination produces useful summaries for a wide class of documents and useless summaries for a small class.
Our PDF summariser does the extraction in your browser. The file you upload is parsed by unpdf — a JavaScript PDF library — directly in the page, before any data leaves your device. The extracted text is then sent to our summary endpoint, which feeds it to Claude Sonnet 4.6 with the standard summary prompt. The PDF itself is never uploaded to our server; only the extracted text is.
This is intentional. Sending a 15 MB PDF over Vercel's 4.5 MB body limit would require a different upload pipeline. Browser-side parsing solves the bandwidth problem and improves privacy as a side effect — your file stays on your device, only the text content travels to the summary service.
Why the file size cap is 15 MB
PDFs vary enormously in size. A 5-page text-only document might be 50 KB. A 200-page scanned book without OCR might be 200 MB. The 15 MB cap balances three considerations.
Browser memory. unpdf runs entirely in the browser tab. A 50 MB PDF can exhaust the tab's memory budget on lower-end devices, causing the page to freeze or crash. The 15 MB cap keeps the parser within safe memory limits on most consumer hardware.
Parse time. A 5 MB text-rich PDF parses in 1-2 seconds. A 15 MB PDF can take 10-15 seconds. Beyond 15 MB, parse times exceed the patience window users have for "instant" tools. The cap matches the user-experience threshold, not the technical maximum.
Useful text content. A 50 MB PDF is usually mostly images (scanned book, photo album, illustrated guide). The extractable text from a 15 MB PDF is already at the upper bound of what a single-pass summary can usefully condense. Beyond that size, the summary is forced to skim because the model's context window can't hold all the text anyway.
For PDFs above 15 MB, the practical workaround is to extract the relevant section first (Adobe Acrobat: File → Extract Pages; preview.app: drag specific pages out) and summarise the smaller fragment.
What the summariser handles well
Several PDF categories produce reliably useful summaries.
Research papers. Standard academic PDFs (5-30 pages, abstract + introduction + body + conclusion + references) are the platform's strongest use case. The summariser captures the abstract's claims, the methodology, the key findings, and the limitations. For literature-review workflows, a paper-summary-per-paper is the right unit of intermediate organisation.
White papers and industry reports. Business-research documents (10-50 pages, executive summary + sections). The summariser extracts the headline findings without reading the entire body.
eBooks (text-based). Non-fiction eBooks distributed as PDF (rather than EPUB or MOBI). For chapters or whole books under the 15 MB cap, summarisation captures the high-level structure. Multi-hundred-page books usually need to be summarised chapter-by-chapter rather than whole.
Technical documentation. Software manuals, API references, user guides. The summariser captures the functional capabilities described without quoting the API signatures verbatim. Use the summary to decide whether to read the full doc.
Lecture slides exported as PDF. Presentation slides converted to PDF retain the bullet-point text that was on each slide. The summariser turns slide bullets into a coherent paragraph or bullet summary.
Business reports and meeting minutes. Internal documents (5-20 pages) where the structure is informal but the text is rich.
What the summariser cannot do well
Several PDF categories defeat the extraction-then-summarise pipeline.
Scanned image-only PDFs without OCR. A PDF where every page is a JPEG of scanned paper has no extractable text. unpdf returns empty strings; the summary has nothing to work with. Run an OCR pass first (Adobe Acrobat: Tools → Scan & OCR; tesseract for offline; ABBYY for high-accuracy work). After OCR, the PDF becomes summarisable.
Encrypted / password-protected PDFs. unpdf cannot decrypt; the summariser refuses the file. Remove the password (Adobe Acrobat: File → Properties → Security) before uploading.
PDFs where the visual layout is the message. Comic books, infographics, complex diagrams, brochures with heavy visual design. The text is fragments and labels; the summary captures fragments and labels. Look at the original.
Multi-column layouts with running headers. Newspaper PDFs and journals with two-column layouts and running headers / footers extract in non-linear order — a sentence might come from column 2 of page 5 in the middle of column 1 of page 4. The summariser tries to reorder coherently but sometimes produces disjointed summaries for these layouts.
Tables and figures. A PDF with a key chart or data table loses the chart entirely (charts are image objects, not extractable text). Tables sometimes extract as misaligned columns. For PDFs where a table or figure is the centerpiece, the summary describes only the surrounding prose.
Equations and mathematical notation. PDFs with LaTeX-rendered equations export the equations as image objects. Extraction returns nothing for those equations. Summary of a math-heavy paper captures the prose around the equations but not the equations themselves.
Very-long PDFs that exceed the model's context window. A 200-page text-rich book has more text than Claude can read in a single pass. Our pipeline truncates to the most-relevant ~50,000 words for very-long documents. The summary captures the truncated portion well; the rest is missing.
Common gotchas
The first-page title may not match the document title. Some PDFs have a cover page (logo, blank, copyright notice) before the actual title page. The extracted "title" comes from the first non-empty text block on page 1, which sometimes is a copyright disclaimer rather than the actual document name.
Page numbers, headers, and footers contaminate the text. A PDF with "Confidential — page 5 of 40" repeated on every page injects that string into the extracted text many times. The summariser usually skips the boilerplate but occasionally surfaces it as a bullet point. Re-run if the summary's first bullet is page-header noise.
Hyperlinks in the PDF appear as raw URLs. A clickable "see our website" link in the PDF extracts as "see our website" followed by a literal URL. The summariser preserves the URL as a fact when the URL is meaningful.
Footnote markers can confuse the prose flow. A sentence with three footnote references might extract as "the data shows X 12 13 14, suggesting Y." Where 12 13 14 are footnote numbers. The summary usually elides these but some footnote-heavy academic papers produce slightly fragmented summaries.
Embedded fonts can break extraction. Some older PDFs use non-standard or embedded fonts that unpdf can't decode correctly, producing question marks or boxes for certain characters. The summary inherits the unreadable characters. Re-export the PDF from the source application (Word → PDF, LaTeX → PDF) to get a more standard encoding.
Non-Latin scripts (Arabic, CJK, Devanagari) extract reliably for modern PDFs. Older PDFs (pre-2010) with non-Latin scripts often have encoding issues that produce garbled text. Newer PDFs handle Unicode cleanly.
When a different tool fits better
For OCR-needed PDFs, run OCR first. Adobe Acrobat Pro has a built-in OCR pass (Tools → Scan & OCR → Recognise Text). Free alternatives include tesseract on the command line and Google Drive's OCR (upload a PDF, then "Open with Google Docs" runs OCR).
For PDFs with critical tables and charts, use a layout-preserving extraction tool. pdfplumber (Python) and Camelot extract tables. For chart data, manually re-create the chart from the source data rather than relying on extraction.
For very-long PDFs (50+ pages of dense text), summarise chapter-by-chapter or section-by-section rather than whole-document. Our summariser truncates very-long input; chapter-level summaries combine into whole-book understanding without losing any chapter.
For presentations where the visual design carries meaning (sales decks, pitch decks), the PDF text extraction misses what the slides communicate visually. Watch the recording or review the deck visually.
For PDFs in languages our model handles weakly, extract the text first, run it through a translator (DeepL, Google), then summarise the translation. Two-step but works for any source language.
A workflow for using PDF summaries
For research workflows where you need to triage a stack of papers:
- Upload each paper, run a bullet summary, save the bullets alongside the citation.
- Read the summaries to decide which papers warrant a full read.
- For papers that pass the triage, read in full with the summary as a navigation aid.
For business workflows where you receive a long document and need the headline points:
- Upload the PDF, run a key-takeaways summary (3-5 takeaways).
- Compare the takeaways to what the sender said the document was about.
- If the takeaways and the framing align, you're done — the document confirms the framing.
- If they diverge, read the relevant sections of the PDF in full.
For learning workflows (textbook chapters, technical guides):
- Summarise the chapter as bullets first.
- Use the bullets as a study outline.
- Read the chapter in full with the outline guiding your attention.
- Save the bullet summary as your study notes.
The summary is a navigation tool and a recall aid, not a replacement for the source document. For PDFs that matter, use the summary to scope the work before reading the original.