What is a PDF-to-Markdown conversion?
PDFs are a fixed-layout format designed for consistent visual presentation — text is positioned absolutely on the page, often split across columns, and the underlying structure that a word processor preserves (headings, lists, tables) is flattened during PDF generation. Converting a PDF to Markdown means extracting the readable text back into a structured, plain-text format that modern tools accept. What survives cleanly: paragraphs of prose, in-line formatting that was encoded as text runs, and the reading order that the PDF indexed. What doesn't survive cleanly: multi-column layouts, complex tables, figure captions, footnotes, embedded images, and any content that was scanned rather than typed.
Our converter targets the common case: you have a text-based PDF — a research paper, a report, a memo, a chapter — and you want the content out. You get Markdown paragraphs with normalized whitespace, ready to paste into a blog, a docs site, a note-taking vault, or a messaging platform. Tables and images are not recovered (see gotchas below); structure beyond paragraph breaks is approximate.
Why extract text from PDFs into Markdown?
Three workflows drive most PDF-to-Markdown conversions. First, research notes: you download papers as PDFs, and you want to quote from them in your notes (Obsidian, Roam, Notion, Logseq). Keeping a PDF library is fine for archival, but unless you extract the text, the content is unsearchable in your knowledge graph and un-copyable into your drafts without painful hand-selection.
Second, content migration: a document was produced as a PDF (a white paper, an internal report, a book chapter) and you need the content on a blog, a static site, or a wiki. Turning the PDF back into Markdown closes the loop — the content can now live in version control, be edited collaboratively, and render on any platform that speaks GFM.
Third, AI preparation: modern language models accept markdown better than PDF. If you want to summarize, translate, or rewrite a PDF's content with an LLM, converting to Markdown first produces more reliable results than feeding raw PDF or scraped HTML. The token count is lower, the structure is cleaner, and the model doesn't have to work around PDF layout artifacts.
Manual approach
Hand-extracting text from a PDF is possible but unpleasant. Open the PDF in a viewer, select all, copy, paste into a text editor. You get the text — usually with broken line wraps (a paragraph becomes 15 separate lines ending mid-sentence), stray page headers and footers repeated every page, and any multi-column content interleaved in reading order that doesn't actually read.
Then you normalize: join broken lines, strip page headers and footers, break into paragraphs at blank lines. If you need headings, go back to the PDF, find the ones that were styled as headings, and add #/##/### in front. Links: the PDF may have had hyperlinks embedded but text-select usually drops them; you rebuild them by hand. Math equations: if the author used a symbol font, your paste got question marks or box characters. Tables: hopeless.
For a 10-page PDF, hand-extraction takes 30-60 minutes. The quality is mediocre — broken wraps, lost formatting, occasional mojibake. For a 200-page book, it's a weekend project. Nobody does this by hand more than once.
Automated approach (our tool)
Our converter runs entirely in your browser using Mozilla's pdf.js (via the unpdf library). You drop a PDF, and the file never leaves your machine — extraction happens client-side, and only the plain-text result is sent to our server for post-processing. This matters for confidential documents: contracts, unpublished research, drafts, client reports. You get the convenience of a web tool without handing us your document.
The pipeline:
- Client-side text extraction via pdf.js — traverses each page's text runs in index order, strips position data, joins into a continuous stream.
- Per-page segmentation up to 50 pages (hard cap). Longer PDFs produce a truncation warning — the first 50 pages are processed, the rest skipped.
- 15-second timeout per file to protect against PDF bombs or malformed documents.
- Server-side post-processing: whitespace normalization, paragraph-break detection from blank lines, collapse of redundant line wraps.
- Output: clean Markdown paragraphs, ready to copy.
File size cap: 15 MB. This covers the vast majority of text-based PDFs. For image-heavy PDFs (scanned documents, photo archives), the size swells but our extractor will return empty output — see the OCR gotcha below.
Common gotchas
Scanned PDFs return empty text. A "scanned PDF" is a PDF where each page is a full-page image of the original document — no text runs, just pixels. Our extractor reads text runs; scanned PDFs have none, so the output is empty and we emit a warning: "empty input — PDF may be image-only (OCR not supported)." To process a scanned PDF, run OCR first. macOS Preview has built-in OCR (Export → PDF → select "Apply OCR"); Adobe Acrobat has it; cloud OCR services (Google Docs OCR, ABBYY) work too. Run OCR, save as a text-based PDF, then convert.
Multi-column layouts interleave. Academic papers, magazines, and many reports use two-column layouts. Our extractor follows the PDF's internal reading order, which is usually correct — columns read left-to-right top-to-bottom. But not always. Some PDFs index text in the order the author typed it, which for a multi-column layout can mean "row 1 left column, row 1 right column, row 2 left column, row 2 right column" — garbled. Spot-check the output; if columns interleave, you may need to export per-column or reconstruct manually.
Page headers and footers repeat. Every page has its header ("Proceedings of X Conference") and footer ("Page 5 of 120") extracted along with the body. Our post-processing doesn't detect and strip these — it would risk removing body content that happens to match page-header patterns. You'll see them in the output; delete by hand or use search-and-replace.
Tables come through as flat text. Complex tables in PDFs are rendered as positioned text runs with visual alignment but no structural markup. Our extractor reads the text runs in order, which for a 3-column table means you get all of column 1, then column 2, then column 3 — or some similar jumble. No GFM pipe table is produced. If the PDF's tables matter, re-create them by hand, or use a dedicated PDF-table tool like Tabula (open source) for just the tables.
Images drop silently. Photos, diagrams, charts, logos — all lost. Our extractor returns text only. If figures in the PDF matter for your use case, extract them separately: macOS Preview, Adobe Acrobat, or pdfimages (CLI) can save embedded images. Re-embed into your Markdown with  after conversion.
Math equations degrade. If equations were typeset in LaTeX and the PDF was built from that, you may see fragments of the source or math-symbol-font characters. If equations were rendered as images, they're lost. If they used Unicode math characters, they survive but without LaTeX semantics. Our tool does not reconstruct $E=mc^2$ from a rendered equation image.
Hyperlinks usually survive, display-text-only. If the PDF had a hyperlink (clickable link in-text), our extractor often recovers the display text but loses the URL. You'll need to add URLs back manually if link preservation matters.
When to use a different tool instead
For academic papers where you need structure (sections, subsections, citations preserved): Pandoc with -f pdf works for some PDFs but the quality varies wildly by PDF. Better: find the paper's source TeX (many authors post preprints on arXiv) and use our /markdown/latex-to-markdown tool instead. LaTeX sources carry the structure; PDFs don't.
For scanned PDFs: OCR first via macOS Preview, Adobe, or a cloud OCR service (Google Docs, ABBYY FineReader). Save as text-based PDF, then use our tool.
For PDF tables specifically: Tabula (open-source, Java) or commercial tools like Camelot (Python). These identify table boundaries visually and export to CSV/Excel. Our tool won't give you usable tables from a PDF.
For high-volume batch conversion: Pandoc or pdftotext (part of poppler-utils) on the command line. Our tool is per-file in the browser; for hundreds of PDFs, script the local CLI path.
For PDFs you want to summarize, not convert: /summary/pdf — VUST's AI summarization tool. Different intent: our tool gives you the full content as Markdown; /summary/pdf gives you a condensed summary via LLM.
Migration workflow
A practical workflow for moving PDF content into Markdown-based systems:
- Triage your PDF library. Categorize: text-based PDFs (most academic papers, reports, e-books), scanned PDFs (old archives, legal docs), and hybrid (some pages text, some images). Text-based works with our tool; scanned needs OCR first; hybrid works partially.
- Convert text-based PDFs first. For each file, drop into our tool, review output, copy to destination. For 10-20 files, this is a 20-minute batch. Longer libraries, script via
pdftotextor Pandoc locally. - Normalize page chrome. Walk the output, delete repeated headers and footers, fix broken line wraps where the extractor didn't catch them.
- Handle references and citations. Academic PDFs usually have a references section at the end. Our tool preserves it as flat text. If you're publishing the converted content, reformat references into your target style (footnotes, Markdown reference links, etc.).
- Add frontmatter. Target systems (Jekyll, Hugo, Obsidian) expect YAML frontmatter — title, date, author, tags. Extract these from the PDF metadata (macOS:
mdls file.pdf; command line:exiftool) or add them by hand. - Re-embed images. For figures that matter, extract them from the PDF separately, save to an images folder, reference with
in the appropriate spots in the Markdown. - Verify rendering. Preview the converted Markdown in your target platform. Common fixes needed: fence code blocks that weren't recognized, fix broken Markdown links, ensure math (if any) renders.
For a library of 50 medium-complexity PDFs, expect 2-4 hours across steps 1-7. The per-file conversion is fast; the per-file cleanup is where most of the time goes.