How to Extract Text from a PDF
Copying text from a PDF can be surprisingly frustrating. Formatting breaks, columns get merged, and line breaks appear in the wrong places. A dedicated text extraction tool pulls the raw text content from the PDF structure, giving you clean plain text you can actually work with. A browser-based extractor handles the entire job locally without uploading your document to a server.
Text-based vs scanned PDFs
Before extracting text, it helps to understand what kind of PDF you have:
Text-based PDFs: created from Word documents, web pages, or other digital sources. The text is stored as data inside the PDF. You can select and highlight text when viewing these files. Text extraction works perfectly with these.
Scanned PDFs: created by scanning a physical document. The PDF contains images of pages, not actual text data. You cannot select text in these files. Standard text extraction returns empty results, you need OCR software instead.
Hybrid PDFs: some PDFs contain a mix of digital text and scanned images. The extractor will capture the text portions but not the image-based content.
Searchable scanned PDFs: a scanned PDF that someone ran through OCR with the text layer embedded behind the page images. Text extraction works on these because the OCR text is stored in the PDF. The accuracy depends on the OCR quality, scanned-OCR text often has typos from misrecognized characters.
How to extract text from a PDF
- Upload your PDF: select the file or drag and drop it. The tool accepts any standard PDF.
- Extract text: click the extract button. The tool processes all pages and displays the raw text.
- Copy or download: copy the text to your clipboard or download it as a TXT file.
A brief history of PDF text extraction
PDF was created in 1993 by Adobe with a deliberately complex internal structure. A PDF stores text as positioned glyphs (character + x/y coordinate on the page), not as flowing prose. To extract readable text, a tool has to read these glyph positions and reconstruct paragraphs by inferring word boundaries, line breaks, and reading order.
The first widely-used PDF text extractor was pdftotext (1996), part of the open-source xpdf project by Derek Noonburg. It used a simple algorithm: sort glyphs by Y then X, group by line, group lines into blocks. Most modern extractors still use a refined version of this approach.
PDF.js (Mozilla, 2011) brought PDF rendering to the browser without a plugin. It also exposed a text-extraction API that powers most browser-based extractors today, including this one. PDF.js can read every PDF feature the browser needs: text, images, forms, annotations, signatures, embedded fonts.
The main improvements over the years have been:
- Better column detection: distinguishing two-column layouts from single-column with wide margins
- Unicode normalization: handling ligatures (fi, fl), accented characters, RTL scripts
- Table awareness: detecting tabular layouts and preserving column structure
- Font-aware spacing: using font metrics to infer where words begin and end
Modern extraction is good for prose documents (books, articles, contracts). It still struggles with multi-column scientific papers, complex tables, and heavily-formatted brochures.
When text extraction is useful
- Data migration: pulling content from PDFs into spreadsheets, databases, or other systems
- Content editing: extracting text to edit in a word processor before creating a new document
- Search and analysis: converting PDF content to plain text for searching, counting, or processing
- Accessibility: making PDF content available in formats that work better with screen readers
- Archiving: creating text backups of important documents
- LLM input: feeding PDF text into ChatGPT, Claude, or local LLMs for summarization or analysis
- Translation: pulling text out so a translator can work in their CAT tool
- Quote extraction: pulling specific passages from legal contracts or research papers for citation
- Citation management: extracting reference lists from PDF papers for Zotero or Mendeley
- Compliance and discovery: extracting text for keyword search in legal eDiscovery workflows
- Subtitle generation: extracting transcripts from PDF educational materials
- Indexing: feeding extracted text into local search systems (Elasticsearch, Meilisearch)
Output format options
Different uses need different output formats:
| Format | Best for | Limitations |
|---|---|---|
| Plain text (.txt) | Universal, no formatting | Loses headings, lists, tables |
| Markdown (.md) | Structured docs, headings preserved | Tables may need manual fix |
| HTML | Web display, preserves bold/italic | More complex than .txt |
| Word (.docx) | Editing in Microsoft Word | Loses some PDF-specific formatting |
| JSON | Per-page or per-block extraction | For developers, not direct reading |
| XML/EPUB | E-book conversion | Complex setup |
For most everyday extraction (copying a paragraph, feeding text to an LLM), plain text is the right choice. For long documents you intend to re-edit, PDF to Word is usually better.
Common pitfalls
- Reading order wrong in multi-column layouts: a two-column academic paper may extract left-column then right-column (correct) or interleave them line by line (scrambled). Verify reading order, especially for academic PDFs.
- Headers and footers in body text: page numbers, running headers, and footers get extracted as text on every page, breaking up the flow. Strip them by searching for the repeated text.
- Ligatures and special characters: "fi" stored as a single glyph may extract as a single character or as "fi" depending on the PDF. Older PDFs are worse for this.
- Hyphenation at line breaks: a word broken at the end of a line with a hyphen (
compre-/hensive) extracts with the hyphen and newline. You may need to manually fix or use a script. - Tables fragmented: PDFs do not store tables structurally; extraction usually produces a flat list of cell text without row/column structure.
- OCR text quality: text layers behind scanned PDFs often contain OCR errors (
rnreads asm,clreads asd). Spot-check before relying on the output. - Encoding mojibake: a PDF that uses a non-standard font encoding may extract as gibberish. Try opening the PDF in Adobe Reader and copy-paste to see if it has the same issue.
- Form fields extracted out of context: fillable PDF forms have field labels and values that may appear scrambled when extracted.
- Annotations and comments: text in PDF annotations is separate from the page content. Some extractors include them, some do not.
- Right-to-left text: Arabic, Hebrew, Persian text may extract left-to-right or with characters in visual order rather than logical order.
- Vertical text: Japanese/Chinese tategaki (vertical writing) PDFs may extract with characters in wrong order.
- Watermarks: watermarks (CONFIDENTIAL, DRAFT) become part of the extracted text, repeated on every page.
Alternative approaches
If browser-based extraction does not work for your PDF:
- OCR for scanned PDFs: Tesseract (open-source), Adobe Acrobat Pro, Google Drive (uploads and runs OCR), or commercial services like ABBYY FineReader.
- Command-line tools:
pdftotext(xpdf/poppler),pdfminer.six(Python),pdfplumber(Python, table-aware),pdf-parse(Node.js). - Adobe Acrobat Pro: Export As > Text or Word, generally accurate but uses cloud services in some workflows.
- PDF-to-Word followed by save-as-text: gives you Word formatting plus the text.
- Print to a text file: some viewers can "print" to a text-only output, useful for awkward layouts.
- LLM-based extraction: ChatGPT/Claude can extract text from uploaded PDFs and even reformat tables; useful for one-offs but slower and limited by upload size.
For confidential PDFs that should not leave your machine, browser-based extraction (this tool) or local command-line tools (pdftotext) are the only safe options.
Tips
- Check if your PDF has selectable text: open the PDF in any viewer and try to highlight text with your cursor. If you can select it, text extraction will work. If you cannot, it is a scanned document.
- Paragraph structure is preserved: the extractor maintains paragraph breaks, so the output follows the document's layout. However, complex layouts with multiple columns may need manual cleanup.
- Large files work fine: since processing happens in your browser, there is no upload size limit. Performance depends on your device, but documents with hundreds of pages are handled without issues.
- Use PDF to Word for formatting: if you need to preserve formatting (bold, headings, tables) rather than just plain text, use a PDF to Word converter instead.
- Use find-and-replace to clean up the output: common cleanup tasks (removing page numbers, joining hyphenated line breaks, removing repeated headers) are easy with regex find-and-replace.
- Pre-strip page numbers and headers: if the source PDF has obvious page numbers, removing them before processing speeds up downstream analysis.
- Combine with LLM for summarization: extract text, then paste into ChatGPT or Claude with a prompt like "Summarize the key points in 5 bullets." Works well for research papers and reports.
- Use specialized tools for tables: if you need just the tables from a PDF, use a tool like Tabula or PDF-to-Excel rather than general text extraction.
Privacy and confidential PDFs
The PDF text extractor runs entirely in your browser. The PDF you upload, intermediate processing, and the extracted text all stay on your device. Nothing is uploaded to a server, logged, or shared with anyone.
This matters because PDFs you extract text from are often very sensitive: contracts with embedded clauses you need to quote, medical records and lab reports, financial statements with account numbers, legal pleadings under attorney-client privilege, employment offer letters and salary details, internal corporate documents, research papers under embargo before publication, scanned IDs and passports, immigration documents. Cloud PDF extractors by design upload your files to their servers, often retain them for "service improvement," and have been involved in real data leaks where confidential contracts and medical records ended up indexed by search engines. A browser-based extractor has zero exposure: the PDF never leaves your machine.
Browser-based extraction also works offline once the page is loaded, useful for processing documents on airplanes, in secure facilities without internet access, or anywhere you cannot or should not upload a confidential document to a third party.
Frequently Asked Questions
Why did my PDF extraction return empty results?
The PDF is likely a scanned document, it contains images of text, not actual text data. Text extraction only works with PDFs that have embedded, selectable text. For scanned documents, you need OCR (optical character recognition) software.
Does this tool use OCR?
No. It extracts embedded text directly from the PDF structure. This is faster and more accurate than OCR for text-based PDFs, but it cannot read text from scanned images.
Is my PDF uploaded to a server?
No. All processing happens in your browser. Your PDF never leaves your device, making it safe for confidential documents.
Can I extract text from a specific page?
The tool processes all pages and returns the complete text. You can then copy or edit the specific sections you need from the output.