Free PDF to Text Converter

Extract all text content from PDF files instantly. Download as TXT or copy to clipboard. Your files never leave your device.

Your files never leave your device
Drop PDF file here or click to browse

Supports PDF · up to 50 MB

Extraction Statistics
0 Pages
0 Characters
0 Words

How It Works

  1. Upload PDF: Drop or select a single PDF file to extract text from.
  2. Configure Options: Choose page separator style and whether to include page numbers.
  3. Extract Text: Click "Extract Text" to process the PDF and display the content.
  4. Download or Copy: Copy extracted text to clipboard or download as a TXT file.

Why Extract PDF Text?

Converting PDF text to plain text is useful for processing document content, searching within PDFs, importing data into other applications, creating backups of text content, or analyzing document text. This tool extracts all text while preserving the reading order, making it perfect for reports, research papers, contracts, and other text-heavy documents.

Features

Frequently Asked Questions

Can I extract text from scanned PDFs?

This tool extracts text from PDFs that contain selectable text. Scanned PDFs (image-based) don't contain extractable text and would require OCR (Optical Character Recognition), which this tool does not provide. For scanned documents, use an OCR tool first.

What's the file size limit?

Files up to 50 MB are supported. Larger files may work depending on your browser's available memory, but extraction will be slower.

Does the tool preserve formatting?

The extracted text is plain text, so formatting like bold, italics, and colors are not preserved. However, text content and order are maintained as accurately as possible.

Can I extract text on mobile?

Yes. This tool works on desktop, tablet, and mobile browsers. Just tap to select a PDF file and extract text.

Is my PDF uploaded to a server?

No. All text extraction happens locally in your browser using PDF.js. Your PDF never leaves your device, ensuring complete privacy and security.

Can I extract text from password-protected PDFs?

Yes, if the PDF is protected with a user password (not an owner password). You would need to remove the password first using another tool, then extract text with this tool.

What is a PDF to text extractor?

A PDF to text extractor pulls the embedded text out of a PDF document into plain UTF-8 text that you can paste anywhere. The result is just the characters: no fonts, no colours, no layout. This is fundamentally different from OCR (Optical Character Recognition), which reads pixels from an image and guesses what letters they represent. Extraction reads the text directly from the PDF's content stream, so it is exact and instant; OCR is approximate and slow.

The reason extraction works is that most PDFs store text as positioned glyph operators (Tj for single text strings, TJ for arrays with adjustments) along with x and y coordinates on the page. The extractor walks the content stream of each page, collects every glyph operator with its position, and reassembles the reading order. For straight prose this is essentially perfect. For multi-column layouts, footnotes, and complex tables the extractor relies on heuristics that mostly work but can produce surprises.

This tool uses pdf.js, the JavaScript PDF renderer Mozilla started in 2011 and ships with Firefox. Everything runs in your browser: the PDF file is loaded into memory, parsed locally, the text is extracted on your machine, and the result appears in a textarea you can copy or download. No file is uploaded to a server. The library handles PDF 1.0 through PDF 2.0 (ISO 32000-2) and most modern encryption schemes.

What is inside the tool

The top of the tool is a drop zone: click to pick a PDF file or drag one in from your file manager. The 50 MB cap is a comfortable browser-memory limit; pdf.js can handle larger files but extraction slows once the document goes past a few hundred pages. As soon as a file is loaded, an info panel shows the filename, page count, and file size so you can confirm you picked the right document.

Two extraction options sit below the file info. Include page numbers toggles whether each page's number is prepended to the extracted text. Page separator lets you choose how pages are divided: a labelled bar (--- Page 3 ---), a blank line, an explicit [PAGE BREAK] marker, or nothing at all. The blank line option is best for re-importing into a writing tool; the labelled bar is best for navigating long documents.

Click Extract Text and the tool loops through every page, pulls the text content, applies your separator setting, and dumps the result into the textarea below. Stats appear underneath: pages processed, total character count, total word count. Two buttons let you copy the result to the clipboard or download it as a .txt file. The output is plain UTF-8, ready to paste into a note, an email, a spreadsheet, or a code editor.

History and background

PostScript and the printable-page problem (1982)

John Warnock and Chuck Geschke left Xerox PARC and founded Adobe in 1982. Their first product was PostScript, a page description language that could describe any printable page using a small set of drawing operators: move, line, curve, fill, place glyph. PostScript let any printer reproduce any page exactly, but it was designed for printing, not for viewing or editing. PostScript is the technical foundation that PDF was later built on.

PDF 1.0 and Acrobat (1993)

In 1991 Warnock circulated an internal Adobe paper called Camelot describing a portable document file format derived from PostScript but optimized for screen viewing and random page access. The first public release was Acrobat 1.0 and PDF 1.0 on 15 June 1993. Early adoption was slow: viewers cost money and files were large. Adobe made the Acrobat Reader free in 1994 and the format took off through the late 1990s for forms, technical manuals, and government documents.

PDF/A for long-term archives (2005)

PDF/A was published as ISO 19005-1 in October 2005. It is a restricted subset of PDF designed for archival: no external dependencies (all fonts embedded), no JavaScript, no encryption, no audio or video. The point is that a PDF/A file opened in 50 years will look exactly the same as today. Most national archives, courts, and corporate records systems require PDF/A for long-term storage. Text extraction from PDF/A is unusually reliable because the format mandates ToUnicode font maps.

PDF becomes an ISO standard (2008)

Adobe handed control of the PDF specification to the International Organization for Standardization in 2008. ISO 32000-1:2008 codified PDF 1.7 as an open international standard. From this point onwards anyone could implement a fully conformant PDF reader without licensing PDF from Adobe. ISO 32000-2 followed in 2017 (PDF 2.0), adding native support for newer features like better digital signatures and HDR rendering.

pdf.js opens the in-browser PDF viewer (2011)

Andreas Gal at Mozilla launched pdf.js as an experimental project in mid-2011 to render PDF documents using only HTML5, JavaScript, and Canvas. Before pdf.js, viewing a PDF in a browser required a plugin (Adobe Reader plugin, Foxit, or similar). pdf.js made native browser-based PDF viewing possible. Mozilla bundled it into Firefox 19 in February 2013, removing the need for any PDF plugin. It is the library this extractor uses.

Chrome ships PDFium (2014)

Google open-sourced PDFium in May 2014. PDFium is a different PDF engine, derived from the commercial Foxit PDF SDK, and is what powers PDF rendering inside Chrome and Edge. PDFium is written in C++; pdf.js is written in JavaScript. From an extraction standpoint both engines produce similar text, but the PDF/A and form-handling support varies. This tool uses pdf.js because it runs natively in any browser without plugins or compiled binaries.

Practical workflows

Extracting quotes from a research paper

Drop the PDF in, click Extract, scroll to find the passage you want, and copy it into your notes or citation manager. Single-column papers come out cleanly. Two-column papers (typical of conference and journal style) may interleave text from left and right columns; in that case copy each column manually rather than relying on the global extraction. For long quotes, prefer the blank-line page separator so paragraph breaks survive.

Searching a contract for specific clauses

Legal contracts are often hundreds of pages and the PDF reader's built-in search misses context. Extract the full text, paste into a text editor, and use Find or grep with a wider context window (5 lines before and after). This is faster than scrolling and lets you write a regular expression for patterns like all clauses that mention liability or termination. Keep the labelled page separator so you can locate the original location in the PDF.

Bulk text for a writing or translation project

When you need to translate, rewrite, or summarize a long PDF document, the first step is getting the raw text out. Extract once, save the .txt file, and work from there. Avoid copying directly from a PDF reader, which often introduces line breaks at the wrong places and breaks words across page boundaries. The blank-line separator works well as input to a translation tool or an LLM.

Pulling receipts into a spreadsheet

Modern receipts and invoices sent by email are often PDFs with embedded text rather than scans. Extract, then parse the totals with a regular expression. For repeated formats (one vendor that sends the same invoice layout every month), a five-line script can pull the date, total, and tax fields into a spreadsheet automatically. Scanned receipts won't work; those need OCR first.

Reading ebooks on the wrong device

PDF is a poor format for e-readers because the page size is fixed; the text doesn't reflow. Extract the text, paste into an EPUB converter, and now the book reflows on any screen. Page numbers and footnotes can be stripped manually before conversion. This trick is most useful for technical books and conference proceedings that publishers only release as PDF.

Sharing meeting minutes as plain text

When a colleague emails meeting minutes as a PDF and you want to paste a summary into Slack or a wiki, extract first. The text comes out clean and you can paste any portion without weird font artefacts or hidden formatting. For minutes with action items, the labelled-bar page separator helps locate the original document section if questions come up later.

Common pitfalls

Scanned PDFs produce empty output

If a PDF was created by scanning a paper document (a flatbed scan, a phone photo, or a copier output), it contains an image of the page, not the underlying text. The extractor walks the content stream looking for text operators and finds none, so the output is empty or contains only stray page numbers if those were typed manually. The fix is to run the PDF through OCR first (tools like Tesseract, Adobe Acrobat's Recognize Text, or ABBYY FineReader), which adds a hidden text layer that this tool can then extract.

Multi-column layouts can interleave text

Academic journals, magazines, and newspapers typically use two or three columns per page. pdf.js extracts each text run by its position on the page and uses heuristics to reconstruct reading order, but those heuristics assume single-column flow. The result for a multi-column page can be: first line of left column, first line of right column, second line of left column, and so on. For these layouts, extract one page at a time and select the columns by eye, or use a layout-aware tool like the python pdfplumber library.

Custom font encodings produce gibberish

A PDF can use any font, and the font can map its glyph IDs to any character code the author chooses. PDF/A and most modern PDFs include a ToUnicode map that says glyph 5 means the letter A, but older or sloppy PDFs sometimes skip the map. Without ToUnicode, the extracted text is the raw glyph IDs (often appearing as boxes, numbers, or random letters), and there is no way to recover the original characters without OCR. If only specific words look wrong, the cause is usually a missing ToUnicode for a single embedded font.

Ligatures may extract as combined characters

Professional typography combines certain letter pairs (fi, fl, ff, ffi) into single glyphs called ligatures. The PDF may store the ligature as Unicode codepoint U+FB01 (the fi ligature) rather than the two letters f and i. The extracted text contains the ligature codepoint, which most editors render correctly but some text-processing tools choke on. If you are feeding output into a search index or natural-language tool, run a one-line replacement to normalize U+FB01 to fi and U+FB02 to fl.

Headers and footers repeat on every page

Most PDFs have a running header (chapter title, document title) and footer (page number, copyright line) on every page. The extractor picks them up because they are real text on the page, and you end up with the same line repeating 200 times in a 200-page document. The fix is a simple deduplication script or a manual find-and-replace pass after extraction. For long documents, this is sometimes the biggest cleanup step.

Math equations and formulas rarely extract cleanly

Math is positioned using individual glyphs from special symbol fonts (Computer Modern, STIX). The extractor reads the glyphs but loses the spatial relationships that make x squared different from x times 2. Inline equations like E equals mc squared come out garbled, and display equations come out as scrambled symbol sequences. For PDFs heavy in math, use a tool that preserves equation structure (MathPix snip, Adobe Acrobat Pro with equation reflow), or extract the equations as images.

Privacy and data handling

The PDF file you drop into the tool stays on your device the entire time. pdf.js is a JavaScript library that runs in your browser, not on a remote server. The file is loaded into memory by your browser, parsed page by page, and the extracted text appears in the textarea on the same page. We never upload the file, never log its contents, and never analyse it. This matters because PDFs often contain confidential information: contracts, medical records, legal correspondence, financial statements.

Once the page is loaded, the tool works offline. You can disconnect from the internet, drop a PDF, extract it, and copy the result without your data ever touching another machine. The extracted text only leaves your machine if you choose to paste or send it somewhere yourself. Many SaaS PDF extractors send your file to a cloud service for processing; for sensitive documents that is exactly what you want to avoid.

When not to use this tool

Scanned or image-only PDFs (need OCR first)

If your PDF is a scan of paper or a series of photos, there is no embedded text to extract; this tool returns empty results. Run the PDF through an OCR engine first to add a text layer: Tesseract (free, command-line, very good for English and Latin scripts), Adobe Acrobat Pro (paid, best layout retention), or ABBYY FineReader (paid, best for non-Latin scripts and complex documents). After OCR, this extractor will work normally.

Fillable PDF forms with field values

A PDF form stores field values (the text you typed into a name field, the checked state of a checkbox) separately from the static page text. This extractor only reads the static page text, so form values are missed. To extract form data, use a PDF form library that reads the AcroForm or XFA dictionary directly (pdftk, Adobe Acrobat Export Data, or python-pdfplumber's form-field API).

When you need to preserve formatting

Plain text loses all formatting: bold, italics, lists, tables, headings, colours, fonts. If you need an editable document that preserves layout, use a PDF-to-Word converter instead (which builds a structured Word document with paragraph styles and tables), or PDF-to-HTML for web-friendly output. PDF-to-text is for the case where you genuinely only need the words.

Encrypted PDFs without the password

PDFs can be encrypted with a user password (required to open the file) or an owner password (restricts actions like printing or copying). pdf.js requires the user password to open an encrypted file; without it, no extraction is possible. Remove the password first with a PDF unlock tool (only on documents you have the right to access) and then extract. The owner password sometimes blocks copying inside Adobe Reader but does not block extraction here.

More questions

What is a PDF text layer?

A text layer is the part of a PDF that stores characters as machine-readable text (Tj and TJ operators in the content stream) rather than as pixels. Digital PDFs created by Word, LaTeX, or web-to-PDF tools always have a text layer. Scanned PDFs do not, until you add one with OCR. The text layer is what allows search, copy-paste, screen readers, and tools like this extractor to work.

Why is some of my extracted text scrambled or out of order?

PDFs do not store text in reading order; they store it as glyph operators at x and y positions on the page. The extractor reconstructs reading order by sorting top to bottom and left to right within rows. This works for single-column flow but can interleave columns, mix headers with body text, or split paragraphs at column breaks. For complex layouts, try copying page by page or use a layout-aware Python library like pdfplumber.

Can I extract text from a PDF that is hundreds of pages long?

Yes, but expect it to take longer and use more memory. Each page is parsed sequentially in JavaScript, which is single-threaded, so a 500-page book might take 20 to 60 seconds depending on your machine and the complexity of the pages. The browser's memory ceiling (a few GB for desktop Chrome, less for mobile) limits the total file size more than the page count. If a giant PDF hangs, try splitting it first with the PDF splitter tool and extracting in chunks.

What is PDF/A and why is its text easier to extract?

PDF/A is the archival subset of PDF defined by ISO 19005. It requires that all fonts be embedded with a ToUnicode map, that all colour profiles be self-contained, and that no external resources be referenced. The ToUnicode requirement is what makes extraction reliable: every glyph in the document maps back to a standard Unicode character. National archives, courts, and corporate records systems use PDF/A precisely so the text remains extractable decades later.

How accurate is the extraction compared to Adobe Acrobat?

For straightforward digital PDFs the output is identical character-for-character. Acrobat has more sophisticated heuristics for handling complex multi-column layouts and tables, so for those specific cases its output may be more readable. pdf.js (this tool) has been actively developed since 2011 and now passes most of the PDF specification's compliance tests. For typical office and research documents the difference is negligible.

Does the tool support non-Latin scripts (Chinese, Arabic, Cyrillic)?

Yes, provided the PDF has a proper ToUnicode map for those characters (which any modern PDF does). The extracted text is UTF-8 and renders correctly in any modern editor. Right-to-left scripts like Arabic and Hebrew are extracted in logical order, not visual order, which is what you want for further processing. CJK (Chinese, Japanese, Korean) extraction is fully supported because pdf.js handles the CIDFont system that PDF uses for those scripts.

Related Tools