Skip to content
>_ TrueFileSize.com
·9 min read

PDF Parsing and Text Extraction — A Practical Guide

PDF is everywhere — invoices, contracts, reports, scanned documents. Parsing it reliably in production is harder than it looks. This guide covers text extraction, metadata reading, and OCR for scanned files, with sample PDFs for every edge case.

The two kinds of PDFs

Before parsing, know which type you're dealing with:

  • Text-based PDFs — generated from Word, LaTeX, or web pages. Text is embedded and extractable.
  • Scanned PDFs — images of paper documents. Text must be recovered with OCR (Tesseract, Google Vision, AWS Textract).

Download both types from our sample PDFs to test your parser against each.

Text extraction with pdf-parse (Node.js)

import fs from 'fs';
import pdfParse from 'pdf-parse';

const buffer = fs.readFileSync('sample-1mb.pdf');
const data = await pdfParse(buffer);

console.log('Pages:', data.numpages);
console.log('Text:', data.text.slice(0, 500));
console.log('Info:', data.info);  // Title, Author, CreationDate

Extraction in the browser with pdf.js

import * as pdfjs from 'pdfjs-dist';
pdfjs.GlobalWorkerOptions.workerSrc = '/pdf.worker.mjs';

const pdf = await pdfjs.getDocument(url).promise;
let fullText = '';
for (let i = 1; i <= pdf.numPages; i++) {
  const page = await pdf.getPage(i);
  const content = await page.getTextContent();
  fullText += content.items.map((it) => it.str).join(' ') + '\n';
}

Handling scanned PDFs with OCR

If data.text is empty or nonsense, the PDF is scanned. Rasterize each page, then run OCR:

import { fromPath } from 'pdf2pic';
import Tesseract from 'tesseract.js';

const convert = fromPath('scanned.pdf', { density: 300, format: 'png' });
const pageImage = await convert(1);
const { data: { text } } = await Tesseract.recognize(pageImage.path, 'eng');
console.log(text);

Performance benchmarks

Test your pipeline against files of varying complexity:

  • 1MB PDF — typical contract, ~10 pages
  • 10MB PDF — report with embedded images
  • 100MB PDF — stress test — most parsers break here

pdf-parse handles 1MB in ~200ms and 10MB in ~2s on a modern laptop. 100MB often requires streaming.

Common failure modes

  • Encrypted PDFs — check data.info.IsAcroFormPresent and handle password prompts
  • Multi-column layouts — text order gets scrambled; use coordinates instead
  • Embedded fonts not in standard set — characters render as question marks
  • Corrupted PDFs — wrap parsers in try/catch; log and skip

Related

For other document formats, see Word (DOCX), Excel (XLSX), or plain text. For large file upload strategies, read our large file upload guide.