Tutorial

How to OCR a Scanned PDF and Make It Searchable

Learn how OCR (Optical Character Recognition) works, what affects accuracy, and how to convert a scanned PDF into a searchable, copy-pasteable document using LuraPDF's browser-based OCR tool.

LuraPDF Team

Editorial & Technical Team · May 4, 2026 · 6 min read

A scanned PDF is a digital photograph of a document. The pages are images. You cannot select text, search for a word, copy a sentence, or feed the content to any text processing tool. For the purpose of information retrieval, a scanned PDF is essentially opaque.

OCR (Optical Character Recognition) solves this by analyzing those images and constructing a text layer that overlays the visual content. The result: a PDF that looks identical to the original scan but contains an invisible text layer that makes everything selectable, searchable, and copyable.

How OCR Works

LuraPDF uses Tesseract.js, the browser-compiled version of Tesseract — one of the most accurate open-source OCR engines, maintained by Google and originally developed by HP Labs. Tesseract uses a neural network model (LSTM-based) trained on millions of document pages across dozens of languages.

The OCR pipeline:

Page rendering: Each PDF page is rendered to a canvas image at high resolution (300+ DPI for best accuracy)
Pre-processing: Image enhancement — binarization, noise reduction, deskewing (straightening rotated scans)
Layout analysis: Detecting text regions, columns, tables, and non-text elements
Character recognition: The neural network classifies each character from segmented text regions
Post-processing: Language model scoring to disambiguate similar characters (e.g., "l" vs "1", "O" vs "0")
PDF writing: Recognized text is embedded as an invisible text layer positioned precisely over the corresponding visual characters

The invisible text layer is what makes the result searchable. The visual page appearance remains the original scan image — you see exactly what you scanned, but the text underneath is now machine-readable.

What Affects OCR Accuracy

Accuracy varies significantly with input quality:

Scan resolution

300 DPI is the minimum for reliable accuracy. Below 200 DPI, character recognition degrades substantially. If you're scanning documents for OCR, always scan at 300 DPI or higher.

Documents scanned at 150 DPI or less should be rescanned at higher resolution before OCR. Running OCR on low-resolution scans produces poor results no matter how good the engine.

Font and print quality

Printed text (laser printer output, typeset books): 98–99% accuracy with clean originals
High-quality handwriting with clear characters: 85–95%
Faint or faded text: 80–95% depending on contrast
Carbon copy paper: 60–85%
Old newspaper / typewriter: 90–95% with clean scans
Cursive handwriting: 40–70% — neural network OCR struggles with cursive

Page orientation

Severely tilted or rotated pages hurt accuracy. Most OCR engines including Tesseract auto-detect and correct minor rotation (up to ~10 degrees). Heavily rotated pages should be corrected manually first using Rotate PDF.

Language

Tesseract supports 100+ languages. LuraPDF's OCR tool automatically detects English. For non-Latin scripts or non-English documents, language selection improves accuracy substantially.

How to OCR a PDF with LuraPDF

Open the OCR tool: Navigate to LuraPDF OCR PDF
Upload the scanned PDF: Drag and drop your file
Select language (if not English): Choose the primary language of the document
Click "Run OCR": Processing happens page by page in your browser. Time varies with document length — a 20-page scan typically takes 30–90 seconds on a modern computer.
Download the searchable PDF: The output is a PDF with the original scan images plus an embedded text layer

Testing the Result

After OCR, verify accuracy:

Select text on the page — text should be selectable exactly over the printed characters
Search (Ctrl+F / Cmd+F) for a common word — it should be found
Copy a paragraph and paste into a text editor — the output should be readable

If accuracy is poor, check the input scan quality first before trying other tools.

When to Run OCR Before Other Operations

OCR unlocks additional LuraPDF operations that don't work on pure-image PDFs:

Compress PDF after OCR: Once text is extracted, the image regions can sometimes be compressed more aggressively
PDF to Word after OCR: Converting an OCR'd PDF to Word gives editable text; converting a raw scan gives a Word file with embedded images
Redact PDF after OCR: Text-based redaction works properly on OCR'd documents
Search and extract: Find and copy specific information without retyping

Privacy: OCR Runs in Your Browser

Tesseract.js runs the entire OCR process locally using WebAssembly. Your scanned documents — which often contain medical records, financial statements, legal documents, or personally identifiable information — never leave your device. No remote server processes your file.

This is a significant advantage over cloud OCR services that necessarily receive a copy of everything you process.

Limitations of Browser-Based OCR

Processing time

Tesseract.js is slower than native desktop Tesseract or cloud OCR APIs. Expect approximately 3–8 seconds per page depending on your hardware. A 50-page document may take several minutes.

Tables

Tesseract recognizes table content but does not reconstruct table structure in the PDF's text layer — the text will be in reading order but the cell structure won't be preserved. For structured table extraction, convert the OCR'd PDF to Word and manually reformat the table.

Mathematical notation

LaTeX-style equations and mathematical symbols have lower accuracy. Tesseract models are optimized for natural language text.

Handwriting

As noted, cursive handwriting accuracy is limited. Print handwriting fares better. For critical handwritten documents, verify each page manually.

Frequently Asked Questions

The OCR'd text doesn't line up with the characters — is that a bug? This can happen with severely skewed scans. The text positions are calculated from the detected character positions, but if the page geometry is non-standard, alignment may drift. Try rotating the PDF to correct the skew before running OCR.

Can I OCR specific pages only? LuraPDF processes all pages. If you only need OCR on specific pages, extract those pages first using Extract PDF Pages, run OCR, then optionally merge the results.

Does OCR change the visual appearance of my scanned document? No. The original scan images are preserved exactly. Only an invisible text layer is added.

Can I run OCR on a PDF that already has some text pages and some scanned pages? Yes — Tesseract processes image-based pages and adds a text layer. Pages that already have a text layer are unaffected.

My document is in Arabic / Chinese / Japanese — will OCR work? Yes, but select the correct language in the tool before running. Tesseract's accuracy for CJK and right-to-left languages is good but varies more with scan quality than Latin-script documents.

OCR transforms locked archives of scanned documents into accessible, searchable, processable information. A cabinet full of scanned contracts becomes a searchable database. A stack of medical records becomes a document you can actually navigate. The process takes seconds to minutes and runs entirely on your device.