Technical

How Browser-Based PDF Editing Works (And Why Privacy Matters)

A technical explainer of how LuraPDF processes PDFs entirely in your browser using pdf-lib, pdfjs-dist, Tesseract.js, and WebAssembly — with no uploads, no cloud, and no data exposure.

LuraPDF Team

Editorial & Technical Team · May 6, 2026 · 9 min read

Every time you upload a document to a cloud-based PDF tool, that document exists on servers you don't control, processed by software you cannot inspect, stored by a company with its own data retention and security posture. For a short while, someone else's infrastructure holds your contracts, tax returns, medical records, or confidential business documents.

LuraPDF works differently. Everything happens in your browser tab, on your hardware, with your operating system's memory management. This article explains the technical architecture that makes this possible and the engineering tradeoffs involved.

The Browser as a Document Processing Platform

Modern browsers are no longer just HTML renderers. They are full application execution environments with:

JavaScript engine (V8 in Chrome, SpiderMonkey in Firefox): executes code at near-native speed
WebAssembly (WASM): executes compiled C/C++ code at near-native speed within a sandboxed environment
Canvas API: enables pixel-level image manipulation
File System Access API: allows reading local files without uploading
Web Workers: runs computation on background threads without freezing the UI
ArrayBuffer / Blob: manages raw binary data (like PDF bytes) in memory

These capabilities together enable a fully-featured document processing pipeline that would have required a backend server five years ago.

The Three Core Libraries

LuraPDF is built on three foundational open-source libraries:

pdf-lib

pdf-lib is a TypeScript library for creating and modifying PDF documents in any JavaScript environment. It implements a substantial portion of the PDF specification (ISO 32000), including:

Page creation, modification, and deletion
Font embedding (both standard 14 fonts and custom TTF/OTF fonts)
Image embedding (JPEG, PNG)
Text annotation placement
Form field creation and manipulation
Document metadata (title, author, created date, etc.)
PDF encryption (AES-128 and AES-256)
Cross-reference table management

When LuraPDF merges PDFs, it uses pdf-lib to read each document's page tree, normalize shared resources, and assemble a new PDF. When it compresses, it re-encodes image objects with lower quality factors. When it encrypts, it applies AES-256 to the document using the Web Crypto API.

pdf-lib operates entirely on ArrayBuffer objects — raw binary data in browser memory. No network calls are made.

pdfjs-dist

PDF.js is Mozilla's PDF rendering engine, distributed as pdfjs-dist on npm. It is the engine powering Firefox's built-in PDF viewer, tested against millions of real-world PDFs.

LuraPDF uses pdfjs-dist for:

Page rendering: Converting PDF pages to Canvas ImageData for display
Text extraction: Reading the text content streams from PDF pages
Page metadata: Getting page dimensions, rotation values, and content type

When you see your PDF rendered in the LuraPDF interface — the visual preview of pages — that's pdfjs-dist rendering each page onto an HTML Canvas element.

Tesseract.js

Tesseract.js is the WebAssembly-compiled version of Tesseract OCR. Tesseract is an open-source OCR engine originally developed by HP Labs (1985) and currently maintained by Google. It uses an LSTM (Long Short-Term Memory) neural network model for character recognition.

Running Tesseract via WebAssembly in the browser means:

The 22 MB WASM binary is loaded once and cached by the browser
OCR processing runs on CPU via WebAssembly, not GPU
Processing is slower than cloud OCR (which uses GPU clusters) but produces the same quality output
Your document content never leaves your device

For a 20-page scan: browser OCR takes approximately 30–120 seconds. A cloud service might take 2–5 seconds. The tradeoff is privacy vs. speed.

The Architecture: From File to Output

Here's what happens when you, say, compress a PDF in LuraPDF:

File selection: You drag a PDF onto the dropzone. The browser's File API reads the file into an ArrayBuffer — raw bytes in browser memory. No upload occurs.
Validation: A quick check that the magic bytes %PDF appear at the start of the ArrayBuffer.
Loading: pdf-lib's PDFDocument.load() parses the cross-reference table, reads the object tree, and builds an in-memory representation of the document.
Image enumeration: The library walks the page content streams to find image XObjects (embedded images). Each image's width, height, compression type, and byte data are extracted.
Re-encoding: For each image, the raw pixel data is drawn onto an HTML Canvas element, then re-exported at a lower JPEG quality factor using canvas.toBlob('image/jpeg', quality). The original image object is replaced with the smaller encoded version.
Document serialization: pdf-lib's save() method serializes the modified document back to bytes. Object streams are DEFLATE-compressed.
Download: The bytes are wrapped in a Blob object, a temporary object URL is created (URL.createObjectURL(blob)), and a programmatic <a> click triggers the browser's download mechanism.

Total data transmitted to any external server: zero bytes.

Web Workers: Keeping the UI Responsive

PDF processing is CPU-intensive. Running it on the main JavaScript thread would freeze the browser tab — you couldn't scroll, click, or see progress updates while the file was being processed.

LuraPDF runs all heavy computation (OCR, large PDF parsing, image re-encoding) in Web Workers — separate JavaScript threads that run in the background. The main thread handles the UI, shows progress bars, and communicates with the worker via message passing (postMessage / onmessage).

When OCR is running, a progress event fires after each page is processed. The worker sends { page: 5, total: 20, confidence: 94 } to the main thread, which updates the progress bar. Your UI remains interactive throughout.

Memory Management

Browser memory is finite. A 200 MB PDF loaded into memory, processed, and saved could require 400–600 MB of RAM (original bytes + intermediate Canvas data + output bytes). On systems with limited RAM, this can trigger memory pressure.

Strategies used:

Streaming page processing: For multi-page operations, pages are processed and the intermediate Canvas elements are discarded after use
URL.revokeObjectURL(): Temporary blob URLs are revoked after download to allow GC
Worker termination: Web Workers are terminated after completing their task to reclaim the memory they used

For very large files (>100 MB), modern browsers handle this without issues on systems with 8+ GB RAM. Older or memory-constrained systems may struggle with 100+ MB source files.

The Privacy Architecture

The security model is enforced by the browser's same-origin policy and isolation architecture:

No HTTP requests: No external API calls, no telemetry, no analytics calls during document processing
Sandboxed execution: Web Workers cannot make network calls to external origins
No persistent storage: Processed files are not written to browser storage (localStorage, IndexedDB, or File System Access)
Tab-scoped memory: When you close the tab, all ArrayBuffers and Blob objects associated with it are garbage collected

This architecture is verifiable: open the browser's Network tab in DevTools while processing a file. You will see zero requests during the processing operation.

Open Source Libraries: Inspectable, Not Black Box

All three core libraries — pdf-lib, pdfjs-dist, and Tesseract.js — are open source with public repositories on GitHub. The implementations are inspectable by anyone. The PDF processing code that runs in LuraPDF is not a proprietary black box; it is public code with real users, issues, and contributions.

This matters for trust. When a cloud service says "we process and delete your files," you are taking their word for it. When LuraPDF uses open-source libraries in a browser sandbox with no network calls, you can verify the claim by watching the network tab.

Tradeoffs vs. Cloud Processing

The browser-based model has real tradeoffs worth being honest about:

Speed: Cloud services have GPU clusters and dedicated infrastructure. Browser OCR on a large document takes 2–10x longer than a cloud equivalent. Compression is fast because it uses the Canvas API directly.

File size limits: Browser memory is the constraint. Very large files (>300 MB) may hit memory limits on some systems. Cloud services have no such constraint.

Processing power: WASM runs at 40–80% of native C++ speed. Cloud services run optimized native code. For most documents this difference is imperceptible; for 100+ page OCR jobs it's noticeable.

Features: Some advanced PDF features (PDF/X compliance checking, professional color management, CMYK workflows) require specialized server-side tools. Browser-based processing handles the 95% case.

For everyday document processing — compressing, merging, signing, converting, redacting — the browser-based model is fast enough, private by design, and requires no account or subscription.

Frequently Asked Questions

Does LuraPDF collect any telemetry or usage data? LuraPDF uses Google Analytics for aggregate page view statistics. Document processing operations generate no analytics events. The contents of your files are never transmitted.

Can I verify that no upload occurs? Yes. Open Developer Tools (F12), click the Network tab, filter by "XHR" or "Fetch." Process a file. Observe zero document-related network requests during processing.

Why does the OCR take so long compared to other services? Browser-based Tesseract.js runs on CPU via WebAssembly. Cloud OCR services run on GPU clusters that process characters orders of magnitude faster. The tradeoff is that your document never leaves your browser.

What happens if I close the tab while processing? Processing stops immediately. The browser tab's memory is garbage collected. Your original file on disk is unchanged.

Is the code open source? The core processing libraries (pdf-lib, pdfjs-dist, Tesseract.js) are open source. LuraPDF's interface code is not currently open source, but the processing pipeline uses only open-source components.

The shift to browser-based document processing is not a gimmick. It is a genuine architectural choice with measurable privacy and trust advantages — at the cost of speed on intensive operations. For documents you'd prefer didn't exist on someone else's server, it's the right tradeoff.