How Browser-Based PDF OCR Works
OCR (Optical Character Recognition) extracts text from image-based content. PDFScanner uses a two-step pipeline: first, PDF.js renders each PDF page onto an HTML5 Canvas at 3× scale to ensure high-resolution input for the OCR engine. Then Tesseract.js (a WebAssembly port of Google's Tesseract OCR engine) processes each canvas image sequentially and extracts the text with bounding-box sorting.
When to Use OCR vs Standard Text Extraction
If your PDF was created digitally (from Word, Excel, etc.), its text is already embedded and can be selected and copied directly in any PDF viewer — OCR is not needed. OCR is for PDFs that are essentially images: documents scanned from paper, photographed pages, or PDFs exported from image-only sources.
OCR Accuracy Tips
- Use the "High" resolution setting for best accuracy on dense text
- Clean, high-contrast scans produce significantly better results than blurry or low-contrast images
- Select the correct language for your document's content
- Hand-written text is not reliably recognised by Tesseract — OCR works best on printed text