Document Detection: The Step Everyone Skips Before OCR

Our OCR accuracy jumped four percentage points the day we stopped feeding rotated junk into the engine. Document detection is the cheapest, most-skipped accuracy win in the field.

Nupura Ughade

June 17, 2026

9 min read

Document Detection: The Step Everyone Skips Before OCR

0%0%100%

Our OCR accuracy jumped four percentage points the day we stopped feeding rotated junk into the engine. Not by switching engines. Not by tuning models. We added a single pre-processing step. It took an afternoon to implement and recovered more accuracy than the previous quarter of model tuning.

This guide is about that step, document detection, and why almost every team skips it until it costs them too much to ignore.

What "Document Detection" Means in Plain English

Document detection is the work of looking at a page and figuring out what kind of object it is and how it is oriented before you try to read it.

A page might be:

A flat scanned page (good, OCR can read it directly)
A photo of a document on a desk (bad, there is desk around the edges, the document is slightly tilted, and shadows make some text harder to see)
A photo of a passport (different, needs different OCR than a contract)
An invoice (vs a bank statement, different downstream rules)
A page rotated 90 degrees (broken, OCR will read it as gibberish without rotation first)
A multi-page document split incorrectly (broken, table on page 7 may continue on page 8)

Document detection answers these questions before OCR runs. The OCR engine then gets a clean, properly oriented page with metadata about what type of document it is. Accuracy goes up. Downstream rules get applied correctly. Bugs disappear.

If you are brand new to OCR, our optical character reader 2026 guide is the easier starting point. This article assumes you have run OCR and want to know why your accuracy is lower than the marketing claims.

The Four Document Detection Tasks That Matter Most

1. Page Boundary Detection

If the input is a photo, the document does not fill the frame. There is desk, hand, shadow. Page boundary detection finds the four corners of the actual document and crops to just that. OCR accuracy on a properly cropped document is dramatically higher than on the full photo.

How: edge detection (Canny in OpenCV is the classic), contour finding, four-point perspective transform. About 20 lines of code if you use OpenCV.

2. Rotation Correction

Pages scanned sideways or upside-down. OCR engines read top-to-bottom. A sideways page produces sideways gibberish. Rotation correction detects which way is up and rotates the page before OCR.

How: small classifier that takes a thumbnail and predicts orientation (0°, 90°, 180°, 270°). Trains fast. Free packages (Tesseract has built-in --rotate-pages) make this a one-flag fix.

3. Skew Correction (Deskew)

A page scanned at a 3-degree tilt. Looks fine to a person; OCR accuracy drops 5-10 points. Deskewing rotates the page to perfectly horizontal.

How: Hough line transform finds the dominant text line angle, then rotate by the negative of that angle. ocrmypdf --deskew does this automatically.

4. Document Classification

Is this an invoice, a bank statement, a passport, a contract? Different document types need different OCR settings, different validation rules, different downstream routing.

How: small image classifier or layout-based heuristic. Production systems usually train a model on tens of thousands of labeled documents per class. Open source classifiers exist for common types.

Why Teams Skip Document Detection

Most teams skip document detection because they assume the OCR vendor handles it, the marketing budget is concentrated on the OCR engine itself, and clean demo documents hide the failures that real-world documents produce. The result is a pipeline that hits 85% accuracy on production data instead of 95%+, then spends months building manual review workflows to catch the gap that one pre-processing pass would have closed.

"The vendor said it was built in"

Some vendors include document detection. Some claim they do but it is shallow (only handles 0/90/180/270 rotation, not arbitrary skew, not page boundaries). Read the docs carefully.

"It's not where the magic is"

The OCR engine and the LLM get the marketing budget. Pre-processing gets ignored. So teams pour effort into model tuning and ignore the cheap fix in front of them.

"We thought our documents were clean"

Every team thinks their documents are clean until they look. Run a sample of your real production documents and you will find tilted pages, rotated pages, photos with desks around the edges, and faded scans. The documents in your inbox right now are not clean.

"We'll get to it later"

"Later" is when you have spent six months building review workflows to fix the errors you could have prevented with one afternoon of pre-processing. Add the pre-processing first.

The Pipeline I Run Today

This is the production pre-processing pipeline I run before every OCR call. Each step is cheap. Each step recovers accuracy. Together they recovered about 6 percentage points of accuracy from our pre-cleanup baseline.

File-type check. Is this actually a PDF/image? Confirm by magic bytes, not extension.
Page extraction. Convert PDF pages to images at 300 DPI.
Page boundary detection. If the image is a photo with margins, crop to the document.
Rotation correction. Detect orientation and rotate to upright.
Deskew. Correct for small tilts.
Adaptive thresholding. Boost contrast in low-contrast regions without crushing high-contrast text.
Document classification. Predict document type. Route to type-specific OCR settings.
OCR. Run the engine. Now it has a clean, oriented, classified document.

The whole pre-processing pipeline runs in under a second per page on commodity hardware. The accuracy gain is permanent and applies to every downstream document. (Detailed walk-through in our honest guide from 4M pages a month.)

The Five Failure Modes Document Detection Catches

These are the failures that document detection prevents. If you see any of these in your output, you are skipping detection somewhere:

Gibberish on specific pages. Rotated page. Add rotation correction.
Wrong document type routing. Bank statement routed as invoice. Add classification.
Margin text picked up as part of the document. Photo with desk around the edges. Add page boundary detection.
Lower accuracy on hand-photographed documents than scanned ones. Skew + photo edges. Add boundary detection and deskew.
Multi-page tables breaking on page boundaries. Page splitting logic ignored continuation tables. Add cross-page table linking.

How to Add Document Detection in an Afternoon

If you have a Tesseract-based pipeline today and want to add detection cheaply:

brew install ocrmypdf
ocrmypdf --deskew --rotate-pages --remove-background input.pdf output.pdf

Those three flags handle deskew, rotation, and background removal. Page boundary detection requires a few extra lines of OpenCV. Document classification requires a small model or a heuristic ruleset.

If you are on a cloud OCR API, check whether detection is included. Most modern APIs (DocsAPI, AWS Textract, Google Document AI) handle the basic detection steps automatically. If yours doesn't, file a feature request, it is a baseline expectation in 2026.

The Way I Explain This to Non-Tech Folks

Imagine you ask a friend to read every label on a shelf in your basement. If you hand her a clean shelf with labels facing forward and lights on, she reads them all correctly. If you hand her a dusty shelf with some labels upside-down, some sideways, some in the shadows, she will misread some.

Document detection is the work you do before handing her the shelf. Brush off the dust. Turn the labels right-side up. Turn the lights on. The shelf is now an easy read. Your friend reads everything correctly.

The shelf is your PDF. The friend is the OCR engine. The dusting and turning is document detection. People skip it because it sounds boring. The accuracy gain from doing it is the biggest single improvement available.

The document detection techniques compared

Document detection is not one technique but a small set of pre-processing steps, each solving a different problem before OCR runs. Understanding which technique fixes which problem tells you what to prioritize for your document mix. The table below maps the four core detection tasks to the problem they solve and the accuracy they protect.

Technique	Problem it solves	Impact if skipped
Page boundary detection	Photos include desk edges, backgrounds	OCR reads the background as text
Deskew and rotation	Documents captured at an angle	Character accuracy drops sharply past a few degrees
Document segmentation	Multi-document packets in one file	A 40-page packet is treated as one document
Document classification	Unknown document type	Wrong extraction template, low field accuracy

How document detection lifts downstream accuracy

The reason document detection matters is that OCR accuracy is capped by input quality, and detection is what fixes input quality before the expensive step runs. A page captured at a 15-degree angle can lose 10 or more points of character accuracy that no amount of OCR engine quality recovers, because the engine is reading distorted text. Deskewing that page first restores the accuracy at almost no cost. The same logic applies to page boundaries: an engine that reads the desk behind a phone-photographed receipt wastes effort and introduces garbage the downstream pipeline has to filter.

The compounding effect is what teams miss. Detection improvements multiply through every downstream step: cleaner input means higher OCR accuracy, which means higher extraction accuracy, which means fewer exceptions in the human review queue. A few hundred lines of detection code at the front of the pipeline often does more for end-to-end accuracy than switching to a more expensive OCR engine, which is exactly why skipping it is the most common and most costly mistake in document pipelines.

What I'd Do Today

If you have a working OCR pipeline that disappoints: add deskew and rotation correction first. Two flags in ocrmypdf or a few lines of code in your own pipeline. The accuracy lift is typically 4-8 points.

If you are starting a new pipeline: do not write the OCR step until you have written the pre-processing. The pre-processing is what makes the OCR work.

If you are evaluating a vendor: ask what document detection they do. The answer should be specific, boundary detection, rotation, deskew, classification. Not just "yes". (I write about this gap a lot.)

Frequently Asked Questions

What is document detection in OCR?

Document detection is the pre-processing step that identifies and prepares pages before OCR runs. It includes page boundary detection, rotation correction, deskew, and document classification. Skipping it costs accuracy on real-world (non-clean) documents.

Is document detection included in modern OCR APIs?

Mostly yes. AWS Textract, Google Document AI, Azure Form Recognizer, and DocsAPI all include basic detection. Tesseract requires you to add it via flags or external tools. Always check the docs for which specific detection steps are included.

How much does document detection improve OCR accuracy?

Typical accuracy gain: 4-8 percentage points on real-world (non-clean) documents. The gain is larger on photos taken with a phone, scanned at angles, or faded. Clean scans benefit less because they were already easy.

Can I add document detection without an engineer?

Partially. Tools like ocrmypdf include detection flags you can run from the command line. For full pipeline integration (page boundaries, classification), an engineer is helpful. Many cloud OCR APIs handle detection automatically.

What is the difference between rotation correction and deskew?

Rotation correction handles 90-degree turns (sideways or upside-down pages). Deskew handles small tilts (a 3-degree angle off horizontal). Both are needed. Rotation correction is binary (4 possible orientations). Deskew is continuous (any small angle).

Does document detection help with handwritten content?

It helps with everything, typed, handwritten, mixed. Handwriting benefits especially from page boundary detection (photos of handwritten notes often include desk edges) and deskew (handwritten pages are rarely perfectly aligned).

Common questions

Frequently asked questions

Mostly yes. AWS Textract, Google Document AI, Azure Form Recognizer, and DocsAPI include basic detection. Tesseract requires you to add it via flags or external tools. Always check the docs for which specific steps are included.

Typical accuracy gain: 4-8 percentage points on real-world documents. The gain is larger on photos taken with a phone, scanned at angles, or faded. Clean scans benefit less because they were already easy.

Partially. Tools like ocrmypdf include detection flags you can run from command line. For full pipeline integration (page boundaries, classification), an engineer helps. Many cloud OCR APIs handle detection automatically.

It helps with everything, typed, handwritten, mixed. Handwriting benefits especially from page boundary detection (photos often include desk edges) and deskew (handwritten pages are rarely perfectly aligned).

Nupura Ughade

Content Marketing Lead, DocsAPI

Nupura Ughade creates clear, insightful content on OCR, document AI, and fintech. She combines technical depth with real-world finance use cases to help engineers and operations leaders navigate digital transformation with confidence.

Ready to Transform Your Lending Process?

See how DocsAPI's AI-powered industry classification can help you process loans faster, improve accuracy, and scale your operations.

Book a Demo View Pricing

Document Detection: The Step Everyone Skips Before OCR

Table of contents

What "Document Detection" Means in Plain English

The Four Document Detection Tasks That Matter Most

1. Page Boundary Detection

2. Rotation Correction

3. Skew Correction (Deskew)

4. Document Classification

Why Teams Skip Document Detection

"The vendor said it was built in"

"It's not where the magic is"

"We thought our documents were clean"

"We'll get to it later"

The Pipeline I Run Today

The Five Failure Modes Document Detection Catches

How to Add Document Detection in an Afternoon

The Way I Explain This to Non-Tech Folks

The document detection techniques compared

How document detection lifts downstream accuracy

What I'd Do Today

Frequently Asked Questions

What is document detection in OCR?

Is document detection included in modern OCR APIs?

How much does document detection improve OCR accuracy?

Can I add document detection without an engineer?

What is the difference between rotation correction and deskew?

Does document detection help with handwritten content?

Frequently asked questions

Nupura Ughade

Related Blog Posts

How to Make a PDF Searchable in 30 Seconds (No Acrobat)

Readable PDF vs Image PDF: How to Tell the Difference Fast

OCR a PDF: 4M-Pages-a-Month Lessons From Production (2026)

Ready to Transform Your Lending Process?