DocsAPI LogoDocsAPI

Document Detection: The Step Everyone Skips Before OCR

Our OCR accuracy jumped four percentage points the day we stopped feeding rotated junk into the engine. Document detection is the cheapest, most-skipped accuracy win in the field.

Nupura Ughade
Nupura Ughade
|
June 17, 2026
|
9 min read
Document Detection: The Step Everyone Skips Before OCR

Our OCR accuracy jumped four percentage points the day we stopped feeding rotated junk into the engine. Not by switching engines. Not by tuning models. We added a single pre-processing step. It took an afternoon to implement and recovered more accuracy than the previous quarter of model tuning.

This guide is about that step — document detection — and why almost every team skips it until it costs them too much to ignore.

What "Document Detection" Means in Plain English

Document detection is the work of looking at a page and figuring out what kind of object it is and how it is oriented before you try to read it.

A page might be:

  • A flat scanned page (good — OCR can read it directly)
  • A photo of a document on a desk (bad — there is desk around the edges, the document is slightly tilted, and shadows make some text harder to see)
  • A photo of a passport (different — needs different OCR than a contract)
  • An invoice (vs a bank statement — different downstream rules)
  • A page rotated 90 degrees (broken — OCR will read it as gibberish without rotation first)
  • A multi-page document split incorrectly (broken — table on page 7 may continue on page 8)

Document detection answers these questions before OCR runs. The OCR engine then gets a clean, properly oriented page with metadata about what type of document it is. Accuracy goes up. Downstream rules get applied correctly. Bugs disappear.

If you are brand new to OCR, our optical character reader 2026 guide is the easier starting point. This article assumes you have run OCR and want to know why your accuracy is lower than the marketing claims.

The Four Document Detection Tasks That Matter Most

1. Page Boundary Detection

If the input is a photo, the document does not fill the frame. There is desk, hand, shadow. Page boundary detection finds the four corners of the actual document and crops to just that. OCR accuracy on a properly cropped document is dramatically higher than on the full photo.

How: edge detection (Canny in OpenCV is the classic), contour finding, four-point perspective transform. About 20 lines of code if you use OpenCV.

2. Rotation Correction

Pages scanned sideways or upside-down. OCR engines read top-to-bottom. A sideways page produces sideways gibberish. Rotation correction detects which way is up and rotates the page before OCR.

How: small classifier that takes a thumbnail and predicts orientation (0°, 90°, 180°, 270°). Trains fast. Free packages (Tesseract has built-in --rotate-pages) make this a one-flag fix.

3. Skew Correction (Deskew)

A page scanned at a 3-degree tilt. Looks fine to a person; OCR accuracy drops 5-10 points. Deskewing rotates the page to perfectly horizontal.

How: Hough line transform finds the dominant text line angle, then rotate by the negative of that angle. ocrmypdf --deskew does this automatically.

4. Document Classification

Is this an invoice, a bank statement, a passport, a contract? Different document types need different OCR settings, different validation rules, different downstream routing.

How: small image classifier or layout-based heuristic. Production systems usually train a model on tens of thousands of labeled documents per class. Open source classifiers exist for common types.

Why Teams Skip Document Detection

Honest answers from talking with dozens of teams:

"The vendor said it was built in"

Some vendors include document detection. Some claim they do but it is shallow (only handles 0/90/180/270 rotation, not arbitrary skew, not page boundaries). Read the docs carefully.

"It's not where the magic is"

The OCR engine and the LLM get the marketing budget. Pre-processing gets ignored. So teams pour effort into model tuning and ignore the cheap fix in front of them.

"We thought our documents were clean"

Every team thinks their documents are clean until they look. Run a sample of your real production documents and you will find tilted pages, rotated pages, photos with desks around the edges, and faded scans. The documents in your inbox right now are not clean.

"We'll get to it later"

"Later" is when you have spent six months building review workflows to fix the errors you could have prevented with one afternoon of pre-processing. Add the pre-processing first.

The Pipeline I Run Today

This is the production pre-processing pipeline I run before every OCR call. Each step is cheap. Each step recovers accuracy. Together they recovered about 6 percentage points of accuracy from our pre-cleanup baseline.

  1. File-type check. Is this actually a PDF/image? Confirm by magic bytes, not extension.
  2. Page extraction. Convert PDF pages to images at 300 DPI.
  3. Page boundary detection. If the image is a photo with margins, crop to the document.
  4. Rotation correction. Detect orientation and rotate to upright.
  5. Deskew. Correct for small tilts.
  6. Adaptive thresholding. Boost contrast in low-contrast regions without crushing high-contrast text.
  7. Document classification. Predict document type. Route to type-specific OCR settings.
  8. OCR. Run the engine. Now it has a clean, oriented, classified document.

The whole pre-processing pipeline runs in under a second per page on commodity hardware. The accuracy gain is permanent and applies to every downstream document. (Detailed walk-through in our honest guide from 4M pages a month.)

The Five Failure Modes Document Detection Catches

These are the failures that document detection prevents. If you see any of these in your output, you are skipping detection somewhere:

  1. Gibberish on specific pages. Rotated page. Add rotation correction.
  2. Wrong document type routing. Bank statement routed as invoice. Add classification.
  3. Margin text picked up as part of the document. Photo with desk around the edges. Add page boundary detection.
  4. Lower accuracy on hand-photographed documents than scanned ones. Skew + photo edges. Add boundary detection and deskew.
  5. Multi-page tables breaking on page boundaries. Page splitting logic ignored continuation tables. Add cross-page table linking.

How to Add Document Detection in an Afternoon

If you have a Tesseract-based pipeline today and want to add detection cheaply:

brew install ocrmypdf
ocrmypdf --deskew --rotate-pages --remove-background input.pdf output.pdf

Those three flags handle deskew, rotation, and background removal. Page boundary detection requires a few extra lines of OpenCV. Document classification requires a small model or a heuristic ruleset.

If you are on a cloud OCR API, check whether detection is included. Most modern APIs (DocsAPI, AWS Textract, Google Document AI) handle the basic detection steps automatically. If yours doesn't, file a feature request — it is a baseline expectation in 2026.

The Way I Explain This to Non-Tech Folks

Imagine you ask a friend to read every label on a shelf in your basement. If you hand her a clean shelf with labels facing forward and lights on, she reads them all correctly. If you hand her a dusty shelf with some labels upside-down, some sideways, some in the shadows, she will misread some.

Document detection is the work you do before handing her the shelf. Brush off the dust. Turn the labels right-side up. Turn the lights on. The shelf is now an easy read. Your friend reads everything correctly.

The shelf is your PDF. The friend is the OCR engine. The dusting and turning is document detection. People skip it because it sounds boring. The accuracy gain from doing it is the biggest single improvement available.

What I'd Do Today

If you have a working OCR pipeline that disappoints: add deskew and rotation correction first. Two flags in ocrmypdf or a few lines of code in your own pipeline. The accuracy lift is typically 4-8 points.

If you are starting a new pipeline: do not write the OCR step until you have written the pre-processing. The pre-processing is what makes the OCR work.

If you are evaluating a vendor: ask what document detection they do. The answer should be specific — boundary detection, rotation, deskew, classification. Not just "yes". (I write about this gap a lot.)

Frequently Asked Questions

What is document detection in OCR?

Document detection is the pre-processing step that identifies and prepares pages before OCR runs. It includes page boundary detection, rotation correction, deskew, and document classification. Skipping it costs accuracy on real-world (non-clean) documents.

Is document detection included in modern OCR APIs?

Mostly yes. AWS Textract, Google Document AI, Azure Form Recognizer, and DocsAPI all include basic detection. Tesseract requires you to add it via flags or external tools. Always check the docs for which specific detection steps are included.

How much does document detection improve OCR accuracy?

Typical accuracy gain: 4-8 percentage points on real-world (non-clean) documents. The gain is larger on photos taken with a phone, scanned at angles, or faded. Clean scans benefit less because they were already easy.

Can I add document detection without an engineer?

Partially. Tools like ocrmypdf include detection flags you can run from the command line. For full pipeline integration (page boundaries, classification), an engineer is helpful. Many cloud OCR APIs handle detection automatically.

What is the difference between rotation correction and deskew?

Rotation correction handles 90-degree turns (sideways or upside-down pages). Deskew handles small tilts (a 3-degree angle off horizontal). Both are needed. Rotation correction is binary (4 possible orientations). Deskew is continuous (any small angle).

Does document detection help with handwritten content?

It helps with everything — typed, handwritten, mixed. Handwriting benefits especially from page boundary detection (photos of handwritten notes often include desk edges) and deskew (handwritten pages are rarely perfectly aligned).

Common questions

Frequently asked questions

Document detection is the pre-processing step that identifies and prepares pages before OCR runs. It includes page boundary detection, rotation correction, deskew, and document classification. Skipping it costs accuracy on real-world documents.

Mostly yes. AWS Textract, Google Document AI, Azure Form Recognizer, and DocsAPI include basic detection. Tesseract requires you to add it via flags or external tools. Always check the docs for which specific steps are included.

Typical accuracy gain: 4-8 percentage points on real-world documents. The gain is larger on photos taken with a phone, scanned at angles, or faded. Clean scans benefit less because they were already easy.

Partially. Tools like ocrmypdf include detection flags you can run from command line. For full pipeline integration (page boundaries, classification), an engineer helps. Many cloud OCR APIs handle detection automatically.

Rotation correction handles 90-degree turns (sideways or upside-down pages). Deskew handles small tilts (a 3-degree angle off horizontal). Both are needed. Rotation correction is binary; deskew is continuous.

It helps with everything — typed, handwritten, mixed. Handwriting benefits especially from page boundary detection (photos often include desk edges) and deskew (handwritten pages are rarely perfectly aligned).

Nupura Ughade

Content Marketing Lead, DocsAPI

Nupura Ughade creates clear, insightful content on OCR, document AI, and fintech. She combines technical depth with real-world finance use cases to help engineers and operations leaders navigate digital transformation with confidence.

Ready to Transform Your Lending Process?

See how DocsAPI's AI-powered industry classification can help you process loans faster, improve accuracy, and scale your operations.