Aspose.OCR Scanned PDF to Text for .NET
Aspose.OCR Scanned PDF to Text for .NET enables developers to extract text from scanned PDF files or convert them into fully searchable documents. It reads any layout and style, accurately defines the structure of text and tables, and preserves original images in the background for complete content retention.
Installation and Setup
To get started, install the Aspose.OCR package into your .NET project via NuGet or from a locally downloaded file. For detailed steps, see the Installation guide. Before calling any OCR methods, configure metered licensing as described in the Metered Licensing documentation.
Features and Functionalities
Text Extraction from Scanned PDFs
- Reads bitmap-based pages and applies OCR to extract recognizable text.
- Supports both single-page and multi-page PDF input.
- Exposes text fragments along with their position, font attributes, and confidence scores.
OCR Accuracy and Layout Retention
- Leverages advanced OCR engines to maximize recognition accuracy on low-quality scans.
- Preserves document flow: paragraphs, columns, and line breaks remain consistent with the source layout.
- Provides detailed layout metadata so developers can reconstruct or reflow content.
Table Recognition and Extraction
- Automatically detects table structures within scanned pages.
- Outputs table content as structured rows and cells with bounding box coordinates.
- Enables downstream export to CSV, Excel, or custom schemas.
Searchable Document Conversion
- Embeds recognized text back into PDFs as an invisible layer, making them searchable without changing appearance.
- Retains original scanned imagery to preserve visual fidelity.
Background Image Preservation
- Keeps scanned images intact in the background.
- Places recognized text overlays on top for seamless reading and printing.
Customizable Recognition Parameters
- Adjust segmentation modes for single/multi-column layouts.
- Configure character whitelist/blacklist for domain-specific recognition.
- Control resolution, DPI, and preprocessing filters (deskew, noise removal, thresholding).
Multi-Language and Script Support
- Recognizes Latin, Cyrillic, Greek, Chinese, Hindi, and more.
- Allows dynamic loading of language packs.
- APIs let you specify primary and secondary recognition languages per page.
Performance and Resource Management
- Supports multi-page PDF processing.
- Async APIs enable parallel processing for batch workloads.
- Provides tuning options for thread usage and buffer sizes.
Example: Extracting Text from Scanned PDFs
Aspose.OCR.Metered metered = new Aspose.OCR.Metered();
metered.SetMeteredKey("PublicKey", "PrivateKey");
Aspose.OCR.AsposeOcr recognitionEngine = new Aspose.OCR.AsposeOcr();
Aspose.OCR.OcrInput input = new Aspose.OCR.OcrInput(Aspose.OCR.InputType.PDF);
// Process selected pages from a PDF
input.Add("source1.pdf", 0, 3); // first 3 pages
// Process all pages from another PDF
input.Add("source2.pdf");
Aspose.OCR.RecognitionSettings recognitionSettings = new Aspose.OCR.RecognitionSettings();
recognitionSettings.Language = Aspose.OCR.Language.Latin;
List<Aspose.OCR.RecognitionResult> results = recognitionEngine.Recognize(input, recognitionSettings);
foreach (Aspose.OCR.RecognitionResult result in results)
{
Console.WriteLine(result.RecognitionText);
}
// Save results to file
results[0].Save("result.txt", Aspose.OCR.SaveFormat.Text);
Aspose.OCR.AsposeOcr.SaveMultipageDocument("result.pdf", Aspose.OCR.SaveFormat.Pdf, results);
Tips and Best Practices
- Preprocess PDFs (deskew, despeckle, threshold) for improved accuracy.
- Use layout analysis to detect text and tables before extraction.
- Apply confidence thresholds to validate critical content.
- Limit concurrent OCR engines in batch jobs to prevent resource contention.
- Cache language packs and reuse OCR engine instances across multiple pages.
By combining OCR accuracy, table detection, and searchable PDF generation, Aspose.OCR Scanned PDF to Text for .NET provides a complete solution for digitizing and extracting text from scanned PDFs while preserving original layouts.