Aspose.PDF Text Extractor for .NET

Aspose.PDF Text Extractor for .NET is a focused plugin that allows developers to extract pure, raw, or plain text from PDF documents. It strips away formatting and graphical elements, providing clean textual content that can be indexed, analyzed, or transformed within .NET applications.

Getting Started

Installation and Setup

  1. Install Aspose.PDF via NuGet or download assemblies directly.
  2. Configure metered licensing before extraction (see Metered Licensing ).

Features and Functionalities

Raw Text Extraction

  • Extracts the unaltered character stream from each page.
  • Preserves whitespace, line breaks, and hidden text.
  • Useful for indexing or bulk text dumps.

Plain Text Extraction

  • Normalizes whitespace and line breaks for readability.
  • Joins adjacent text runs intelligently.
  • Ignores fonts, graphics, and positioning.

Page and Range-Based Extraction

  • Extract text from entire documents or specific page ranges.
  • Reduces memory usage by limiting scope.

Region-Based Extraction

  • Specify rectangular regions (x, y, width, height).
  • Extract text from headers, footers, or columns.
  • Ideal for structured layouts.

Text Filtering and Cleanup

  • Remove control sequences, non-printable characters, and extra whitespace.
  • Optionally exclude text from annotations, fields, or hidden layers.

Encrypted PDF Support

  • Open password-protected PDFs by supplying credentials.
  • Extraction APIs decrypt automatically during processing.

Unicode and Encoding

  • Output in UTF-8 or specified encodings.
  • Supports complex scripts, right-to-left languages, and Unicode glyphs.

Performance and Concurrency

  • Stream-based extraction minimizes memory footprint.
  • Thread-safe APIs allow parallel processing of multiple PDFs.

Code Example: Extracting Text from PDF

// Define input file
var inputPath = Path.Combine(@"C:\Samples\", "sample.pdf");

// Create text extractor instance
var extractor = new TextExtractor();

// Configure extraction options
var options = new TextExtractorOptions
{
    Mode = TextExtractionMode.PlainText
};

// Add input
options.AddInput(new FileDataSource(inputPath));

// Process extraction
var resultContainer = extractor.Process(options);

// Retrieve text result
var textResult = resultContainer.ResultCollection[0];
Console.WriteLine(textResult);

Tips and Best Practices

  • Choose extraction mode based on needs: raw for indexing, plain for readability.
  • Limit extraction to ranges or regions to improve performance.
  • Apply filters early to simplify post-processing.
  • Cache decrypted instances when reusing secured PDFs.
  • Tune thread counts and buffer sizes for large-scale workflows.
  • Configure licensing at startup to avoid evaluation warnings.

Frequently Asked Questions

What modes of extraction are supported? Three: raw, plain, and region-based extraction.

Can I extract text from password-protected PDFs? Yes, by providing the correct password, text can be extracted securely.

Does it support right-to-left and complex scripts? Yes, Unicode and RTL scripts (e.g., Arabic, Hebrew) are fully supported.

How is this plugin different from the full Aspose.PDF library? This plugin is lightweight and optimized only for text extraction, while Aspose.PDF provides a full PDF manipulation API.

Is extraction thread-safe? Yes, operations are thread-safe at the document level for parallel processing.