Excellent File Parsing And Data Extraction APIs For .NET & Java

GroupDocs parsing APIs enable developers to programmatically extract data and parse PDF, DOCX, XLSX, PPTX, RTF, ODT, TXT, EML, MSG, HTML, ZIP, and more file formats.

View all APIsTry our APIs for Free

Powerful Document Parsing and Data Extraction Solutions

Parsing refers to the analysis of different types of data that may include symbols, text, and numbers. It helps in determining the structure of data by interpreting it in an easy-to-understand manner. Different data and information processing activities utilize parsing for seamless data interpretation. Parsing is essentially used in data extraction, mining, translation, and more while also assisting in better management of vast amounts of data by dividing it into smaller parts.

If you are a software or app developer and on the lookout for a powerful API to programmatically parse your documents, please try GroupDocs.Parser API for .NET and Java. It equips you with all you need to parse PDF, Word, Excel, PowerPoint, eBooks, Emails, HTML, and an array of other file types. You can extract data such as metadata, text, and images from the supported file formats across .NET and Java platforms with the help of the file parsing and data extraction API

Getting Started

Please refer to the information given below to correctly install the .NET or Java version of the document parsing API for developers on your system.

GroupDocs.Parser for .NET installation

You may obtain the DLLs or MSI installer from the downloads section. Or, you can install the file parsing and data extraction API in your .NET application via NuGet too.
PM> Install-Package GroupDocs.Parser 

GroupDocs.Parser for Java installation

Please download the JAR file from the downloads section, or, use the up-to-date repository and dependency configuration for your (Maven-based) Java applications.
<repository>
<id>groupdocs-artifacts-repository</id>
<name>GroupDocs Artifacts Repository</name>
<url>https://releases.groupdocs.com/java/repo/</url>
</repository>
<dependency>
<groupId>com.groupdocs</groupId>
<artifactId>groupdocs-parser</artifactId>
<version>22.6</version>
</dependency>
   

Real-world Use Cases for Document Parsing and Data Extraction

After successfully setting up the desired version of the GroupDocs.Parser API at your end, we can now check some real-world case scenarios to parse documents and extract formatted text, images, and metadata from documents.

Mastering Data Parsing from PDF Documents

Parsing PDF documents enables the convenient extraction of information from a PDF file and converting it into a structured format that is easier to interpret and process. You can extract data such as text, images, tables, hyperlinks, and other elements by parsing PDF files. GroupDocs.Parser for .NET and Java APIs let you effortlessly incorporate PDF parser functionality into your document processing apps. You can learn how to parse PDF files and extract data from them with the help of these excellent document parser APIs.

Mastering Data Parsing from PDF Documents

Develop PDF parser applications in .NET

Please use the below-given sample code for parsing PDF files in .NET:
using System;
using System.Collections.Generic;
using System.Text;
using GroupDocs.Parser.Data;

// Create an instance of Parser class
using (Parser parser = new Parser(“filepath/sample.pdf”))
{
    // Extract data from PDF document
    DocumentData data = parser.ParseForm();
    // Check if form extraction is supported
    if (data == null)
    {
        Console.WriteLine("Form extraction isn't supported.");
        return;
    }

    // Create the preliminary record object
    PreliminaryRecord rec = new PreliminaryRecord();
    rec.Name = GetFieldText(data, "Name");
    rec.Model = GetFieldText(data, "Model");
    rec.Time = GetFieldText(data, "Time");
    rec.Description = GetFieldText(data, "Description");

    // We can save the preliminary record object to the database, 
    // send it as the web response or just print it to the console
    Console.WriteLine("Preliminary record");
    Console.WriteLine("Name: {0}", rec.Name);
    Console.WriteLine("Model: {0}", rec.Model);
    Console.WriteLine("Time: {0}", rec.Time);
    Console.WriteLine("Description: {0}", rec.Description);
}

private static string GetFieldText(DocumentData data, string fieldName)
{
    // Get the field from data collection
    FieldData fieldData = data.GetFieldsByName(fieldName).FirstOrDefault();
    // Check if the field data is not null (a field with the fieldName is contained in data collection)
    // and check if the field data contains the text
    return fieldData != null && fieldData.PageArea is PageTextArea
        ? (fieldData.PageArea as PageTextArea).Text
        : null;
}

// 
// Simple POCO object to store the extracted data.
// 
public class PreliminaryRecord
{
    public string Name { get; set; }
    public string Model { get; set; }
    public string Time { get; set; }
    public string Description { get; set; }
}    

PDF parsing in Java

To add PDF parser functionality in your Java apps, please use the following code:
// Create an instance of Parser class
try (Parser parser = new Parser(Constants.SampleCarWashPdf)) {
    // Extract data from PDF document
    DocumentData data = parser.parseForm();
    // Check if form extraction is supported
    if (data == null) {
        System.out.println("Form extraction isn't supported.");
        return;
    }
    // Create the preliminary record object
    PreliminaryRecord rec = new PreliminaryRecord();
    rec.Name = getFieldText(data, "Name");
    rec.Model = getFieldText(data, "Model");
    rec.Time = getFieldText(data, "Time");
    rec.Description = getFieldText(data, "Description");
    // We can save the preliminary record object to the database,
    // send it as the web response or just print it to the console
    System.out.println("Preliminary record");
    System.out.println(String.format("Name: %s", rec.Name));
    System.out.println(String.format("Model: %s", rec.Model));
    System.out.println(String.format("Time: %s", rec.Time));
    System.out.println(String.format("Description: %s", rec.Description));
}

private static String getFieldText(DocumentData data, String fieldName) {
    // Get the field from data collection
    FieldData fieldData = data.getFieldsByName(fieldName).get(0);
    // Check if the field data is not null (a field with the fieldName is contained in data collection)
    // and check if the field data contains the text
    return fieldData != null && fieldData.getPageArea() instanceof PageTextArea
            ? ((PageTextArea) fieldData.getPageArea()).getText()
            : null;
}

/**
 * Simple POCO object to store the extracted data.
 */
static class PreliminaryRecord {
    public String Name;
    public String Model;
    public String Time;
    public String Description;
}

Efficiently Extracting Data from Microsoft Office Documents

As businesses become increasingly digital, the need to parse Word, Excel, and PowerPoint documents has also gained importance. File parsing is a critical part of data analysis and business intelligence as it allows you to extract structured data easily. This data can then be used to automate business processes, uncover insights, and improve decision-making. GroupDocs.Parser APIs support parsing Word, Excel, and PowerPoint files enabling you to extract text, metadata, images, tables, and hyperlinks contained within these documents by building smart document parsing solutions for businesses.

Efficiently Extracting Data from Microsoft Office Documents

Parse Microsoft Word documents to extract tables in .NET

The following code snippet lets you extract tables from DOCX files:
// Create an instance of Parser class
using (Parser parser = new Parser(“filepath/sample.docx”))
{
    // Get the reader object for the document XML representation
    using (XmlReader reader = parser.GetStructure())
    {
        // Iterate over the document
        while (reader.Read())
        {
            // Check if this is the start of the table
            if (reader.IsStartElement() && reader.Name == "table")
            {
                // Process the table
                ProcessTable(reader);
            }
        }
    }
}
 
private static void ProcessTable(XmlReader reader)
{
    Console.WriteLine("table");
    // Create an instance of StringBuilder to store the cell value
    StringBuilder value = new StringBuilder();
    // Iterate over the table
    while (reader.Read())
    {
        // Check if the current tag is the end of the table
        bool isTableEnd = !reader.IsStartElement() && reader.Name == "table";
        // Check if the current tag is the start of the row or the cell
        bool isRowOrCellStart = reader.IsStartElement() && (reader.Name == "tr" || reader.Name == "td");
        // Print the cell value if this is the end of the table or the start of the row or the cell
        if ((isTableEnd || isRowOrCellStart) && value.Length > 0)
        {
            Console.Write("  ");
            Console.WriteLine(value.ToString());
            value = new StringBuilder();
        }
        // If this is the end of the table - return to the main function
        if (isTableEnd)
        {
            return;
        }
        // If this is the start of the row or the cell - print the tag name
        if (isRowOrCellStart)
        {
            Console.WriteLine(reader.Name);
            continue;
        }
        // If this code line is reached then this is the value of the cell
        value.Append(reader.Value);
    }
} 

Parse Word documents in Java and extract tables

This code sample helps you in extracting tables from Word files:
// Create an instance of Parser class
try (Parser parser = new Parser(Constants.SampleDocx)) {
    // Extract text structure to the XML reader
    Document document = parser.getStructure();
    // Read XML document
    readNode(document.getDocumentElement());
}

private static void readNode(Node node) {
    NodeList nodes = node.getChildNodes();
    // Iterate over the child nodes
    for (int i = 0; i < nodes.getLength(); i++) {
        Node n = nodes.item(i);
        // If it's a table
        if (n.getNodeName().toLowerCase() == "table") {
            System.out.println("table");
            // Process node
            processNode(n);
        }
        readNode(n);
    }
}
private static void processNode(Node node) {
    NodeList nodes = node.getChildNodes();
    // Iterate over the child nodes
    for (int i = 0; i < nodes.getLength(); i++) {
        Node n = nodes.item(i);
        switch (n.getNodeName().toLowerCase()) {
            // In the case of a row or cell
            case "tr":
            case "td": {
                // Print the name
                System.out.println(n.getNodeName());
                // Process sub-nodes
                processNode(n);
                System.out.println();
                System.out.println("/" + n.getNodeName());
                break;
            }
            default:
                // Print the node value (if it's not null)
                String value = n.getNodeValue();
                if(value != null) {
                    System.out.print(value);
                }
                processNode(n);
                break;
        }
    }
}       

Easily extract text from Excel spreadsheets in .NET

For text and data extraction from an Excel sheet, please use this sample code:
// Create an instance of Parser class
using(Parser parser = new Parser(filePath))
{
    // Get the document info
    IDocumentInfo documentInfo = parser.GetDocumentInfo();
   
    // Iterate over sheets
    for(int p = 0; p < documentInfo.PageCount; p++)
    {
        // Print a sheet number 
        Console.WriteLine(string.Format("Page {0}/{1}", p + 1, documentInfo.PageCount));
   
        // Extract a text into the reader
        using(TextReader reader = parser.GetText(p))
        {
            // Print a text from the spreadsheet sheet
            Console.WriteLine(reader.ReadToEnd());
        }
    }
}

    

Extracting text from Microsoft Excel documents in Java

Similarly, you can extract text from spreadsheets using the following code:
// Create an instance of Parser class
try (Parser parser = new Parser(Constants.SampleXlsx)) {
    // Get the spreadsheet info
    IDocumentInfo spreadsheetInfo = parser.getDocumentInfo();
    // Iterate over sheets
    for (int p = 0; p < spreadsheetInfo.getPageCount(); p++) {
        // Print a sheet number
        System.out.println(String.format("Sheet %d/%d", p + 1, spreadsheetInfo.getPageCount()));
        // Extract a text into the reader
        try (TextReader reader = parser.getText(p)) {
            // Print a text from the spreadsheet
            System.out.println(reader.readToEnd());
        }
    }
}    

Image Extraction from PDFs in .NET and Java

Extracting images from PDF documents could have various applications such as analysis of the content, creating a digital archive, or acquiring visual data to use in other applications. Owing to the growth of digital documents, image extraction is fast becoming a necessity and has many practical uses depending on the nature of the business. For instance, a recruiting agency might want to extract photos of the candidates from their resumes for record keeping. With GroupDocs.Parser APIs, you can extract all images from a PDF document, extract images from a particular document page, or from a specific page area.

Image Extraction from PDFs in .NET and Java

Extract images from PDF files and save them in JPEG format in .NET

The following code highlights how to extract the images in a PDF document and then save them in JPEG format:
    // Extract images from PDF using C#
using (Parser parser = new Parser("filepath/sample.pdf"))
{
    IEnumerable images = parser.GetImages();
    // Check if image extraction is supported
    if (images == null) 
    {
        Console.WriteLine("Images extraction isn't supported");
        return;
    }
    
    ImageOptions options = new ImageOptions(ImageFormat.Jpeg);
    int imageNumber = 0;
    
    // Iterate over retrieved images
    foreach (PageImageArea image in images)
    {
        // Save Images
        image.Save("imageFilePath/image-" + imageNumber.ToString() + ".jpeg", options);
        imageNumber++;
    }
}        

Extract and save images from PDF documents in PNG format in Java

For image extraction from PDF files and saving them to PNG format in Java, please use the code given below:
    // Create an instance of the Parser class
try (Parser parser = new Parser("filepath/sample.pdf")) {
    // Extract images from document
    Iterable images = parser.getImages();
    // Create the options to save images in PNG format
    ImageOptions options = new ImageOptions(ImageFormat.Png);
    int imageNumber = 0;
    // Iterate over images
    for (PageImageArea image : images)
    {
        // Save the images to the PNG file
        image.save(Constants.getOutputFilePath(String.format("%d.png", imageNumber)), options);
        imageNumber++;
    }
}        

Metadata Extraction from PDFs, Office Files, Emails, and More

Metadata includes information such as the type and size of the file, author, date created, and other data associated with the file. Metadata extraction is an important process for many industries as it can provide valuable insight into the content of the document. It could be used in digital preservation, digital asset management, content management, and search engine optimization. Extracting document metadata is one of the features supported by GroupDocs.Parser for .NET and Java APIs. You can upgrade your existing document parser apps or develop new parsing solutions to extract metadata from PDF, Microsoft Word, Excel, PowerPoint, Email files, and eBooks using these APIs.

Metadata Extraction from PDFs, Office Files, Emails, and More

Extracting metadata from PDF, DOCX, XLSX, and PPTX documents in .NET

Please use the below-given code to extract metadata from PDF and other data files in .NET. You can use a Word, Excel, or PowerPoint file instead of PDF in the Parser class instance to extract its metadata:
        // Create an instance of the Parser class
using(Parser parser = new Parser(“filepath/sample.pdf”))
{
    // Extract metadata from the document
    IEnumerable metadata = parser.GetMetadata();
  
    // Iterate over metadata items
    foreach(MetadataItem item in metadata)
    {
        // Print the item name and value
        Console.WriteLine(string.Format("{0}: {1}", item.Name, item.Value));
    }
}        

Extract metadata in PDF, word-processing, spreadsheets, and presentations in Java

To extract metadata from a Word file, please make use of the following code snippet. Please replace the source document with a PDF, Excel, or PowerPoint file instead of DOCX to extract its metadata too:
// Create an instance of Parser class
try (Parser parser = new Parser(“filepath/sample.docx”)) {
    // Extract metadata from the document
    Iterable metadata = parser.getMetadata();
    // Iterate over metadata items
    for (MetadataItem item : metadata) {
        // Print an item name and value
        System.out.println(String.format("%s: %s", item.getName(), item.getValue()));
    }
}   

Metadata extraction from Emails in .NET

You can also extract metadata from emails in .NET. Please use this sample code to do so:
// Create an instance of Parser class
using(Parser parser = new Parser(“filepath/sample.msg”))
{
    // Extract metadata from the email
    IEnumerable metadata = parser.GetMetadata();
 
    // Iterate over metadata items
    foreach(MetadataItem item in metadata)
    {
        // Print the item name and value
        Console.WriteLine(string.Format("{0}: {1}", item.Name, item.Value));
    }
}        

eBook metadata extraction in Java

Extract metadata from eBooks (EPUB) in Java with the help of the code snippet shown below:
 // Create an instance of Parser class
 try (Parser parser = new Parser(“filepath/sample.epub”)) {
    // Extract metadata from the e-book
    Iterable metadata = parser.getMetadata();
    // Iterate over metadata items
    for (MetadataItem item : metadata) {
        // Print an item name and value
        System.out.println(String.format("%s: %s", item.getName(), item.getValue()));
    }
}        

We provide working examples on GitHub for the .NET and Java versions of GroupDocs.Parser APIs. Please be sure to check them out. Furthermore, if you want to parse PDF, DOCX, XLSX, PPTX, EPUB, MSG, and many other file types on the fly, please use our Free Online Document Parser and Data Extraction Apps.

Independently automate your document and image processing tasks

Why choose GroupDocs?

Unmatched file formats support

  • All popular file formats supported including documents, images, audio, videos, and ebooks.
  • PDF, DOC, DOCX, XLS, XLSX, PPT, PPTX, PUB, PNG, PSD, ODT, MSG, EML, MP3, MP4, and many more.

Extensively programmable libraries

  • Use GroupDocs APIs to build fully customizable .NET and Java apps.
  • Manipulate your business documents, spreadsheets, presentations, and images any way you like.

Hundreds of supported features

  • Convert Word or Excel to PDF, annotate PDFs, edit DOC, DOCX, or watermark files.
  • Work with esignatures, tables, mail-merge, attachments, shapes, and much more.

Tailored to your needs

  • Free trials and different paid licensing options to choose from.
  • Well-suited to individual users, startups, as well as small and large enterprises.

APIs for Developers

  • Programmatically process your digital documents and images in .NET and Java platforms.
  • Document APIs designed specifically for .NET and Java application developers.

Trusted by users globally

  • Preferred by developers and businesses alike, our libraries are used globally.
  • Generate optimised documents easily in standalone and distributed environments.

Do more with your documents and images

  • Create, render, edit, convert, compare, digitally sign, watermark, and export your files.
  • Experience endless possibilities by creating multi-functional, high-performance apps.

Simple integration and convenient application

  • Enjoy greater flexibility by integrating with your existing software applications.
  • Get up and running using a few lines of code with our super-fast and reliable APIs.

Multiple support channels

  • Need help? Look no further than one of our developer-led support options.
  • Explore the APIs structure, and documentation, or dive into the knowledge base.

Ready to get started?

Download Free Trial