Build file parsing apps to extract data in .NET and Java

Parse PDF, DOCX, XLSX, PPTX, ONE, RTF, ODT, TXT, EML, MSG, HTML, ZIP, and a host of other data files. Perform data extraction including metadata, image, and text extraction among your .NET and Java documents.

View all APIsTry our APIs for Free

Advanced document parsing, text, image, and metadata extraction solutions

Parsing is the process of analyzing and interpreting data or information, such as words, numbers, and symbols, to determine their structure and meaning. Parsing can be used to extract data from documents and interpret the meaning of the extracted text. Parsing is a key component of many text-processing tasks, like information extraction, data mining, machine translation, and more. It is also used to break down larger chunks of data into smaller, more manageable pieces

If you are a software or app developer and on the lookout for an API to programmatically parse your documents, please try GroupDocs.Parser API for .NET and Java. It equips you with all you need to parse PDF, Word, Excel, PowerPoint, eBooks, Emails, HTML, and an array of other file types. You can extract data such as metadata, text, and images from the supported file formats across .NET and Java platforms with the help of GroupDocs.Parser API.

Getting Started

Please refer to the information given below to correctly install the .NET or Java version of GroupDocs.Parser API on your system.

GroupDocs.Parser for .NET installation

You may obtain the DLLs or MSI installer from the downloads section. Or, you can install the API in your .NET application via NuGet too.
PM> Install-Package GroupDocs.Parser 

GroupDocs.Parser for Java installation

Please download the JAR file from the downloads section, or, use the up-to-date repository and dependency configuration for your (Maven-based) Java applications.
<repository>
<id>groupdocs-artifacts-repository</id>
<name>GroupDocs Artifacts Repository</name>
<url>https://releases.groupdocs.com/java/repo/</url>
</repository>
<dependency>
<groupId>com.groupdocs</groupId>
<artifactId>groupdocs-parser</artifactId>
<version>22.6</version>
</dependency>
   

Document parsing and data extraction use cases

After successfully setting up the desired version of GroupDocs.Parser API at your end, we can now check some of the commonly used case scenarios for parsing documents and extracting text, images, and metadata from the supported file types.

Learn to parse data and information from PDF documents

Parsing a PDF document is the process of extracting the data from a PDF file and converting it into a structured format that is easier to read and understand. By parsing a PDF document, you can extract information such as text, images, tables, hyperlinks, and other elements from it. GroupDocs.Parser for .NET and Java APIs let you incorporate PDF parser functionality into your document processing apps.

Learn to parse data and information from PDF documents

Develop PDF parser applications in .NET

Please use the below-given sample code for parsing PDF files in .NET:
using System;
using System.Collections.Generic;
using System.Text;
using GroupDocs.Parser.Data;

// Create an instance of Parser class
using (Parser parser = new Parser(“filepath/sample.pdf”))
{
    // Extract data from PDF document
    DocumentData data = parser.ParseForm();
    // Check if form extraction is supported
    if (data == null)
    {
        Console.WriteLine("Form extraction isn't supported.");
        return;
    }

    // Create the preliminary record object
    PreliminaryRecord rec = new PreliminaryRecord();
    rec.Name = GetFieldText(data, "Name");
    rec.Model = GetFieldText(data, "Model");
    rec.Time = GetFieldText(data, "Time");
    rec.Description = GetFieldText(data, "Description");

    // We can save the preliminary record object to the database, 
    // send it as the web response or just print it to the console
    Console.WriteLine("Preliminary record");
    Console.WriteLine("Name: {0}", rec.Name);
    Console.WriteLine("Model: {0}", rec.Model);
    Console.WriteLine("Time: {0}", rec.Time);
    Console.WriteLine("Description: {0}", rec.Description);
}

private static string GetFieldText(DocumentData data, string fieldName)
{
    // Get the field from data collection
    FieldData fieldData = data.GetFieldsByName(fieldName).FirstOrDefault();
    // Check if the field data is not null (a field with the fieldName is contained in data collection)
    // and check if the field data contains the text
    return fieldData != null && fieldData.PageArea is PageTextArea
        ? (fieldData.PageArea as PageTextArea).Text
        : null;
}

// 
// Simple POCO object to store the extracted data.
// 
public class PreliminaryRecord
{
    public string Name { get; set; }
    public string Model { get; set; }
    public string Time { get; set; }
    public string Description { get; set; }
}    

Parse PDF documents in Java

To add PDF parser functionality in your Java apps, please use the following code:
// Create an instance of Parser class
try (Parser parser = new Parser(Constants.SampleCarWashPdf)) {
    // Extract data from PDF document
    DocumentData data = parser.parseForm();
    // Check if form extraction is supported
    if (data == null) {
        System.out.println("Form extraction isn't supported.");
        return;
    }
    // Create the preliminary record object
    PreliminaryRecord rec = new PreliminaryRecord();
    rec.Name = getFieldText(data, "Name");
    rec.Model = getFieldText(data, "Model");
    rec.Time = getFieldText(data, "Time");
    rec.Description = getFieldText(data, "Description");
    // We can save the preliminary record object to the database,
    // send it as the web response or just print it to the console
    System.out.println("Preliminary record");
    System.out.println(String.format("Name: %s", rec.Name));
    System.out.println(String.format("Model: %s", rec.Model));
    System.out.println(String.format("Time: %s", rec.Time));
    System.out.println(String.format("Description: %s", rec.Description));
}

private static String getFieldText(DocumentData data, String fieldName) {
    // Get the field from data collection
    FieldData fieldData = data.getFieldsByName(fieldName).get(0);
    // Check if the field data is not null (a field with the fieldName is contained in data collection)
    // and check if the field data contains the text
    return fieldData != null && fieldData.getPageArea() instanceof PageTextArea
            ? ((PageTextArea) fieldData.getPageArea()).getText()
            : null;
}

/**
 * Simple POCO object to store the extracted data.
 */
static class PreliminaryRecord {
    public String Name;
    public String Model;
    public String Time;
    public String Description;
}

Extracting data from Microsoft Word, Excel, and PowerPoint documents

As businesses become increasingly digital, the need to parse Word, Excel, and PowerPoint documents has also gained importance. Parsing files is a critical part of data analysis and business intelligence as it allows you to extract structured data easily. This data can then be used to automate processes, uncover insights, and improve decision making. GroupDocs.Parser APIs support parsing Word, Excel, and PowerPoint files enabling you to extract text, metadata, images, tables, and hyperlinks contained within these documents.

Extracting data from Microsoft Word, Excel, and PowerPoint documents

Extract tables from Microsoft Word documents in .NET

The following code snippet lets you extract tables from DOCX files:
// Create an instance of Parser class
using (Parser parser = new Parser(“filepath/sample.docx”))
{
    // Get the reader object for the document XML representation
    using (XmlReader reader = parser.GetStructure())
    {
        // Iterate over the document
        while (reader.Read())
        {
            // Check if this is the start of the table
            if (reader.IsStartElement() && reader.Name == "table")
            {
                // Process the table
                ProcessTable(reader);
            }
        }
    }
}
 
private static void ProcessTable(XmlReader reader)
{
    Console.WriteLine("table");
    // Create an instance of StringBuilder to store the cell value
    StringBuilder value = new StringBuilder();
    // Iterate over the table
    while (reader.Read())
    {
        // Check if the current tag is the end of the table
        bool isTableEnd = !reader.IsStartElement() && reader.Name == "table";
        // Check if the current tag is the start of the row or the cell
        bool isRowOrCellStart = reader.IsStartElement() && (reader.Name == "tr" || reader.Name == "td");
        // Print the cell value if this is the end of the table or the start of the row or the cell
        if ((isTableEnd || isRowOrCellStart) && value.Length > 0)
        {
            Console.Write("  ");
            Console.WriteLine(value.ToString());
            value = new StringBuilder();
        }
        // If this is the end of the table - return to the main function
        if (isTableEnd)
        {
            return;
        }
        // If this is the start of the row or the cell - print the tag name
        if (isRowOrCellStart)
        {
            Console.WriteLine(reader.Name);
            continue;
        }
        // If this code line is reached then this is the value of the cell
        value.Append(reader.Value);
    }
} 

Learn to extract tables from Word documents in Java

This code sample helps you in extracting tables from Word files:
// Create an instance of Parser class
try (Parser parser = new Parser(Constants.SampleDocx)) {
    // Extract text structure to the XML reader
    Document document = parser.getStructure();
    // Read XML document
    readNode(document.getDocumentElement());
}

private static void readNode(Node node) {
    NodeList nodes = node.getChildNodes();
    // Iterate over the child nodes
    for (int i = 0; i < nodes.getLength(); i++) {
        Node n = nodes.item(i);
        // If it's a table
        if (n.getNodeName().toLowerCase() == "table") {
            System.out.println("table");
            // Process node
            processNode(n);
        }
        readNode(n);
    }
}
private static void processNode(Node node) {
    NodeList nodes = node.getChildNodes();
    // Iterate over the child nodes
    for (int i = 0; i < nodes.getLength(); i++) {
        Node n = nodes.item(i);
        switch (n.getNodeName().toLowerCase()) {
            // In the case of a row or cell
            case "tr":
            case "td": {
                // Print the name
                System.out.println(n.getNodeName());
                // Process sub-nodes
                processNode(n);
                System.out.println();
                System.out.println("/" + n.getNodeName());
                break;
            }
            default:
                // Print the node value (if it's not null)
                String value = n.getNodeValue();
                if(value != null) {
                    System.out.print(value);
                }
                processNode(n);
                break;
        }
    }
}       

Easily extract text from Excel spreadsheets in .NET

For extracting text from an Excel sheet, please use this sample code:
// Create an instance of Parser class
using(Parser parser = new Parser(filePath))
{
    // Get the document info
    IDocumentInfo documentInfo = parser.GetDocumentInfo();
   
    // Iterate over sheets
    for(int p = 0; p < documentInfo.PageCount; p++)
    {
        // Print a sheet number 
        Console.WriteLine(string.Format("Page {0}/{1}", p + 1, documentInfo.PageCount));
   
        // Extract a text into the reader
        using(TextReader reader = parser.GetText(p))
        {
            // Print a text from the spreadsheet sheet
            Console.WriteLine(reader.ReadToEnd());
        }
    }
}

    

Extracting text from Microsoft Excel documents in Java

Similarly, you can extract text from a spreadsheet using the following code:
// Create an instance of Parser class
try (Parser parser = new Parser(Constants.SampleXlsx)) {
    // Get the spreadsheet info
    IDocumentInfo spreadsheetInfo = parser.getDocumentInfo();
    // Iterate over sheets
    for (int p = 0; p < spreadsheetInfo.getPageCount(); p++) {
        // Print a sheet number
        System.out.println(String.format("Sheet %d/%d", p + 1, spreadsheetInfo.getPageCount()));
        // Extract a text into the reader
        try (TextReader reader = parser.getText(p)) {
            // Print a text from the spreadsheet
            System.out.println(reader.readToEnd());
        }
    }
}    

How to extract images from PDF documents in .NET and Java?

Extracting images from PDF documents could have various applications such as analysis of the content, creating a digital archive, or acquiring visual data to use in other applications. Owing to the growth of digital documents, image extraction is fast becoming a necessity and has many practical uses depending on the nature of the business. For instance, a recruiting agency might want to extract photos of the candidates from their resumes for record keeping. With GroupDocs.Parser APIs, you can extract all images from a PDF document, extract images from a particular document page, or from a specific page area.

How to extract images from PDF documents in .NET and Java?

Extract images from a PDF file and save them in JPEG format in .NET

The following code highlights how to extract the images in a PDF document and then save them in JPEG format:
    // Extract images from PDF using C#
using (Parser parser = new Parser("filepath/sample.pdf"))
{
    IEnumerable images = parser.GetImages();
    // Check if image extraction is supported
    if (images == null) 
    {
        Console.WriteLine("Images extraction isn't supported");
        return;
    }
    
    ImageOptions options = new ImageOptions(ImageFormat.Jpeg);
    int imageNumber = 0;
    
    // Iterate over retrieved images
    foreach (PageImageArea image in images)
    {
        // Save Images
        image.Save("imageFilePath/image-" + imageNumber.ToString() + ".jpeg", options);
        imageNumber++;
    }
}        

Extract and save images from PDF documents in PNG format in Java

For extracting images from Java-based PDF files and saving them to PNG format, please use the code given below:
    // Create an instance of the Parser class
try (Parser parser = new Parser("filepath/sample.pdf")) {
    // Extract images from document
    Iterable images = parser.getImages();
    // Create the options to save images in PNG format
    ImageOptions options = new ImageOptions(ImageFormat.Png);
    int imageNumber = 0;
    // Iterate over images
    for (PageImageArea image : images)
    {
        // Save the images to the PNG file
        image.save(Constants.getOutputFilePath(String.format("%d.png", imageNumber)), options);
        imageNumber++;
    }
}        

Metadata extraction from PDF, Word, Excel, PowerPoint documents, Emails, and eBooks

Metadata includes information such as the type and size of file, author, date created, and other data associated with the file. Metadata extraction is an important process for many industries as it can provide valuable insight into the content of the document. It could be used in digital preservation, digital asset management, content management, and search engine optimization. Extracting document metadata is one of the features supported by GroupDocs.Parser for .NET and Java APIs. You can extract metadata from PDF, Microsoft Word, Excel, PowerPoint, Email files, and eBooks using these APIs.

Metadata extraction from PDF, Word, Excel, PowerPoint documents, Emails, and eBooks

Extracting metadata from PDF, DOCX, XLSX, and PPTX documents in .NET

Please use the below-given code to extract metadata from PDF and other data files in .NET. You can use a Word, Excel, or PowerPoint file instead of PDF in the Parser class instance to extract its metadata:
        // Create an instance of the Parser class
using(Parser parser = new Parser(“filepath/sample.pdf”))
{
    // Extract metadata from the document
    IEnumerable metadata = parser.GetMetadata();
  
    // Iterate over metadata items
    foreach(MetadataItem item in metadata)
    {
        // Print the item name and value
        Console.WriteLine(string.Format("{0}: {1}", item.Name, item.Value));
    }
}        

Extract metadata in PDF, word-processing, spreadsheets, and presentations in Java

To extract metadata from a DOCX file, please make use of the following code snippet. Please replace the source document with a PDF, Excel, or PowerPoint file instead of DOCX to extract its metadata too:
// Create an instance of Parser class
try (Parser parser = new Parser(“filepath/sample.docx”)) {
    // Extract metadata from the document
    Iterable metadata = parser.getMetadata();
    // Iterate over metadata items
    for (MetadataItem item : metadata) {
        // Print an item name and value
        System.out.println(String.format("%s: %s", item.getName(), item.getValue()));
    }
}   

Metadata extraction from Emails in .NET

You can also extract metadata from your emails. Please use this sample code to do so:
// Create an instance of Parser class
using(Parser parser = new Parser(“filepath/sample.msg”))
{
    // Extract metadata from the email
    IEnumerable metadata = parser.GetMetadata();
 
    // Iterate over metadata items
    foreach(MetadataItem item in metadata)
    {
        // Print the item name and value
        Console.WriteLine(string.Format("{0}: {1}", item.Name, item.Value));
    }
}        

eBook metadata extraction in Java

Extract metadata from eBooks (EPUB) in Java with the help of the code snippet shown below:
 // Create an instance of Parser class
 try (Parser parser = new Parser(“filepath/sample.epub”)) {
    // Extract metadata from the e-book
    Iterable metadata = parser.getMetadata();
    // Iterate over metadata items
    for (MetadataItem item : metadata) {
        // Print an item name and value
        System.out.println(String.format("%s: %s", item.getName(), item.getValue()));
    }
}        

We provide working examples at GitHub for the .NET and Java versions of GroupDocs.Parser APIs. Please be sure to check them out. Furthermore, if you want to parse PDF, DOCX, XLSX, PPTX, EPUB, MSG, and many other file types on the fly, please use our Free Online Parsing and Data Extraction Apps.

Independently automate your document and image processing tasks

Why choose GroupDocs?

Unmatched file formats support

  • All popular file formats supported including documents, images, audio, videos, and ebooks.
  • PDF, DOC, DOCX, XLS, XLSX, PPT, PPTX, PUB, PNG, PSD, ODT, MSG, EML, MP3, MP4, and many more.

Extensively programmable libraries

  • Use GroupDocs APIs to build fully customizable .NET and Java apps.
  • Manipulate your business documents, spreadsheets, presentations, and images any way you like.

Hundreds of supported features

  • Convert Word or Excel to PDF, annotate PDFs, edit DOC, DOCX, or watermark files.
  • Work with esignatures, tables, mail-merge, attachments, shapes, and much more.

Tailored to your needs

  • Free trials and different paid licensing options to choose from.
  • Well-suited to individual users, startups, as well as small and large enterprises.

APIs for Developers

  • Programmatically process your digital documents and images in .NET and Java platforms.
  • Document APIs designed specifically for .NET and Java application developers.

Trusted by users globally

  • Preferred by developers and businesses alike, our libraries are used globally.
  • Generate optimised documents easily in standalone and distributed environments.

Do more with your documents and images

  • Create, render, edit, convert, compare, digitally sign, watermark, and export your files.
  • Experience endless possibilities by creating multi-functional, high-performance apps.

Simple integration and convenient application

  • Enjoy greater flexibility by integrating with your existing software applications.
  • Get up and running using a few lines of code with our super-fast and reliable APIs.

Multiple support channels

  • Need help? Look no further than one of our developer-led support options.
  • Explore the APIs structure, and documentation, or dive into the knowledge base.

Ready to get started?

Download Free Trial