Parsing refers to the analysis of different types of data that may include symbols, text, and numbers. It helps in determining the structure of data by interpreting it in an easy-to-understand manner. Different data and information processing activities utilize parsing for seamless data interpretation. Parsing is essentially used in data extraction, mining, translation, and more while also assisting in better management of vast amounts of data by dividing it into smaller parts.
If you are a software or app developer and on the lookout for a powerful API to programmatically parse your documents, please try GroupDocs.Parser API for .NET and Java. It equips you with all you need to parse PDF, Word, Excel, PowerPoint, eBooks, Emails, HTML, and an array of other file types. You can extract data such as metadata, text, and images from the supported file formats across .NET and Java platforms with the help of the file parsing and data extraction API
Please refer to the information given below to correctly install the .NET or Java version of the document parsing API for developers on your system.
After successfully setting up the desired version of the GroupDocs.Parser API at your end, we can now check some real-world case scenarios to parse documents and extract formatted text, images, and metadata from documents.
Parsing PDF documents enables the convenient extraction of information from a PDF file and converting it into a structured format that is easier to interpret and process. You can extract data such as text, images, tables, hyperlinks, and other elements by parsing PDF files. GroupDocs.Parser for .NET and Java APIs let you effortlessly incorporate PDF parser functionality into your document processing apps. You can learn how to parse PDF files and extract data from them with the help of these excellent document parser APIs.
using System;
using System.Collections.Generic;
using System.Text;
using GroupDocs.Parser.Data;
// Create an instance of Parser class
using (Parser parser = new Parser(“filepath/sample.pdf”))
{
// Extract data from PDF document
DocumentData data = parser.ParseForm();
// Check if form extraction is supported
if (data == null)
{
Console.WriteLine("Form extraction isn't supported.");
return;
}
// Create the preliminary record object
PreliminaryRecord rec = new PreliminaryRecord();
rec.Name = GetFieldText(data, "Name");
rec.Model = GetFieldText(data, "Model");
rec.Time = GetFieldText(data, "Time");
rec.Description = GetFieldText(data, "Description");
// We can save the preliminary record object to the database,
// send it as the web response or just print it to the console
Console.WriteLine("Preliminary record");
Console.WriteLine("Name: {0}", rec.Name);
Console.WriteLine("Model: {0}", rec.Model);
Console.WriteLine("Time: {0}", rec.Time);
Console.WriteLine("Description: {0}", rec.Description);
}
private static string GetFieldText(DocumentData data, string fieldName)
{
// Get the field from data collection
FieldData fieldData = data.GetFieldsByName(fieldName).FirstOrDefault();
// Check if the field data is not null (a field with the fieldName is contained in data collection)
// and check if the field data contains the text
return fieldData != null && fieldData.PageArea is PageTextArea
? (fieldData.PageArea as PageTextArea).Text
: null;
}
//
// Simple POCO object to store the extracted data.
//
public class PreliminaryRecord
{
public string Name { get; set; }
public string Model { get; set; }
public string Time { get; set; }
public string Description { get; set; }
}
// Create an instance of Parser class
try (Parser parser = new Parser(Constants.SampleCarWashPdf)) {
// Extract data from PDF document
DocumentData data = parser.parseForm();
// Check if form extraction is supported
if (data == null) {
System.out.println("Form extraction isn't supported.");
return;
}
// Create the preliminary record object
PreliminaryRecord rec = new PreliminaryRecord();
rec.Name = getFieldText(data, "Name");
rec.Model = getFieldText(data, "Model");
rec.Time = getFieldText(data, "Time");
rec.Description = getFieldText(data, "Description");
// We can save the preliminary record object to the database,
// send it as the web response or just print it to the console
System.out.println("Preliminary record");
System.out.println(String.format("Name: %s", rec.Name));
System.out.println(String.format("Model: %s", rec.Model));
System.out.println(String.format("Time: %s", rec.Time));
System.out.println(String.format("Description: %s", rec.Description));
}
private static String getFieldText(DocumentData data, String fieldName) {
// Get the field from data collection
FieldData fieldData = data.getFieldsByName(fieldName).get(0);
// Check if the field data is not null (a field with the fieldName is contained in data collection)
// and check if the field data contains the text
return fieldData != null && fieldData.getPageArea() instanceof PageTextArea
? ((PageTextArea) fieldData.getPageArea()).getText()
: null;
}
/**
* Simple POCO object to store the extracted data.
*/
static class PreliminaryRecord {
public String Name;
public String Model;
public String Time;
public String Description;
}As businesses become increasingly digital, the need to parse Word, Excel, and PowerPoint documents has also gained importance. File parsing is a critical part of data analysis and business intelligence as it allows you to extract structured data easily. This data can then be used to automate business processes, uncover insights, and improve decision-making. GroupDocs.Parser APIs support parsing Word, Excel, and PowerPoint files enabling you to extract text, metadata, images, tables, and hyperlinks contained within these documents by building smart document parsing solutions for businesses.
// Create an instance of Parser class
using (Parser parser = new Parser(“filepath/sample.docx”))
{
// Get the reader object for the document XML representation
using (XmlReader reader = parser.GetStructure())
{
// Iterate over the document
while (reader.Read())
{
// Check if this is the start of the table
if (reader.IsStartElement() && reader.Name == "table")
{
// Process the table
ProcessTable(reader);
}
}
}
}
private static void ProcessTable(XmlReader reader)
{
Console.WriteLine("table");
// Create an instance of StringBuilder to store the cell value
StringBuilder value = new StringBuilder();
// Iterate over the table
while (reader.Read())
{
// Check if the current tag is the end of the table
bool isTableEnd = !reader.IsStartElement() && reader.Name == "table";
// Check if the current tag is the start of the row or the cell
bool isRowOrCellStart = reader.IsStartElement() && (reader.Name == "tr" || reader.Name == "td");
// Print the cell value if this is the end of the table or the start of the row or the cell
if ((isTableEnd || isRowOrCellStart) && value.Length > 0)
{
Console.Write(" ");
Console.WriteLine(value.ToString());
value = new StringBuilder();
}
// If this is the end of the table - return to the main function
if (isTableEnd)
{
return;
}
// If this is the start of the row or the cell - print the tag name
if (isRowOrCellStart)
{
Console.WriteLine(reader.Name);
continue;
}
// If this code line is reached then this is the value of the cell
value.Append(reader.Value);
}
}
// Create an instance of Parser class
try (Parser parser = new Parser(Constants.SampleDocx)) {
// Extract text structure to the XML reader
Document document = parser.getStructure();
// Read XML document
readNode(document.getDocumentElement());
}
private static void readNode(Node node) {
NodeList nodes = node.getChildNodes();
// Iterate over the child nodes
for (int i = 0; i < nodes.getLength(); i++) {
Node n = nodes.item(i);
// If it's a table
if (n.getNodeName().toLowerCase() == "table") {
System.out.println("table");
// Process node
processNode(n);
}
readNode(n);
}
}
private static void processNode(Node node) {
NodeList nodes = node.getChildNodes();
// Iterate over the child nodes
for (int i = 0; i < nodes.getLength(); i++) {
Node n = nodes.item(i);
switch (n.getNodeName().toLowerCase()) {
// In the case of a row or cell
case "tr":
case "td": {
// Print the name
System.out.println(n.getNodeName());
// Process sub-nodes
processNode(n);
System.out.println();
System.out.println("/" + n.getNodeName());
break;
}
default:
// Print the node value (if it's not null)
String value = n.getNodeValue();
if(value != null) {
System.out.print(value);
}
processNode(n);
break;
}
}
}
// Create an instance of Parser class
using(Parser parser = new Parser(filePath))
{
// Get the document info
IDocumentInfo documentInfo = parser.GetDocumentInfo();
// Iterate over sheets
for(int p = 0; p < documentInfo.PageCount; p++)
{
// Print a sheet number
Console.WriteLine(string.Format("Page {0}/{1}", p + 1, documentInfo.PageCount));
// Extract a text into the reader
using(TextReader reader = parser.GetText(p))
{
// Print a text from the spreadsheet sheet
Console.WriteLine(reader.ReadToEnd());
}
}
}
// Create an instance of Parser class
try (Parser parser = new Parser(Constants.SampleXlsx)) {
// Get the spreadsheet info
IDocumentInfo spreadsheetInfo = parser.getDocumentInfo();
// Iterate over sheets
for (int p = 0; p < spreadsheetInfo.getPageCount(); p++) {
// Print a sheet number
System.out.println(String.format("Sheet %d/%d", p + 1, spreadsheetInfo.getPageCount()));
// Extract a text into the reader
try (TextReader reader = parser.getText(p)) {
// Print a text from the spreadsheet
System.out.println(reader.readToEnd());
}
}
} Extracting images from PDF documents could have various applications such as analysis of the content, creating a digital archive, or acquiring visual data to use in other applications. Owing to the growth of digital documents, image extraction is fast becoming a necessity and has many practical uses depending on the nature of the business. For instance, a recruiting agency might want to extract photos of the candidates from their resumes for record keeping. With GroupDocs.Parser APIs, you can extract all images from a PDF document, extract images from a particular document page, or from a specific page area.
// Extract images from PDF using C#
using (Parser parser = new Parser("filepath/sample.pdf"))
{
IEnumerable images = parser.GetImages();
// Check if image extraction is supported
if (images == null)
{
Console.WriteLine("Images extraction isn't supported");
return;
}
ImageOptions options = new ImageOptions(ImageFormat.Jpeg);
int imageNumber = 0;
// Iterate over retrieved images
foreach (PageImageArea image in images)
{
// Save Images
image.Save("imageFilePath/image-" + imageNumber.ToString() + ".jpeg", options);
imageNumber++;
}
}
// Create an instance of the Parser class
try (Parser parser = new Parser("filepath/sample.pdf")) {
// Extract images from document
Iterable images = parser.getImages();
// Create the options to save images in PNG format
ImageOptions options = new ImageOptions(ImageFormat.Png);
int imageNumber = 0;
// Iterate over images
for (PageImageArea image : images)
{
// Save the images to the PNG file
image.save(Constants.getOutputFilePath(String.format("%d.png", imageNumber)), options);
imageNumber++;
}
} Metadata includes information such as the type and size of the file, author, date created, and other data associated with the file. Metadata extraction is an important process for many industries as it can provide valuable insight into the content of the document. It could be used in digital preservation, digital asset management, content management, and search engine optimization. Extracting document metadata is one of the features supported by GroupDocs.Parser for .NET and Java APIs. You can upgrade your existing document parser apps or develop new parsing solutions to extract metadata from PDF, Microsoft Word, Excel, PowerPoint, Email files, and eBooks using these APIs.
// Create an instance of the Parser class
using(Parser parser = new Parser(“filepath/sample.pdf”))
{
// Extract metadata from the document
IEnumerable metadata = parser.GetMetadata();
// Iterate over metadata items
foreach(MetadataItem item in metadata)
{
// Print the item name and value
Console.WriteLine(string.Format("{0}: {1}", item.Name, item.Value));
}
}
// Create an instance of Parser class
try (Parser parser = new Parser(“filepath/sample.docx”)) {
// Extract metadata from the document
Iterable metadata = parser.getMetadata();
// Iterate over metadata items
for (MetadataItem item : metadata) {
// Print an item name and value
System.out.println(String.format("%s: %s", item.getName(), item.getValue()));
}
}
// Create an instance of Parser class
using(Parser parser = new Parser(“filepath/sample.msg”))
{
// Extract metadata from the email
IEnumerable metadata = parser.GetMetadata();
// Iterate over metadata items
foreach(MetadataItem item in metadata)
{
// Print the item name and value
Console.WriteLine(string.Format("{0}: {1}", item.Name, item.Value));
}
}
// Create an instance of Parser class
try (Parser parser = new Parser(“filepath/sample.epub”)) {
// Extract metadata from the e-book
Iterable metadata = parser.getMetadata();
// Iterate over metadata items
for (MetadataItem item : metadata) {
// Print an item name and value
System.out.println(String.format("%s: %s", item.getName(), item.getValue()));
}
} We provide working examples on GitHub for the .NET and Java versions of GroupDocs.Parser APIs. Please be sure to check them out. Furthermore, if you want to parse PDF, DOCX, XLSX, PPTX, EPUB, MSG, and many other file types on the fly, please use our Free Online Document Parser and Data Extraction Apps.