Parsing is the process of analyzing and interpreting data or information, such as words, numbers, and symbols, to determine their structure and meaning. Parsing can be used to extract data from documents and interpret the meaning of the extracted text. Parsing is a key component of many text-processing tasks, like information extraction, data mining, machine translation, and more. It is also used to break down larger chunks of data into smaller, more manageable pieces
If you are a software or app developer and on the lookout for an API to programmatically parse your documents, please try GroupDocs.Parser API for .NET and Java. It equips you with all you need to parse PDF, Word, Excel, PowerPoint, eBooks, Emails, HTML, and an array of other file types. You can extract data such as metadata, text, and images from the supported file formats across .NET and Java platforms with the help of GroupDocs.Parser API.
Please refer to the information given below to correctly install the .NET or Java version of GroupDocs.Parser API on your system.
After successfully setting up the desired version of GroupDocs.Parser API at your end, we can now check some of the commonly used case scenarios for parsing documents and extracting text, images, and metadata from the supported file types.
Parsing a PDF document is the process of extracting the data from a PDF file and converting it into a structured format that is easier to read and understand. By parsing a PDF document, you can extract information such as text, images, tables, hyperlinks, and other elements from it. GroupDocs.Parser for .NET and Java APIs let you incorporate PDF parser functionality into your document processing apps.
using System; using System.Collections.Generic; using System.Text; using GroupDocs.Parser.Data; // Create an instance of Parser class using (Parser parser = new Parser(“filepath/sample.pdf”)) { // Extract data from PDF document DocumentData data = parser.ParseForm(); // Check if form extraction is supported if (data == null) { Console.WriteLine("Form extraction isn't supported."); return; } // Create the preliminary record object PreliminaryRecord rec = new PreliminaryRecord(); rec.Name = GetFieldText(data, "Name"); rec.Model = GetFieldText(data, "Model"); rec.Time = GetFieldText(data, "Time"); rec.Description = GetFieldText(data, "Description"); // We can save the preliminary record object to the database, // send it as the web response or just print it to the console Console.WriteLine("Preliminary record"); Console.WriteLine("Name: {0}", rec.Name); Console.WriteLine("Model: {0}", rec.Model); Console.WriteLine("Time: {0}", rec.Time); Console.WriteLine("Description: {0}", rec.Description); } private static string GetFieldText(DocumentData data, string fieldName) { // Get the field from data collection FieldData fieldData = data.GetFieldsByName(fieldName).FirstOrDefault(); // Check if the field data is not null (a field with the fieldName is contained in data collection) // and check if the field data contains the text return fieldData != null && fieldData.PageArea is PageTextArea ? (fieldData.PageArea as PageTextArea).Text : null; } //// Simple POCO object to store the extracted data. // public class PreliminaryRecord { public string Name { get; set; } public string Model { get; set; } public string Time { get; set; } public string Description { get; set; } }
// Create an instance of Parser class try (Parser parser = new Parser(Constants.SampleCarWashPdf)) { // Extract data from PDF document DocumentData data = parser.parseForm(); // Check if form extraction is supported if (data == null) { System.out.println("Form extraction isn't supported."); return; } // Create the preliminary record object PreliminaryRecord rec = new PreliminaryRecord(); rec.Name = getFieldText(data, "Name"); rec.Model = getFieldText(data, "Model"); rec.Time = getFieldText(data, "Time"); rec.Description = getFieldText(data, "Description"); // We can save the preliminary record object to the database, // send it as the web response or just print it to the console System.out.println("Preliminary record"); System.out.println(String.format("Name: %s", rec.Name)); System.out.println(String.format("Model: %s", rec.Model)); System.out.println(String.format("Time: %s", rec.Time)); System.out.println(String.format("Description: %s", rec.Description)); } private static String getFieldText(DocumentData data, String fieldName) { // Get the field from data collection FieldData fieldData = data.getFieldsByName(fieldName).get(0); // Check if the field data is not null (a field with the fieldName is contained in data collection) // and check if the field data contains the text return fieldData != null && fieldData.getPageArea() instanceof PageTextArea ? ((PageTextArea) fieldData.getPageArea()).getText() : null; } /** * Simple POCO object to store the extracted data. */ static class PreliminaryRecord { public String Name; public String Model; public String Time; public String Description; }
As businesses become increasingly digital, the need to parse Word, Excel, and PowerPoint documents has also gained importance. Parsing files is a critical part of data analysis and business intelligence as it allows you to extract structured data easily. This data can then be used to automate processes, uncover insights, and improve decision making. GroupDocs.Parser APIs support parsing Word, Excel, and PowerPoint files enabling you to extract text, metadata, images, tables, and hyperlinks contained within these documents.
// Create an instance of Parser class using (Parser parser = new Parser(“filepath/sample.docx”)) { // Get the reader object for the document XML representation using (XmlReader reader = parser.GetStructure()) { // Iterate over the document while (reader.Read()) { // Check if this is the start of the table if (reader.IsStartElement() && reader.Name == "table") { // Process the table ProcessTable(reader); } } } } private static void ProcessTable(XmlReader reader) { Console.WriteLine("table"); // Create an instance of StringBuilder to store the cell value StringBuilder value = new StringBuilder(); // Iterate over the table while (reader.Read()) { // Check if the current tag is the end of the table bool isTableEnd = !reader.IsStartElement() && reader.Name == "table"; // Check if the current tag is the start of the row or the cell bool isRowOrCellStart = reader.IsStartElement() && (reader.Name == "tr" || reader.Name == "td"); // Print the cell value if this is the end of the table or the start of the row or the cell if ((isTableEnd || isRowOrCellStart) && value.Length > 0) { Console.Write(" "); Console.WriteLine(value.ToString()); value = new StringBuilder(); } // If this is the end of the table - return to the main function if (isTableEnd) { return; } // If this is the start of the row or the cell - print the tag name if (isRowOrCellStart) { Console.WriteLine(reader.Name); continue; } // If this code line is reached then this is the value of the cell value.Append(reader.Value); } }
// Create an instance of Parser class try (Parser parser = new Parser(Constants.SampleDocx)) { // Extract text structure to the XML reader Document document = parser.getStructure(); // Read XML document readNode(document.getDocumentElement()); } private static void readNode(Node node) { NodeList nodes = node.getChildNodes(); // Iterate over the child nodes for (int i = 0; i < nodes.getLength(); i++) { Node n = nodes.item(i); // If it's a table if (n.getNodeName().toLowerCase() == "table") { System.out.println("table"); // Process node processNode(n); } readNode(n); } } private static void processNode(Node node) { NodeList nodes = node.getChildNodes(); // Iterate over the child nodes for (int i = 0; i < nodes.getLength(); i++) { Node n = nodes.item(i); switch (n.getNodeName().toLowerCase()) { // In the case of a row or cell case "tr": case "td": { // Print the name System.out.println(n.getNodeName()); // Process sub-nodes processNode(n); System.out.println(); System.out.println("/" + n.getNodeName()); break; } default: // Print the node value (if it's not null) String value = n.getNodeValue(); if(value != null) { System.out.print(value); } processNode(n); break; } } }
// Create an instance of Parser class using(Parser parser = new Parser(filePath)) { // Get the document info IDocumentInfo documentInfo = parser.GetDocumentInfo(); // Iterate over sheets for(int p = 0; p < documentInfo.PageCount; p++) { // Print a sheet number Console.WriteLine(string.Format("Page {0}/{1}", p + 1, documentInfo.PageCount)); // Extract a text into the reader using(TextReader reader = parser.GetText(p)) { // Print a text from the spreadsheet sheet Console.WriteLine(reader.ReadToEnd()); } } }
// Create an instance of Parser class try (Parser parser = new Parser(Constants.SampleXlsx)) { // Get the spreadsheet info IDocumentInfo spreadsheetInfo = parser.getDocumentInfo(); // Iterate over sheets for (int p = 0; p < spreadsheetInfo.getPageCount(); p++) { // Print a sheet number System.out.println(String.format("Sheet %d/%d", p + 1, spreadsheetInfo.getPageCount())); // Extract a text into the reader try (TextReader reader = parser.getText(p)) { // Print a text from the spreadsheet System.out.println(reader.readToEnd()); } } }
Extracting images from PDF documents could have various applications such as analysis of the content, creating a digital archive, or acquiring visual data to use in other applications. Owing to the growth of digital documents, image extraction is fast becoming a necessity and has many practical uses depending on the nature of the business. For instance, a recruiting agency might want to extract photos of the candidates from their resumes for record keeping. With GroupDocs.Parser APIs, you can extract all images from a PDF document, extract images from a particular document page, or from a specific page area.
// Extract images from PDF using C# using (Parser parser = new Parser("filepath/sample.pdf")) { IEnumerableimages = parser.GetImages(); // Check if image extraction is supported if (images == null) { Console.WriteLine("Images extraction isn't supported"); return; } ImageOptions options = new ImageOptions(ImageFormat.Jpeg); int imageNumber = 0; // Iterate over retrieved images foreach (PageImageArea image in images) { // Save Images image.Save("imageFilePath/image-" + imageNumber.ToString() + ".jpeg", options); imageNumber++; } }
// Create an instance of the Parser class try (Parser parser = new Parser("filepath/sample.pdf")) { // Extract images from document Iterableimages = parser.getImages(); // Create the options to save images in PNG format ImageOptions options = new ImageOptions(ImageFormat.Png); int imageNumber = 0; // Iterate over images for (PageImageArea image : images) { // Save the images to the PNG file image.save(Constants.getOutputFilePath(String.format("%d.png", imageNumber)), options); imageNumber++; } }
Metadata includes information such as the type and size of file, author, date created, and other data associated with the file. Metadata extraction is an important process for many industries as it can provide valuable insight into the content of the document. It could be used in digital preservation, digital asset management, content management, and search engine optimization. Extracting document metadata is one of the features supported by GroupDocs.Parser for .NET and Java APIs. You can extract metadata from PDF, Microsoft Word, Excel, PowerPoint, Email files, and eBooks using these APIs.
// Create an instance of the Parser class using(Parser parser = new Parser(“filepath/sample.pdf”)) { // Extract metadata from the document IEnumerablemetadata = parser.GetMetadata(); // Iterate over metadata items foreach(MetadataItem item in metadata) { // Print the item name and value Console.WriteLine(string.Format("{0}: {1}", item.Name, item.Value)); } }
// Create an instance of Parser class try (Parser parser = new Parser(“filepath/sample.docx”)) { // Extract metadata from the document Iterablemetadata = parser.getMetadata(); // Iterate over metadata items for (MetadataItem item : metadata) { // Print an item name and value System.out.println(String.format("%s: %s", item.getName(), item.getValue())); } }
// Create an instance of Parser class using(Parser parser = new Parser(“filepath/sample.msg”)) { // Extract metadata from the email IEnumerablemetadata = parser.GetMetadata(); // Iterate over metadata items foreach(MetadataItem item in metadata) { // Print the item name and value Console.WriteLine(string.Format("{0}: {1}", item.Name, item.Value)); } }
// Create an instance of Parser class try (Parser parser = new Parser(“filepath/sample.epub”)) { // Extract metadata from the e-book Iterablemetadata = parser.getMetadata(); // Iterate over metadata items for (MetadataItem item : metadata) { // Print an item name and value System.out.println(String.format("%s: %s", item.getName(), item.getValue())); } }
We provide working examples at GitHub for the .NET and Java versions of GroupDocs.Parser APIs. Please be sure to check them out. Furthermore, if you want to parse PDF, DOCX, XLSX, PPTX, EPUB, MSG, and many other file types on the fly, please use our Free Online Parsing and Data Extraction Apps.