Parse Html Document Get All input fields with ID and Value


Question

I have several thousand (ASP.net - messy html) html generated invoices that I'm trying to parse and save into a database.

Basically like:

 foreach(var htmlDoc in HtmlFolder)
 {
   foreach(var inputBox in htmlDoc)
   { 
      //Make Collection of ID and Values Insert to DB
   }
 }  

From all the other questions I've read the best tool for this type of problem is the HtmlAgilityPack, however for the life of me I can't get the documentation .chm file to work. Any ideas on how I could accomplish this with or without the Agility Pack ?

Thanks in advance

Accepted Answer

An newer alternative to HtmlAgilityPack is CsQuery. See this later question on its relative performance merits, but its use of CSS selectors can't be beat:

var doc = CQ.CreateDocumentFromFile(htmldoc); //load, parse the file
var fields = doc["input"]; //get input fields with CSS
var pairs = fields.Select(node => new Tuple<string, string>(node.Id, node.Value()))
       //get values

Popular Answer

To get the CHM to work, you probably need to view the properties in Windows Explorer and uncheck the "Unblock Content" checkbox.

The HTML Agility Pack is quite easy when you know your way around Linq-to-XML or XPath.

Basics you'll need to know:

//import the HtmlAgilityPack
using HtmlAgilityPack;

HtmlDocument doc = new HtmlDocument();

// Load your data
// -----------------------------
// Load doc from file:
doc.Load(pathToFile);

// OR

// Load doc from string:
doc.LoadHtml(contentsOfFile);
// -----------------------------

// Find what you're after
// -----------------------------
// Finding things using Linq
var nodes = doc.DocumentNode.DescendantsAndSelf("input")
    .Where(node => !string.IsNullOrWhitespace(node.Id)
        && node.Attributes["value"] != null
        && !string.IsNullOrWhitespace(node.Attributes["value"].Value));

// OR

// Finding things using XPath
var nodes = doc.DocumentNode
    .SelectNodes("//input[not(@id='') and not(@value='')]");
// -----------------------------


// looping through the nodes:
// the XPath interfaces can return null when no nodes are found
if (nodes != null) 
{ 
    foreach (var node in nodes)
    {
        var id = node.Id;
        var value = node.Attributes["value"].Value;
    }
}

The easiest way to add the HtmlAgility Pack is using NuGet:

PM> Install-Package HtmlAgilityPack





Licensed under: CC-BY-SA
Not affiliated with Stack Overflow
Is this KB legal? Yes, learn why