Parse Html Document Get All input fields with ID and Value

c# csquery html-agility-pack

Question

I'm attempting to parse and store thousands of (ASP.net - nasty html) html-generated invoices into a database.

like basically

 foreach(var htmlDoc in HtmlFolder)
 {
   foreach(var inputBox in htmlDoc)
   { 
      //Make Collection of ID and Values Insert to DB
   }
 }  

The HtmlAgilityPack is the greatest tool for this kind of issue, according to all the other questions I've read, but for the life of me, I can't get the documentation.chm file to function. Any suggestions as to how I may go about doing this, with or without the Agility Pack?

I appreciate you.

1
4
5/23/2017 12:00:04 PM

Accepted Answer

HtmlAgilityPack's more recent substitute is CsQuery. Regarding relative performance, see this later query, yet it uses CSS selectors better than anybody else:

var doc = CQ.CreateDocumentFromFile(htmldoc); //load, parse the file
var fields = doc["input"]; //get input fields with CSS
var pairs = fields.Select(node => new Tuple<string, string>(node.Id, node.Value()))
       //get values
4
5/23/2017 12:33:45 PM

Popular Answer

You most likely need to inspect the properties in Windows Explorer and Remove the checkmark next to "Unblock Content" to get the CHM to operate.

When you are familiar with XPath or Linq-to-XML, using the HTML Agility Pack is pretty simple.

Basic information you should have:

//import the HtmlAgilityPack
using HtmlAgilityPack;

HtmlDocument doc = new HtmlDocument();

// Load your data
// -----------------------------
// Load doc from file:
doc.Load(pathToFile);

// OR

// Load doc from string:
doc.LoadHtml(contentsOfFile);
// -----------------------------

// Find what you're after
// -----------------------------
// Finding things using Linq
var nodes = doc.DocumentNode.DescendantsAndSelf("input")
    .Where(node => !string.IsNullOrWhitespace(node.Id)
        && node.Attributes["value"] != null
        && !string.IsNullOrWhitespace(node.Attributes["value"].Value));

// OR

// Finding things using XPath
var nodes = doc.DocumentNode
    .SelectNodes("//input[not(@id='') and not(@value='')]");
// -----------------------------


// looping through the nodes:
// the XPath interfaces can return null when no nodes are found
if (nodes != null) 
{ 
    foreach (var node in nodes)
    {
        var id = node.Id;
        var value = node.Attributes["value"].Value;
    }
}

The quickest method to sleep is:

PM> Install-Package HtmlAgilityPack



Related Questions





Related

Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow