HTML Agility pack is used to extract data from an HTML form.

c# html html-agility-pack winforms

Question

I'm attempting to list every node in my HTML form dynamically using HTML Agility Pack, but I'm not familiar with the names of the inputs or attributes. The issue arises when I want to get the label that goes with the input.

<form name="input" action="html_form_action.asp" method="get">
Username: <input type="text" name="user" />
<input type="submit" value="Submit" />
</form>

It appears pretty apparent in this example that I want to write Username after the input, but occasionally there are hidden inputs or other tags that prevent this from happening.

Another illustration:

   <input type=hidden name="startDate">

      <TR>  <TD bgColor=#008088 colSpan=2 class="headfont">

        <FONT color=#FFFFFF>  <B>* Enter ur username and password</B> </FONT>

      </TD></TR>

      <TR>

       <TD bgColor=#9ccdcd class="datafont"><FONT color=black>Username</FONT></TD>

            <TD bgColor=#9ccdcd class="datafont">

            <INPUT tabIndex=1 name=stuNum 

              autocomplete="off" size="20"></TD></TR>

          <TR>

In my project, I'm utilizing C# winforms.

Since I'm new to HTML Agility Pack, I figured there must be a method or some shortcut to acquire it since I have a few ideas but they will take a lot of time. Any recommendations?

1
3
11/9/2011 1:06:29 PM

Popular Answer

This kind of stuff ought to work.

static IEnumerable<Tuple<string, HtmlNode>> GetInputNodes(HtmlDocument doc, params string[] fields)
{
    var form = doc.DocumentNode.SelectSingleNode("//form");
    foreach (var field in fields)
    {
        var fieldNode = form.ChildNodes
            .OfType<HtmlTextNode>()
            .Where(node => node.Text.Trim().StartsWith(field, StringComparison.OrdinalIgnoreCase))
            .SingleOrDefault();
        if (fieldNode == null)
            continue;

        var input = FindCorrespondingInputNode(fieldNode);
        if (input != null)
            yield return Tuple.Create(field, input);
    }
}

static HtmlNode FindCorrespondingInputNode(HtmlTextNode fieldNode)
{
    for (var currentNode = fieldNode.NextSibling;
         currentNode != null && currentNode.NodeType != HtmlNodeType.Text;
         currentNode = currentNode.NextSibling)
    {
        if (currentNode.Name == "input"
         && !currentNode.Attributes["type"].Value.Contains("hidden"))
        {
            return currentNode;
        }
    }
    return null;
}

To use it, just enter the names of the fields whose input items you wish to get.

GetInputNodes(doc, "username");

Just a heads up: it seems that HtmlAgilityPack doesn't properly end the form. Therefore, you must indicate that form elements should be closed prior to loading the html. Without it, HAP will fail to detect the existence of any child nodes in the form.

var doc = new HtmlDocument();
HtmlNode.ElementsFlags["form"] = HtmlElementFlag.Closed;
doc.Load(url);
2
11/8/2011 7:30:01 PM


Related Questions





Related

Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow