Extract data from an HTML form using HTML Agility pack

c# html html-agility-pack winforms

Question

I'm trying to list all nodes in the HTML form I have dynamically using HTML agility pack, meaning that I don't know the names of the Attributes and the input names. The problem is when I want to get the label corresponding to the input.

<form name="input" action="html_form_action.asp" method="get">
Username: <input type="text" name="user" />
<input type="submit" value="Submit" />
</form>

So here I want to write Username then the input, it seems really obvious in this example but sometimes they're not direct siblings, there would be many hidden inputs, or other tags.

Another example:

   <input type=hidden name="startDate">

      <TR>  <TD bgColor=#008088 colSpan=2 class="headfont">

        <FONT color=#FFFFFF>  <B>* Enter ur username and password</B> </FONT>

      </TD></TR>

      <TR>

       <TD bgColor=#9ccdcd class="datafont"><FONT color=black>Username</FONT></TD>

            <TD bgColor=#9ccdcd class="datafont">

            <INPUT tabIndex=1 name=stuNum 

              autocomplete="off" size="20"></TD></TR>

          <TR>

Am using C# winforms in my project .

I have few ideas but they will take lots of time,so I thought since am new to HTML agility pack there would be a way or some shortcut to get it,,,Any suggestions?

Popular Answer

Something like this should work.

static IEnumerable<Tuple<string, HtmlNode>> GetInputNodes(HtmlDocument doc, params string[] fields)
{
    var form = doc.DocumentNode.SelectSingleNode("//form");
    foreach (var field in fields)
    {
        var fieldNode = form.ChildNodes
            .OfType<HtmlTextNode>()
            .Where(node => node.Text.Trim().StartsWith(field, StringComparison.OrdinalIgnoreCase))
            .SingleOrDefault();
        if (fieldNode == null)
            continue;

        var input = FindCorrespondingInputNode(fieldNode);
        if (input != null)
            yield return Tuple.Create(field, input);
    }
}

static HtmlNode FindCorrespondingInputNode(HtmlTextNode fieldNode)
{
    for (var currentNode = fieldNode.NextSibling;
         currentNode != null && currentNode.NodeType != HtmlNodeType.Text;
         currentNode = currentNode.NextSibling)
    {
        if (currentNode.Name == "input"
         && !currentNode.Attributes["type"].Value.Contains("hidden"))
        {
            return currentNode;
        }
    }
    return null;
}

Then to use it, just pass in the names of the fields you want to get the input elements for.

GetInputNodes(doc, "username");

Just a warning, it seems that HtmlAgilityPack doesn't close off the form like it probably should. So you'll have to specify that form elements should be closed before loading the html. Without it, HAP will not recognize that the form has any child nodes.

var doc = new HtmlDocument();
HtmlNode.ElementsFlags["form"] = HtmlElementFlag.Closed;
doc.Load(url);



Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Is this KB legal? Yes, learn why
Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Is this KB legal? Yes, learn why