Parsing tags which are not closed from web page with HtmlAgilityPack

c# html-agility-pack linq

Question

I am trying to parse the list of stations from the NOAA web site (weather.noaa.gov). If you look at the source of a page such as Belarus Stations, you can see the list of available stations is presented as:

<select name="cccc">
    <option selected>Select a location
    <OPTION VALUE="UMBB"> Brest
    <OPTION VALUE="UMGG"> Gomel'
    <OPTION VALUE="UMMG"> Grodno
    <OPTION VALUE="UMMM"> Loshitsa / Minsk International 1
    <OPTION VALUE="UMMS"> Minsk
    <OPTION VALUE="UMII"> Vitebsk
</select>

You can see that the 'OPTION' tags are not closed. The default options in HtmlAgilityPack closes the tags like so:

<select name="cccc">
    <option selected>Select a location
    <OPTION VALUE="UMBB"> Brest
    <OPTION VALUE="UMGG"> Gomel'
    <OPTION VALUE="UMMG"> Grodno
    <OPTION VALUE="UMMM"> Loshitsa / Minsk International 1
    <OPTION VALUE="UMMS"> Minsk
    <OPTION VALUE="UMII"> Vitebsk
    </OPTION></OPTION></OPTION></OPTION></OPTION></OPTION></OPTION>
</select>

Which makes it a pain to parse or traverse. I came up with the following method to recurse each tag, but I wonder if there is a more elegant way, perhaps using LINQ?

My method:

private static void GetStations(HtmlNode node, ref Dictionary<string, string> stations)
{
    // the HTML is malformed, such that the <option> elements are
    // not properly closed, so we have to parse manually
    string name = node.GetAttributeValue("value", string.Empty).Trim();
    string value = node.InnerHtml.Substring(0, node.InnerHtml.IndexOf("\n")).Trim();

    if (!string.IsNullOrEmpty(name) &&
             name.Length == 4 &&
            char.IsUpper(name[0]))
    {
        stations.Add(name, value);
    }
    // due to not closing the <option> elements
    // we have to recurse into child nodes until
    // we get them all
    if (node.HasChildNodes)
    {
        GetStations(node.LastChild, ref stations);
    }
}

Which is called like so:

Dictionary<string, string> sites = new Dictionary<string, string>();
...
foreach (HtmlNode option in select.ChildNodes)
{
    if ((option.Name == "option") && (option.HasAttributes))
    {
        GetStations(option, ref sites);
    }
}

I feel like I am using a brute force method to get the list of stations, and I might be missing some of the power of the HtmlAgilityPack library. Is there a better way? Are there settings that might make this a non-issue? Can LINQ handle this more easily?

I am trying XPATH, as it seems the simplest mechanism to get a subset of tags. However, due to the tags not being closed, I am getting every option tag on the page, while I only want the ones inside the 'select' tag. So, one qualifier, as you can see, is that the 'option' tags I want have a @value='XXXX' where 'XXXX' is a 4-character, upper case station id. Is there a way to specify that I want only the option tags in the documente that have an attribute named 'value' with an uppercase 4-character value? Can I pass in a comparison function to an xpath statement?

Accepted Answer

Thanks for all the pointers. I did more searches for xpath syntax, and found this that works:

//select[@name='cccc']/descendant::option[@value]

this gives me all the 'option' tags under the 'select' tag with an attribute @name='cccc' where the 'option tag has a @value attribute.

Much less work than what I was doing. Now to refactor all my other code that loops through the DOM using HAP and see how XPATH can make my life easier!


Popular Answer

HtmlAgilityPack can automatically fix closing tag but maybe not exactly the way you expect :

HtmlNode.ElementsFlags["option"] = HtmlElementFlag.Closed;
var doc = new HtmlDocument();
doc.LoadHtml(html);

Anyway at this point you can still select text that is supposed to be within <option> tag using XPath following-sibling::text()[1], for example :

var optionTexts = doc.DocumentNode.SelectNodes("//select[@name='cccc']/option/following-sibling::text()[1]");
foreach (HtmlNode node in optionTexts)
{
    Console.WriteLine(node.InnerText);
}



Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Is this KB legal? Yes, learn why
Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Is this KB legal? Yes, learn why