Parsing tags which are not closed from web page with HtmlAgilityPack

c# html-agility-pack linq

Question

I am trying to parse the list of stations from the NOAA web site (weather.noaa.gov). If you look at the source of a page such as Belarus Stations, you can see the list of available stations is presented as:

<select name="cccc">
    <option selected>Select a location
    <OPTION VALUE="UMBB"> Brest
    <OPTION VALUE="UMGG"> Gomel'
    <OPTION VALUE="UMMG"> Grodno
    <OPTION VALUE="UMMM"> Loshitsa / Minsk International 1
    <OPTION VALUE="UMMS"> Minsk
    <OPTION VALUE="UMII"> Vitebsk
</select>

You can see that the 'OPTION' tags are not closed. The default options in HtmlAgilityPack closes the tags like so:

<select name="cccc">
    <option selected>Select a location
    <OPTION VALUE="UMBB"> Brest
    <OPTION VALUE="UMGG"> Gomel'
    <OPTION VALUE="UMMG"> Grodno
    <OPTION VALUE="UMMM"> Loshitsa / Minsk International 1
    <OPTION VALUE="UMMS"> Minsk
    <OPTION VALUE="UMII"> Vitebsk
    </OPTION></OPTION></OPTION></OPTION></OPTION></OPTION></OPTION>
</select>

Which makes it a pain to parse or traverse. I came up with the following method to recurse each tag, but I wonder if there is a more elegant way, perhaps using LINQ?

My method:

private static void GetStations(HtmlNode node, ref Dictionary<string, string> stations)
{
    // the HTML is malformed, such that the <option> elements are
    // not properly closed, so we have to parse manually
    string name = node.GetAttributeValue("value", string.Empty).Trim();
    string value = node.InnerHtml.Substring(0, node.InnerHtml.IndexOf("\n")).Trim();

    if (!string.IsNullOrEmpty(name) &&
             name.Length == 4 &&
            char.IsUpper(name[0]))
    {
        stations.Add(name, value);
    }
    // due to not closing the <option> elements
    // we have to recurse into child nodes until
    // we get them all
    if (node.HasChildNodes)
    {
        GetStations(node.LastChild, ref stations);
    }
}

Which is called like so:

Dictionary<string, string> sites = new Dictionary<string, string>();
...
foreach (HtmlNode option in select.ChildNodes)
{
    if ((option.Name == "option") && (option.HasAttributes))
    {
        GetStations(option, ref sites);
    }
}

I feel like I am using a brute force method to get the list of stations, and I might be missing some of the power of the HtmlAgilityPack library. Is there a better way? Are there settings that might make this a non-issue? Can LINQ handle this more easily?

I am trying XPATH, as it seems the simplest mechanism to get a subset of tags. However, due to the tags not being closed, I am getting every option tag on the page, while I only want the ones inside the 'select' tag. So, one qualifier, as you can see, is that the 'option' tags I want have a @value='XXXX' where 'XXXX' is a 4-character, upper case station id. Is there a way to specify that I want only the option tags in the documente that have an attribute named 'value' with an uppercase 4-character value? Can I pass in a comparison function to an xpath statement?

1
2
7/17/2014 3:26:46 PM

Accepted Answer

Thanks for all the pointers. I did more searches for xpath syntax, and found this that works:

//select[@name='cccc']/descendant::option[@value]

this gives me all the 'option' tags under the 'select' tag with an attribute @name='cccc' where the 'option tag has a @value attribute.

Much less work than what I was doing. Now to refactor all my other code that loops through the DOM using HAP and see how XPATH can make my life easier!

0
7/17/2014 3:32:25 PM

Popular Answer

HtmlAgilityPack can automatically fix closing tag but maybe not exactly the way you expect :

HtmlNode.ElementsFlags["option"] = HtmlElementFlag.Closed;
var doc = new HtmlDocument();
doc.LoadHtml(html);

Anyway at this point you can still select text that is supposed to be within <option> tag using XPath following-sibling::text()[1], for example :

var optionTexts = doc.DocumentNode.SelectNodes("//select[@name='cccc']/option/following-sibling::text()[1]");
foreach (HtmlNode node in optionTexts)
{
    Console.WriteLine(node.InnerText);
}


Related Questions





Related

Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow