Parsing tags which are not closed from web page with HtmlAgilityPack

c# html-agility-pack linq

Question

I'm attempting to parse the NOAA website's station list (weather.noaa.gov). If you look at the source of a website like Russia Stations, you can notice the stations that are now accessible are listed as follows:

<select name="cccc">
    <option selected>Select a location
    <OPTION VALUE="UMBB"> Brest
    <OPTION VALUE="UMGG"> Gomel'
    <OPTION VALUE="UMMG"> Grodno
    <OPTION VALUE="UMMM"> Loshitsa / Minsk International 1
    <OPTION VALUE="UMMS"> Minsk
    <OPTION VALUE="UMII"> Vitebsk
</select>

The 'OPTION' tags are not closed, as you can see. The HtmlAgilityPack default settings close the tags as follows:

<select name="cccc">
    <option selected>Select a location
    <OPTION VALUE="UMBB"> Brest
    <OPTION VALUE="UMGG"> Gomel'
    <OPTION VALUE="UMMG"> Grodno
    <OPTION VALUE="UMMM"> Loshitsa / Minsk International 1
    <OPTION VALUE="UMMS"> Minsk
    <OPTION VALUE="UMII"> Vitebsk
    </OPTION></OPTION></OPTION></OPTION></OPTION></OPTION></OPTION>
</select>

this makes parsing or traversing it difficult. To recurse each tag, I came up with the following technique, but I wonder if there is a more elegant approach—possibly one that makes use of LINQ?

My approach

private static void GetStations(HtmlNode node, ref Dictionary<string, string> stations)
{
    // the HTML is malformed, such that the <option> elements are
    // not properly closed, so we have to parse manually
    string name = node.GetAttributeValue("value", string.Empty).Trim();
    string value = node.InnerHtml.Substring(0, node.InnerHtml.IndexOf("\n")).Trim();

    if (!string.IsNullOrEmpty(name) &&
             name.Length == 4 &&
            char.IsUpper(name[0]))
    {
        stations.Add(name, value);
    }
    // due to not closing the <option> elements
    // we have to recurse into child nodes until
    // we get them all
    if (node.HasChildNodes)
    {
        GetStations(node.LastChild, ref stations);
    }
}

What is referred to as:

Dictionary<string, string> sites = new Dictionary<string, string>();
...
foreach (HtmlNode option in select.ChildNodes)
{
    if ((option.Name == "option") && (option.HasAttributes))
    {
        GetStations(option, ref sites);
    }
}

I believe I am obtaining the list of stations using a brute force approach, and I might be excluding part of the HtmlAgilityPack library's potential. Exists a better approach? Exist any options that could eliminate this problem? Does LINQ make this easier to handle?

I'm experimenting with XPATH since it looks like the easiest way to get a subset of tags. I only want the options that are within the "select" tag, but I am receiving every option tag on the page since the tags are not closed. The 'option' tags I want have a @value='XXXX', where 'XXXX' is a 4-character, upper case station id, as you can see. This is one qualification. Is it possible to indicate that I only want option tags in the page with an attribute called "value" and a 4-character value in uppercase? A comparison function may I feed into an xpath statement?

1
2
7/17/2014 3:26:46 PM

Accepted Answer

I appreciate all the advice. I continued looking for working xpath syntax and came upon the following:

//select[@name='cccc']/descendant::option[@value]

this gives me all the 'option' tags under the 'select' tag with an attribute @name='cccc' where the 'option tag has a @value attribute.

The job I was doing was far less work. Now I need to rethink all of my other HAP-based DOM looping stuff and see if XPATH can simplify things for me.

0
7/17/2014 3:32:25 PM

Popular Answer

The closing tag can be fixed automatically by HTML Agility Pack, but maybe not exactly as you anticipate:

HtmlNode.ElementsFlags["option"] = HtmlElementFlag.Closed;
var doc = new HtmlDocument();
doc.LoadHtml(html);

However, you may still choose text that should be inside at this time.<option> using XPath to tagfollowing-sibling::text()[1] , for instance:

var optionTexts = doc.DocumentNode.SelectNodes("//select[@name='cccc']/option/following-sibling::text()[1]");
foreach (HtmlNode node in optionTexts)
{
    Console.WriteLine(node.InnerText);
}


Related Questions





Related

Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow