XHTML Parsing with HTMLAgilityPack

c# html-agility-pack

Question

I have a list of the following elements inside a element that I have found using HTMLAgilityPack.

<option value="67"><span style="color: #cc0000;">Horde</span> Leveling / Dailies & Event Guide ($50.00)</option>

What I need to do is parse all the text out of the tag, without all the mumbo jumbo in there. I've tried (seemingly!) everything, but it always comes out looking like this:

Horde
Leveling / Dailies & Event Guide ($50.00)

and sometimes like:

Horde
Leveling
/ Dailies & Event Guide ($50.00)

and a couple other variations like that. I've even gone so far as to print out each character in the string as a byte, and I haven't found any linebreaks or feeds, only what I expected, normal letters and spaces. Here's the full source of the html for reference, copied straight from the page.

<option value="13"><span style="color: #0000ff;">Alliance</span> Leveling Guide ($30.00)</option>


<option value="12"><span style="color: #cc0000;">Horde</span> Leveling Guide ($30.00)</option>

<option value="46"><span style="color: #cc0000;">Horde</span> Dailies & Events Guide ($25.00)</option>

<option value="67"><span style="color: #cc0000;">Horde</span> Leveling / Dailies & Event Guide ($50.00)</option>


<option value="11"><span style="color: #0000ff;">Alliance</span> &amp; <span style="color: #cc0000;">Horde</span> Leveling Guide ($50.00)</option>

<option value="97"><span style="color: #0000ff;">Alliance</span> Achievements & Professions Guide ($20.00)</option>

<option value="98"><span style="color: #cc0000;">Horde</span> Achievements & Professions Guide ($20.00)</option>


<option value="99"><span style="color: #0000ff;">Alliance</span> &amp; <span style="color: #cc0000;">Horde</span> Achievements & Professions Guide ($30.00)</option>

Popular Answer

By default, the <OPTION> tag is treated by Html Agility Pack as a "Empty", which means it does not need a closing </OPTION>, that's why in this case, it's not easy to catch with XPATH. You can change this using the HtmlNode.ElementFlags collection.

Here is a code that should do what you want:

HtmlDocument doc = new HtmlDocument();
HtmlNode.ElementsFlags.Remove("option");
doc.LoadHtml(yourHtml);
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//option"))
{
    Console.WriteLine(node.InnerText);
}



Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Is this KB legal? Yes, learn why
Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Is this KB legal? Yes, learn why