Ignoring   when parsing with HtmlAgilityPack

c# html-agility-pack

Question

I'm parsing html table in c# using Html Agility Pack that contains non-breaking space.

HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(page);

Where page is string containing table with special characters   within text.

<td>&#160;test</td>
<td>number =&#160;123&#160;</td>

Using SelectSingleNode(".//td").InnerText will contains this special characters but i want to ignore them.

Is there some elegant way to ignore this (with or without help of Html Agility Pack) without modifying source table?

Accepted Answer

You could use HtmlDecode

string foo = HttpUtility.HtmlDecode("Special char: &#160;");

Will give you a string:

Special char:


Popular Answer

The "Special Character" non-breaking-space of which you speak is a valid character which can perfectly legitimately appear in text, just as "fancy quotes", em-dash etc can.

Often we want to treat certain characters as being equivalent.

  • So you might want to treat an em-dash, en-dash and minus sign/dash as being the same.
  • Or fancy quotes as the same as straight quotes.
  • Or the non-breaking-space as an ordinary space.

However this is not something HTML Agility pack can help with. You need to use something like string.Replace or your own canonicalization function to do this.

I would suggest something like:

static string CleanupStringForMyApp(string s){
    // replace characters with their equivalents
    s = s.Replace(string.FromCharCode(160), " ");
    // Add any more replacements you want to do here
    return s;
}



Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Is this KB legal? Yes, learn why
Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Is this KB legal? Yes, learn why