When parsing with HtmlAgilityPack, remove whitespace and newlines.

asp.net c# html-agility-pack trim

Question

Using the HTMLAgilityPack, I attempted the following HTML parsing:

HtmlDocument htmlDoc = new HtmlDocument();
htmlDoc.LoadHtml(xhtmlString);

Unfortunately, the xhtmlString includes extra whitespace and newline characters, which results in the following _text in the htmlDoc:

<html xmlns=\"http://www.w3.org/1999/xhtml\">\n\t<head></head>\n\t<body>\n\n<p>Alle Auktionen<br /></p>\n\n\t</body>\n</html>

When I deal with the kid parts of the body, I run into this issue.

What is the simplest approach to get rid of these extra characters?

Does the HTMLAgilityPack have a feature that removes newlines and tabs from HTML?

1
7
1/5/2012 1:37:49 PM

Popular Answer

Instead of extra whitespace and newline characters, this is the indentation for the document.
I don't see how this might be an issue, but why can't you simply swap out the special characters like "t" and "n"?

After a quick search, I discovered this: HTML Agility Pack: tidy up your code
Perhaps it would be beneficial to set certain attributes to false.

2
5/23/2017 12:33:21 PM


Related Questions





Related

Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow