Prevent HTMLAgilityPack from connecting words when using InnerText

c# html-agility-pack

Question

I'm trying to do a simple task of getting text from HTML document. So I'm using HTMLdoc.DocumentNode.InnerText for that. The problem is that on some sites the don't put spaces between words when they are in a different tags. In those cases the DocumentNode.InnerText connect those word into one and it became useless.

for example, I'm trying to read a site contain that line

<span>Ä°stanbul</span><ul><li><a href="i1.htm">Adana</a></li>

I'm getting "Ä°stanbulAdana" which is meaningless.

I couldn't find any solution at HTMLAgilityPack documentation nor Google

Do I missing something?

Thanks,

Accepted Answer

That should be rather easy to do.

const string html = @"<span>Ä°stanbul</span><ul><li><a href=""i1.htm"">Adana</a></li>";
var doc = new HtmlDocument();
doc.LoadHtml(html);
string result = string.Join(" ", doc.DocumentNode.Descendants()
  .Where(n => !n.HasChildNodes && !string.IsNullOrWhiteSpace(n.InnerText))
  .Select(n => n.InnerText));
Console.WriteLine(result); // prints "Ä°stanbul Adana"

Popular Answer

Well, the code snippet hangs for this example:

const string html = @"<td><font size=""2"">abc </font><font size=""2"">(</font><font size=""2"">abc</font><font size=""2"">) </font><a href=""?query=abc"">abc</a>, abc<br><font size=""2"">abc </font>abc, <a href=""?query=abc"">abc</a>, abc, <a href=""?query=abc"">abc</a><br><font size=""2"">abc </font>abc abc, abc abc<br></td>";

It doesn't hang without the join-clause (but it doesn't put spaces correctly neither).




Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Is this KB legal? Yes, learn why
Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Is this KB legal? Yes, learn why