InnerText=InnerHtml - How to extract readable text with HtmlAgilityPack

html html-agility-pack innerhtml innertext vb.net

Question

I need to extract text from a very bad Html.

I'm trying to do this using vb.net and HtmlAgilityPack

The tag that I need to parse has InnerText = InnerHtml and both:

Name:<!--b>&#61;</b--> Albert E<!--span-->instein  s<!--i>&#89;</i-->ection: 3 room: -

While debuging I can read it using "Html viewer": it shows:

Name: Albert Einstein section: 3 room: -

How can I get this into a string variable?

EDIT:

I use this code to get the node:

Dim ElePs As HtmlNodeCollection = _
    mWPage.DocumentNode.SelectNodes("//div[@id='div_main']//p")
For Each EleP As HtmlNode In ElePs
    'Here I need to get EleP.InnerText "normalized"
Next

Accepted Answer

If you notice this mess is actually just html comments and they shall be ignored, so just getting the text and using string.Join is enough:

C#

var text = string.Join("",htmlDoc.DocumentNode.SelectNodes("//text()[normalize-space()]").
                                            Select(t=>t.InnerText));

VB.net

 Dim text = String.Join("", From t In htmlDoc.DocumentNode.SelectNodes("//text()[normalize-space()]")
                                   Select t.InnerText)

the html is valid, nothing bad about it, its just written by someone without a soul.

based on your update this shall do:

Dim ElePs As HtmlNodeCollection = mWPage.DocumentNode.SelectNodes("//div[@id='div_main']//p")
For Each EleP As HtmlNode In ElePs
    'Here I need to get EleP.InnerText "normalized"
     Dim text = String.Join("", From t In EleP.SelectNodes(".//text()[normalize-space()]")
                Select t.InnerText).Trim()
Next

note the .// it means that it will look for the descendant nodes of the current node unlike // which will always start from the top node.




Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Is this KB legal? Yes, learn why
Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Is this KB legal? Yes, learn why