I need to extract text from a very bad Html.
I'm trying to do this using vb.net
and HtmlAgilityPack
The tag that I need to parse has InnerText = InnerHtml and both:
Name:<!--b>=</b--> Albert E<!--span-->instein s<!--i>Y</i-->ection: 3 room: -
While debuging I can read it using "Html viewer": it shows:
Name: Albert Einstein section: 3 room: -
How can I get this into a string variable?
EDIT:
I use this code to get the node:
Dim ElePs As HtmlNodeCollection = _
mWPage.DocumentNode.SelectNodes("//div[@id='div_main']//p")
For Each EleP As HtmlNode In ElePs
'Here I need to get EleP.InnerText "normalized"
Next
If you notice this mess is actually just html comments and they shall be ignored, so just getting the text and using string.Join
is enough:
C#
var text = string.Join("",htmlDoc.DocumentNode.SelectNodes("//text()[normalize-space()]").
Select(t=>t.InnerText));
VB.net
Dim text = String.Join("", From t In htmlDoc.DocumentNode.SelectNodes("//text()[normalize-space()]")
Select t.InnerText)
the html is valid, nothing bad about it, its just written by someone without a soul.
based on your update this shall do:
Dim ElePs As HtmlNodeCollection = mWPage.DocumentNode.SelectNodes("//div[@id='div_main']//p")
For Each EleP As HtmlNode In ElePs
'Here I need to get EleP.InnerText "normalized"
Dim text = String.Join("", From t In EleP.SelectNodes(".//text()[normalize-space()]")
Select t.InnerText).Trim()
Next
note the .//
it means that it will look for the descendant nodes of the current node unlike //
which will always start from the top node.