I have a requirement to extract all the text that is present in the
<body> of the html. Sample Html input :-
<html> <title>title</title> <body> <h1> This is a big title.</h1> How are doing you? <h3> I am fine </h3> <img src="abc.jpg"/> </body> </html>
The output should be :-
This is a big title. How are doing you? I am fine
I want to use only HtmlAgility for this purpose. No regular expressions please.
I know how to load HtmlDocument and then using xquery like '//body' we can get body contents. But how do I strip the html as I have shown in output?
Thanks in advance :)
You can use the body's
string html = @" <html> <title>title</title> <body> <h1> This is a big title.</h1> How are doing you? <h3> I am fine </h3> <img src=""abc.jpg""/> </body> </html>"; HtmlDocument doc = new HtmlDocument(); doc.LoadHtml(html); string text = doc.DocumentNode.SelectSingleNode("//body").InnerText;
Next, you may want to collapse spaces and new lines:
text = Regex.Replace(text, @"\s+", " ").Trim();
Note, however, that while it is working in this case, markup such as
hello<i>world</i> will be converted by
helloworld - removing the tags. It is difficult to solve that issue, as display is ofter determined by the CSS, not just by the markup.
How about using the XPath expression
'//body//text()' to select all text nodes?