I need to extract every piece of text that is contained in the
that uses HTML. Example HTML entry:
<html> <title>title</title> <body> <h1> This is a big title.</h1> How are doing you? <h3> I am fine </h3> <img src="abc.jpg"/> </body> </html>
The final result should be:
This is a big title. How are doing you? I am fine
For this, I simply want to utilize HTMLAgility. Please, no regular expressions.
I am aware of how to load an HTML document, and then using xquery, we can get the contents of the body. However, how can I remove the HTML as I've done in the output?
Thank you in advance.
Utilizing the body's
string html = @" <html> <title>title</title> <body> <h1> This is a big title.</h1> How are doing you? <h3> I am fine </h3> <img src=""abc.jpg""/> </body> </html>"; HtmlDocument doc = new HtmlDocument(); doc.LoadHtml(html); string text = doc.DocumentNode.SelectSingleNode("//body").InnerText;
Then, you may wish to collapse new lines and spaces:
text = Regex.Replace(text, @"\s+", " ").Trim();
However, keep in mind that although it works in this instance, markup like
will be changed into by
- taking off the tags. That problem is challenging to resolve since display is often influenced by CSS as well as markup.
Consider applying the XPath expression.
to choose every text node?