How can I take just the text from an HTML page?

c# html-agility-pack

Question

I need to extract every piece of text that is contained in the<body> that uses HTML. Example HTML entry:

<html>
    <title>title</title>
    <body>
           <h1> This is a big title.</h1>
           How are doing you?
           <h3> I am fine </h3>
           <img src="abc.jpg"/>
    </body>
</html>

The final result should be:

This is a big title. How are doing you? I am fine

For this, I simply want to utilize HTMLAgility. Please, no regular expressions.

I am aware of how to load an HTML document, and then using xquery, we can get the contents of the body. However, how can I remove the HTML as I've done in the output?

Thank you in advance.

1
6
5/1/2011 9:46:55 AM

Accepted Answer

Utilizing the body'sInnerText :

string html = @"
<html>
    <title>title</title>
    <body>
           <h1> This is a big title.</h1>
           How are doing you?
           <h3> I am fine </h3>
           <img src=""abc.jpg""/>
    </body>
</html>";

HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
string text = doc.DocumentNode.SelectSingleNode("//body").InnerText;

Then, you may wish to collapse new lines and spaces:

text = Regex.Replace(text, @"\s+", " ").Trim();

However, keep in mind that although it works in this instance, markup likehello<br>world or hello<i>world</i> will be changed into byInnerText to helloworld - taking off the tags. That problem is challenging to resolve since display is often influenced by CSS as well as markup.

4
5/1/2011 9:59:22 AM

Popular Answer

Consider applying the XPath expression.'//body//text()' to choose every text node?



Related Questions





Related

Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow