How do I convert a website into plain text?

c# html-agility-pack regex

Question

I'm attempting to turn the website into plain text. However, I also get the td and tr tags whenever I come across the table. I can't receive part of the material if I change those table tags.

The code is below.

string s = Regex.Replace(htmldoc, "<script.*?</script>", "", RegexOptions.Singleline | RegexOptions.IgnoreCase);
s = Regex.Replace(s, "<!--.*?-->", "", RegexOptions.Singleline | RegexOptions.IgnoreCase);
s = Regex.Replace(s, "<style.*?style>", "", RegexOptions.Singleline | RegexOptions.IgnoreCase);
s = Regex.Replace(s, "<a.*?a>", "", RegexOptions.Singleline | RegexOptions.IgnoreCase);
s = Regex.Replace(s, "<img.*?img>", "", RegexOptions.Singleline | RegexOptions.IgnoreCase);
s = Regex.Replace(s, "<table.*?table>", "", RegexOptions.Singleline | RegexOptions.IgnoreCase);
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(s);
s = doc.DocumentNode.SelectSingleNode("//body").InnerText.Trim();

Please review it and advise me on how to get the table's information without obtaining the td and tr tags.

1
2
11/28/2017 6:49:32 PM

Accepted Answer

You do not need to use your regex to remove the HTML tags from the table if HTML Agility Pack is being used to parse it. Here on SO, there are a few excellent examples of parsing tables with HTML Agility pack. as in Table parsing in the HTML Agility Pack

1
5/23/2017 11:51:59 AM

Popular Answer

Utilizing the body'sInnerText :

string html = @"
<html>
    <title>title</title>
    <body>
           <h1> The wheel.</h1>
           Stop reinventing the wheel ! Use powerful APIs 
           for manipulating html docs !
           <h3> I am fine </h3>
           <img src=""da_wheel_in_my_mind.png""/>
    </body>
</html>";

HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
string text = doc.DocumentNode.SelectSingleNode("//body").InnerText;

Then, you may wish to collapse new lines and spaces:

text = Regex.Replace(text, @"\s+", " ").Trim();

However, keep in mind that although it works in this instance, markup likehello<br>world or hello<i>world</i> will be changed into byInnerText to helloworld - taking off the tags. That problem is challenging to resolve since CSS is often used in conjunction with markup to decide presentation.



Related Questions





Related

Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow