Parse Compelete Web Page

c# html-agility-pack parsing

Question

How to parse complete HTML web page not specific nodes using HTML Agility Pack or any other technique?

I am using this code, but this code only parse specific node, but I need complete page to parse with neat and clear contents

List<string> list = new List<string>();
string url = "https://www.google.com";
HtmlWeb web = new HtmlWeb();
HtmlAgilityPack.HtmlDocument doc = web.Load(url);
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//a"))
{
   list.Add(node.InnerText);
}

Accepted Answer

To get all descendant text nodes use something like

var textNodes = doc.DocumentNode.SelectNodes("//text()").
                                 Select(t=>t.InnerText);

To get all non empty descendant text nodes

var textNodes = doc.DocumentNode.
                    SelectNodes("//text()[normalize-space()]").
                    Select(t=>t.InnerText);

Popular Answer

Do SelectNodes("*") . '*' (asterisk) Is the wild card selector and will get every node on the page.




Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Is this KB legal? Yes, learn why
Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Is this KB legal? Yes, learn why