How to get Dynamically loaded content using HtmlAgilityPack

c# html-agility-pack

Question

I was trying to extract some html from our Central bank using HtmlAgilityPack.

Here is a Weekly Account. Second half of the Statement "An Account pursuant to the Bangladesh Bank Order 1972 ....." contains a line "A. Gold Coin and Bullion".

I've tried with following code -

var get = new HtmlWeb();
for (int i = 1; i < 8284; i++)
{
    var dat = get.Load("https://www.bb.org.bd/pub/weekly/staffair/state_affairs.php?prId=" + i);
    var htm = dat.DocumentNode.InnerHtml;
    if (htm.Contains("Gold Coin and Bullion"))
    {
       File.WriteAllText(@"C:\Test\" + i + ".txt", htm);
       Console.WriteLine(i + " written");
    }
}

If I right click on the page and click on "View source" I do not see the line "A. Gold Coin and Bullion". dat.DocumentNode.InnerHtml returns same thing, as a result No file is written in Test Folder. But I can see all information if I click "Inspect element" instead of "View source".

How to get that line using HtmlAgilityPack?

Accepted Answer

The reason why you cannot see it in the source is because the data you're looking for is loaded via javascript (xhr) within your browser after the initial download of the page. The HtmlAgiligyPack is just an HTML parser and doesn't support running javascript and loading additional resources. There are other ways to do this, but you would need to use another tool. This is probably a good place to start:

Load a DOM and Execute javascript, server side, with .Net




Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Is this KB legal? Yes, learn why
Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Is this KB legal? Yes, learn why