How to get Dynamically loaded content using HtmlAgilityPack

c# html-agility-pack

Question

Using HTMLAgilityPack, I was attempting to extract some html from our Central bank.

It is a Weekly Account, Here. Line "A. Gold Coin and Bullion" appears in the second part of the statement "An Account according to the Bangladesh Bank Order 1972....."

I experimented with the code below:

var get = new HtmlWeb();
for (int i = 1; i < 8284; i++)
{
    var dat = get.Load("https://www.bb.org.bd/pub/weekly/staffair/state_affairs.php?prId=" + i);
    var htm = dat.DocumentNode.InnerHtml;
    if (htm.Contains("Gold Coin and Bullion"))
    {
       File.WriteAllText(@"C:\Test\" + i + ".txt", htm);
       Console.WriteLine(i + " written");
    }
}

I cannot see the line "A. Gold Coin and Bullion" if I right-click the page and choose "View source."dat.DocumentNode.InnerHtml Gives the same result, hence no file is written in the Test Folder. But I can see all information if I click "Inspect element" instead of "View source".

How does HTMLAgilityPack acquire that line?

1
2
2/1/2017 3:31:36 PM

Accepted Answer

The information you're seeking for is loaded through javascript (xhr) inside your browser after the first download of the page, which is why you can't see it in the code. The HtmlAgiligyPack is just an HTML parser; it does not handle loading extra resources or executing JavaScript. There are alternative approaches, but you'd need a different tool. I think a decent place to start is with:

Load a DOM and use.Net to run server-side Javascript

1
5/23/2017 12:33:34 PM


Related Questions





Related

Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow