How to get URLs on page with HTMLAgilityPack, when the Source does not contain the URLs?

c# html html-agility-pack

Question

I am trying to scrape the KB Urls from this page: https://support.microsoft.com/en-us/kb/894199

On the page, there are URLs such as: https://support.microsoft.com/kb/2976978

If you open up the developer tools in Chrome, it shows that data is contained like this:

<div class="indent">
<a id="kb-link-142" href="https://support.microsoft.com/kb/2976978" target="_self">https://support.microsoft.com/kb/2976978</a>
</div>

Now based on the above HTML, I believe I should be able to scrape the URLs from the href element like this:

foreach(HtmlNode link in doc.DocumentNode.SelectNodes("//a[@href]"))
{
   list.Add(link.GetAttributeValue("href", string.Empty));
}

The problem I am running into though, is that when I download the HTMLSource, the content changes. What I mean is that even though the Developer tools show the above HTML available on the page, if you right click the page and choose to View source, the HTML it shows at that point is totally different, and does not contain any of the URLs that the rendered page displays.

My theory is that there's some kind of file reference where the HTML loads a file somewhere and the file contains the details of the page that is rendered. So how can I use HTMLAgilityPack to get the URLs that are on the rendered page, since the source doesn't seem to contain them?

Also - I realize my question Title may be really confusing. If there is a technical term for what this page is doing/how it works, let me know and I can update the title so it is more logical and others can search it out in the future.

Popular Answer

Okay, I see the problem now. This page is using Angularjs directives and bindings, and the hrefs are loading post page load. The page we are getting is before any parsing/execution has happened as from the web browser agent. This means the changes on the page after any DOM manupulation/ javascript or ajax modification will not be included in the HtmlDocument response. I think the way to go about this would be to pretend like a web browser request, let the javascript and ajax execute completely and fetch the content as advised here . Hope this helps!



Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Is this KB legal? Yes, learn why
Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Is this KB legal? Yes, learn why