How to get URLs on page with HTMLAgilityPack, when the Source does not contain the URLs?

c# html html-agility-pack

Question

This page's KB Urls, https://support.microsoft.com/en-us/kb/894199, are what I'm attempting to scrape.

There are URLs like https://support.microsoft.com/kb/2976978 on the page.

The developer tools in Chrome show that data is organized as follows:

<div class="indent">
<a id="kb-link-142" href="https://support.microsoft.com/kb/2976978" target="_self">https://support.microsoft.com/kb/2976978</a>
</div>

I think I should be able to scrape the URLs from the href element in the following way based on the HTML shown above:

foreach(HtmlNode link in doc.DocumentNode.SelectNodes("//a[@href]"))
{
   list.Add(link.GetAttributeValue("href", string.Empty));
}

The issue I'm having, however, is that the information changes when I download the HTMLSource. What I mean is that although though the Developer tools reflect the aforementioned HTML as being present on the page, if you right-click the page and choose to see its source, a completely other HTML document that is devoid of any of the URLs that the rendered page displays is shown.

According to my idea, the HTML loads a file from elsewhere, and the file provides the information needed to produce the page. Since the source doesn't seem to include the URLs that are on the produced page, how may I use HTMLAgilityPack to get them?

I am aware that the title of my query may be quite unclear. Please let me know if there is a technical word for what this website is doing or how it operates so I can edit the title to make it more obvious and make it easier for people to find in the future.

1
1
2/21/2016 7:54:11 PM

Popular Answer

Okay, I get the issue now. The hrefs are loading after the page has loaded on this page, which uses Angularjs directives and bindings. The page we are receiving is straight from the web browser agent without any processing or execution. The HtmlDocument response will not include the modifications made to the page as a result of any DOM manipulation, javascript, or ajax alteration. I believe the best course of action would be to mimic a web browser request, let javascript and ajax run entirely, and then get the material as suggested by here. Hope this is useful!

0
5/23/2017 12:31:08 PM


Related Questions





Related

Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow