HtmlAgilityPack & Selenium Webdriver returns random results

c# html-agility-pack selenium-webdriver web-crawler web-scraping

Question

I'm trying to scrape product names from a website. Oddly, I seem to only scrape random 12 items. I've tried both HtmlAgilityPack and with HTTPClient and I get the same random results. Here's my code for HtmlAgilityPack:

using HtmlAgilityPack;
using System.Net.Http;

var url = @"http://www.roots.com/ca/en/men/tops/shirts-and-polos/";
HtmlWeb web = new HtmlWeb();
var doc = web.Load(url, "GET", proxy, new NetworkCredential(PROXY_UID, PROXY_PWD, PROXY_DMN));
var nodes = doc.DocumentNode.Descendants("div")
            .Where(div => div.GetAttributeValue("class", string.Empty) == "product-name")
            .Select(div => div.InnerText.Trim())
            ;

[UPDATE 1] @CodingKuma suggested I try Selenium Webdriver. Here's my code using Selenium Webdriver:

IWebDriver chromeDriver = new ChromeDriver(@"C:\TEMP\Projects\Chrome\chromedriver_win32");
chromeDriver.Url = "http://www.roots.com/ca/en/men/tops/shirts-and-polos/";
var items = chromeDriver.FindElements(By.ClassName("product-name"));
items.Count().Dump();
chromeDriver.Quit();

I tried this code but still no luck. There are over 20 items on that page, but I seem to only get a random 12. How can I scrape all items on that site?

Expert Answer

Since the v1.5.0-beta92,

HtmlAgilityPack has a FromBrowser method that allows you to wait until all elements you want are ready.

Documentation: http://html-agility-pack.net/from-browser

string url = "http://html-agility-pack/from-browser";

var web1 = new HtmlWeb();
var doc1 = web1.LoadFromBrowser(url, o =>
{
    var webBrowser = (WebBrowser) o;

    // WAIT until the dynamic text is set
    return !string.IsNullOrEmpty(webBrowser.Document.GetElementById("uiDynamicText").InnerText);
});
var t1 = doc1.DocumentNode.SelectSingleNode("//div[@id='uiDynamicText']").InnerText

var web2 = new HtmlWeb();
var doc2 = web2.LoadFromBrowser(url, html =>
{
    // WAIT until the dynamic text is set
    return !html.Contains("<div id=\"uiDynamicText\"></div>");
});
var t2 = doc2.DocumentNode.SelectSingleNode("//div[@id='uiDynamicText']").InnerText

Console.WriteLine("Text 1: " + t1);
Console.WriteLine("Text 2: " + t2);

The trick here is to find something that tells you when the page is ready since it's impossible for the library to know.


Popular Answer

So there are a couple issues that prevent the count from being correct.

  1. The page has a lazy loader. You have to scroll down to trigger the load of the items over 12.

  2. The page uses AJAX calls to load the items over 12.

So, you need to navigate to the page, scroll to the bottom of the page, wait for AJAX to complete, and then scrape the page. The code below is tested and returns 20 items.

The script

String url = "http://www.roots.com/ca/en/men/tops/shirts-and-polos/";
driver.navigate().to(url);
JavascriptExecutor js = ((JavascriptExecutor) driver);
int height = 1;
int lastHeight = 0;
while (lastHeight != height)
{
    lastHeight = height;
    js.executeScript("window.scrollTo(0, document.body.scrollHeight);");
    height = (int) (long) js.executeScript("return document.body.scrollHeight;");
}

waitForJSandJQueryToLoad(10);

List<WebElement> products = driver.findElements(By.cssSelector("div.product-name"));
System.out.println(products.size());
for (WebElement e : products)
{
    System.out.println(e.getText());
}

Support function

public boolean waitForJSandJQueryToLoad(int timeOut)
{
    WebDriverWait wait = new WebDriverWait(driver, timeOut);

    ExpectedCondition<Boolean> jQueryIsLoaded = new ExpectedCondition<Boolean>()
    {
        @Override
        public Boolean apply(WebDriver driver)
        {
            return (Boolean) ((JavascriptExecutor) driver).executeScript("return (window.jQuery != null) && (jQuery.active === 0);");
        }
    };

    ExpectedCondition<Boolean> jsIsLoaded = new ExpectedCondition<Boolean>()
    {
        @Override
        public Boolean apply(WebDriver driver)
        {
            return (Boolean) ((JavascriptExecutor) driver).executeScript("return document.readyState == 'complete'");
        }
    };

    return wait.until(jQueryIsLoaded) && wait.until(jsIsLoaded);
}

Output

20
Rideau Flannel Shirt
Westridge Denim Shirt
Rideau Flannel Shirt
Riverside Plaid Shirt
Riverside Plaid Shirt
Heritage Peppered Polo
Heritage Peppered Polo
Heritage Peppered Polo
Cedar Jersey Polo
Cedar Jersey Polo
Hope River Shirt
Hawthorne Surplus Shacket
Acadian Linen Shirt
Camp Short Sleeve Shirt
Foxley Short Sleeve Shirt
Heritage Peppered Polo
Foxley Short Sleeve Shirt
Waterway Indigo Shirt
Waterway Indigo Shirt
Resolute Flannel Shirt



Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Is this KB legal? Yes, learn why
Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Is this KB legal? Yes, learn why