HTMLAgilityPack .load connection is closed on some sites

.net html-agility-pack vb.net

Question

I have the following code which works on some sites that I have tried scraping for information, but it will not work on a particular site that I have tried, and I get the error "The underlying connection was closed: The connection was closed unexpectedly." Why would this work on some sites and not others? If used on siteA, for example, i get the last hyperlink in div "wrapper". but on another site, I just get closed connection. please help.

Private Function getText() As String
    Dim web = New HtmlWeb()
    Dim html As HtmlDocument

    html = web.load("http://some-website.com")
    Dim lastLink = html.DocumentNode.SelectSingleNode("//div[@id='wrapper']//a[last()]")

    If lastLink IsNot Nothing Then
        Return lastLink.InnerHtml
    Else
        Return "nothing found"
    End If

End Function

Protected Sub Page_Load(sender As Object, e As EventArgs) Handles Me.Load        
    label4.Text = getText()
End Sub

Accepted Answer

There are many possible reasons for that. Deferred javascript execution comes into mind or an archaic kind of browser switch. It might be useful to check your browsers request headers with the ones used in HtmlAgilityPack.

The first thing I'd do is to use the same user agent string:

Private Function getText() As String
    Dim web = New HtmlWeb()
    web.UserAgent = "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.137 Safari/537.36"
    Dim html As HtmlDocument

    html = web.load("http://some-website.com")
    Dim lastLink = html.DocumentNode.SelectSingleNode("//div[@id='wrapper']//a[last()]")

    If lastLink IsNot Nothing Then
        Return lastLink.InnerHtml
    Else
        Return "nothing found"
    End If

End Function

Protected Sub Page_Load(sender As Object, e As EventArgs) Handles Me.Load        
    label4.Text = getText()
End Sub

I suppose your browser can give you a clue about the actual request header (e.g. Chrome Developer Tools, Firebug). A quick way to compare both settings can be done by fetching from http://www.mybrowserinfo.com/. If you have your own webserver, just view into the logs. If this doesn't help dumping the traffic would be the brute force option.



Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Is this KB legal? Yes, learn why
Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Is this KB legal? Yes, learn why