HTMLAgilityPack .load connection is closed on some sites

.net html-agility-pack vb.net

Question

I've tried scraping information from a few websites using the code below, but it won't work on one in particular, giving me the error message "The underlying connection was closed: The connection was closed abruptly." Why does this function on certain websites but not on others? For instance, when utilized on siteA, I receive the last hyperlink in the div "wrapper". On another website, though, I just receive a closed connection. Please provide a hand.

Private Function getText() As String
    Dim web = New HtmlWeb()
    Dim html As HtmlDocument

    html = web.load("http://some-website.com")
    Dim lastLink = html.DocumentNode.SelectSingleNode("//div[@id='wrapper']//a[last()]")

    If lastLink IsNot Nothing Then
        Return lastLink.InnerHtml
    Else
        Return "nothing found"
    End If

End Function

Protected Sub Page_Load(sender As Object, e As EventArgs) Handles Me.Load        
    label4.Text = getText()
End Sub
1
3
7/28/2017 7:01:45 PM

Accepted Answer

There are several explanations that might apply. Javascript execution that is deferred or a vintage kind of browser switch come to mind. Checking your browser's request headers against those used by HTMLAgilityPack may be helpful.

I'd start by using the same user agent string:

Private Function getText() As String
    Dim web = New HtmlWeb()
    web.UserAgent = "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.137 Safari/537.36"
    Dim html As HtmlDocument

    html = web.load("http://some-website.com")
    Dim lastLink = html.DocumentNode.SelectSingleNode("//div[@id='wrapper']//a[last()]")

    If lastLink IsNot Nothing Then
        Return lastLink.InnerHtml
    Else
        Return "nothing found"
    End If

End Function

Protected Sub Page_Load(sender As Object, e As EventArgs) Handles Me.Load        
    label4.Text = getText()
End Sub

I assume your browser can provide some insight into the request header itself (e.g. Chrome Developer Tools, Firebug). Fetching from http://www.mybrowserinfo.com/ is a simple approach to compare the two settings. View the logs if you have your own webserver. Dumping the traffic would be a brute-force approach if this doesn't work.

8
5/23/2014 9:06:45 PM


Related Questions





Related

Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow