I have the following code which works on some sites that I have tried scraping for information, but it will not work on a particular site that I have tried, and I get the error "The underlying connection was closed: The connection was closed unexpectedly." Why would this work on some sites and not others? If used on siteA, for example, i get the last hyperlink in div "wrapper". but on another site, I just get closed connection. please help.
Private Function getText() As String
Dim web = New HtmlWeb()
Dim html As HtmlDocument
html = web.load("http://some-website.com")
Dim lastLink = html.DocumentNode.SelectSingleNode("//div[@id='wrapper']//a[last()]")
If lastLink IsNot Nothing Then
Return lastLink.InnerHtml
Else
Return "nothing found"
End If
End Function
Protected Sub Page_Load(sender As Object, e As EventArgs) Handles Me.Load
label4.Text = getText()
End Sub
There are many possible reasons for that. Deferred javascript execution comes into mind or an archaic kind of browser switch. It might be useful to check your browsers request headers with the ones used in HtmlAgilityPack.
The first thing I'd do is to use the same user agent string:
Private Function getText() As String
Dim web = New HtmlWeb()
web.UserAgent = "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.137 Safari/537.36"
Dim html As HtmlDocument
html = web.load("http://some-website.com")
Dim lastLink = html.DocumentNode.SelectSingleNode("//div[@id='wrapper']//a[last()]")
If lastLink IsNot Nothing Then
Return lastLink.InnerHtml
Else
Return "nothing found"
End If
End Function
Protected Sub Page_Load(sender As Object, e As EventArgs) Handles Me.Load
label4.Text = getText()
End Sub
I suppose your browser can give you a clue about the actual request header (e.g. Chrome Developer Tools, Firebug). A quick way to compare both settings can be done by fetching from http://www.mybrowserinfo.com/. If you have your own webserver, just view into the logs. If this doesn't help dumping the traffic would be the brute force option.