VB.net使用HtmlAgilityPack從谷歌搜索中提取鏈接

google-search html-agility-pack vb.net

我現在更新了我的代碼作為測試我要列出所有帶有index.php一詞的網址,但它也會顯示其他內容。

這是我的工作代碼:

Private Sub Button1_Click(sender As Object, e As EventArgs) Handles Button1.Click

    Dim webClient As New System.Net.WebClient
    Dim WebSource As String = webClient.DownloadString("http://www.google.com/search?lr=&cr=countryCA&newwindow=1&hl=fil&as_qdr=all&biw=1366&bih=667&tbs=ctr%3AcountryCA&q=index.php&oq=index.php&gs_l=serp.12..0l10.520034.522335.0.525032.9.9.0.0.0.0.497.3073.1j1j2j0j5.9.0....0...1c.1.25.serp..5.4.884.J4smY262XgY")
    RichTextBox1.Text = WebSource

    ListBox1.Items.Clear()


    Dim htmlDoc As New HtmlAgilityPack.HtmlDocument()
    htmlDoc.LoadHtml(WebSource)

    For Each link As HtmlNode In htmlDoc.DocumentNode.SelectNodes("//cite")

        If link.InnerText.Contains("index.php") Then
            ListBox1.Items.Add(link.InnerText)
        End If

    Next

End Sub

預期輸出應該只是在其上有index.php的網站,如下所示:

Private Sub Button1_Click(sender As Object, e As EventArgs) Handles Button1.Click

    Dim webClient As New System.Net.WebClient
    Dim WebSource As String = webClient.DownloadString("http://www.google.com/search?lr=&cr=countryCA&newwindow=1&hl=fil&as_qdr=all&biw=1366&bih=667&tbs=ctr%3AcountryCA&q=index.php&oq=index.php&gs_l=serp.12..0l10.520034.522335.0.525032.9.9.0.0.0.0.497.3073.1j1j2j0j5.9.0....0...1c.1.25.serp..5.4.884.J4smY262XgY")
    RichTextBox1.Text = WebSource

    ListBox1.Items.Clear()


    Dim htmlDoc As New HtmlAgilityPack.HtmlDocument()
    htmlDoc.LoadHtml(WebSource)

    For Each link As HtmlNode In htmlDoc.DocumentNode.SelectNodes("//cite")

        If link.InnerText.Contains("index.php") Then
            ListBox1.Items.Add(link.InnerText)
        End If

    Next

End Sub

但問題是它只會停止,直到index.php鏈接的其他部分不包括在內。

例如,完整的網址是

Private Sub Button1_Click(sender As Object, e As EventArgs) Handles Button1.Click

    Dim webClient As New System.Net.WebClient
    Dim WebSource As String = webClient.DownloadString("http://www.google.com/search?lr=&cr=countryCA&newwindow=1&hl=fil&as_qdr=all&biw=1366&bih=667&tbs=ctr%3AcountryCA&q=index.php&oq=index.php&gs_l=serp.12..0l10.520034.522335.0.525032.9.9.0.0.0.0.497.3073.1j1j2j0j5.9.0....0...1c.1.25.serp..5.4.884.J4smY262XgY")
    RichTextBox1.Text = WebSource

    ListBox1.Items.Clear()


    Dim htmlDoc As New HtmlAgilityPack.HtmlDocument()
    htmlDoc.LoadHtml(WebSource)

    For Each link As HtmlNode In htmlDoc.DocumentNode.SelectNodes("//cite")

        If link.InnerText.Contains("index.php") Then
            ListBox1.Items.Add(link.InnerText)
        End If

    Next

End Sub

程序只顯示

Private Sub Button1_Click(sender As Object, e As EventArgs) Handles Button1.Click

    Dim webClient As New System.Net.WebClient
    Dim WebSource As String = webClient.DownloadString("http://www.google.com/search?lr=&cr=countryCA&newwindow=1&hl=fil&as_qdr=all&biw=1366&bih=667&tbs=ctr%3AcountryCA&q=index.php&oq=index.php&gs_l=serp.12..0l10.520034.522335.0.525032.9.9.0.0.0.0.497.3073.1j1j2j0j5.9.0....0...1c.1.25.serp..5.4.884.J4smY262XgY")
    RichTextBox1.Text = WebSource

    ListBox1.Items.Clear()


    Dim htmlDoc As New HtmlAgilityPack.HtmlDocument()
    htmlDoc.LoadHtml(WebSource)

    For Each link As HtmlNode In htmlDoc.DocumentNode.SelectNodes("//cite")

        If link.InnerText.Contains("index.php") Then
            ListBox1.Items.Add(link.InnerText)
        End If

    Next

End Sub

或者它會破碎點像

Private Sub Button1_Click(sender As Object, e As EventArgs) Handles Button1.Click

    Dim webClient As New System.Net.WebClient
    Dim WebSource As String = webClient.DownloadString("http://www.google.com/search?lr=&cr=countryCA&newwindow=1&hl=fil&as_qdr=all&biw=1366&bih=667&tbs=ctr%3AcountryCA&q=index.php&oq=index.php&gs_l=serp.12..0l10.520034.522335.0.525032.9.9.0.0.0.0.497.3073.1j1j2j0j5.9.0....0...1c.1.25.serp..5.4.884.J4smY262XgY")
    RichTextBox1.Text = WebSource

    ListBox1.Items.Clear()


    Dim htmlDoc As New HtmlAgilityPack.HtmlDocument()
    htmlDoc.LoadHtml(WebSource)

    For Each link As HtmlNode In htmlDoc.DocumentNode.SelectNodes("//cite")

        If link.InnerText.Contains("index.php") Then
            ListBox1.Items.Add(link.InnerText)
        End If

    Next

End Sub

一般承認的答案

我會使用Html Agility Pack來提取鏈接,如下所示

Dim links As New List(Of String)()
Dim htmlDoc As New HtmlAgilityPack.HtmlDocument()
htmlDoc.LoadHtml(WebSource)
For Each link As HtmlNode In htmlDoc.DocumentNode.SelectNodes("//a[@href]")
    Dim att As HtmlAttribute = link.Attributes("href")
    If att.Value.Contains("/forums/") Then
        links.Add(att.Value)
    End If
Next

如果是谷歌搜索結果嘗試如下

Dim links As New List(Of String)()
Dim htmlDoc As New HtmlAgilityPack.HtmlDocument()
htmlDoc.LoadHtml(WebSource)
For Each link As HtmlNode In htmlDoc.DocumentNode.SelectNodes("//a[@href]")
    Dim att As HtmlAttribute = link.Attributes("href")
    If att.Value.Contains("/forums/") Then
        links.Add(att.Value)
    End If
Next



許可下: CC-BY-SA with attribution
不隸屬於 Stack Overflow
這個KB合法嗎? 是的,了解原因
許可下: CC-BY-SA with attribution
不隸屬於 Stack Overflow
這個KB合法嗎? 是的,了解原因