VB.net estrae collegamenti da google-search utilizzando HtmlAgilityPack

google-search html-agility-pack vb.net

Domanda

Ora ho aggiornato il mio codice come test Voglio elencare tutti gli URL che hanno la parola index.php ma mostra anche altre cose.

Ecco il mio codice di lavoro:

Private Sub Button1_Click(sender As Object, e As EventArgs) Handles Button1.Click

    Dim webClient As New System.Net.WebClient
    Dim WebSource As String = webClient.DownloadString("http://www.google.com/search?lr=&cr=countryCA&newwindow=1&hl=fil&as_qdr=all&biw=1366&bih=667&tbs=ctr%3AcountryCA&q=index.php&oq=index.php&gs_l=serp.12..0l10.520034.522335.0.525032.9.9.0.0.0.0.497.3073.1j1j2j0j5.9.0....0...1c.1.25.serp..5.4.884.J4smY262XgY")
    RichTextBox1.Text = WebSource

    ListBox1.Items.Clear()


    Dim htmlDoc As New HtmlAgilityPack.HtmlDocument()
    htmlDoc.LoadHtml(WebSource)

    For Each link As HtmlNode In htmlDoc.DocumentNode.SelectNodes("//cite")

        If link.InnerText.Contains("index.php") Then
            ListBox1.Items.Add(link.InnerText)
        End If

    Next

End Sub

l'output atteso dovrebbe essere solo i siti web che ha index.php su di esso, come questo:

Private Sub Button1_Click(sender As Object, e As EventArgs) Handles Button1.Click

    Dim webClient As New System.Net.WebClient
    Dim WebSource As String = webClient.DownloadString("http://www.google.com/search?lr=&cr=countryCA&newwindow=1&hl=fil&as_qdr=all&biw=1366&bih=667&tbs=ctr%3AcountryCA&q=index.php&oq=index.php&gs_l=serp.12..0l10.520034.522335.0.525032.9.9.0.0.0.0.497.3073.1j1j2j0j5.9.0....0...1c.1.25.serp..5.4.884.J4smY262XgY")
    RichTextBox1.Text = WebSource

    ListBox1.Items.Clear()


    Dim htmlDoc As New HtmlAgilityPack.HtmlDocument()
    htmlDoc.LoadHtml(WebSource)

    For Each link As HtmlNode In htmlDoc.DocumentNode.SelectNodes("//cite")

        If link.InnerText.Contains("index.php") Then
            ListBox1.Items.Add(link.InnerText)
        End If

    Next

End Sub

Ma il problema è che si ferma solo fino a index.php altre parti del collegamento non sono incluse.

per esempio l'URL completo è

Private Sub Button1_Click(sender As Object, e As EventArgs) Handles Button1.Click

    Dim webClient As New System.Net.WebClient
    Dim WebSource As String = webClient.DownloadString("http://www.google.com/search?lr=&cr=countryCA&newwindow=1&hl=fil&as_qdr=all&biw=1366&bih=667&tbs=ctr%3AcountryCA&q=index.php&oq=index.php&gs_l=serp.12..0l10.520034.522335.0.525032.9.9.0.0.0.0.497.3073.1j1j2j0j5.9.0....0...1c.1.25.serp..5.4.884.J4smY262XgY")
    RichTextBox1.Text = WebSource

    ListBox1.Items.Clear()


    Dim htmlDoc As New HtmlAgilityPack.HtmlDocument()
    htmlDoc.LoadHtml(WebSource)

    For Each link As HtmlNode In htmlDoc.DocumentNode.SelectNodes("//cite")

        If link.InnerText.Contains("index.php") Then
            ListBox1.Items.Add(link.InnerText)
        End If

    Next

End Sub

il programma visualizza solo

Private Sub Button1_Click(sender As Object, e As EventArgs) Handles Button1.Click

    Dim webClient As New System.Net.WebClient
    Dim WebSource As String = webClient.DownloadString("http://www.google.com/search?lr=&cr=countryCA&newwindow=1&hl=fil&as_qdr=all&biw=1366&bih=667&tbs=ctr%3AcountryCA&q=index.php&oq=index.php&gs_l=serp.12..0l10.520034.522335.0.525032.9.9.0.0.0.0.497.3073.1j1j2j0j5.9.0....0...1c.1.25.serp..5.4.884.J4smY262XgY")
    RichTextBox1.Text = WebSource

    ListBox1.Items.Clear()


    Dim htmlDoc As New HtmlAgilityPack.HtmlDocument()
    htmlDoc.LoadHtml(WebSource)

    For Each link As HtmlNode In htmlDoc.DocumentNode.SelectNodes("//cite")

        If link.InnerText.Contains("index.php") Then
            ListBox1.Items.Add(link.InnerText)
        End If

    Next

End Sub

o avrebbe rotto punti come

Private Sub Button1_Click(sender As Object, e As EventArgs) Handles Button1.Click

    Dim webClient As New System.Net.WebClient
    Dim WebSource As String = webClient.DownloadString("http://www.google.com/search?lr=&cr=countryCA&newwindow=1&hl=fil&as_qdr=all&biw=1366&bih=667&tbs=ctr%3AcountryCA&q=index.php&oq=index.php&gs_l=serp.12..0l10.520034.522335.0.525032.9.9.0.0.0.0.497.3073.1j1j2j0j5.9.0....0...1c.1.25.serp..5.4.884.J4smY262XgY")
    RichTextBox1.Text = WebSource

    ListBox1.Items.Clear()


    Dim htmlDoc As New HtmlAgilityPack.HtmlDocument()
    htmlDoc.LoadHtml(WebSource)

    For Each link As HtmlNode In htmlDoc.DocumentNode.SelectNodes("//cite")

        If link.InnerText.Contains("index.php") Then
            ListBox1.Items.Add(link.InnerText)
        End If

    Next

End Sub

Risposta accettata

Vorrei utilizzare Html Agility Pack per estrarre i collegamenti come di seguito

Dim links As New List(Of String)()
Dim htmlDoc As New HtmlAgilityPack.HtmlDocument()
htmlDoc.LoadHtml(WebSource)
For Each link As HtmlNode In htmlDoc.DocumentNode.SelectNodes("//a[@href]")
    Dim att As HtmlAttribute = link.Attributes("href")
    If att.Value.Contains("/forums/") Then
        links.Add(att.Value)
    End If
Next

se è risultato di ricerca di google prova qualcosa di simile qui sotto

Dim links As New List(Of String)()
Dim htmlDoc As New HtmlAgilityPack.HtmlDocument()
htmlDoc.LoadHtml(WebSource)
For Each link As HtmlNode In htmlDoc.DocumentNode.SelectNodes("//a[@href]")
    Dim att As HtmlAttribute = link.Attributes("href")
    If att.Value.Contains("/forums/") Then
        links.Add(att.Value)
    End If
Next



Autorizzato sotto: CC-BY-SA with attribution
Non affiliato con Stack Overflow
È legale questo KB? Sì, impara il perché
Autorizzato sotto: CC-BY-SA with attribution
Non affiliato con Stack Overflow
È legale questo KB? Sì, impara il perché