使用Html Agility Pack獲取文本內容

html html-agility-pack vb.net

我會盡力去具體。基本上在vb.net中使用爬蟲,我更感興趣的是提取頁面的文本內容。我當前的應用程序使用Web瀏覽器控件在文本框中下載html源代碼的主體,如下所示:

Private Sub Button1_Click(ByVal sender As System.Object, ByVal e As System.EventArgs)   Handles Button1.Click
    Dim url As String = "<url>"
    WebBrowser1.Navigate(url)
End Sub

Private Sub WebBrowser1_DocumentCompleted(ByVal sender As System.Object, ByVal e As    System.Windows.Forms.WebBrowserDocumentCompletedEventArgs) Handles WebBrowser1.DocumentCompleted
    TextBox2.Text = WebBrowser1.Document.Body.OuterHtml
End Sub

現在從這裡開始,textbox2由垃圾html組成,其中包含href,img,ads,script等,但我需要獲取所有這些元數據並獲取純文本。

我可以應用正則表達式屬性來獲取所有異常,但我認為HAP更適合html解析器。

在這裡搜索帶我到這個頁面,討論'Meltdown'提到的白名單技術的使用

HTML Agility Pack strip標籤不在白名單中

但是我如何在vb.net中應用它,因為它似乎是一個好主意?

請adivce guys ..........

編輯:我發現下面顯示的代碼的vb.net版本,但似乎有一個錯誤

If i IsNot DeletableNodesXpath.Count - 1 Then

錯誤:IsNot要求操作數具有引用類型,但此操作數的值類型為整數

這是代碼:

Public NotInheritable Class HtmlSanitizer Private Sub New()End Sub Private SharedOnly Whitelist As IDictionary(Of String,String())Private Shared DeletableNodesXpath As New List(Of String)()

Shared Sub New()
    Whitelist = New Dictionary(Of String, String())() From { _
        {"a", New () {"href"}}, _
        {"strong", Nothing}, _
        {"em", Nothing}, _
        {"blockquote", Nothing}, _
        {"b", Nothing}, _
        {"p", Nothing}, _
        {"ul", Nothing}, _
        {"ol", Nothing}, _
        {"li", Nothing}, _
        {"div", New () {"align"}}, _
        {"strike", Nothing}, _
        {"u", Nothing}, _
        {"sub", Nothing}, _
        {"sup", Nothing}, _
        {"table", Nothing}, _
        {"tr", Nothing}, _
        {"td", Nothing}, _
        {"th", Nothing} _
    }
End Sub

Public Shared Function Sanitize(input As String) As String
    If input.Trim().Length < 1 Then
        Return String.Empty
    End If
    Dim htmlDocument = New HtmlDocument()

    htmldocument.LoadHtml(input)
    SanitizeNode(htmldocument.DocumentNode)
    Dim xPath As String = HtmlSanitizer.CreateXPath()

    Return StripHtml(htmldocument.DocumentNode.WriteTo().Trim(), xPath)
End Function

Private Shared Sub SanitizeChildren(parentNode As HtmlNode)
    For i As Integer = parentNode.ChildNodes.Count - 1 To 0 Step -1
        SanitizeNode(parentNode.ChildNodes(i))
    Next
End Sub

Private Shared Sub SanitizeNode(node As HtmlNode)
    If node.NodeType = HtmlNodeType.Element Then
        If Not Whitelist.ContainsKey(node.Name) Then
            If Not DeletableNodesXpath.Contains(node.Name) Then
                'DeletableNodesXpath.Add(node.Name.Replace("?",""));
                node.Name = "removeableNode"
                DeletableNodesXpath.Add(node.Name)
            End If
            If node.HasChildNodes Then
                SanitizeChildren(node)
            End If

            Return
        End If

        If node.HasAttributes Then
            For i As Integer = node.Attributes.Count - 1 To 0 Step -1
                Dim currentAttribute As HtmlAttribute = node.Attributes(i)
                Dim allowedAttributes As String() = Whitelist(node.Name)
                If allowedAttributes IsNot Nothing Then
                    If Not allowedAttributes.Contains(currentAttribute.Name) Then
                        node.Attributes.Remove(currentAttribute)
                    End If
                Else
                    node.Attributes.Remove(currentAttribute)
                End If
            Next
        End If
    End If

    If node.HasChildNodes Then
        SanitizeChildren(node)
    End If
End Sub

Private Shared Function StripHtml(html As String, xPath As String) As String
    Dim htmlDoc As New HtmlDocument()
    htmlDoc.LoadHtml(html)
    If xPath.Length > 0 Then
        Dim invalidNodes As HtmlNodeCollection = htmlDoc.DocumentNode.SelectNodes(xPath)
        For Each node As HtmlNode In invalidNodes
            node.ParentNode.RemoveChild(node, True)
        Next
    End If
    Return htmlDoc.DocumentNode.WriteContentTo()


End Function

Private Shared Function CreateXPath() As String
    Dim _xPath As String = String.Empty
    For i As Integer = 0 To DeletableNodesXpath.Count - 1
        If i IsNot DeletableNodesXpath.Count - 1 Then
            _xPath += String.Format("//{0}|", DeletableNodesXpath(i).ToString())
        Else
            _xPath += String.Format("//{0}", DeletableNodesXpath(i).ToString())
        End If
    Next
    Return _xPath
End Function
End Class

請有人幫忙??????

熱門答案

而不是使用IsNot ,只需使用<> 。正如您在基本上檢查一個整數的值不等於另一個整數的值 - 1。

我相信IsNot不能用於整數。

編輯:我剛剛注意到這是超級超級老。剛看到7月26日的日期!



許可下: CC-BY-SA with attribution
不隸屬於 Stack Overflow
這個KB合法嗎? 是的,了解原因
許可下: CC-BY-SA with attribution
不隸屬於 Stack Overflow
這個KB合法嗎? 是的,了解原因