使用Html Agility Pack获取文本内容

html html-agility-pack vb.net

我会尽力去具体。基本上在vb.net中使用爬虫,我更感兴趣的是提取页面的文本内容。我当前的应用程序使用Web浏览器控件在文本框中下载html源代码的主体,如下所示:

Private Sub Button1_Click(ByVal sender As System.Object, ByVal e As System.EventArgs)   Handles Button1.Click
    Dim url As String = "<url>"
    WebBrowser1.Navigate(url)
End Sub

Private Sub WebBrowser1_DocumentCompleted(ByVal sender As System.Object, ByVal e As    System.Windows.Forms.WebBrowserDocumentCompletedEventArgs) Handles WebBrowser1.DocumentCompleted
    TextBox2.Text = WebBrowser1.Document.Body.OuterHtml
End Sub

现在从这里开始,textbox2由垃圾html组成,其中包含href,img,ads,script等,但我需要获取所有这些元数据并获取纯文本。

我可以应用正则表达式属性来获取所有异常,但我认为HAP更适合html解析器。

在这里搜索带我到这个页面,讨论'Meltdown'提到的白名单技术的使用

HTML Agility Pack strip标签不在白名单中

但是我如何在vb.net中应用它,因为它似乎是一个好主意?

请adivce guys ..........

编辑:我发现下面显示的代码的vb.net版本,但似乎有一个错误

If i IsNot DeletableNodesXpath.Count - 1 Then

错误:IsNot要求操作数具有引用类型,但此操作数的值类型为整数

这是代码:

Public NotInheritable Class HtmlSanitizer Private Sub New()End Sub Private SharedOnly Whitelist As IDictionary(Of String,String())Private Shared DeletableNodesXpath As New List(Of String)()

Shared Sub New()
    Whitelist = New Dictionary(Of String, String())() From { _
        {"a", New () {"href"}}, _
        {"strong", Nothing}, _
        {"em", Nothing}, _
        {"blockquote", Nothing}, _
        {"b", Nothing}, _
        {"p", Nothing}, _
        {"ul", Nothing}, _
        {"ol", Nothing}, _
        {"li", Nothing}, _
        {"div", New () {"align"}}, _
        {"strike", Nothing}, _
        {"u", Nothing}, _
        {"sub", Nothing}, _
        {"sup", Nothing}, _
        {"table", Nothing}, _
        {"tr", Nothing}, _
        {"td", Nothing}, _
        {"th", Nothing} _
    }
End Sub

Public Shared Function Sanitize(input As String) As String
    If input.Trim().Length < 1 Then
        Return String.Empty
    End If
    Dim htmlDocument = New HtmlDocument()

    htmldocument.LoadHtml(input)
    SanitizeNode(htmldocument.DocumentNode)
    Dim xPath As String = HtmlSanitizer.CreateXPath()

    Return StripHtml(htmldocument.DocumentNode.WriteTo().Trim(), xPath)
End Function

Private Shared Sub SanitizeChildren(parentNode As HtmlNode)
    For i As Integer = parentNode.ChildNodes.Count - 1 To 0 Step -1
        SanitizeNode(parentNode.ChildNodes(i))
    Next
End Sub

Private Shared Sub SanitizeNode(node As HtmlNode)
    If node.NodeType = HtmlNodeType.Element Then
        If Not Whitelist.ContainsKey(node.Name) Then
            If Not DeletableNodesXpath.Contains(node.Name) Then
                'DeletableNodesXpath.Add(node.Name.Replace("?",""));
                node.Name = "removeableNode"
                DeletableNodesXpath.Add(node.Name)
            End If
            If node.HasChildNodes Then
                SanitizeChildren(node)
            End If

            Return
        End If

        If node.HasAttributes Then
            For i As Integer = node.Attributes.Count - 1 To 0 Step -1
                Dim currentAttribute As HtmlAttribute = node.Attributes(i)
                Dim allowedAttributes As String() = Whitelist(node.Name)
                If allowedAttributes IsNot Nothing Then
                    If Not allowedAttributes.Contains(currentAttribute.Name) Then
                        node.Attributes.Remove(currentAttribute)
                    End If
                Else
                    node.Attributes.Remove(currentAttribute)
                End If
            Next
        End If
    End If

    If node.HasChildNodes Then
        SanitizeChildren(node)
    End If
End Sub

Private Shared Function StripHtml(html As String, xPath As String) As String
    Dim htmlDoc As New HtmlDocument()
    htmlDoc.LoadHtml(html)
    If xPath.Length > 0 Then
        Dim invalidNodes As HtmlNodeCollection = htmlDoc.DocumentNode.SelectNodes(xPath)
        For Each node As HtmlNode In invalidNodes
            node.ParentNode.RemoveChild(node, True)
        Next
    End If
    Return htmlDoc.DocumentNode.WriteContentTo()


End Function

Private Shared Function CreateXPath() As String
    Dim _xPath As String = String.Empty
    For i As Integer = 0 To DeletableNodesXpath.Count - 1
        If i IsNot DeletableNodesXpath.Count - 1 Then
            _xPath += String.Format("//{0}|", DeletableNodesXpath(i).ToString())
        Else
            _xPath += String.Format("//{0}", DeletableNodesXpath(i).ToString())
        End If
    Next
    Return _xPath
End Function
End Class

请有人帮忙??????

热门答案

而不是使用IsNot ,只需使用<> 。正如您在基本上检查一个整数的值不等于另一个整数的值 - 1。

我相信IsNot不能用于整数。

编辑:我刚刚注意到这是超级超级老。刚看到7月26日的日期!




许可下: CC-BY-SA with attribution
不隶属于 Stack Overflow
这个KB合法吗? 是的,了解原因
许可下: CC-BY-SA with attribution
不隶属于 Stack Overflow
这个KB合法吗? 是的,了解原因