Grabbing text content using Html Agility Pack

html html-agility-pack vb.net

Question

I'll do my best to be precise. I'm mostly working on a vb.net crawler, and my primary focus is extracting the text content of the page. My present program uses a web browser control to download the HTML source's body into a textbox as seen below:

Private Sub Button1_Click(ByVal sender As System.Object, ByVal e As System.EventArgs)   Handles Button1.Click
    Dim url As String = "<url>"
    WebBrowser1.Navigate(url)
End Sub

Private Sub WebBrowser1_DocumentCompleted(ByVal sender As System.Object, ByVal e As    System.Windows.Forms.WebBrowserDocumentCompletedEventArgs) Handles WebBrowser1.DocumentCompleted
    TextBox2.Text = WebBrowser1.Document.Body.OuterHtml
End Sub

From this point on, textbox2 includes trash html that contains href, image, advertising, script, and other elements. I need to remove all of these metadata so that I can extract the basic text.

Although I could use regex properties to eliminate all the abnormalities, I believe HAP is a lot better option for html parsers.

I found this website via a search on this site that describes the use of the Whitelist approach stated in "Meltdown."

Strip tags from HTML Agility Pack NOT ON whitelist

But even if it seems like a good concept, how can I implement it in VB.NET?

Please give us some advice, people.

EDIT: The code following is available in vb.net, although it seems to include a bug at

If i IsNot DeletableNodesXpath.Count - 1 Then

Errors: IsNot requires operand that have reference types, but this operand has the value type integer

This is the key:

Private Sub New() End Sub Private Shared ReadOnly Whitelist As IDictionary(Of String, String()) Public NotInheritable Class HtmlSanitizer DeletableNodesXpath Is A Private Shared List Of Strings ()

Shared Sub New()
    Whitelist = New Dictionary(Of String, String())() From { _
        {"a", New () {"href"}}, _
        {"strong", Nothing}, _
        {"em", Nothing}, _
        {"blockquote", Nothing}, _
        {"b", Nothing}, _
        {"p", Nothing}, _
        {"ul", Nothing}, _
        {"ol", Nothing}, _
        {"li", Nothing}, _
        {"div", New () {"align"}}, _
        {"strike", Nothing}, _
        {"u", Nothing}, _
        {"sub", Nothing}, _
        {"sup", Nothing}, _
        {"table", Nothing}, _
        {"tr", Nothing}, _
        {"td", Nothing}, _
        {"th", Nothing} _
    }
End Sub

Public Shared Function Sanitize(input As String) As String
    If input.Trim().Length < 1 Then
        Return String.Empty
    End If
    Dim htmlDocument = New HtmlDocument()

    htmldocument.LoadHtml(input)
    SanitizeNode(htmldocument.DocumentNode)
    Dim xPath As String = HtmlSanitizer.CreateXPath()

    Return StripHtml(htmldocument.DocumentNode.WriteTo().Trim(), xPath)
End Function

Private Shared Sub SanitizeChildren(parentNode As HtmlNode)
    For i As Integer = parentNode.ChildNodes.Count - 1 To 0 Step -1
        SanitizeNode(parentNode.ChildNodes(i))
    Next
End Sub

Private Shared Sub SanitizeNode(node As HtmlNode)
    If node.NodeType = HtmlNodeType.Element Then
        If Not Whitelist.ContainsKey(node.Name) Then
            If Not DeletableNodesXpath.Contains(node.Name) Then
                'DeletableNodesXpath.Add(node.Name.Replace("?",""));
                node.Name = "removeableNode"
                DeletableNodesXpath.Add(node.Name)
            End If
            If node.HasChildNodes Then
                SanitizeChildren(node)
            End If

            Return
        End If

        If node.HasAttributes Then
            For i As Integer = node.Attributes.Count - 1 To 0 Step -1
                Dim currentAttribute As HtmlAttribute = node.Attributes(i)
                Dim allowedAttributes As String() = Whitelist(node.Name)
                If allowedAttributes IsNot Nothing Then
                    If Not allowedAttributes.Contains(currentAttribute.Name) Then
                        node.Attributes.Remove(currentAttribute)
                    End If
                Else
                    node.Attributes.Remove(currentAttribute)
                End If
            Next
        End If
    End If

    If node.HasChildNodes Then
        SanitizeChildren(node)
    End If
End Sub

Private Shared Function StripHtml(html As String, xPath As String) As String
    Dim htmlDoc As New HtmlDocument()
    htmlDoc.LoadHtml(html)
    If xPath.Length > 0 Then
        Dim invalidNodes As HtmlNodeCollection = htmlDoc.DocumentNode.SelectNodes(xPath)
        For Each node As HtmlNode In invalidNodes
            node.ParentNode.RemoveChild(node, True)
        Next
    End If
    Return htmlDoc.DocumentNode.WriteContentTo()


End Function

Private Shared Function CreateXPath() As String
    Dim _xPath As String = String.Empty
    For i As Integer = 0 To DeletableNodesXpath.Count - 1
        If i IsNot DeletableNodesXpath.Count - 1 Then
            _xPath += String.Format("//{0}|", DeletableNodesXpath(i).ToString())
        Else
            _xPath += String.Format("//{0}", DeletableNodesXpath(i).ToString())
        End If
    Next
    Return _xPath
End Function
End Class

Can someone please assist?

1
1
5/23/2017 10:34:20 AM

Popular Answer

rather than utilizingIsNot just use<> . The value of one number does not match the value of another integer, which is 1.

I think.IsNot not applicable to integers.

edit: This is really very old, I've just realized. I just noticed the date of July 26.

0
7/26/2012 9:07:58 AM


Related Questions





Related

Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow