使用HTMLAgilityPack庫檢索屬性和跨度

.net html html-agility-pack html-parsing vb.net

在這段HTML代碼中:

<div class="item">

    <div class="thumb">
        <a href="http://www.mp3crank.com/wolf-eyes/lower-demos-121866" rel="bookmark" lang="en" title="Wolf Eyes - Lower Demos album downloads">
        <img width="100" height="100" alt="Mp3 downloads Wolf Eyes - Lower Demos" title="Free mp3 downloads Wolf Eyes - Lower Demos" src="http://www.mp3crank.com/cover-album/Wolf-Eyes-–-Lower-Demos.jpg" /></a>
    </div>

    <div class="release">
        <h3>Wolf Eyes</h3>
        <h4>
        <a href="http://www.mp3crank.com/wolf-eyes/lower-demos-121866" title="Wolf Eyes - Lower Demos">Lower Demos</a>
        </h4>
        <script src="/ads/button.js"></script>
    </div>

    <div class="release-year">
        <p>Year</p>
        <span>2013</span>
    </div>

    <div class="genre">
        <p>Genre</p>
        <a href="http://www.mp3crank.com/genre/rock" rel="tag">Rock</a>
        <a href="http://www.mp3crank.com/genre/pop" rel="tag">Pop</a>
    </div>

</div>

我知道如何以其他方式解析它,但我想使用HTMLAgilityPack庫檢索此信息:

Title : Wolf Eyes - Lower Demos
Cover : http://www.mp3crank.com/cover-album/Wolf-Eyes-–-Lower-Demos.jpg
Year  : 2013
Genres: Rock, Pop
URL   : http://www.mp3crank.com/wolf-eyes/lower-demos-121866

這些html行是哪些:

Title : title="Wolf Eyes - Lower Demos"
Cover : src="http://www.mp3crank.com/cover-album/Wolf-Eyes-–-Lower-Demos.jpg"
Year  : <span>2013</span>
Genre1: <a href="http://www.mp3crank.com/genre/rock" rel="tag">Rock</a>
Genre2: <a href="http://www.mp3crank.com/genre/pop" rel="tag">Pop</a>
URL   : href="http://www.mp3crank.com/wolf-eyes/lower-demos-121866" 

這就是我正在嘗試的,但我總是在嘗試選擇單個節點時得到一個object reference not set異常,抱歉但我是HTML的新手,我試圖按照這個問題的步驟HtmlAgilityPack基本如何獲得標題和鏈接?

Public Class Form1

    Private htmldoc As HtmlAgilityPack.HtmlDocument = New HtmlAgilityPack.HtmlDocument
    Private htmlnodes As HtmlAgilityPack.HtmlNodeCollection = Nothing

    Private Title As String = String.Empty
    Private Cover As String = String.Empty
    Private Genres As String() = {String.Empty}
    Private Year As Integer = -0
    Private URL as String = String.Empty

    Private Sub Test() Handles MyBase.Shown

        ' Load the html document.
        htmldoc.LoadHtml(IO.File.ReadAllText("C:\source.html"))

        ' Select the (10 items) nodes.
        htmlnodes = htmldoc.DocumentNode.SelectNodes("//div[@class='item']")

        ' Loop trough the nodes.
        For Each node As HtmlAgilityPack.HtmlNode In htmlnodes

            Title = node.SelectSingleNode("//div[@class='release']").Attributes("title").Value
            Cover = node.SelectSingleNode("//div[@class='thumb']").Attributes("src").Value
            Year = CInt(node.SelectSingleNode("//div[@class='release-year']").Attributes("span").Value)
            Genres = ¿select multiple nodes?
            URL = node.SelectSingleNode("//div[@class='release']").Attributes("href").Value

        Next

    End Sub

End Class

一般承認的答案

你在這裡的錯誤是試圖從你找到的那個中訪問一個childnode的屬性。

當你調用node.SelectSingleNode("//div[@class='release']")你得到了正確的div返回,但調用.Attributes只返回div標籤本身的屬性,而不是任何內部HTML元素。

可以編寫選擇子節點的XPATH查詢,例如//div[@class='release']/a - 有關XPATH的更多信息,請參閱http://www.w3schools.com/xpath/xpath_syntax.asp 。雖然示例適用於XML,但大多數原則應適用於HTML文檔。

另一種方法是在您找到的節點上使用進一步的XPATH調用。我修改了你的代碼,使它能夠使用這種方法:

' Load the html document.
htmldoc.LoadHtml(IO.File.ReadAllText("C:\source.html"))

' Select the (10 items) nodes.
htmlnodes = htmldoc.DocumentNode.SelectNodes("//div[@class='item']")

' Loop through the nodes.
For Each node As HtmlAgilityPack.HtmlNode In htmlnodes

    Dim releaseNode = node.SelectSingleNode(".//div[@class='release']")
    'Assumes we find the node and it has a a-tag
    Title = releaseNode.SelectSingleNode(".//a").Attributes("title").Value
    URL = releaseNode.SelectSingleNode(".//a").Attributes("href").Value

    Dim thumbNode = node.SelectSingleNode(".//div[@class='thumb']")
    Cover = thumbNode.SelectSingleNode(".//img").Attributes("src").Value

    Dim releaseYearNode = node.SelectSingleNode(".//div[@class='release-year']")
    Year = CInt(releaseYearNode.SelectSingleNode(".//span").InnerText)

    Dim genreNode = node.SelectSingleNode(".//div[@class='genre']")
    Dim genreLinks = genreNode.SelectNodes(".//a")
    Genres = (From n In genreLinks Select n.InnerText).ToArray()

    Console.WriteLine("Title : {0}", Title)
    Console.WriteLine("Cover : {0}", Cover)
    Console.WriteLine("Year  : {0}", Year)
    Console.WriteLine("Genres: {0}", String.Join(",", Genres))
    Console.WriteLine("URL   : {0}", URL)

Next

請注意,在此代碼中,我們假設文檔已正確形成,並且每個節點/元素/屬性都存在且正確。您可能想要為此添加大量錯誤檢查,例如, If someNode Is Nothing Then ....

編輯:我稍微修改了上面的代碼,以確保每個.SelectSingleNode使用“.//”前綴 - 這確保它有效,如果有幾個“項目”節點,否則它選擇文件中的第一個匹配不是當前節點。

如果您想要更短的XPATH解決方案,請使用以下方法使用相同的代碼:

' Load the html document.
htmldoc.LoadHtml(IO.File.ReadAllText("C:\source.html"))

' Select the (10 items) nodes.
htmlnodes = htmldoc.DocumentNode.SelectNodes("//div[@class='item']")

' Loop through the nodes.
For Each node As HtmlAgilityPack.HtmlNode In htmlnodes

    Title = node.SelectSingleNode(".//div[@class='release']/h4/a[@title]").Attributes("title").Value
    URL = node.SelectSingleNode(".//div[@class='release']/h4/a[@href]").Attributes("href").Value

    Cover = node.SelectSingleNode(".//div[@class='thumb']/a/img[@src]").Attributes("src").Value

    Year = CInt(node.SelectSingleNode(".//div[@class='release-year']/span").InnerText)

    Dim genreLinks = node.SelectNodes(".//div[@class='genre']/a")
    Genres = (From n In genreLinks Select n.InnerText).ToArray()

    Console.WriteLine("Title : {0}", Title)
    Console.WriteLine("Cover : {0}", Cover)
    Console.WriteLine("Year  : {0}", Year)
    Console.WriteLine("Genres: {0}", String.Join(",", Genres))
    Console.WriteLine("URL   : {0}", URL)
    Console.WriteLine()

Next

熱門答案

你離解決方案的距離不遠。兩個重要說明:

  • //是一個遞歸調用。它可能會產生一些嚴重的性能影響,並且它可能會選擇您不想要的節點,因此我建議您僅在層次結構很深或複雜或可變時使用它,並且您不希望指定整個路徑。
  • XmlNode上有一個名為GetAttributeValue的有用輔助方法,即使它不存在也會獲得一個屬性(您需要指定默認值)。

這是一個似乎有用的示例:

' select the base/parent DIV (here we use a discriminant CLASS attribute)
' all select calls below will use this DIV element as a starting point
Dim node As HtmlNode = htmldoc.DocumentNode.SelectNodes("//div[@class='item']")

' get to the A tag which is a child or grand child (//) of a 'release' DIV
Console.WriteLine(("Title :" & node.SelectSingleNode("div[@class='release']//a").GetAttributeValue("title", CStr(Nothing))))

' get to the IMG tag which is a child or grand child (//) of a 'thumb' DIV
Console.WriteLine(("Cover :" & node.SelectSingleNode("div[@class='thumb']//img").GetAttributeValue("src", CStr(Nothing))))

' get to the SPAN tag which is a child or grand child (//) of a 'release-year' DIV
Console.WriteLine(("Year  :" & node.SelectSingleNode("div[@class='release-year']//span").InnerText))

' get all A elements which are child or grand child(//) of a 'genre' DIV
Dim nodes As HtmlNodeCollection = node.SelectNodes("div[@class='genre']//a")
Dim i As Integer
For i = 0 To nodes.Count - 1
    Console.WriteLine(String.Concat(New Object() { "Genre", (i + 1), ":", nodes.Item(i).InnerText }))
Next i

' get to the A tag which is a child or grand child (//) of a 'release' DIV
Console.WriteLine(("Url   :" & node.SelectSingleNode("div[@class='release']//a").GetAttributeValue("href", CStr(Nothing))))


許可下: CC-BY-SA with attribution
不隸屬於 Stack Overflow
這個KB合法嗎? 是的,了解原因
許可下: CC-BY-SA with attribution
不隸屬於 Stack Overflow
這個KB合法嗎? 是的,了解原因