Issue parsing html using powershell and xpath

html html-agility-pack html-parsing powershell xpath

Question

This is a follow up question to one I asked last week, posted here. I have gotten past the original issue but now I am running into a slightly different issue.

I am now able to get the attribute of an item I'm interested in if the html tags aren't nested using the GetAttributeValue method, here it is the data-pid but I am now having trouble grabbing the attribute of an item that is in nested tags, in my code snippet it is the date. I am using xpath and the HtmlAgility pack to parse the html here but in the example below the same date gets returned over and over.

Here is what the $item object looks like:

Attributes           : {class, data-pid}
ChildNodes           : {#text, a, #text, span...}
Closed               : True
ClosingAttributes    : {}
FirstChild           : HtmlAgilityPack.HtmlTextNode
HasAttributes        : True
HasChildNodes        : True
HasClosingAttributes : False
Id                   : 
InnerHtml            :  <a href="/mod/4175126893.html" class="i"><span class="price">$20</span></a> <span class="star"></span> <span class="pl"> <span class="date">Nov 
                       30</span>  <a href="/mod/4175126893.html">Unlock Any GSM Cell Phone Today!</a> </span> <span class="l2"> <span class="price">$20</span>  <span 
                       class="pnr"> <small> (Des Moines)</small> <span class="px"> <span class="p"> </span></span> </span>  <a class="gc" href="/mod/" 
                       data-cat="mod">cell phones - by dealer</a> </span> 
InnerText            :  $20   Nov 30  Unlock Any GSM Cell Phone Today!   $20    (Des Moines)      cell phones - by dealer  
LastChild            : HtmlAgilityPack.HtmlTextNode
Line                 : 305
LinePosition         : 5408
Name                 : p
NextSibling          : HtmlAgilityPack.HtmlTextNode
NodeType             : Element
OriginalName         : p
OuterHtml            : <p class="row" data-pid="4175126893"> <a href="/mod/4175126893.html" class="i"><span class="price">$20</span></a> <span class="star"></span> 
                       <span class="pl"> <span class="date">Nov 30</span>  <a href="/mod/4175126893.html">Unlock Any GSM Cell Phone Today!</a> </span> <span class="l2"> 
                       <span class="price">$20</span>  <span class="pnr"> <small> (Des Moines)</small> <span class="px"> <span class="p"> </span></span> </span>  <a 
                       class="gc" href="/mod/" data-cat="mod">cell phones - by dealer</a> </span> </p>
OwnerDocument        : HtmlAgilityPack.HtmlDocument
ParentNode           : HtmlAgilityPack.HtmlNode
PreviousSibling      : HtmlAgilityPack.HtmlTextNode
StreamPosition       : 18733
XPath                : /html[1]/body[1]/article[1]/section[1]/div[1]/div[2]/p[11]

Attributes           : {class, data-pid}
ChildNodes           : {#text, a, #text, span...}
Closed               : True
ClosingAttributes    : {}

I want to pull out data from the outerhtml value.

OuterHtml            : <p class="row" data-latitude="41.5937565437255" data-longitude="-93.6437636649079" data-pid="4184719674"> <a href="/mod/4184719674.html" class="i"></a> 
               <span class="star"></span> <span class="pl"> <span class="date">Nov 27</span>  <a href="/mod/4184719674.html">iPhone and other Cell Phone Unlocks</a> 
               </span> <span class="l2">   <span class="pnr"> <small> (Des Moines)</small> <span class="px"> <span class="p"> <a href="#" class="maptag" 
               data-pid="4184719674">map</a></span></span> </span>  <a class="gc" href="/mod/" data-cat="mod">cell phones - by dealer</a> </span> </p>

I can grab the data-pid no problem. Here is what the current code looks like:

ForEach ($item in $results) {

    # This is working
    $ID = $item.GetAttributeValue("data-pid", "")

    # This is looping over the same item
    $Date = $item.SelectSingleNode("//span[@class='date']").InnerText
}

What I want to do is to be able to grab attributes from the different tags that are contained in the outerhtml object using my xpath statements but I can't figure out how to do that. Is that the best way to go about the problem or should I just be using some regex to get the value I want?

Let me know what other details I need to post.

Accepted Answer

I haven't used the HTML Agility Pack, but AFAICS built-in tools should suffice anyway:

$url = 'http://www.example.com/path/to/some.html'

$html = (Invoke-Webrequest $url).ParsedHTML

$html.getElementsByTagName('p') | ? { $_.className -eq 'row' } | % {
  $ID   = $_.getAttributeNode('data-pid').value
  $Date = $_.getElementsByTagName('span') | ? { $_.className -eq 'date' } |
          % { $_.innerText }

  # do stuff with $ID and $Date
  "{0}: {1}" -f $ID, $Date
}

Note that Invoke-Webrequest requires PowerShell v3. Use the Internet Explorer COM object if your limited to PowerShell v2:

$ie = New-Object -COM InternetExplorer.Application
$ie.Navigate($url)
while ($ie.ReadyState -ne 4) { sleep 100 }
$html = $ie.Document

If your HTML file is a local file, replace the Invoke-Webrequest line with something like this:

$htmlfile = 'C:\path\to\some.html'

$html = New-Object -COM HTMLFile
$html.write((Get-Content $htmlfile | Out-String))

Popular Answer

I'm way too late but here's your mistake. You've been using absolute paths.

ForEach ($item in $results) {

    # This is working
    $ID = $item.GetAttributeValue("data-pid", "")

    # This is looping over the same item
    $Date = $item.SelectSingleNode("//span[@class='date']").InnerText

    # This is looping over the different items (i.e. this is what what you want)
    $Date = $item.SelectSingleNode(".//span[@class='date']").InnerText
}



Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Is this KB legal? Yes, learn why
Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Is this KB legal? Yes, learn why