HtmlAgilityPack extracts text from all divs in a page and not just from the one div specified in the code

c# html-agility-pack

Question

When using HtmlAgilityPack with an xpath expression, I'm seeing weird behavior. I'm attempting to use the HTML Agility Pack to extract every value from a div using the following declaration:<div class='cont'> However, when I run the following code, I just get all values inside<div class='cont'> AND <div class='button'> . Is there a reason for this, anyone? Here is the whole code to make it:

using System;
using System.Xml.XPath;
using HtmlAgilityPack;

namespace ConsoleApplication1
{
    class Program
    {
        static void Main(string[] args)
        {
            const string text1 = @"<div class=""cont"">
<h3>content</h3> 
<div style=""margin: 0cm 0cm 0pt"" class=""Normal"">content1</div><div style=""margin: 0cm 0cm 0pt"" class=""Normal""> content2</div>
<div style=""margin: 0cm 0cm 0pt"" class=""Normal"">content3 </div>
<div>content4 </div><strong>content5
<div>content6 </div><ul type=""disc"">    
<div>content7 </div>        
<div>content8 </div>    </ul>
<p class='margin10'><font size=""2"">
<div>
<p><span style=""font-family: Arial"">content9</span></p>
</div>
<div>content10</font><a href=""mailto:james@polis.com""><u><font color=""#0000ff"" size=""2""><font color=""#0000ff"" size=""2""> content11 </u></font></font></a><font size=""2""> content12
<div>content13</div>
</div>
</font>
</p>
</div>
<div class=""button"">
<span class=""applybtn""><a class=""buttonGlobal buttonAlpha"" href=""/uk/job/apply/(id)/608735"">content14</a></span>
</div>";
            foreach (XPathNavigator node in SearchInPage(text1, "//div[@class='cont']"))
            {
                Console.WriteLine("option " + node.Value);
            }

        }

        private static XPathNodeIterator SearchInPage(string text, string xpath)
        {
            HtmlDocument htmlDocument = new HtmlDocument();
            htmlDocument.LoadHtml(text);
            XPathNavigator xpathNavigator = htmlDocument.CreateNavigator();
            XPathNodeIterator nodes = xpathNavigator.Select(xpath);
            return nodes;
        }
    }
}

The code outputs "content," "content1 through 13" PLUS "content14," which is present in<div class='button'>

1
0
6/18/2012 9:25:50 PM

Popular Answer

If I have this right, you want to discover the value just for the node's children.<div class="cont"> ?

Do this:

HtmlDocument doc = new HtmlDocument;
doc.Load(Html);
HtmlNode node = doc.DocumentNode.SelectSingleNode(".//div[@class='cont']");

foreach(HtmlNode childNode in node)
{
    Console.WriteLine(childNode.Value);
}

This should work, but I don't have a method to debug it right now. the(".//div[@class='cont']") should ignore everything that resides outside the chosen node and only choose the specified node and its children. The only other components are Linq and HtmlAgilityPack. Keep in mind that HtmlAgilityPack implements XPath, so before using XPath, look through AgilityPack's available methods. Also keep in mind that HTML and XML are two distinct programming languages, and that what works for one won't always work for the other.

0
6/26/2012 8:15:23 PM


Related Questions





Related

Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow