HtmlAgilityPack extracts text from all divs in a page and not just from the one div specified in the code

c# html-agility-pack


I am having a strange behaviour with a xpath expression with HtmlAgilityPack. I'm trying to use the HtmlAgilityPack to extract all the values within a div declared as <div class='cont'> However, when I use the code below I simply get all values within <div class='cont'> AND <div class='button'>. Does anyone know why this is happening? Here is the full code to reproduce it:

using System;
using System.Xml.XPath;
using HtmlAgilityPack;

namespace ConsoleApplication1
    class Program
        static void Main(string[] args)
            const string text1 = @"<div class=""cont"">
<div style=""margin: 0cm 0cm 0pt"" class=""Normal"">content1</div><div style=""margin: 0cm 0cm 0pt"" class=""Normal""> content2</div>
<div style=""margin: 0cm 0cm 0pt"" class=""Normal"">content3 </div>
<div>content4 </div><strong>content5
<div>content6 </div><ul type=""disc"">    
<div>content7 </div>        
<div>content8 </div>    </ul>
<p class='margin10'><font size=""2"">
<p><span style=""font-family: Arial"">content9</span></p>
<div>content10</font><a href=""""><u><font color=""#0000ff"" size=""2""><font color=""#0000ff"" size=""2""> content11 </u></font></font></a><font size=""2""> content12
<div class=""button"">
<span class=""applybtn""><a class=""buttonGlobal buttonAlpha"" href=""/uk/job/apply/(id)/608735"">content14</a></span>
            foreach (XPathNavigator node in SearchInPage(text1, "//div[@class='cont']"))
                Console.WriteLine("option " + node.Value);


        private static XPathNodeIterator SearchInPage(string text, string xpath)
            HtmlDocument htmlDocument = new HtmlDocument();
            XPathNavigator xpathNavigator = htmlDocument.CreateNavigator();
            XPathNodeIterator nodes = xpathNavigator.Select(xpath);
            return nodes;

The code returns: 'content', 'content1-13' PLUS 'content14' which exists within <div class='button'>

6/18/2012 9:25:50 PM

Popular Answer

So If I'm understanding correctly, you want to find the value only for the children nodes of node <div class="cont">?

Try this:

HtmlDocument doc = new HtmlDocument;
HtmlNode node = doc.DocumentNode.SelectSingleNode(".//div[@class='cont']");

foreach(HtmlNode childNode in node)

I don't have a way to debug this in front of me, but this should work. the (".//div[@class='cont']") should select only the specified node and it's children, and ignore anything that lives outside the specified node. The rest is just Linq and HtmlAgilityPack - Remember, HtmlAgilityPack implements XPath, so make sure to look through AgilityPacks available methods before using XPath... remember that xml and html are different languages, and what works for one won't necessarily work for the other.

6/26/2012 8:15:23 PM

Related Questions


Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow