HTMLAgilityPack is used to parse HTML.

c# html-agility-pack

Question

I'm attempting to use the HTML Agility Pack to parse the following HTML.

This is a portion of the whole file that the code returned:

<div class="story-body fnt-13 p20-b user-gen">
    <p>text here text here text </p>
    <p>text here text here text text here text here text text here text here text text here text here text </p>
    <div  class="gallery clr bdr aln-c js-no-shadow mod  cld">
        <div>
            <ol>
                <li class="fader-item aln-c ">
                    <div class="imageWrap m10-b">
                       &#8203;<img class="http://www.domain.com/picture.png| " src="http://www.domain.com/picture.png" alt="alt text" />
                    </div>
                    <p class="caption">caption text</p>
                </li>
            </ol>
        </div>
    </div >
    <p>text here text here text text here text here text text here text here text text here text here text </p>
    <p>text here text here text text here text here text text here text here text text here text here text text here text here text </p>
    <p>text here text here text text here text here text text here text here text text here text here text text here text here text </p>
</div>

This line of code is obtained by utilizing the following (which is messy I know)

string url = "http://www.domain.com/story.html";
var webGet = new HtmlWeb();
var document = webGet.Load(url);

var links = document.DocumentNode
        .Descendants("div")
        .Where(div => div.GetAttributeValue("class", "").Contains("story-body fnt-13 p20-b user-gen")) //
        .SelectMany(div => div.Descendants("p"))
        .ToList();
int cn = links.Count;

HtmlAgilityPack.HtmlNodeCollection tl = document.DocumentNode.SelectNodes("/html[1]/body[1]/div[1]/div[2]/div[1]/div[1]/div[1]/div[2]/div[1]");
foreach (HtmlAgilityPack.HtmlNode node in tl)
{
    textBox1.AppendText(node.InnerText.Trim());
    textBox1.AppendText(System.Environment.NewLine);
}

The program cycles over eachp and (for the time being) adds it to a text box. Everything is functioning properly save thediv add a class to the taggallery clr bdr aln-c js-no-shadow mod cld . This HTML code has the effect of giving me the&#8203; Provide text for the captions.

How can you ensure that doesn't appear in the results?

1
0
11/28/2011 8:00:37 PM

Accepted Answer

XPATH is on your side. Try this instead of using the awful xlink syntax:-)

HtmlNodeCollection tl = document.DocumentNode.SelectNodes("//p[not(@*)]");
foreach (HtmlAgilityPack.HtmlNode node in tl)
{
    Console.WriteLine(node.InnerText.Trim());
}

With this expression, all P nodes without any set attributes will be chosen. Other examples may be found here: Syntax for XPath

2
11/28/2011 9:54:30 PM

Popular Answer

Your question isn't really clear. I assume you're asking how to just obtain a certain div's immediate descendants. Use if that's the case.ChildNodes instead ofDescendants . Which is:

.SelectMany(div => div.ChildNodes().Where(n => n.Name == "p"))

The difficulty is thatDescendants walks the document tree in complete recursion.



Related Questions





Related

Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow