Select node based on sibling properties - HtmlAgilityPack - C#

c# html-agility-pack html-parsing

Question

I have a document in HTML that is set up as follows.

<ul class="beverageFacts">
<li>
    <span>Vintage</span> 
    <strong>2007&nbsp;</strong>
</li>
<li>
    <span>ABV</span> 
    <strong>13,0&nbsp;%</strong>
</li>
<li>
    <span>Sugar</span> 
    <strong>5&nbsp;gram/liter</strong>
</li>

I must parse the values of<strong> the associated tagsstring 's based on the value the<span> has, tag.

I have these things:

String vintage;
String sugar;
String abv;

I'm now cycling over each node's child to see whetherbeverageFacts validating the values to match them with the appropriate nodesstring . The following is the code I have so far to get the "Vintage"-value, however the outcome is alwaysnull .

HtmlNodeCollection childNodes = bevFactNode.ChildNodes;
foreach (HtmlNode subNode in childNodes)
{
    if (subNode.InnerText.TrimStart() == "Vintage")
        vintage = subNode.NextSibling.InnerText.Trim();
}

I think I chose the nodes incorrectly, but I can't seem to find out how to do it correctly and effectively.

Is there a simple method to do this?


Edit 2013-07-29

Using the following code, I have attempted to delete the whitespaces as advised by enricoariel in the comments.

        HtmlAgilityPack.HtmlDocument page = new HtmlWeb().Load("http://www.systembolaget.se/" + articleID);

        string cleanDoc = Regex.Replace(page.DocumentNode.OuterHtml, @"\s*(?<capture><(?<markUp>\w+)>.*<\/\k<markUp>>)\s*", "${capture}", RegexOptions.Singleline);

        HtmlDocument cleanPage = new HtmlDocument();
        cleanPage.LoadHtml(cleanDoc);

The outcome is still

 String vintage = null;
1
1
5/23/2017 11:57:32 AM

Accepted Answer

I discovered that I didn't go far enough into the nodes after looking at the HTML syntax. Additionally, I don't correctly clear certain whitespaces, as enricoariel pointed out. I obtain the right answer by bypassing the sibling, which are the whitespaces, and instead moving on to the next line.

        foreach (HtmlNode bevFactNode in bevFactsNodes)
        {
            HtmlNodeCollection childNodes = bevFactNode.ChildNodes;
            foreach (HtmlNode node in childNodes)
            {
                foreach(HtmlNode subNode in node.ChildNodes)
                {
                    if (subNode.InnerText.Trim() == "Årgång")
                        vintage = HttpUtility.HtmlDecode(subNode.NextSibling.NextSibling.InnerText.Trim());
                }
            }
        }
        Console.WriteLine("Vintage: " + vintage);

will produce

Vintage: 2007

To format the output properly, I decoded the HTML.

Learned lessons!

1
7/28/2013 10:39:25 PM

Popular Answer

To sum up, I believe that using a regex to remove any white spaces before retrieving the nextSibling value is the best solution:

    string myHtml =
    @"
    <ul class='beverageFacts'>
    <li>
        <span>Vintage</span> 
        <strong>2007&nbsp;</strong>
    </li>
    <li>
        <span>ABV</span> 
        <strong>13,0&nbsp;%</strong>
    </li>
    <li>
        <span>Sugar</span> 
        <strong>5&nbsp;gram/liter</strong>
    </li>";
    //Remove space after and before tag
myHtml = Regex.Replace(myHtml, @"\s+<", "<", RegexOptions.Multiline | RegexOptions.Compiled);
myHtml = Regex.Replace(myHtml, @">\s+", "> ", RegexOptions.Compiled | RegexOptions.Multiline);

HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(myHtml.Replace("/r", "").Replace("/n", "").Replace("/r/n", "").Replace("  ", ""));
doc.OptionFixNestedTags = true;

HtmlNodeCollection vals = doc.DocumentNode.SelectNodes("//ul[@class='beverageFacts']//span");

var myNodeContent = string.Empty;
foreach (HtmlNode val in vals)
{
    if (val.InnerText == "Vintage")
    {
        myNodeContent = val.NextSibling.InnerText;
    }
}

return myNodeContent;


Related Questions





Related

Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow