I have an HTML-document that is structured as follows
<ul class="beverageFacts">
<li>
<span>Vintage</span>
<strong>2007 </strong>
</li>
<li>
<span>ABV</span>
<strong>13,0 %</strong>
</li>
<li>
<span>Sugar</span>
<strong>5 gram/liter</strong>
</li>
I need to parse the values of the <strong>
-tags to corresponding string
's, depending on what value the <span>
-tag has.
I have the following:
String vintage;
String sugar;
String abv;
As of now, I am looping through each child node of the beverageFacts
-node checking the values to parse it to the correct corresponding string
.
The code I have so far to get the "Vintage"-value is the following, though the result is always null
.
HtmlNodeCollection childNodes = bevFactNode.ChildNodes;
foreach (HtmlNode subNode in childNodes)
{
if (subNode.InnerText.TrimStart() == "Vintage")
vintage = subNode.NextSibling.InnerText.Trim();
}
I believe my selection of the nodes is incorrect, but I cannot figure out how to properly do it in the most efficient way.
Is there an easy way to achieve this?
Edit 2013-07-29
I have tried to remove the whitespaces as suggested by enricoariel in the comments using the following code
HtmlAgilityPack.HtmlDocument page = new HtmlWeb().Load("http://www.systembolaget.se/" + articleID);
string cleanDoc = Regex.Replace(page.DocumentNode.OuterHtml, @"\s*(?<capture><(?<markUp>\w+)>.*<\/\k<markUp>>)\s*", "${capture}", RegexOptions.Singleline);
HtmlDocument cleanPage = new HtmlDocument();
cleanPage.LoadHtml(cleanDoc);
The resulting is still
String vintage = null;
Looking at the HTML markup, I realized I didn't go deep enough in the nodes. Also, as enricoariel pointed out, there are whitespaces that I do not clean properly. By skipping the sibling which is the whitespaces, and instead jump to the following, I get the correct result.
foreach (HtmlNode bevFactNode in bevFactsNodes)
{
HtmlNodeCollection childNodes = bevFactNode.ChildNodes;
foreach (HtmlNode node in childNodes)
{
foreach(HtmlNode subNode in node.ChildNodes)
{
if (subNode.InnerText.Trim() == "Årgång")
vintage = HttpUtility.HtmlDecode(subNode.NextSibling.NextSibling.InnerText.Trim());
}
}
}
Console.WriteLine("Vintage: " + vintage);
will output
Vintage: 2007
I decoded the HTML to get the result formatted correctly.
Lessons learned!
to summarize I think the best solution would be stripping all white spaces using a regex prior to retrieve the nextSibling value:
string myHtml =
@"
<ul class='beverageFacts'>
<li>
<span>Vintage</span>
<strong>2007 </strong>
</li>
<li>
<span>ABV</span>
<strong>13,0 %</strong>
</li>
<li>
<span>Sugar</span>
<strong>5 gram/liter</strong>
</li>";
//Remove space after and before tag
myHtml = Regex.Replace(myHtml, @"\s+<", "<", RegexOptions.Multiline | RegexOptions.Compiled);
myHtml = Regex.Replace(myHtml, @">\s+", "> ", RegexOptions.Compiled | RegexOptions.Multiline);
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(myHtml.Replace("/r", "").Replace("/n", "").Replace("/r/n", "").Replace(" ", ""));
doc.OptionFixNestedTags = true;
HtmlNodeCollection vals = doc.DocumentNode.SelectNodes("//ul[@class='beverageFacts']//span");
var myNodeContent = string.Empty;
foreach (HtmlNode val in vals)
{
if (val.InnerText == "Vintage")
{
myNodeContent = val.NextSibling.InnerText;
}
}
return myNodeContent;