HTML Agility Pack - Grab Text after a node

c# html html-agility-pack

Question

I have some HTML that I'm parsing using C#

The sample text is below, though this is repeated about 150 times with different records

<strong>Title</strong>: Mr<br>
<strong>First name</strong>: Fake<br>
<strong>Surname</strong>: Guy<br>

I'm trying to get the text in an array which will be like

customerArray [0,0] = Title
customerArray [0,1] = Mr
customerArray [1,0] = First Name
customerArray [1,1] = Fake
customerArray [2,0] = Surname
customerArray [2,1] = Guy

I can get the text in the array but I'm just having trouble getting the text after the STRONG closing tab up until the BR tag then finding the next STRONG tag

any help would be appreciated

Accepted Answer

You can use XPath following-sibling::text()[1] to get text node located directly after each strong. Here is a minimal but complete example :

var raw = @"<div>
<strong>Title</strong>: Mr<br>
<strong>First name</strong>: Fake<br>
<strong>Surname</strong>: Guy<br>
        </div>";
var doc = new HtmlDocument();
doc.LoadHtml(raw);
foreach(HtmlNode node in doc.DocumentNode.SelectNodes("//strong"))
{
    var val = node.SelectSingleNode("following-sibling::text()[1]");
    Console.WriteLine(node.InnerText + ", " + val.InnerText);
}

dotnetfiddle demo

output :

Title, : Mr
First name, : Fake
Surname, : Guy

You should be able to remove the ":" by doing simple string manipulation, if needed...


Popular Answer

<strong> is a common tag, so something specific for the sample format you provided.

var html = @"
<div>
<strong>First name</strong><em>italic</em>: Fake<br>
<strong>Bold</strong> <a href='#'>hyperlink</a><br>.
<strong>bold</strong>
<strong>bold</strong> <br>
text
</div>

<div>
<strong>Title</strong>: Mr<BR>
<strong>First name</strong>: Fake<br>
<strong>Surname</strong>: Guy<br>
</div>";

var document = new HtmlDocument();
document.LoadHtml(html);
// 1. <strong>
var strong = document.DocumentNode.SelectNodes("//strong");
if (strong != null)
{
    foreach (var node in strong.Where(
        // 2. followed by non-empty text node
        x => x.NextSibling is HtmlTextNode
        && !string.IsNullOrEmpty(x.NextSibling.InnerText.Trim())
        // 3. followed by <br>
        && x.NextSibling.NextSibling is HtmlNode
        && x.NextSibling.NextSibling.Name.ToLower() == "br"))
    {
        Console.WriteLine("{0} {1}", node.InnerText, node.NextSibling.InnerText);
    }
}


Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Is this KB legal? Yes, learn why
Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Is this KB legal? Yes, learn why