Suppose I have the following HTML
<p id="definition">
<span class="hw">emolument</span> \ih-MOL-yuh-muhnt\, <i>noun</i>:
The wages or perquisites arising from office, employment, or labor
</p>
I want to extract each part separately using HTMLAgilityPack in C#
I can get the word and word class easily enough
var definition = doc.DocumentNode.Descendants()
.Where(x => x.Name == "p" && x.Attributes["id"] == "definition")
.FirstOrDefault();
string word = definition.Descendants()
.Where(x => x.Name == "span")
.FirstOrDefault().InnerText;
string word_class = definition.Descendants()
.Where(x => x.Name == "i")
.FirstOrDefault().InnerText;
But how do I get the pronunciation or actual definition? These fall between nodes, and if I use defintion.InnerText
I get the whole lot in one string. Is there a way to do this in XPath
perhaps?
How do I select text between nodes in HtmlAgilityPack?
Is there a way to do this in XPath perhaps?
Yes - and quite an easy one.
The key concept you need to understand is how text and child element nodes are organized in XML/HTML - and thus XPath.
If the textual content of an element is punctuated by child elements, they end up in separate text nodes. You can access individual text nodes by their position.
Simply using text()
on any element retrieves all child text nodes. Applying //p/text()
to the snippet you have shown yields (individual results separated by -------
):
[EMPTY TEXT NODE, EXCEPT WHITESPACE]
-----------------------
\ih-MOL-yuh-muhnt\,
-----------------------
:
The wages or perquisites arising from office, employment, or labor
The first text node of this p
element only contains whitespace, so that's probably not what you're after. //p/text()[2]
retrieves
\ih-MOL-yuh-muhnt\,
and //p/text()[3]
:
:
The wages or perquisites arising from office, employment, or labor
HtmlNode text = doc.DocumentNode.Descendants().Where(x => x.Name == "p" && x.Id == "definition").FirstOrDefault();
foreach (HtmlNode node in text.SelectNodes(".//text()"))
{
Console.WriteLine(node.InnerText.Trim());
}
Output of this will be:
If you want 2. \ih-MOL-yuh-muhnt\,
result. You need this.
HtmlNode a = text.SelectNodes(".//text()[2]").FirstOrDefault();