Getting text from between two html nodes using HtmlAgilityPack

c# html-agility-pack linq nodes xpath

Question

Suppose I have the following HTML

<p id="definition">
    <span class="hw">emolument</span> \ih-MOL-yuh-muhnt\, <i>noun</i>:
    The wages or perquisites arising from office, employment, or labor
</p>

I want to extract each part separately using HTMLAgilityPack in C#

I can get the word and word class easily enough

var definition = doc.DocumentNode.Descendants()
    .Where(x => x.Name == "p" && x.Attributes["id"] == "definition")
    .FirstOrDefault();

string word = definition.Descendants()
    .Where(x => x.Name == "span")
    .FirstOrDefault().InnerText;

string word_class = definition.Descendants()
    .Where(x => x.Name == "i")
    .FirstOrDefault().InnerText;

But how do I get the pronunciation or actual definition? These fall between nodes, and if I use defintion.InnerText I get the whole lot in one string. Is there a way to do this in XPath perhaps?

How do I select text between nodes in HtmlAgilityPack?

Accepted Answer

Is there a way to do this in XPath perhaps?

Yes - and quite an easy one.

The key concept you need to understand is how text and child element nodes are organized in XML/HTML - and thus XPath.

If the textual content of an element is punctuated by child elements, they end up in separate text nodes. You can access individual text nodes by their position.

Simply using text() on any element retrieves all child text nodes. Applying //p/text() to the snippet you have shown yields (individual results separated by -------):

[EMPTY TEXT NODE, EXCEPT WHITESPACE]
-----------------------
\ih-MOL-yuh-muhnt\,
-----------------------
:
The wages or perquisites arising from office, employment, or labor

The first text node of this p element only contains whitespace, so that's probably not what you're after. //p/text()[2] retrieves

  \ih-MOL-yuh-muhnt\,

and //p/text()[3]:

:
The wages or perquisites arising from office, employment, or labor

Popular Answer

        HtmlNode text = doc.DocumentNode.Descendants().Where(x => x.Name == "p" && x.Id == "definition").FirstOrDefault();

        foreach (HtmlNode node in text.SelectNodes(".//text()"))
        {
            Console.WriteLine(node.InnerText.Trim());
        }

Output of this will be:

  1. emolument
  2. \ih-MOL-yuh-muhnt\,
  3. noun
  4. :
  5. The wages or perquisites arising from office, employment, or labor

If you want 2. \ih-MOL-yuh-muhnt\, result. You need this.

HtmlNode a = text.SelectNodes(".//text()[2]").FirstOrDefault();


Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Is this KB legal? Yes, learn why
Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Is this KB legal? Yes, learn why