Pardon me if it sounds too simple to be asked here but since this is my very first day with html-agility-pack, I am unable to sort out a way to select the inner text of a node which is the direct child of the node and ignoring inner text of the children nodes.
For example
<div id="div1">
<div class="h1"> this needs to be selected
<small> and not this</small>
</div>
</div>
currently I am trying this
HtmlDocument page = new HtmlWeb().Load(url);
var s = page.DocumentNode.SelectSingleNode("//div[@id='div1']//div[@class='h1']");
string selText = s.innerText;
which returns the whole text (e.g- this needs to be selected and not this). Any suggestions??
You can use the /text()
option to get all text nodes directly under a specific tag. If you only need the first one, add [1]
to it:
page.LoadHtml(text);
var s = page.DocumentNode.SelectSingleNode("//div[@id='div1']//div[@class='h1']/text()[1]");
string selText = s.InnerText;
The div
could possibly have multiple text nodes if there is text before and after its children. As I similarly indicated here, I think the best way to get all the direct text content of a node is to do something like:
HtmlDocument page = new HtmlWeb().Load(url);
var nodes = page.DocumentNode.SelectNodes("//div[@id='div1']//div[@class='h1']/text()");
StringBuilder sb = new StringBuilder();
foreach(var node in nodes)
{
sb.Append(node.InnerText);
}
string content = sb.ToString();