HtmlAgilityPack SelectNodes, Disposing

c# html-agility-pack web-scraping

Question

I am trying to do some screen scraping using HtmlAgilityPack using SelectNodes and getting some values from each node returned

Here is the code

private readonly HtmlDocument _document = new HtmlDocument();

public void ParseValues(string html)
{
    _document.LoadHtml(html);
    var tables = _document.DocumentNode.SelectNodes("//table");

    foreach (var table in tables)
    {
        _document.LoadHtml(table.OuterHtml);
        var value = _document.DocumentNode.SelectSingleNode("//tbody[1]/tr/td[0]");
    }
}

But I have noticed that when trying to select children with inside the foreach loop it actually searches from the document root. Something that is really annoying.

Questions:

  1. Is there a way to select the values from each table returned from SelectNodes without having to create new document instance from the HtmlDocument?

  2. Is there a way to dispose HtmlDocument, because I noticed that there is a memory leak every time I use _document.LoadHtml(html);

Popular Answer

(for a more detailed explanation, see Html Agility Pack - Problem selecting subnode)


You don't have to create another HtmlDocument object, or load another HTML into it. You just have to do:

foreach (var table in tables)
{
    var value = table.SelectSingleNode(".//tbody[1]/tr/td[0]");
}

The key is to use .//tbody instead of //tbody.




Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Is this KB legal? Yes, learn why
Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Is this KB legal? Yes, learn why