How can I get all content within tag using a HTML Agility Pack?

c# html-agility-pack screen-scraping

Question

So I'm writing an application that will do a little screen scraping. I'm using the HTML Agility Pack to load an entire HTML page into an instance of HtmlDocoument called doc. Now I want to parse that doc, looking for this:

<table border="0" cellspacing="3">
<tr><td>First rows stuff</td></tr>
<tr>
<td> 
The data I want is in here <br /> 
and it's seperated by these annoying <br /> 's.

No id's, classes, or even a single <p> tag. </p> Just a bunch of <br />  tags.
</td> 
</tr> 
</table> 

So I just need to get the data within the 2nd row. How can I do this? Should I use a regex or something else?

Update: Here is how I'm loading my doc

HtmlWeb hw = new HtmlWeb();
HtmlDocument doc = hw.Load(Url);

Accepted Answer

Since you are using Html Agility Pack already I would suggest using the methods it provides to find the information you want. There are a few ways to navigate the document, but one of the most concise is to use XPath. In this case you could use something like this:

HtmlDocument doc = new HtmlDocument();
doc.Load("input.html");
HtmlNode node = doc.DocumentNode
                   .SelectNodes("//table[@cellspacing='3']/tr[2]/td")
                   .Single();
string text = node.InnerText;



Related

Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow