So I'm writing an application that will do a little screen scraping. I'm using the HTML Agility Pack to load an entire HTML page into an instance of
doc. Now I want to parse that doc, looking for this:
<table border="0" cellspacing="3"> <tr><td>First rows stuff</td></tr> <tr> <td> The data I want is in here <br /> and it's seperated by these annoying <br /> 's. No id's, classes, or even a single <p> tag. </p> Just a bunch of <br /> tags. </td> </tr> </table>
So I just need to get the data within the 2nd row. How can I do this? Should I use a regex or something else?
Update: Here is how I'm loading my
HtmlWeb hw = new HtmlWeb(); HtmlDocument doc = hw.Load(Url);
Since you are using Html Agility Pack already I would suggest using the methods it provides to find the information you want. There are a few ways to navigate the document, but one of the most concise is to use XPath. In this case you could use something like this:
HtmlDocument doc = new HtmlDocument(); doc.Load("input.html"); HtmlNode node = doc.DocumentNode .SelectNodes("//table[@cellspacing='3']/tr/td") .Single(); string text = node.InnerText;