HTML Agility Pack for HTML Scraping

ajax c# html-agility-pack web-scraping

Question

Can someone please advise me on the best approach to extract the information listed below from the HTML using HTMLAgilityPack?

I must scrape the HTML that has been supplied.value set the values for of the ID "image"x and y want them to serve a different purpose.

The appropriate HTML is

<div id="values">
<input type="hidden" id="x" name="x" value='0' />
<input type="hidden" id="y" name="y" value='0' />
<input type="hidden" id="img" name="img" value="86932" />
<input type="hidden" id="source" name = "source" value="center" />

The javascript code shown below is receiving these data.

submitClick(document.getElementById("img").getAttribute("value"), 
              document.getElementById("x").getAttribute("value"), 
              document.getElementById("y").getAttribute("value"), 
              'tiled'  );

Could someone please advise me on how to proceed?

The code below is what I wrote to obtain the page's html data.

HttpWebRequest request = (HttpWebRequest)WebRequest.Create(Url);
request.Method = "GET";
using (var stream = request.GetResponse().GetResponseStream())
using (var reader = new StreamReader(stream, Encoding.UTF8))
{
    result = reader.ReadToEnd();
}
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.Load(new StringReader(result));
HtmlNode root = doc.DocumentNode;

How should I look for the parameters and then send them via GET now that I know the root?

1
0
11/30/2011 10:49:12 PM

Accepted Answer

Continuing from your example code above, you could simply obtain the values by doing something like this.

string imgValue = doc.DocumentNode.SelectSingleNode("//input[@id = \"img\"]").GetAttributeValue("value", "0");
string xValue = doc.DocumentNode.SelectSingleNode("//input[@id = \"x\"]").GetAttributeValue("value", "0");
string yValue = doc.DocumentNode.SelectSingleNode("//input[@id = \"y\"]").GetAttributeValue("value", "0");

Whereas the first example above essentially says, "Find the first input-type node with a "id" property equal to "image," then give me the value of that node's "value" attribute."

Simply add to the destination URL and make the get request like you did to get the first HTML after that.

2
12/2/2011 1:42:03 AM

Popular Answer

Because I don't know how to make the HTML Agility Pack feed back to the original website, I wouldn't use it for this. I would substitute WatiN. WatiN is designed to drive a browser for testing, but I've found it to be quite helpful when I need to scrape websites that are not under my control (such as Facebook or Wal-Mart). The drawback is that it drives a real browser window, making it impossible to conceal from a user. The advantage is that you can simply imitate mouse clicks and text entry into form fields.



Related Questions





Related

Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow