Select Nodes in HTML Agility Pack

c# html-agility-pack html-parsing

Question

I'm attempting to scrape some data from a website using the HTML Agility pack. When it comes to using selectnodes inside of a foreach and exporting the data to a list or array, I am having a lot of trouble.

Here is the code that I have so far been using.

       string result = string.Empty;

        HttpWebRequest request = (HttpWebRequest)WebRequest.Create(http://www.amazon.com/gp/offer-listing/B002UYSHMM/);
        request.Method = "GET";

        using (var stream = request.GetResponse().GetResponseStream())
        using (var reader = new StreamReader(stream, Encoding.UTF8))
        {
            result = reader.ReadToEnd();
        }

        HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
        doc.Load(new StringReader(result));
        HtmlNode root = doc.DocumentNode;

        string itemdesc = doc.DocumentNode.SelectSingleNode("//h1[@class='producttitle']").InnerText;  //this works perfectly to get the title of the item
        //HtmlNodeCollection sellers = doc.DocumentNode.SelectNodes("//id['bucketnew']/div/table/tbody/tr/td/ul/a/img/@alt");//this does not work at all in getting the alt attribute from the seller images
        HtmlNodeCollection prices = doc.DocumentNode.SelectNodes("//span[@class='price']"); //this works fine getting the prices
        HtmlNodeCollection nodes = doc.DocumentNode.SelectNodes("//div[@class='resultsset']/table/tbody[@class='result']/tr"); //this is the code I am working on to try to collect each tr in the result.  I then want to eather add each span.price to a list from this and also add each alt attribute from the seller image to a list.  Once I get this working I will want to use an if statement in the case that there is text for the seller name instead of an image.

        List<string> sellers = new List<string>();
        List<string> prices = new List<string>();

        foreach (HtmlNode node in nodes)
        {
            HtmlNode seller = node.SelectSingleNode(".//img/@alt");  // I am not sure if this works
            sellers.Add(seller.SelectSingleNode("img").Attributes["alt"]); //this definitly does not work and will not compile.

        }

In the code above, I've included comments describing what works, what doesn't, and what I'm trying to do in general.

Any suggestions or recommended reading would be much appreciated. I've looked through forums and examples, but I haven't found anything useful.

1
6
10/21/2016 3:38:15 PM

Accepted Answer

The first issue you have is the commented outSelectNodes doesn't work since the word "id" is an attribute name rather than the name of an element. Your other expressions for choosing an attribute and comparing values have the proper syntax. Eg,//ElementName[@attributeName='value'] . I believe even[attributeName='value'] Although I haven't tried it, it should work.

The language used in theSelectNodes with this may be able to assist you; the method is named "XPath".

The seller The node you've chosen has a sibling.node There is an image with an alt property for the current iteration. However, I believe simply is the right syntax for what you need.img[@alt] .

Check the error message when you have the next compilation issue; it will likely be complaining about argument types.sellers.Add I believe that instead of an attribute, which is what the code within the add is returning, is seeking to name another HtmlNode.

Additionally, look at the documentation for the HTML Agility pack and other syntax-related queries.

11
10/21/2016 3:38:36 PM


Related Questions





Related

Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow