I'm trying to get a list of images from website and also save them to hard disk but it doesn't work

c# html-agility-pack


I'm using HtmlAgilityPack.

In this function the imageNodes in the foreach count is 0

I don't understand why the list count is 0

The website contains many images. What I want is to get a list of the images from the site and show the list in the richTextBox1 and I also want to save all the images from the site on my hard disk.

How can I fix it ?

public void GetAllImages()
   // Bing Image Result for Cat, First Page
   string url = "http://www.bing.com/images/search?q=cat&go=&form=QB&qs=n";

   // For speed of dev, I use a WebClient
   WebClient client = new WebClient();
   string html = client.DownloadString(url);

   // Load the Html into the agility pack
   HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();

   // Now, using LINQ to get all Images
   List<HtmlNode> imageNodes = null;
   imageNodes = (from HtmlNode node in doc.DocumentNode.SelectNodes("//img")
                 where node.Name == "img"
                    && node.Attributes["class"] != null
                    && node.Attributes["class"].Value.StartsWith("img_")
                 select node).ToList();

   foreach (HtmlNode node in imageNodes)
      // Console.WriteLine(node.Attributes["src"].Value);
      richTextBox1.Text += node.Attributes["src"].Value + Environment.NewLine;

Accepted Answer

As I can see the correct class of the Bing images is sg_t. You can obtain those HtmlNodes with the following Linq query:

List<HtmlNode> imageNodes = doc.DocumentNode.Descendants("img")
    .Where(n=> n.Attributes["class"] != null && n.Attributes["class"].Value == "sg_t")

This list should be filled with all the img with class = 'sg_t'

Popular Answer

A quick look at that example page/URL in your code shows that the images you are after do not have a class type starting with "img_".

<img class="sg_t" src="http://ts2.mm.bing.net/images/thumbnail.aspx?q=4588327016989297&amp;id=db87e23954c9a0360784c0546cd1919c&amp;url=http%3a%2f%2factnowtraining.files.wordpress.com%2f2012%2f02%2fcat.jpg" style="height:133px;top:2px">

I notice your code is targetting the thumnails only. You also want the full size image URL, which are in the anchor surrounding each thumbnail. You will need to pull the final URL from a href that looks like this:

<a href="/images/search?q=cat&amp;view=detail&amp;id=89929E55C0136232A79DF760E3859B9952E22F69&amp;first=0&amp;FORM=IDFRIR" class="sg_tc" h="ID=API.images,18.1"><img class="sg_t" src="http://ts2.mm.bing.net/images/thumbnail.aspx?q=4588327016989297&amp;id=db87e23954c9a0360784c0546cd1919c&amp;url=http%3a%2f%2factnowtraining.files.wordpress.com%2f2012%2f02%2fcat.jpg" style="height:133px;top:2px"></a>

and decode the bit that look like: url=http%3a%2f%2factnowtraining.files.wordpress.com%2f2012%2f02%2fcat.jpg

which decodes to: http://actnowtraining.files.wordpress.com/2012/02/cat.jpg

Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Is this KB legal? Yes, learn why
Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Is this KB legal? Yes, learn why