How can I get all the photos from a website using HTML Agility Pack?

c# html-agility-pack parsing

Question

I recently got the HTMLAgilityPack, but there are no examples in the docs.

I'm trying to figure out how to get every picture from a website. Not the real picture, but the address strings.

<img src="blabalbalbal.jpeg" />

I have to get each image tag's source. I only want to get a sense of the library's capabilities. Everybody agreed that this was the ideal instrument for the task.

Edit

public void GetAllImages()
    {
        WebClient x = new WebClient();
        string source = x.DownloadString(@"http://www.google.com");

        HtmlAgilityPack.HtmlDocument document = new HtmlAgilityPack.HtmlDocument();
        document.Load(source);

                         //I can't use the Descendants method. It doesn't appear.
        var ImageURLS = document.desc
                   .Select(e => e.GetAttributeValue("src", null))
                   .Where(s => !String.IsNullOrEmpty(s));        
    }
1
21
8/9/2012 4:08:44 PM

Accepted Answer

LINQ can be used to accomplish this, as in:

var document = new HtmlWeb().Load(url);
var urls = document.DocumentNode.Descendants("img")
                                .Select(e => e.GetAttributeValue("src", null))
                                .Where(s => !String.IsNullOrEmpty(s));

EDIT: This code is now functional; I neglected to write.document.DocumentNode .

38
1/22/2010 12:08:45 AM

Popular Answer

Using their single example as a base but modifying the XPath:

 HtmlDocument doc = new HtmlDocument();
 List<string> image_links = new List<string>();
 doc.Load("file.htm");
 foreach(HtmlNode link in doc.DocumentElement.SelectNodes("//img"))
 {
    image_links.Add( link.GetAttributeValue("src", "") );
 }

Since I'm unfamiliar with this extension, I'm not sure how to write the array to another location, but at least you'll have your data. (I'm sure I've defined the array incorrectly as well. Sorry).

Edit

Using your illustration

public void GetAllImages()
    {
        WebClient x = new WebClient();
        string source = x.DownloadString(@"http://www.google.com");

        HtmlAgilityPack.HtmlDocument document = new HtmlAgilityPack.HtmlDocument();
        List<string> image_links = new List<string>();
        document.Load(source);

        foreach(HtmlNode link in document.DocumentElement.SelectNodes("//img"))
        {
          image_links.Add( link.GetAttributeValue("src", "") );
       }


    }


Related Questions





Related

Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow