How to download all images from a site on the C# + HtmlAgilityPack?

.net c# html-agility-pack parsing

Question

I use programs such as: Teleport, HTTrack, Offline Explorer, DownThemAll and others. All pictures are found only - DownThemAll. But I have a lot of pages, with which you want to download pictures of the goods. DownThemAll is not suitable.

I wrote the program on C# + HtmlAgilityPack, but she didn't find all the pictures of the goods.

Ideally, I'd like the following:

  1. The program loads the file URLS.txt. In which such references are:

http://www.onlinetrade.ru/catalogue/televizori-c181/

http://www.onlinetrade.ru/catalogue/3d_ochki-c130/

etc

  1. The program loads on these pages all the pictures of the goods.

What do you advise? Maybe I'm wrong to write the code on C#?

HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
WebClient wc = new WebClient();
string url = wc.DownloadString("http://www.onlinetrade.ru/catalogue/televizori-c181/");
doc.LoadHtml(url);

HtmlNodeCollection nodes = doc.DocumentNode.SelectNodes("//a[@class='catalog__displayedItem__columnFotomainLnk']/img");

if (nodes != null)
            {
                foreach (HtmlNode node in nodes)
                {                    
                    listBox1.Items.Add(node.Attributes["src"].Value);
                }
            }

Accepted Answer

You were going well. In this solution I am using LINQ and TPL.

This site use pagination, so you must load all pages to be able to download all product's images.

  1. Load first page (HtmlNode)
  2. Discover how many pages this product catalogue have
  3. Load other pages (HtmlNode)

Then you have a collection of pages

  1. Load img nodes that you want to download
  2. Create a tuple with de Image url and new WebClient instance¹
  3. Download image
public class ImageDownloader
{
    public void DownloadImagesFromUrl(string url, string folderImagesPath)
    {
        var uri = new Uri(url + "/?per_page=50");
        var pages = new List<HtmlNode> { LoadHtmlDocument(uri) };

        pages.AddRange(LoadOtherPages(pages[0], url));

        pages.SelectMany(p => p.SelectNodes("//a[@class='catalog__displayedItem__columnFotomainLnk']/img"))
             .Select(node => Tuple.Create(new UriBuilder(uri.Scheme, uri.Host, uri.Port, node.Attributes["src"].Value).Uri, new WebClient()))
             .AsParallel()
             .ForAll(t => DownloadImage(folderImagesPath, t.Item1, t.Item2));
    }

    private static void DownloadImage(string folderImagesPath, Uri url, WebClient webClient)
    {
        try
        {
            webClient.DownloadFile(url, Path.Combine(folderImagesPath, Path.GetFileName(url.ToString())));
        }
        catch (Exception ex)
        {
            Console.WriteLine(ex.Message);
        }
    }

    private static IEnumerable<HtmlNode> LoadOtherPages(HtmlNode firstPage, string url)
    {
        return Enumerable.Range(1, DiscoverTotalPages(firstPage))
                         .AsParallel()
                         .Select(i => LoadHtmlDocument(new Uri(url + "/?per_page=50&page=" + i)));
    }

    private static int DiscoverTotalPages(HtmlNode documentNode)
    {
        var totalItemsDescription = documentNode.SelectNodes("//div[@class='catalogItemList__numsInWiev']").First().InnerText.Trim();
        var totalItems = int.Parse(Regex.Match(totalItemsDescription, @"\d+$").ToString());
        var totalPages = (int)Math.Ceiling(totalItems / 50d);
        return totalPages;
    }

    private static HtmlNode LoadHtmlDocument(Uri uri)
    {
        var doc = new HtmlDocument();
        var wc = new WebClient();
        doc.LoadHtml(wc.DownloadString(uri));

        var documentNode = doc.DocumentNode;
        return documentNode;
    }
}

And you can use like that:

DownloadImagesFromUrl("http://www.onlinetrade.ru/catalogue/televizori-c181/", @"C:\temp\televizori-c181\images");

And then 178 images were downloaded.

When images are downloading, sometimes it can fail, so I suggest you to implement Retry pattern using Polly.

Obs¹: WebClient dont support parallel operation, so I create one for each image url.




Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Is this KB legal? Yes, learn why
Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Is this KB legal? Yes, learn why