C# How to loop through a list, update the same list, and continue looping?

c# for-loop html-agility-pack loops web-scraping

Question

I am writing a webscraper that grabs specific urls and adds them to a list.

using HtmlAgilityPack;

List<string> mylist = new List<string>();
var firstUrl = "http://example.com";
HtmlWeb web = new HtmlWeb();
HtmlDocument document = web.Load(firstUrl);

HtmlNodeCollection nodes = document.DocumentNode.SelectNodes("//div[contains(@class,'Name')]/a");
            foreach (HtmlNode htmlNode in (IEnumerable<HtmlNode>)nodes)
            {
                if (!mylist.Contains(htmlNode.InnerText))
                {
                    mylist.Add(htmlNode.InnerText);

                }

            }

What I want to do at this point is to loop through 'mylist' and do the exact same thing and basically continue forever. The code should be taking newly parsed URLs and adding them to the list. What would be the easiest way to do this?

I tried creating a for loop right after the one above. But it does not seem to be updating the list. It will only continue to loop over the same items already in the list forever (since i will always be less than mylist.Count)

for (int i = 0; i < mylist.Count; i++)
            {
                //the items in mylist are added to the url
                var urls = "http://example.com" + mylist[i];

                HtmlWeb web = new HtmlWeb();
                HtmlDocument document = web.Load(urls);

                HtmlNodeCollection nodes = document.DocumentNode.SelectNodes("//div[contains(@class,'Name')]/a");
                foreach (HtmlNode htmlNode in (IEnumerable<HtmlNode>)nodes)
                {
                    if (!mylist.Contains(htmlNode.InnerText))
                    { 
                        mylist.Add(htmlNode.InnerText);
                    }

                }


            }

Thanks!

Accepted Answer

Queue fit for your requirement.

Queue<string> mylist = new Queue<string>();

First pass :

using HtmlAgilityPack;

Queue<string> mylist = new Queue<string>();
var firstUrl = "http://example.com";
HtmlWeb web = new HtmlWeb();
HtmlDocument document = web.Load(firstUrl);

HtmlNodeCollection nodes = document.DocumentNode.SelectNodes("//div[contains(@class,'Name')]/a");
            foreach (HtmlNode htmlNode in (IEnumerable<HtmlNode>)nodes)
            {
                if (!mylist.Contains(htmlNode.InnerText))
                {
                    mylist.Enqueue(htmlNode.InnerText);

                }
            }

Now the second pass

while (mylist.Count > 0)
            {
                var url = mylist..Dequeue();
                //the items in mylist are added to the url
                var urls = "http://example.com" + url;

                HtmlWeb web = new HtmlWeb();
                HtmlDocument document = web.Load(urls);

                HtmlNodeCollection nodes = document.DocumentNode.SelectNodes("//div[contains(@class,'Name')]/a");
                foreach (HtmlNode htmlNode in (IEnumerable<HtmlNode>)nodes)
                {
                    if (!mylist.Contains(htmlNode.InnerText))
                    { 
                        mylist.Enqueue(htmlNode.InnerText);
                    }

                }


            }

Popular Answer

Go NuGet "System.Interactive" and then do this:

var found = new HashSet<string>();

var urls =
    EnumerableEx
        .Expand(
            new[] { "http://example.com" },
            url =>
            {
                var web = new HtmlWeb();
                var document = web.Load(url);
                var nodes = document.DocumentNode.SelectNodes("//div[contains(@class,'Name')]/a");
                return
                    nodes
                        .Cast<HtmlNode>()
                        .Select(x => x.InnerText)
                        .Where(x => !found.Contains(x))
                        .Do(x => found.Add(x))
                        .Select(x => "http://example.com" + x);
            });



Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Is this KB legal? Yes, learn why
Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Is this KB legal? Yes, learn why