Parsing html page in WinForm,C#

c# html-agility-pack html-parsing linq-to-sql winforms

Question

I am using HtmlAgility pack for parsing the html page. I am able to locate my section from where i have to get data.Actually its a table and i have to parse its tr. Basically, I have two queries.

  1. When i load a page in parser, it took around 20-30 secs to load it in memory and there are around 4738 web pages to parse. So, I want to reduce it....I want to know Can I use delegate call the method in a loop so that i can reduce the time of delay. Or Is there any efficient way to do so. Please guide me thru that.

  2. I am getting my row as "\r\n\t\t\t\t<td style=\"width:20%;\">110001</td><td style=\"width:25%;\">New Delhi</td><td style=\"width:25%;\">Delhi</td><td style=\"width:30%;\">Baroda House</td>\r\n\t\t\t", from the above I have to parse 11001, New Delhi, Delhi and Baroda House. Actually I am having a class Pincodes where I have the properties Pincode, Area, State and District. So I need a regex or some way to put these values to the class.

Finally I have to push these records to my database where i am using Linq2Sql. So keeping all the things, please tell give me solution. Any reference or link will be a great help.

My Code:

  var url = @"http://www.eximguru.com/traderesources/pincode.aspx?&GridInfo=Pincode01";
            var web = new HtmlWeb();
            var doc = web.Load(url);
            //doc.DocumentNode.SelectSingleNode("//*[@id=\"lst-ib\"]");//("/html/body/div[2]/form/div/div[2]/table/tbody/tr/td/table/tbody/tr/td/div/table/tbody/tr/td/table/tbody/tr/td[2]/div/input");
            //System.Console.WriteLine(doc.DocumentNode.SelectSingleNode("//*[@id=\"lst-ib\"]").Id);
            var htmlNode =
                doc.DocumentNode.SelectSingleNode(
                    "//*[@id=\"ctl00_uxContentPlaceHolder_ResourceAndGuideUserControl1_ResourceAndGuideGrid_myGridView_mainGridView\"]");

Thanks in advance

Accepted Answer

It doesn't look like there's a pattern to the urls, ids or anything else on that page. It was poorly designed. If there was a nice pattern to it (such as the different page numbers for the results), then perhaps this could be done in parallel. Since it isn't, you'd have to do it sequentially since there's no reliable method (that I can see) to get the url to the next page.

var url = "http://eximguru.com/traderesources/pincode.aspx?&GridInfo=Pincode01";
var web = new HtmlWeb();
var results = new List<Pincode>();
while (!String.IsNullOrWhiteSpace(url))
{
    var doc = web.Load(url);
    var query = doc.DocumentNode
        .SelectNodes("//div[@class='Search']/div[3]//tr")
        .Skip(1)
        .Select(row => row.SelectNodes("td"))
        .Select(row => new Pincode
        {
            PinCode = row[0].InnerText,
            District = row[1].InnerText,
            State = row[2].InnerText,
            Area = row[3].InnerText,
        });
    results.AddRange(query);

    var next = doc.DocumentNode
        .SelectSingleNode("//div[@class='slistFooter']//a[last()]");
    if (next != null && next.InnerText == "Next")
    {
        url = next.Attributes["href"].Value;
    }
    else
    {
        url = null;
    }
}



Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Is this KB legal? Yes, learn why
Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Is this KB legal? Yes, learn why