How can I use html agility to grab everything between and

c# html-agility-pack html-parsing screen-scraping

Question

I poorly asked about this same project last week and didn't receive any suggestions. I will try to be more clear. I am trying to work with data from the website www.gtin13.com. For example if you enter peanut butter into the search, I am trying to grab the description:**Nabisco Nutter Butter Sandwich Cookies Chocolate Peanut Butter 4 Ct The *Size:Size: 12 oz The GTIN: 0044000003562 *ean:**00-44000-00356-2 upc: 044000003562 and upca: 04400000356. I have tried using nodeCollection with SelectNodes("<b>") and all I get are errors. Is it even possible using html agility to grab the data between the <b> <br> as well and then parse between the /s? With my lack of experience I just can't make any headway on this. It doesn't appear that the returned page has what I would consider true nodes. If html agility can't do this can anyone suggest a better approach? Eventually I would like to send each piece of the data to a sql table. I hope I have presented in a way that makes better sense.

The page returns the information in this source format:

<b><a href="/product/nabisco+nutter+butter+sandwich+cookies+chocolate+peanut+butter+4+ct/">Nabisco Nutter Butter Sandwich Cookies Chocolate Peanut Butter 4 Ct</a></b><br />

Size: 12 oz<br />

GTIN/EAN-13: 0044000003562 / 00-44000-00356-2<br />

UPC-A: 044000003562 / 04400000356<br />



Tags:

<a href="/tag/chocolate/">Chocolate</a>, 

<a href="/tag/cookies/">Cookies</a>, 
 ..<br />

<br >
1
0
4/2/2011 10:06:46 PM

Accepted Answer

It's not that easy because the original document is quite unstructured (not using a hierarchical layout, but a flat one), but here is how you can extract the main text fields with the Html Agility Pack:

        HtmlDocument doc = new HtmlDocument();
        doc.Load("yourDoc.Htm");

        // Get A nodes that have an HREF attribute
        foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//b/a[@href]"))
        {
            // This will contain anchor's displayed text
            string title = node.InnerText;
            Console.WriteLine("title=" + title);

            // Get the 1st BR, and then it's next sibling of TEXT type.
            HtmlNode sizeNode = node.SelectSingleNode("../following-sibling::br[1]/following-sibling::text()");
            Console.WriteLine(" size=" + sizeNode.InnerText.Trim());

            // Get the 3nd BR, and then it's next sibling of TEXT type.
            HtmlNode eanNode = node.SelectSingleNode("../following-sibling::br[2]/following-sibling::text()");
            Console.WriteLine(" ean=" + eanNode.InnerText.Trim());

            // Get the 3rd BR, and then it's next sibling of TEXT type.
            HtmlNode upcNode = node.SelectSingleNode("../following-sibling::br[3]/following-sibling::text()");
            Console.WriteLine(" upc=" + upcNode.InnerText.Trim());
        }

This will display:

title=Peanut Delight Peanut Butter & Grape Jelly
 size=Size: 18 oz
 ean=GTIN/EAN-13: 0041498143909 / 00-41498-14390-9
 upc=UPC-A: 041498143909 / 04149814390
title=Nabisco Nutter Butter Sandwich Cookie Bites Peanut Butter
 size=Size: 10 oz
 ean=GTIN/EAN-13: 0044000046118 / 00-44000-04611-8
 upc=UPC-A: 044000046118 / 04400004611
title=Nabisco Nutter Butter Sandwich Cookies Chocolate Peanut Butter 4 Ct
 size=Size: 12 oz
 ean=GTIN/EAN-13: 0044000003562 / 00-44000-00356-2
 upc=UPC-A: 044000003562 / 04400000356

etc...

NOTE: It's not 100% finished, as you'll have to parse the size, ean and upc variable using standard string manipulation (IndexOf, Substring, etc...) or Regex but the Html side of things is done.

1
4/3/2011 10:22:54 AM

Popular Answer

Using HTQL, the query to extract the whole table from the page is:

<div (CLASS='BGC')>1.<div (CLASS='CON')>1.<div (CLASS='SC')>1.<div (ID='post-20')>1.<div (CLASS='PostContent')>1.<b sep>2-0 {
  title=<a>1:tx; 
  size=/'Size:'~'<br />'/;
  gtin=/'GTIN/EAN-13:'~'<br />'/;
  upc=/'UPC-A:'~'<br />'/;
  tags=/'Tags:'~'<br />'/;
}

If you only need to send the results to sql database, then I sugguest you use IRobotSoft web scraper.



Related Questions





Related

Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow