How can I use html agility to grab everything between and

c# html-agility-pack html-parsing screen-scraping

Question

I poorly asked about this same project last week and didn't receive any suggestions. I will try to be more clear. I am trying to work with data from the website www.gtin13.com. For example if you enter peanut butter into the search, I am trying to grab the description:**Nabisco Nutter Butter Sandwich Cookies Chocolate Peanut Butter 4 Ct The *Size:Size: 12 oz The GTIN: 0044000003562 *ean:**00-44000-00356-2 upc: 044000003562 and upca: 04400000356. I have tried using nodeCollection with SelectNodes("<b>") and all I get are errors. Is it even possible using html agility to grab the data between the <b> <br> as well and then parse between the /s? With my lack of experience I just can't make any headway on this. It doesn't appear that the returned page has what I would consider true nodes. If html agility can't do this can anyone suggest a better approach? Eventually I would like to send each piece of the data to a sql table. I hope I have presented in a way that makes better sense.

The page returns the information in this source format:

<b><a href="/product/nabisco+nutter+butter+sandwich+cookies+chocolate+peanut+butter+4+ct/">Nabisco Nutter Butter Sandwich Cookies Chocolate Peanut Butter 4 Ct</a></b><br />

Size: 12 oz<br />

GTIN/EAN-13: 0044000003562 / 00-44000-00356-2<br />

UPC-A: 044000003562 / 04400000356<br />



Tags:

<a href="/tag/chocolate/">Chocolate</a>, 

<a href="/tag/cookies/">Cookies</a>, 
 ..<br />

<br >

Accepted Answer

It's not that easy because the original document is quite unstructured (not using a hierarchical layout, but a flat one), but here is how you can extract the main text fields with the Html Agility Pack:

        HtmlDocument doc = new HtmlDocument();
        doc.Load("yourDoc.Htm");

        // Get A nodes that have an HREF attribute
        foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//b/a[@href]"))
        {
            // This will contain anchor's displayed text
            string title = node.InnerText;
            Console.WriteLine("title=" + title);

            // Get the 1st BR, and then it's next sibling of TEXT type.
            HtmlNode sizeNode = node.SelectSingleNode("../following-sibling::br[1]/following-sibling::text()");
            Console.WriteLine(" size=" + sizeNode.InnerText.Trim());

            // Get the 3nd BR, and then it's next sibling of TEXT type.
            HtmlNode eanNode = node.SelectSingleNode("../following-sibling::br[2]/following-sibling::text()");
            Console.WriteLine(" ean=" + eanNode.InnerText.Trim());

            // Get the 3rd BR, and then it's next sibling of TEXT type.
            HtmlNode upcNode = node.SelectSingleNode("../following-sibling::br[3]/following-sibling::text()");
            Console.WriteLine(" upc=" + upcNode.InnerText.Trim());
        }

This will display:

title=Peanut Delight Peanut Butter & Grape Jelly
 size=Size: 18 oz
 ean=GTIN/EAN-13: 0041498143909 / 00-41498-14390-9
 upc=UPC-A: 041498143909 / 04149814390
title=Nabisco Nutter Butter Sandwich Cookie Bites Peanut Butter
 size=Size: 10 oz
 ean=GTIN/EAN-13: 0044000046118 / 00-44000-04611-8
 upc=UPC-A: 044000046118 / 04400004611
title=Nabisco Nutter Butter Sandwich Cookies Chocolate Peanut Butter 4 Ct
 size=Size: 12 oz
 ean=GTIN/EAN-13: 0044000003562 / 00-44000-00356-2
 upc=UPC-A: 044000003562 / 04400000356

etc...

NOTE: It's not 100% finished, as you'll have to parse the size, ean and upc variable using standard string manipulation (IndexOf, Substring, etc...) or Regex but the Html side of things is done.


Popular Answer

Using HTQL, the query to extract the whole table from the page is:

<div (CLASS='BGC')>1.<div (CLASS='CON')>1.<div (CLASS='SC')>1.<div (ID='post-20')>1.<div (CLASS='PostContent')>1.<b sep>2-0 {
  title=<a>1:tx; 
  size=/'Size:'~'<br />'/;
  gtin=/'GTIN/EAN-13:'~'<br />'/;
  upc=/'UPC-A:'~'<br />'/;
  tags=/'Tags:'~'<br />'/;
}

If you only need to send the results to sql database, then I sugguest you use IRobotSoft web scraper.



Related

Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Is this KB legal? Yes, learn why
Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Is this KB legal? Yes, learn why