Looping through node created by HtmlAgilityPack

c#-4.0 html-agility-pack xpath

Question

I need to use C# and HtmlAgilityPack to parse this HTML code. I know how to retrieve the div node with class="patent bibdata," but I'm not sure how to cycle over the child nodes.

There are six hrefs in this sample, however I need to divide them into two groups: inventors, and classification. The latter two don't interest me. In this div, there may be any number of hrefs.

As you can see, there is text describing the hrefs before the two groups.

code example

HtmlWeb hw = new HtmlWeb();
HtmlDocument doc = m_hw.Load("http://www.google.com/patents/US3748943");
string xpath = "/html/body/table[@id='viewport_table']/tr/td[@id='viewport_td']/div[@class='vertical_module_list_row'][1]/div[@id='overview']/div[@id='overview_v']/table[@id='summarytable']/tr/td/div[@class='patent_bibdata']";
HtmlNode node = m_doc.DocumentNode.SelectSingleNode(xpath);

So, how would you go about doing this?

<div class="patent_bibdata">
    <b>Inventors</b>:&nbsp;
    <a href="http://www.google.com/search?tbo=p&amp;tbm=pts&amp;hl=en&amp;q=ininventor:%22Ronald+T.+Lashley%22">
    Ronald T. Lashley
    </a>, 
    <a href="http://www.google.com/search?tbo=p&amp;tbm=pts&amp;hl=en&amp;q=ininventor:%22Ronald+T.+Lashley%22">
    Ronald T. Lashley
    </a><br>
    <b>Current U.S. Classification</b>:&nbsp;
    <a href="http://www.google.com/url?id=3eF8AAAAEBAJ&amp;q=http://www.uspto.gov/web/patents/classification/uspc084/defs084.htm&amp;usg=AFQjCNEZRFtAyKTfNudgc-XVt2-VboD77Q#C084S31200P">84/312.00P</a>;
    <a href="http://www.google.com/url?id=3eF8AAAAEBAJ&amp;q=http://www.uspto.gov/web/patents/classification/uspc084/defs084.htm&amp;usg=AFQjCNEZRFtAyKTfNudgc-XVt2-VboD77Q#C084S31200R">84/312.00R</a><br>
    <br>
    <a href="http://www.google.com/url?id=3eF8AAAAEBAJ&q=http://patft.uspto.gov/netacgi/nph-Parser%3FSect2%3DPTO1%26Sect2%3DHITOFF%26p%3D1%26u%3D/netahtml/PTO/search-bool.html%26r%3D1%26f%3DG%26l%3D50%26d%3DPALL%26RefSrch%3Dyes%26Query%3DPN/3748943&usg=AFQjCNGKUic_9BaMHWdCZtCghtG5SYog-A">
    View patent at USPTO</a><br>
    <a href="http://www.google.com/url?id=3eF8AAAAEBAJ&q=http://assignments.uspto.gov/assignments/q%3Fdb%3Dpat%26pat%3D3748943&usg=AFQjCNGbD7fvsJjOib3GgdU1gCXKiVjQsw">
    Search USPTO Assignment Database
    </a><br>
</div>

intended outcome AuthorGroup =

<a href="http://www.google.com/search?tbo=p&amp;tbm=pts&amp;hl=en&amp;q=ininventor:%22Ronald+T.+Lashley%22">
    Ronald T. Lashley
    </a>
    <a href="http://www.google.com/search?tbo=p&amp;tbm=pts&amp;hl=en&amp;q=ininventor:%22Ronald+T.+Lashley%22">
    Thomas R. Lashley
    </a>

ClassificationGroup

<a href="http://www.google.com/url?id=3eF8AAAAEBAJ&amp;q=http://www.uspto.gov/web/patents/classification/uspc084/defs084.htm&amp;usg=AFQjCNEZRFtAyKTfNudgc-XVt2-VboD77Q#C084S31200P">84/312.00P</a>;
    <a href="http://www.google.com/url?id=3eF8AAAAEBAJ&amp;q=http://www.uspto.gov/web/patents/classification/uspc084/defs084.htm&amp;usg=AFQjCNEZRFtAyKTfNudgc-XVt2-VboD77Q#C084S31200R">84/312.00R</a>

I'm attempting to scrape the following page: http://www.google.com/patents/US3748943

A. Anders

PS: I am aware that some of the inventors' names are the same on this website, but most of them are distinct!

1
2
8/8/2012 4:15:17 PM

Accepted Answer

XPATH is on your side. You may get the inventor's name by doing something like this:

HtmlWeb w = new HtmlWeb();
HtmlDocument doc = w.Load("http://www.google.com/patents/US3748943");
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//div[@class='patent_bibdata']/br[1]/preceding-sibling::a"))
{
    Console.WriteLine(node.InnerHtml);
}
4
8/8/2012 4:24:54 PM

Popular Answer

Therefore, it is clear that I don't comprehend XPath (yet). I thus came up with this answer. It may not be the best option, but it does the job!

A. Anders

List<string> inventorList = new List<string>();
List<string> classificationList = new List<string>();

string xpath = "/html/body/table[@id='viewport_table']/tr/td[@id='viewport_td']/div[@class='vertical_module_list_row'][1]/div[@id='overview']/div[@id='overview_v']/table[@id='summarytable']/tr/td/div[@class='patent_bibdata']";
HtmlNode nodes = m_doc.DocumentNode.SelectSingleNode(xpath);
bool bInventors = false;
bool bClassification = false;
for (int i = 0; i < nodes.ChildNodes.Count; i++)
{
    HtmlNode node = nodes.ChildNodes[i];
    string txt = node.InnerText;
    if (txt.IndexOf("Inventor") > -1)
    {
        bClassification = false;
        bInventors = true;
    }
    if (txt.IndexOf("Classification") > -1)
    {
        bClassification = true;
        bInventors = false;
    }
    if (txt.IndexOf("USPTO") > -1)
    {
        bClassification = false;
        bInventors = false;
    }
    string name = node.Name;
    if (name.IndexOf("a") > -1)
    {
        if (bInventors)
        {
            string inventor = node.InnerText;
            inventorList.Add(inventor);
        }
        if (bClassification)
        {
            string classification = node.InnerText;
            classificationList.Add(classification);
        }
    }


Related Questions





Related

Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow