HTML Agility Pack and LINQ

c# html-agility-pack linq web-scraping


In order to keep just the rows that match, I want to utilize HAP to scrape data from a table on a website, loop over the rows looking for a value in a column that matches a predetermined text, and then identify that row. The column heading will then serve as the dictionary's key, while the column text for the chosen row will serve as the dictionary's value.

Table example

<table id="Table3">
<td>Last Name</td>
<td>First Name</td>
<td>Birth Date</td>

<td>&nbsp;DUNN          &nbsp;</td>
<td>&nbsp;JOE          &nbsp;</td>

<td>&nbsp;SMITH          &nbsp;</td>
<td>&nbsp;MARY          &nbsp;</td>

<td>&nbsp;ROCKFORD          &nbsp;</td>
<td>&nbsp;BILL          &nbsp;</td>


I need all the information about Bill if my desired DOB date is 20000320.

It is simple to add the header titles to the list. I am aware that the user row is incorrectly written. I'm still attempting to get a list of rows as opposed to a single row. Another issue I have with the user row is that the inner text will return with " " in it, making it impossible for me to merely type a. I need a technique to get rid of the gaps so I can replace. Any ideas would be welcome. Better methods for carrying out all of this, etc.

List<string> headerList = new List<string>();
List<string> userList = new List<string>();

var htmlRows = htmlDoc.DocumentNode.SelectNodes("//*[@id=\"Table3\"]/tbody/tr");
if(htmlRows != null)
     // Add first row which contains column headings
         .Select(td => td.InnerText.Trim())
         .ForEach(header => headerList.Add(header));

     // Add user rows
         .Select(tr => tr.Elements("td")
             .Where(td => td.InnerText.Trim() == dteDOB))
         .ForEach(row => userList.Add(row));

    for(int i = 0; i < headerList.Count; i++)
        if(headerList.Count == userList.Count && userList[i] != null)
            dictValues.Add(headerList[i], userList[i]);                 
2/19/2013 6:42:16 PM

Accepted Answer

I believe you might attempt to select the whole tr using the value in td.

//*[@id=\"Table3\"]/tbody/tr[td//text()[contains(., 'targetString')]]

take a look at this

Using XPath, choose a table row where a cell has the required text.

5/23/2017 10:24:51 AM

Related Questions


Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow