Parsing HTML with LINQ

c# html html-agility-pack linq

Question

I'm trying to take all cells from a HTML table using Html Agility Pack and LINQ. I have loaded the HTML source in a HtmlAgilityPack.HtmlDocument and selected the tags with LINQ. However after using foreach to iterate the result, it crashes in the second record.

This is a fragment of the HTML source:

<tr>
    <td class='city'>New York</td>
    <td>Card 1</td>
</tr>
<tr>
    <td class='city'>London</td>
    <td>Card 2</td>
</tr>
<tr>
    <td class='city'>Tokyo</td>
    <td>Card 3</td>
</tr>
<tr>
    <td class='city'>Berlin</td>
    <td>Card 4</td>
</tr>

And this is what I made:

htmlDoc.LoadHtml(await msgRecived.Content.ReadAsStringAsync());

var tds=
    from td in htmlDoc.DocumentNode.Descendants("td")
    where td.Attributes["class"].Value == "city"
    select td.InnerText;

foreach (var td in tds)
{
    citiesText = citiesText + " " + td;
}

It only return the first element, e.g. if instead of using foreach I do:

citiesText = tds.ElementAt(0);

It returns New York, but if I try ElementAt(1) it crashes with Object reference not set to an instance of an object.

Any help? Thanks

Accepted Answer

You need to make sure that Attributes["class"] is not null:

var tds =
    from td in doc.DocumentNode.Descendants("td")
    where td.Attributes["class"] != null && td.Attributes["class"].Value == "city"
    select td.InnerText;

The second <td> retrieved has no class attribute, so when you access Attributes["class"] in that case, you're getting null. Calling .Value on null is causing the exception.

Alternatively you could use GetAttributeValue:

var tds =
    from td in doc.DocumentNode.Descendants("td")
    where td.GetAttributeValue("class", null) == "city"
    select td.InnerText;

Popular Answer

Just a guess but you are probably only looking at the td on the first element. Maybe you need

htmlDoc.DocumentNode.Descendants("table") instead.



Related

Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow