How to scrape values from a web page using Html Agility Pack

c# html html-agility-pack web-scraping

Question

I need some values from a web page, so I am building a scraping using html agility pack.

I'll show you html website and my Csharp.

Html Website:

  <div class="box-overflow">
    <div class="box-overflow__in">
      <table class="table-main js-tablebanner-t js-tablebanner-ntb">
        <tr>
          <th class="h-text-left" colspan="2">17. Round</th>

          <th class="h-text-center">1</th>

          <th class="h-text-center">X</th>

          <th class="h-text-center">2</th>

          <th>&nbsp;</th>
        </tr>

        <tr>
          <td class="h-text-left"><a href=
          "/soccer/poland/ekstraklasa/lechia-gdansk-leczna/Kjnscb6D/" class=
          "in-match"><span>Lechia Gdansk</span> - <span>Leczna</span></a></td>

          <td class="h-text-center"><a href=
          "/soccer/poland/ekstraklasa/lechia-gdansk-leczna/Kjnscb6D/">3:0</a></td>

          <td class="table-matches__odds colored"></td>

          <td class="table-matches__odds" data-odd="4.04"></td>

          <td class="table-matches__odds" data-odd="6.29"></td>

          <td class="h-text-right h-text-no-wrap">28.11.2016</td>
        </tr>

        <tr>
          <td class="h-text-left"><a href=
          "/soccer/poland/ekstraklasa/plock-piast-gliwice/KrhILsqE/" class=
          "in-match"><span>Plock</span> - <span>Piast Gliwice</span></a></td>

          <td class="h-text-center"><a href=
          "/soccer/poland/ekstraklasa/plock-piast-gliwice/KrhILsqE/">0:0</a></td>

          <td class="table-matches__odds" data-odd="2.05"></td>

          <td class="table-matches__odds colored"></td>

          <td class="table-matches__odds" data-odd="3.50"></td>

          <td class="h-text-right h-text-no-wrap">27.11.2016</td>
        </tr>

        <tr>
          <td class="h-text-left"><a href=
          "/soccer/poland/ekstraklasa/slask-wroclaw-legia/bZjMK1bK/" class=
          "in-match"><span>Slask Wroclaw</span> - <span>Legia</span></a></td>

          <td class="h-text-center"><a href=
          "/soccer/poland/ekstraklasa/slask-wroclaw-legia/bZjMK1bK/">0:4</a></td>

          <td class="table-matches__odds" data-odd="4.53"></td>

          <td class="table-matches__odds" data-odd="3.64"></td>

          <td class="table-matches__odds colored"></td>

          <td class="h-text-right h-text-no-wrap">27.11.2016</td>
        </tr>
      </table>
    </div>
  </div>

My csharp:

 var url = "http://www.betexplorer.com/soccer/poland/ekstraklasa/results/";

        var web = new HtmlWeb();
        var doc = web.Load(url);

        Bets = new List<Bet>();



        // Lettura delle righe
        var Rows = doc.DocumentNode.SelectNodes("//table");

        foreach (var row in Rows)
        {
            if (!row.GetAttributeValue("class", "").Contains("table-main js-tablebanner-t js-tablebanner-ntb"))
            {
                if (string.IsNullOrEmpty(row.InnerText))
                    continue;

                var rowBet = new Bet();
                foreach (var node in row.ChildNodes)
                {
                    var data_odd = node.GetAttributeValue("data-odd", "");

                    if (string.IsNullOrEmpty(data_odd))
                    {
                        if (node.GetAttributeValue("class", "").Contains("in-match"))
                        {
                            rowBet.Match = node.InnerText.Trim();
                            var matchTeam = rowBet.Match.Split(new[] { " - " }, StringSplitOptions.RemoveEmptyEntries);
                            rowBet.Home = matchTeam[0];
                            rowBet.Host = matchTeam[1];
                        }


                        if (node.GetAttributeValue("class", "").Contains("h-text-center"))
                        {
                            rowBet.Result = node.InnerText.Trim();
                            var matchPoints = rowBet.Result.Split(new[] { ':' }, StringSplitOptions.RemoveEmptyEntries);
                            int help;
                            if (int.TryParse(matchPoints[0], out help))
                            {
                                rowBet.HomePoints = help;
                            }
                            if (matchPoints.Length == 2 && int.TryParse(matchPoints[1], out help))
                            {
                                rowBet.HostPoints = help;
                            }

                        }


                        if (node.GetAttributeValue("class", "").Contains("h-text-right h-text-no-wrap"))
                            rowBet.Date = node.InnerText.Trim();

                    }
                    else
                    {
                        rowBet.Odds.Add(data_odd);
                    }
                }

                if (!string.IsNullOrEmpty(rowBet.Match))
                    Bets.Add(rowBet);
            }
        }

I'll give you more informations:

I need to take teams name (e.g. Lechia Gdansk - Leczna),
result (e.g. 3:0)
data-odd (e.g. 1.49, 4.04, 6.29)
and match date (e.g. 28.11.2016)

If someone needs more infromations, ask me what you want to know. Thanks

Accepted Answer

I would do it like

var list =  doc.DocumentNode.SelectSingleNode("//table[@class='table-main js-tablebanner-t js-tablebanner-ntb']")
                .Descendants("tr")
                .Select(x => new
                {
                    Val1 = x.SelectSingleNode("td[@class='h-text-left']")?.InnerText,
                    Val2 = x.SelectSingleNode("td[@class='h-text-center']")?.InnerText
                })
                .Where(x => x.Val1!=null)
                .ToList();


Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Is this KB legal? Yes, learn why
Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Is this KB legal? Yes, learn why