Import data from HTML table to DataTable in C#

c# datatable html html-agility-pack

Question

I wanted to show the top 16 persons in my Form application's DataGridView after importing some data from an HTML table (here is a link: http://road2paris.com/wp-content/themes/roadtoparis/api/generated_table_august.html). According to what I've heard, using HTML Agility pack is the ideal approach, therefore I downloaded it and added it to my project. I am aware that loading the contents of the html file comes first. The code I applied was as follows:

        string htmlCode = "";
        using (WebClient client = new WebClient())
        {
            client.Headers.Add(HttpRequestHeader.UserAgent, "AvoidError");
            htmlCode = client.DownloadString("http://road2paris.com/wp-content/themes/roadtoparis/api/generated_table_august.html");
        }
        HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();

        doc.LoadHtml(htmlCode);

Then I ran into a problem. I have no idea how to insert information from the html table into my datatable. I've tried a lot of different things, but nothing is really working. If someone could assist me with that, I would be grateful.

1
7
8/6/2013 9:14:31 PM

Accepted Answer

HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(htmlCode);
var headers = doc.DocumentNode.SelectNodes("//tr/th");
DataTable table = new DataTable();
foreach (HtmlNode header in headers)
    table.Columns.Add(header.InnerText); // create columns from th
// select rows with td elements 
foreach (var row in doc.DocumentNode.SelectNodes("//tr[td]")) 
    table.Rows.Add(row.SelectNodes("td").Select(td => td.InnerText).ToArray());

To use this code, you must have the HTML Agility Pack library.

15
12/12/2014 2:48:33 PM

Popular Answer

I've written some code in the section below to avoid having redundant data headers. Each "Column" in a DataTable has to have its own name when it is created. Additionally, an HTML row may sometimes exceed its limitations, in which case you will need to add extra columns to the data table to prevent data loss. I've found a solution in this.

'''
public enum DuplicateHeaderReplacementStrategy
{
    AppendAlpha,
    AppendNumeric,
    Delete
}

public class HtmlServices
{
    private static readonly string[] Alpha = new[] { "A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K", "L", "M", "N", "O", "P", "Q", "R", "S", "T", "U", "V", "W", "X", "Y", "Z" };

    public static HtmlDocument RenameDuplicateHeaders(HtmlDocument doc, DuplicateHeaderReplacementStrategy strategy)
    {
        var index = 0;
        try
        {
            foreach (HtmlNode table in doc.DocumentNode?.SelectNodes("//table"))
            {
                var tableHeaders = table.SelectNodes("th")?
                   .GroupBy(x => x)?
                   .Where(g => g.Count() > 1)?
                   .ToList();
                tableHeaders?.ForEach(y =>
                   {
                       switch (strategy)
                       {
                           case DuplicateHeaderReplacementStrategy.AppendNumeric:
                               y.Key.InnerHtml += index++;
                               break;

                           case DuplicateHeaderReplacementStrategy.AppendAlpha:
                               y.Key.InnerHtml += Alpha[index++];
                               break;

                           case DuplicateHeaderReplacementStrategy.Delete:
                               y.Key.InnerHtml = string.Empty;
                               break;
                       }
                });
            }
            return doc;
        }
        catch
        {
            return doc;
        }


    }
}


public static DataTable GetDataTableFromHtmlTable(string url, string[] htmlIds)
    {
        ServicePointManager.SecurityProtocol = SecurityProtocolType.Tls11 | SecurityProtocolType.Tls12;
        HtmlWeb web = new HtmlWeb();
        HtmlDocument doc = web.Load(url);
        string html = doc.DocumentNode.OuterHtml;

        doc = HtmlServices.RenameDuplicateHeaders(doc, DuplicateHeaderReplacementStrategy.AppendNumeric);

        var headers = doc.DocumentNode.SelectNodes("//tr/th");

        DataTable table = new DataTable();
        foreach (HtmlNode header in headers)
            if (!table.ColumnExists(header.InnerText))
            {
                table.Columns.Add(header.InnerText); // create columns from th
            }
            else
            {
                int columnIteration = 0;
                while (table.ColumnExists(header.InnerText + columnIteration.ToString()))
                {
                    columnIteration++;
                }
                table.Columns.Add(header.InnerText + columnIteration.ToString()); // create columns from th
            }

        // select rows with td elements
        foreach (var row in doc.DocumentNode.SelectNodes("//tr[td]"))
        {
            var addRow = row.SelectNodes("td").Select(td => td.InnerHtml.StripHtmlTables()).ToArray();

            if (addRow.Count() > table.Columns.Count)
            {
                int m_numberOfRowsToAdd = addRow.Count() - table.Columns.Count;
                for (int i = 0; i < m_numberOfRowsToAdd; i++)
                    table.Columns.Add($"ExtraColumn {i + 1}");
            }

            try
            {
                table.Rows.Add(addRow);
            }
            catch (Exception e)
            {
                debug.Print(e.Message);
            }
        }
        return table;
    }


Related Questions





Related

Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow