Parse HTML table to a CSV file (colspan and rowspan)

asp.net c# html-agility-pack html-parsing

Question

I want to parse a HTML table into a CSV file, but keeping the right number of colspan and rowpspan.

I'm using ";" as delimiter cell. Thus, when there colspan of 2 columns, for example, instead of having only one, ";", it will have 2.

I can extract the content of the table and make line breaks where tr indicators ends, but don't know how to treat colspan and rowspan.

HtmlNodeCollection rows = tables[0].SelectNodes("tr");

// Aux vars
int i;
// ncolspan

// For each row...
for (i = 0; i < rows.Count; ++i)
{
    // For each cell in the col...
    foreach (HtmlNode cell in rows[i].SelectNodes("th|td"))
    {
        /* Unsuccessful attempt to treat colspan
        foreach (HtmlNode n_cell in rows[i].SelectNodes("//td[@colspan]"))
        {
            ncolspan = n_cell.Attributes["colspan"].Value;
        }
        */

        text.Write(System.Text.RegularExpressions.Regex.Replace(cell.InnerText, @"\s\s+", ""));
        text.Write(";");
        /*
        for (int x = 0; x <= int.Parse(ncolspan); x++)
        {
            text.Write(";");
        }
        */
    }
    text.WriteLine();
    ncolspan = "0";
}

Any help, please? Thank you!

UPDATE: Here a simple example table to use:

<table id="T123" border="1">
    <tr>
        <td colspan="3"><center><font color="red">Title</font></center></td>
    </tr>
    <tr>
        <th>R1 C1</th>
        <th>R1 C2</th>
        <th>R1 C3</th>
    </tr>
    <tr>
        <td>R2 C1</td>
        <td>R2 C2</td>
        <td>R2 C3</td>
    </tr>
    <tr>
        <td colspan="2">R3 C1 e C2 with "</td>
        <td>R3 C3</td>
    </tr>
    <tr>
        <td>R4 C1</td>
        <td colspan=2>R4 C2 e C3 without "</td>
    </tr>
    <tr>
        <td>R5 C1</td>
        <td>R5 C2</td>
        <td>R5 C3</td>
    </tr>
    <tr>
        <td rowspan ="2">R6/R7 C1: Two lines rowspan. Must leave the second line blank.</td>
        <td>R6 C2</td>
        <td>R6 C3</td>
    </tr>
    <tr>
        <td>R7 C2</td>
        <td>R7 C3</td>
    </tr>
    <tr>
        <td>End</td>
    </tr>
</table>

Popular Answer

CSV doesn't handle rowspan or colspan values - it's a very simple format that has no concept of columns or rows beyond it's delimiter and the end of line character.

If you want to try preserve the rowspan and colspan you will need to use an intermediate object model which you can use to store the specific contents of a cell and it's location, for example, before exporting the model to CSV. And even then, the CSV format will not preserve the colspan and rowspan as you may be hoping (i.e. like an Excel sheet would).



Related

Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow