Parse HTML table to a CSV file (colspan and rowspan)

asp.net c# html-agility-pack html-parsing

Question

I want to convert an HTML table into a CSV file while maintaining the correct amount of rows and columns.

I'm using zzzz-9 zzzz. As a result, when there is a colspan of two columns, for instance, there will be two ";"s instead of one.

I am able to extract the table's content and insert line breaks where the tr indications stop, but I am unsure how to handle the colspan and rowspan indicators.

HtmlNodeCollection rows = tables[0].SelectNodes("tr");

// Aux vars
int i;
// ncolspan

// For each row...
for (i = 0; i < rows.Count; ++i)
{
    // For each cell in the col...
    foreach (HtmlNode cell in rows[i].SelectNodes("th|td"))
    {
        /* Unsuccessful attempt to treat colspan
        foreach (HtmlNode n_cell in rows[i].SelectNodes("//td[@colspan]"))
        {
            ncolspan = n_cell.Attributes["colspan"].Value;
        }
        */

        text.Write(System.Text.RegularExpressions.Regex.Replace(cell.InnerText, @"\s\s+", ""));
        text.Write(";");
        /*
        for (int x = 0; x <= int.Parse(ncolspan); x++)
        {
            text.Write(";");
        }
        */
    }
    text.WriteLine();
    ncolspan = "0";
}

Please, I need some assistance. I'm grateful.

UPDATE: An easy sample table is provided here:

<table id="T123" border="1">
    <tr>
        <td colspan="3"><center><font color="red">Title</font></center></td>
    </tr>
    <tr>
        <th>R1 C1</th>
        <th>R1 C2</th>
        <th>R1 C3</th>
    </tr>
    <tr>
        <td>R2 C1</td>
        <td>R2 C2</td>
        <td>R2 C3</td>
    </tr>
    <tr>
        <td colspan="2">R3 C1 e C2 with "</td>
        <td>R3 C3</td>
    </tr>
    <tr>
        <td>R4 C1</td>
        <td colspan=2>R4 C2 e C3 without "</td>
    </tr>
    <tr>
        <td>R5 C1</td>
        <td>R5 C2</td>
        <td>R5 C3</td>
    </tr>
    <tr>
        <td rowspan ="2">R6/R7 C1: Two lines rowspan. Must leave the second line blank.</td>
        <td>R6 C2</td>
        <td>R6 C3</td>
    </tr>
    <tr>
        <td>R7 C2</td>
        <td>R7 C3</td>
    </tr>
    <tr>
        <td>End</td>
    </tr>
</table>
1
1
7/11/2014 12:37:01 PM

Popular Answer

Since CSV is a very basic format with no idea of columns or rows outside its delimiter and the end of line character, it does not support rowspan or colspan values.

Use an intermediate object model to hold the precise contents of a cell and its position, for example, before exporting the model to CSV if you want to attempt to maintain the rowspan and colspan. And even then, despite your best efforts, the colspan and rowspan will not be preserved by the CSV format (i.e. like an Excel sheet would).

3
7/11/2014 11:59:57 AM


Related Questions





Related

Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow