Se il file HTML non ha terminato Tag "/ tr" O "/ td" Tag Quindi HTML Agility Pack non legge perfettamente tali informazioni

.net c# html-agility-pack parsing winforms

Domanda

Sto usando HTML Agility Pack per analizzare il contenuto html. Sto usando l'analisi per estrarre le informazioni della tabella. Funziona. Ma se non ci sono i tag "/ tr" o "/ td" che terminano, allora non analizza perfettamente le informazioni (in cui non esistono tag tr o td tag).

Piace

    <html>
  <head>
    <meta name="generator" content=
    "HTML Tidy for Windows (vers 14 February 2006), see www.w3.org">
    <title></title>
  </head>
  <body>
    <table cellspacing="0" cellpadding="0" width="100%" border="0">
      <tbody>
        <tr>
          <td class="xl27" valign="bottom" colspan="9">
            Sir / Madam,<br>
            I/We have this day done by your order and on your account the
            following transactions:
          </td>
          <td class="xl27boTRL" align="middle" colspan="5">
            Stamp duty as required under the relevant stamp act to be paid on
            consolidated basis at the end of the month.
          </td>
        </tr>
        <tr height="30">
          <td class="xl27boTBL" align="middle" width="7%">
            Order No
          </td>
          <td class="xl27boTBL" align="middle" width="4%">
            Order Time
          </td>

          <td class="xl27boTBL" align="middle" width="5%">
            Net Rate
          </td>
          <td class="xl27boTBL" align="middle" width="5%">
            Service Tax
          </td>
          <td class="xl27boTBL" align="middle" width="5%">
           Amount
          </td>
          <td class="xl27boTRBL" style="BORDER-BOTTOM: windowtext 1pt solid;"
          align="middle" width="8%">
          Net Amount Rs
          </td>
        </tr>
        <tr height="20">
          <td class="xl27boL" nowrap width="7%">
            25222105
          </td>
          <td class="xl27boL" nowrap width="4%">
            14:02:39
          </td>


          <td class="xl27boL" nowrap align="right" width="5%">
             
          </td>
          <td class="xl27boL" nowrap align="right" width="5%">
             
          </td>
          <td class="xl27boRL" nowrap align="right" width="8%">
            125288.00 
          </td>

        <tr height="20">
          <td class="xl27boL" nowrap width="7%">
            122122141
          </td>
          <td class="xl27boL" nowrap width="4%">
            14:01:56
          </td>


          <td class="xl27boL" nowrap align="right" width="5%">
             
          </td>
          <td class="xl27boL" nowrap align="right" width="5%">
             
          </td>
          <td class="xl27boRL" nowrap align="right" width="8%">
            249612.64 
          </td>

        <tr height="20">
          <td class="xl27boL" nowrap width="7%">
             
          </td>
          <td class="xl27boL" nowrap width="4%">
             
          </td>
          <td class="xl27boL" nowrap width="7%">
             
          </td>
          <td class="xl27boL" nowrap width="4%">
             
          </td>
          <td class="xl27boL" nowrap align="left" width="15%">
            [SERVICE TAX]
          </td>
          <td class="xl27boL" nowrap align="right" width="5%">
             
          </td>
          <td class="xl27boL" nowrap align="right" width="5%">
             
          </td>
          <td class="xl27boL" nowrap align="right" width="5%">
             
          </td>
          <td class="xl27boL" nowrap align="right" width="7%">
             
          </td>
          <td class="xl27boL" nowrap align="right" width="5%">
             
          </td>
          <td class="xl27boL" nowrap align="right" width="5%">
             
          </td>
          <td class="xl27boL" nowrap align="right" width="5%">
             
          </td>
          <td class="xl27boL" nowrap align="right" width="5%">
             
          </td>
          <td class="xl27boRL" nowrap align="right" width="8%">
            61.66
          </td>
        </tr>
      </tbody>
    </table>
  </body>
</html>

Quindi per quello che dovrei fare?

    <html>
  <head>
    <meta name="generator" content=
    "HTML Tidy for Windows (vers 14 February 2006), see www.w3.org">
    <title></title>
  </head>
  <body>
    <table cellspacing="0" cellpadding="0" width="100%" border="0">
      <tbody>
        <tr>
          <td class="xl27" valign="bottom" colspan="9">
            Sir / Madam,<br>
            I/We have this day done by your order and on your account the
            following transactions:
          </td>
          <td class="xl27boTRL" align="middle" colspan="5">
            Stamp duty as required under the relevant stamp act to be paid on
            consolidated basis at the end of the month.
          </td>
        </tr>
        <tr height="30">
          <td class="xl27boTBL" align="middle" width="7%">
            Order No
          </td>
          <td class="xl27boTBL" align="middle" width="4%">
            Order Time
          </td>

          <td class="xl27boTBL" align="middle" width="5%">
            Net Rate
          </td>
          <td class="xl27boTBL" align="middle" width="5%">
            Service Tax
          </td>
          <td class="xl27boTBL" align="middle" width="5%">
           Amount
          </td>
          <td class="xl27boTRBL" style="BORDER-BOTTOM: windowtext 1pt solid;"
          align="middle" width="8%">
          Net Amount Rs
          </td>
        </tr>
        <tr height="20">
          <td class="xl27boL" nowrap width="7%">
            25222105
          </td>
          <td class="xl27boL" nowrap width="4%">
            14:02:39
          </td>


          <td class="xl27boL" nowrap align="right" width="5%">
             
          </td>
          <td class="xl27boL" nowrap align="right" width="5%">
             
          </td>
          <td class="xl27boRL" nowrap align="right" width="8%">
            125288.00 
          </td>

        <tr height="20">
          <td class="xl27boL" nowrap width="7%">
            122122141
          </td>
          <td class="xl27boL" nowrap width="4%">
            14:01:56
          </td>


          <td class="xl27boL" nowrap align="right" width="5%">
             
          </td>
          <td class="xl27boL" nowrap align="right" width="5%">
             
          </td>
          <td class="xl27boRL" nowrap align="right" width="8%">
            249612.64 
          </td>

        <tr height="20">
          <td class="xl27boL" nowrap width="7%">
             
          </td>
          <td class="xl27boL" nowrap width="4%">
             
          </td>
          <td class="xl27boL" nowrap width="7%">
             
          </td>
          <td class="xl27boL" nowrap width="4%">
             
          </td>
          <td class="xl27boL" nowrap align="left" width="15%">
            [SERVICE TAX]
          </td>
          <td class="xl27boL" nowrap align="right" width="5%">
             
          </td>
          <td class="xl27boL" nowrap align="right" width="5%">
             
          </td>
          <td class="xl27boL" nowrap align="right" width="5%">
             
          </td>
          <td class="xl27boL" nowrap align="right" width="7%">
             
          </td>
          <td class="xl27boL" nowrap align="right" width="5%">
             
          </td>
          <td class="xl27boL" nowrap align="right" width="5%">
             
          </td>
          <td class="xl27boL" nowrap align="right" width="5%">
             
          </td>
          <td class="xl27boL" nowrap align="right" width="5%">
             
          </td>
          <td class="xl27boRL" nowrap align="right" width="8%">
            61.66
          </td>
        </tr>
      </tbody>
    </table>
  </body>
</html>

Risposta accettata

Dal momento che hai testato la mia altra idea e non ha funzionato, penso che tu abbia solo due opzioni:

  1. Modifica HTML Agility Pack per gestire il tuo caso, o
  2. Compila tu stesso i </tr> mancanti.

Ecco una espressione regolare che potrebbe contenere i </tr> mancanti per te:

html = Regex.Replace(html, "<tr[^>]*>(?:(?!</?tr>|</tbody>|</table>).)*?(?=<tr[^>]*>|</tbody>|</table>)", "$&</tr>", RegexOptions.Singleline | RegexOptions.IgnoreCase);

(Se qualcuno può migliorare la mia espressione regolare, per favore sentiti libero.)


Risposta popolare

Puoi provare HTML Tidy Tidy.NET . Questo sembra risolvere i tuoi problemi.




Autorizzato sotto: CC-BY-SA with attribution
Non affiliato con Stack Overflow
È legale questo KB? Sì, impara il perché
Autorizzato sotto: CC-BY-SA with attribution
Non affiliato con Stack Overflow
È legale questo KB? Sì, impara il perché