HTML 파일에 "/ tr"태그 또는 "/ td"태그가 없으면 HTML 민첩성 팩은 해당 정보를 완벽하게 읽지 않습니다

.net c# html-agility-pack parsing winforms

문제

html 콘텐츠를 구문 분석하기 위해 HTML 민첩성 팩을 사용하고 있습니다. 구문 분석을 사용하여 테이블 정보를 추출하고 있습니다. 그것은 작동합니다. 그러나 끝 "/ tr"태그 또는 "/ td"태그가 없으면 해당 정보를 완벽하게 구문 분석하지 않습니다 (끝 부분에 tr 태그 또는 td 태그가 없음).

처럼

    <html>
  <head>
    <meta name="generator" content=
    "HTML Tidy for Windows (vers 14 February 2006), see www.w3.org">
    <title></title>
  </head>
  <body>
    <table cellspacing="0" cellpadding="0" width="100%" border="0">
      <tbody>
        <tr>
          <td class="xl27" valign="bottom" colspan="9">
            Sir / Madam,<br>
            I/We have this day done by your order and on your account the
            following transactions:
          </td>
          <td class="xl27boTRL" align="middle" colspan="5">
            Stamp duty as required under the relevant stamp act to be paid on
            consolidated basis at the end of the month.
          </td>
        </tr>
        <tr height="30">
          <td class="xl27boTBL" align="middle" width="7%">
            Order No
          </td>
          <td class="xl27boTBL" align="middle" width="4%">
            Order Time
          </td>

          <td class="xl27boTBL" align="middle" width="5%">
            Net Rate
          </td>
          <td class="xl27boTBL" align="middle" width="5%">
            Service Tax
          </td>
          <td class="xl27boTBL" align="middle" width="5%">
           Amount
          </td>
          <td class="xl27boTRBL" style="BORDER-BOTTOM: windowtext 1pt solid;"
          align="middle" width="8%">
          Net Amount Rs
          </td>
        </tr>
        <tr height="20">
          <td class="xl27boL" nowrap width="7%">
            25222105
          </td>
          <td class="xl27boL" nowrap width="4%">
            14:02:39
          </td>


          <td class="xl27boL" nowrap align="right" width="5%">
             
          </td>
          <td class="xl27boL" nowrap align="right" width="5%">
             
          </td>
          <td class="xl27boRL" nowrap align="right" width="8%">
            125288.00 
          </td>

        <tr height="20">
          <td class="xl27boL" nowrap width="7%">
            122122141
          </td>
          <td class="xl27boL" nowrap width="4%">
            14:01:56
          </td>


          <td class="xl27boL" nowrap align="right" width="5%">
             
          </td>
          <td class="xl27boL" nowrap align="right" width="5%">
             
          </td>
          <td class="xl27boRL" nowrap align="right" width="8%">
            249612.64 
          </td>

        <tr height="20">
          <td class="xl27boL" nowrap width="7%">
             
          </td>
          <td class="xl27boL" nowrap width="4%">
             
          </td>
          <td class="xl27boL" nowrap width="7%">
             
          </td>
          <td class="xl27boL" nowrap width="4%">
             
          </td>
          <td class="xl27boL" nowrap align="left" width="15%">
            [SERVICE TAX]
          </td>
          <td class="xl27boL" nowrap align="right" width="5%">
             
          </td>
          <td class="xl27boL" nowrap align="right" width="5%">
             
          </td>
          <td class="xl27boL" nowrap align="right" width="5%">
             
          </td>
          <td class="xl27boL" nowrap align="right" width="7%">
             
          </td>
          <td class="xl27boL" nowrap align="right" width="5%">
             
          </td>
          <td class="xl27boL" nowrap align="right" width="5%">
             
          </td>
          <td class="xl27boL" nowrap align="right" width="5%">
             
          </td>
          <td class="xl27boL" nowrap align="right" width="5%">
             
          </td>
          <td class="xl27boRL" nowrap align="right" width="8%">
            61.66
          </td>
        </tr>
      </tbody>
    </table>
  </body>
</html>

그래서 나는 그것을 어떻게해야합니까?

    <html>
  <head>
    <meta name="generator" content=
    "HTML Tidy for Windows (vers 14 February 2006), see www.w3.org">
    <title></title>
  </head>
  <body>
    <table cellspacing="0" cellpadding="0" width="100%" border="0">
      <tbody>
        <tr>
          <td class="xl27" valign="bottom" colspan="9">
            Sir / Madam,<br>
            I/We have this day done by your order and on your account the
            following transactions:
          </td>
          <td class="xl27boTRL" align="middle" colspan="5">
            Stamp duty as required under the relevant stamp act to be paid on
            consolidated basis at the end of the month.
          </td>
        </tr>
        <tr height="30">
          <td class="xl27boTBL" align="middle" width="7%">
            Order No
          </td>
          <td class="xl27boTBL" align="middle" width="4%">
            Order Time
          </td>

          <td class="xl27boTBL" align="middle" width="5%">
            Net Rate
          </td>
          <td class="xl27boTBL" align="middle" width="5%">
            Service Tax
          </td>
          <td class="xl27boTBL" align="middle" width="5%">
           Amount
          </td>
          <td class="xl27boTRBL" style="BORDER-BOTTOM: windowtext 1pt solid;"
          align="middle" width="8%">
          Net Amount Rs
          </td>
        </tr>
        <tr height="20">
          <td class="xl27boL" nowrap width="7%">
            25222105
          </td>
          <td class="xl27boL" nowrap width="4%">
            14:02:39
          </td>


          <td class="xl27boL" nowrap align="right" width="5%">
             
          </td>
          <td class="xl27boL" nowrap align="right" width="5%">
             
          </td>
          <td class="xl27boRL" nowrap align="right" width="8%">
            125288.00 
          </td>

        <tr height="20">
          <td class="xl27boL" nowrap width="7%">
            122122141
          </td>
          <td class="xl27boL" nowrap width="4%">
            14:01:56
          </td>


          <td class="xl27boL" nowrap align="right" width="5%">
             
          </td>
          <td class="xl27boL" nowrap align="right" width="5%">
             
          </td>
          <td class="xl27boRL" nowrap align="right" width="8%">
            249612.64 
          </td>

        <tr height="20">
          <td class="xl27boL" nowrap width="7%">
             
          </td>
          <td class="xl27boL" nowrap width="4%">
             
          </td>
          <td class="xl27boL" nowrap width="7%">
             
          </td>
          <td class="xl27boL" nowrap width="4%">
             
          </td>
          <td class="xl27boL" nowrap align="left" width="15%">
            [SERVICE TAX]
          </td>
          <td class="xl27boL" nowrap align="right" width="5%">
             
          </td>
          <td class="xl27boL" nowrap align="right" width="5%">
             
          </td>
          <td class="xl27boL" nowrap align="right" width="5%">
             
          </td>
          <td class="xl27boL" nowrap align="right" width="7%">
             
          </td>
          <td class="xl27boL" nowrap align="right" width="5%">
             
          </td>
          <td class="xl27boL" nowrap align="right" width="5%">
             
          </td>
          <td class="xl27boL" nowrap align="right" width="5%">
             
          </td>
          <td class="xl27boL" nowrap align="right" width="5%">
             
          </td>
          <td class="xl27boRL" nowrap align="right" width="8%">
            61.66
          </td>
        </tr>
      </tbody>
    </table>
  </body>
</html>

수락 된 답변

다른 아이디어를 테스트 한 결과 작동하지 않았기 때문에 두 가지 옵션 만 있다고 생각합니다.

  1. 케이스를 처리하기 위해 HTML 민첩성 팩 수정
  2. 누락 된 </tr> 직접 채 웁니다.

다음은 누락 된 </tr> 채울 수있는 정규식입니다.

html = Regex.Replace(html, "<tr[^>]*>(?:(?!</?tr>|</tbody>|</table>).)*?(?=<tr[^>]*>|</tbody>|</table>)", "$&</tr>", RegexOptions.Singleline | RegexOptions.IgnoreCase);

(누군가 내 정규 표현식을 개선 할 수 있다면 기분이 좋으세요.)


인기 답변

HTML Tidy Tidy.NET을 사용해 볼 수 있습니다. 이것은 귀하의 문제를 해결하는 것 같습니다.




아래 라이선스: CC-BY-SA with attribution
와 제휴하지 않음 Stack Overflow
이 KB는 합법적입니까? 예, 이유를 알아보십시오.
아래 라이선스: CC-BY-SA with attribution
와 제휴하지 않음 Stack Overflow
이 KB는 합법적입니까? 예, 이유를 알아보십시오.