HtmlAgilityPack - 從html表中獲取數據

c# html html-agility-pack screen-scraping

我的程序使用HtmlAgilityPack並抓取HTML網頁,將其存儲在變量中,並且我試圖從HTML兩個表中獲取特定Div類標記(boardcontainer)。使用我當前的代碼,它在整個網頁中搜索每個表並顯示它們,但是當一個單元格為空時它會拋出異常:

“NullReferenceException未處理 - 對象引用未設置為對象的實例。”

HTML的一小部分(在這種情況下,我在網站上搜索'Microsoft':

<div class="boardcontainer">
<table cellpadding="4" cellspacing="1" border="0" width="100%">
<tr><td colspan="6" class="catbg" height="18" >Main Database</td></tr>
<tr>
    <td class="windowbg" width="28%" align="center">Company Name</td>
    <td class="windowbg" width="12%" align="center">0870 / 0871</td>
    <td class="windowbg" width="12%" align="center">0844 / 0845</td>
    <td class="windowbg" width="12%" align="center">01 / 02 / 03</td>
    <td class="windowbg" width="12%" align="center">Freephone</td>
    <td class="windowbg" width="24%" align="center">Other Information</td>
</tr>
    <tr>
<td class=windowbg2 width=28% align=center BGCOLOR=#FFFFCC><a href=http://www.websitename.com/exit.php?site=www.microsoft.co.uk target="_blank">Microsoft</a></td><td class=windowbg2 width=12% align=center BGCOLOR=#FFFFCC>�0870 601 0100</a></td><td class=windowbg2 width=12% align=center BGCOLOR=#FFFFCC>�0844 800 2400</a></td><td class=windowbg2 width=12% align=center BGCOLOR=#FFFFCC>�01954 713950</a></td><td class=windowbg2 width=12% align=center BGCOLOR=#FFFFCC>�</a></td><td class=windowbg2 width=24% align=center BGCOLOR=#FFFFCC>�<b>Customer Support</b><br><i>Straight to agent (no menu)</i><br><font size=1>Also for 0870 6010200</font></td></tr>
    <tr>
<td class=windowbg2 width=28% align=center BGCOLOR=#FFFFCC><a href=http://www.websitename.com/exit.php?site=www.microsoft.co.uk target="_blank">Microsoft</a></td><td class=windowbg2 width=12% align=center BGCOLOR=#FFFFCC>�0870 601 0100</a></td><td class=windowbg2 width=12% align=center BGCOLOR=#FFFFCC>�0844 800 2400</a></td><td class=windowbg2 width=12% align=center BGCOLOR=#FFFFCC>�0118 909 7800</a></td><td class=windowbg2 width=12% align=center BGCOLOR=#FFFFCC>�</a></td><td class=windowbg2 width=24% align=center BGCOLOR=#FFFFCC>�<b>Main UK Switchboard</b><br><i>Ask to be put through to required department</i><br><font size=1>Also for 0870 6010200</font></td></tr>
    <tr>

這是我當前的代碼,它只抓取表並顯示行+單元格然後在Null上拋出異常。

<div class="boardcontainer">
<table cellpadding="4" cellspacing="1" border="0" width="100%">
<tr><td colspan="6" class="catbg" height="18" >Main Database</td></tr>
<tr>
    <td class="windowbg" width="28%" align="center">Company Name</td>
    <td class="windowbg" width="12%" align="center">0870 / 0871</td>
    <td class="windowbg" width="12%" align="center">0844 / 0845</td>
    <td class="windowbg" width="12%" align="center">01 / 02 / 03</td>
    <td class="windowbg" width="12%" align="center">Freephone</td>
    <td class="windowbg" width="24%" align="center">Other Information</td>
</tr>
    <tr>
<td class=windowbg2 width=28% align=center BGCOLOR=#FFFFCC><a href=http://www.websitename.com/exit.php?site=www.microsoft.co.uk target="_blank">Microsoft</a></td><td class=windowbg2 width=12% align=center BGCOLOR=#FFFFCC>�0870 601 0100</a></td><td class=windowbg2 width=12% align=center BGCOLOR=#FFFFCC>�0844 800 2400</a></td><td class=windowbg2 width=12% align=center BGCOLOR=#FFFFCC>�01954 713950</a></td><td class=windowbg2 width=12% align=center BGCOLOR=#FFFFCC>�</a></td><td class=windowbg2 width=24% align=center BGCOLOR=#FFFFCC>�<b>Customer Support</b><br><i>Straight to agent (no menu)</i><br><font size=1>Also for 0870 6010200</font></td></tr>
    <tr>
<td class=windowbg2 width=28% align=center BGCOLOR=#FFFFCC><a href=http://www.websitename.com/exit.php?site=www.microsoft.co.uk target="_blank">Microsoft</a></td><td class=windowbg2 width=12% align=center BGCOLOR=#FFFFCC>�0870 601 0100</a></td><td class=windowbg2 width=12% align=center BGCOLOR=#FFFFCC>�0844 800 2400</a></td><td class=windowbg2 width=12% align=center BGCOLOR=#FFFFCC>�0118 909 7800</a></td><td class=windowbg2 width=12% align=center BGCOLOR=#FFFFCC>�</a></td><td class=windowbg2 width=24% align=center BGCOLOR=#FFFFCC>�<b>Main UK Switchboard</b><br><i>Ask to be put through to required department</i><br><font size=1>Also for 0870 6010200</font></td></tr>
    <tr>

如何更改此選項以搜索特定div類並從內部提取表?

謝謝你的閱讀。

完整的HTML:

<div class="boardcontainer">
<table cellpadding="4" cellspacing="1" border="0" width="100%">
<tr><td colspan="6" class="catbg" height="18" >Main Database</td></tr>
<tr>
    <td class="windowbg" width="28%" align="center">Company Name</td>
    <td class="windowbg" width="12%" align="center">0870 / 0871</td>
    <td class="windowbg" width="12%" align="center">0844 / 0845</td>
    <td class="windowbg" width="12%" align="center">01 / 02 / 03</td>
    <td class="windowbg" width="12%" align="center">Freephone</td>
    <td class="windowbg" width="24%" align="center">Other Information</td>
</tr>
    <tr>
<td class=windowbg2 width=28% align=center BGCOLOR=#FFFFCC><a href=http://www.websitename.com/exit.php?site=www.microsoft.co.uk target="_blank">Microsoft</a></td><td class=windowbg2 width=12% align=center BGCOLOR=#FFFFCC>�0870 601 0100</a></td><td class=windowbg2 width=12% align=center BGCOLOR=#FFFFCC>�0844 800 2400</a></td><td class=windowbg2 width=12% align=center BGCOLOR=#FFFFCC>�01954 713950</a></td><td class=windowbg2 width=12% align=center BGCOLOR=#FFFFCC>�</a></td><td class=windowbg2 width=24% align=center BGCOLOR=#FFFFCC>�<b>Customer Support</b><br><i>Straight to agent (no menu)</i><br><font size=1>Also for 0870 6010200</font></td></tr>
    <tr>
<td class=windowbg2 width=28% align=center BGCOLOR=#FFFFCC><a href=http://www.websitename.com/exit.php?site=www.microsoft.co.uk target="_blank">Microsoft</a></td><td class=windowbg2 width=12% align=center BGCOLOR=#FFFFCC>�0870 601 0100</a></td><td class=windowbg2 width=12% align=center BGCOLOR=#FFFFCC>�0844 800 2400</a></td><td class=windowbg2 width=12% align=center BGCOLOR=#FFFFCC>�0118 909 7800</a></td><td class=windowbg2 width=12% align=center BGCOLOR=#FFFFCC>�</a></td><td class=windowbg2 width=24% align=center BGCOLOR=#FFFFCC>�<b>Main UK Switchboard</b><br><i>Ask to be put through to required department</i><br><font size=1>Also for 0870 6010200</font></td></tr>
    <tr>

一般承認的答案

以下XPATH允許您在HTML文檔中搜索特定的DIV (帶有“boardcontainer”類):

//div[@class='boardcontainer']/table

要處理空行,只需檢查返回的HtmlNodeCollection是否為null

這是一個完整的例子:

//div[@class='boardcontainer']/table

您還應該檢查是否找到了一個表,以及找到的表是否包含行。


熱門答案

嘗試:

foreach (HtmlNode table in 
         htmlDoc.DocumentNode.SelectNodes("//div[@class='boardcontainer']/table"))

它是與屬性匹配的XPath表達式。有關詳情,請參閱此處:

http://www.exampledepot.com/egs/org.w3c.dom/xpath_getelembyattr.html




許可下: CC-BY-SA with attribution
不隸屬於 Stack Overflow
這個KB合法嗎? 是的,了解原因
許可下: CC-BY-SA with attribution
不隸屬於 Stack Overflow
這個KB合法嗎? 是的,了解原因