C#從維基頁面刮取數據(屏幕抓取)

c# html-agility-pack screen screen-scraping

我想刮一個維基頁面。具體來說, 這個。

我的應用程序將允許用戶輸入車輛的註冊號(例如,SBS8988Z),它將顯示相關信息(位於頁面本身)。

例如,如果用戶在我的應用程序的文本字段中輸入SBS8988Z,它應該在該Wiki頁面上查找該行

SBS8988Z (SLBP 192/194*) - F&N NutriSoy Fresh Milk: Singapore's No. 1 Soya Milk! (2nd Gen)

返回SBS8988Z(SLBP 192/194 *) - F&N NutriSoy鮮奶:新加坡第一大豆奶! (第二代)。

到目前為止我的代碼是(從各種網站複製和編輯)...

WebClient getdeployment = new WebClient();
string url = "http://sgwiki.com/wiki/Scania_K230UB_(Batch_1_Euro_V)";

getdeployment.Headers["User-Agent"] = "NextBusApp/GetBusData UserAgent";
string sgwikiresult = getdeployment.DownloadString(url); // <<< EXCEPTION
MessageBox.Show(sgwikiresult); //for debugging only!

HtmlAgilityPack.HtmlDocument sgwikihtml = new HtmlAgilityPack.HtmlDocument();
sgwikihtml.Load(new StreamReader(sgwikiresult));
HtmlNode root = sgwikihtml.DocumentNode;

List<string> anchorTags = new List<string>();   

foreach(HtmlNode deployment in root.SelectNodes("SBS8988Z"))
{
    string att = deployment.OuterHtml;
    anchorTags.Add(att);
}

但是,我得到一個未處理的ArgumentException - 路徑中的非法字符。

代碼有什麼問題?有更簡單的方法嗎?我正在使用HtmlAgilityPack,但如果有更好的解決方案,我很樂意遵守。

一般承認的答案

代碼有什麼問題?坦率地說,一切。 :P

頁面的格式不符合您的閱讀方式。你不能希望以這種方式獲得所需的內容。

頁面的內容(我們感興趣的部分)看起來像這樣:

<h2>
<span id="Deployments" class="mw-headline">Deployments</span>
</h2>
<p>
    <!-- ... -->
    <b>SBS8987B</b>
    (SLBP 192/194*)
    <br>
    <b>SBS8988Z</b>
    (SLBP 192/194*) - F&amp;N NutriSoy Fresh Milk: Singapore's No. 1 Soya Milk! (2nd Gen)
    <br>
    <b>SBS8989X</b>
    (SLBP SP)
    <br>
    <!-- ... -->
</p>

基本上我們需要找到包含我們正在尋找的註冊號的b元素。找到該元素後,獲取文本並將其組合在一起以形成結果。這是代碼:

static string GetVehicleInfo(string reg)
{
    var url = "http://sgwiki.com/wiki/Scania_K230UB_%28Batch_1_Euro_V%29";

    // HtmlWeb is a helper class to get pages from the web
    var web = new HtmlAgilityPack.HtmlWeb();

    // Create an HtmlDocument from the contents found at given url
    var doc = web.Load(url);

    // Create an XPath to find the `b` elements which contain the registration numbers
    var xpath = "//h2[span/@id='Deployments']" // find the `h2` element that has a span with the id, 'Deployments' (the header)
              + "/following-sibling::p[1]"     // move to the first `p` element (where the actual content is in) after the header
              + "/b";                          // select the `b` elements

    // Get the elements from the specified XPath
    var deployments = doc.DocumentNode.SelectNodes(xpath);

    // Create a LINQ query to find the  requested registration number and generate a result
    var query =
        from b in deployments                 // from the list of registration numbers
        where b.InnerText == reg              // find the registration we're looking for
        select reg + b.NextSibling.InnerText; // and create the result combining the registration number with the description (the text following the `b` element)

    // The query should yield exactly one result (or we have a problem) or none (null)
    var content = query.SingleOrDefault();

    // Decode the content (to convert stuff like "&amp;" to "&")
    var decoded = System.Net.WebUtility.HtmlDecode(content);

    return decoded;
}


Related

許可下: CC-BY-SA with attribution
不隸屬於 Stack Overflow
許可下: CC-BY-SA with attribution
不隸屬於 Stack Overflow