Scrape data from a wiki page in C# (screen-scraping)

c# html-agility-pack screen screen-scraping


A Wiki page must be scraped, please. in particular, namely this.

Users of my app will be able to input the vehicle's registration number (for instance, SBS8988Z), and it will show the relevant information (which is on the page itself).

For instance, my program should search for the line if the user types SBS8988Z into a text field and find it on that wiki page.

SBS8988Z (SLBP 192/194*) - F&N NutriSoy Fresh Milk: Singapore's No. 1 Soya Milk! (2nd Gen)

the SBS8988Z (SLBP 192/194*) back. - F&N NutriSoy Fresh Milk: The Best Soy Milk in Singapore! (2nd Gen).

My code is currently (copied and edited from various websites)...

WebClient getdeployment = new WebClient();
string url = "";

getdeployment.Headers["User-Agent"] = "NextBusApp/GetBusData UserAgent";
string sgwikiresult = getdeployment.DownloadString(url); // <<< EXCEPTION
MessageBox.Show(sgwikiresult); //for debugging only!

HtmlAgilityPack.HtmlDocument sgwikihtml = new HtmlAgilityPack.HtmlDocument();
sgwikihtml.Load(new StreamReader(sgwikiresult));
HtmlNode root = sgwikihtml.DocumentNode;

List<string> anchorTags = new List<string>();   

foreach(HtmlNode deployment in root.SelectNodes("SBS8988Z"))
    string att = deployment.OuterHtml;

I am, however, receiving a bonusArgumentException Unhandled Characters in Path: Illegal Characters

What is incorrect with the code? Is there a simpler method to do this? I now use HTML Agility Pack, but if there is a better method, I'd be happy to use it.

9/19/2011 12:29:05 PM

Accepted Answer

Why is the code flawed? To put it simply, everything.

The way you are viewing the page is not how it is structured. That is not how you may expect to get the necessary contents.

The section of the page's content that interests us looks like this:

<span id="Deployments" class="mw-headline">Deployments</span>
    <!-- ... -->
    (SLBP 192/194*)
    (SLBP 192/194*) - F&amp;N NutriSoy Fresh Milk: Singapore's No. 1 Soya Milk! (2nd Gen)
    (SLBP SP)
    <!-- ... -->

Basically, we must identify theb we are searching for items that include the registration number. Once we've located it, we'll retrieve the text and combine it with it to create the final product. The code is as follows:

static string GetVehicleInfo(string reg)
    var url = "";

    // HtmlWeb is a helper class to get pages from the web
    var web = new HtmlAgilityPack.HtmlWeb();

    // Create an HtmlDocument from the contents found at given url
    var doc = web.Load(url);

    // Create an XPath to find the `b` elements which contain the registration numbers
    var xpath = "//h2[span/@id='Deployments']" // find the `h2` element that has a span with the id, 'Deployments' (the header)
              + "/following-sibling::p[1]"     // move to the first `p` element (where the actual content is in) after the header
              + "/b";                          // select the `b` elements

    // Get the elements from the specified XPath
    var deployments = doc.DocumentNode.SelectNodes(xpath);

    // Create a LINQ query to find the  requested registration number and generate a result
    var query =
        from b in deployments                 // from the list of registration numbers
        where b.InnerText == reg              // find the registration we're looking for
        select reg + b.NextSibling.InnerText; // and create the result combining the registration number with the description (the text following the `b` element)

    // The query should yield exactly one result (or we have a problem) or none (null)
    var content = query.SingleOrDefault();

    // Decode the content (to convert stuff like "&amp;" to "&")
    var decoded = System.Net.WebUtility.HtmlDecode(content);

    return decoded;
9/24/2011 5:51:01 AM

Related Questions


Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow