Scrape data from a wiki page in C# (screen-scraping)

c# html-agility-pack screen screen-scraping


I want to scrape a Wiki page. Specifically, this one.

My app will allow users to enter the registration number of the vehicle (for example, SBS8988Z) and it will display the related information (which is on the page itself).

For example, if the user enters SBS8988Z into a text field in my application, it should look for the line on that wiki page

SBS8988Z (SLBP 192/194*) - F&N NutriSoy Fresh Milk: Singapore's No. 1 Soya Milk! (2nd Gen)

and return SBS8988Z (SLBP 192/194*) - F&N NutriSoy Fresh Milk: Singapore's No. 1 Soya Milk! (2nd Gen).

My code so far is (copied and edited from various websites)...

WebClient getdeployment = new WebClient();
string url = "";

getdeployment.Headers["User-Agent"] = "NextBusApp/GetBusData UserAgent";
string sgwikiresult = getdeployment.DownloadString(url); // <<< EXCEPTION
MessageBox.Show(sgwikiresult); //for debugging only!

HtmlAgilityPack.HtmlDocument sgwikihtml = new HtmlAgilityPack.HtmlDocument();
sgwikihtml.Load(new StreamReader(sgwikiresult));
HtmlNode root = sgwikihtml.DocumentNode;

List<string> anchorTags = new List<string>();   

foreach(HtmlNode deployment in root.SelectNodes("SBS8988Z"))
    string att = deployment.OuterHtml;

However, I am getting a an ArgumentException was unhandled - Illegal Characters in path.

What is wrong with the code? Is there an easier way to do this? I'm using HtmlAgilityPack but if there is a better solution, I'd be glad to comply.

9/19/2011 12:29:05 PM

Accepted Answer

What's wrong with the code? To be blunt, everything. :P

The page is not formatted in the way you are reading it. You can't hope to get the desired contents that way.

The contents of the page (the part we're interested in) looks something like this:

<span id="Deployments" class="mw-headline">Deployments</span>
    <!-- ... -->
    (SLBP 192/194*)
    (SLBP 192/194*) - F&amp;N NutriSoy Fresh Milk: Singapore's No. 1 Soya Milk! (2nd Gen)
    (SLBP SP)
    <!-- ... -->

Basically we need to find the b elements that contain the registration number we are looking for. Once we find that element, get the text and put it together to form the result. Here it is in code:

static string GetVehicleInfo(string reg)
    var url = "";

    // HtmlWeb is a helper class to get pages from the web
    var web = new HtmlAgilityPack.HtmlWeb();

    // Create an HtmlDocument from the contents found at given url
    var doc = web.Load(url);

    // Create an XPath to find the `b` elements which contain the registration numbers
    var xpath = "//h2[span/@id='Deployments']" // find the `h2` element that has a span with the id, 'Deployments' (the header)
              + "/following-sibling::p[1]"     // move to the first `p` element (where the actual content is in) after the header
              + "/b";                          // select the `b` elements

    // Get the elements from the specified XPath
    var deployments = doc.DocumentNode.SelectNodes(xpath);

    // Create a LINQ query to find the  requested registration number and generate a result
    var query =
        from b in deployments                 // from the list of registration numbers
        where b.InnerText == reg              // find the registration we're looking for
        select reg + b.NextSibling.InnerText; // and create the result combining the registration number with the description (the text following the `b` element)

    // The query should yield exactly one result (or we have a problem) or none (null)
    var content = query.SingleOrDefault();

    // Decode the content (to convert stuff like "&amp;" to "&")
    var decoded = System.Net.WebUtility.HtmlDecode(content);

    return decoded;
9/24/2011 5:51:01 AM

Related Questions


Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow