Parsing with Async, HtmlAgilityPack, and XPath

asynchronous c# html-agility-pack web-scraping xpath


I have run into a rather strange problem. It's very hard to explain so please bear with me, but basically here is a brief introduction:

  • I am new to Async programming but couldn't locate a problem in my code
  • I have used HtmlAgilityPack before, but never the .NET 4.5 version.
  • This is a learning project, I am not trying to scrape or anything like that.

Basically, what is happening is this: I am retrieving a page from the internet, loading it via stream into an HtmlDocument, then retrieving certain HtmlNodes from it using XPath expressions. Here is a piece of simplified code:

            myStream = await httpClient.GetStreamAsync(string.Format("{0}{1}", SomeString, AnotherString);

            using (myStream)

The HTML is being retreived correctly, but the HtmlNodes extracted by XPath are getting their HTML mangled. Here is a sample piece of HTML which I got in a response taken from Fiddler:

                    <div id="menu">
   <div id="splash">
      <div id="menuItem_1" class="ScreenTitle"  >Horse Racing</div>
      <div id="menuItem_2" class="Title"  >Wednesday Racing</div>
      <div id="subMenu_2">
         <div id="menuItem_3" class="Level2"  >&#187;  <a href="../coupon/?ptid=4020&amp;key=2-70-70-22361707-2-20181217-0-0-1-0-0-4020-0-36200255-1-0-0-0-0">21.51 Britannia Way</a></div>
         <div id="menuItem_4" class="Level2"  >&#187;  <a href="../coupon/?ptid=4020&amp;key=2-70-70-22361710-2-20181217-0-0-1-0-0-4020-0-36200258-1-0-0-0-0">21.54 Britannia Way</a></div>
         <div id="menuItem_5" class="Level2"  >&#187;  <a href="../coupon/?ptid=4020&amp;key=2-70-70-22361713-2-20181217-0-0-1-0-0-4020-0-36200261-1-0-0-0-0">21.57 Britannia Way</a></div>
         <div id="menuItem_6" class="Level2"  >&#187;  <a href="../coupon/?ptid=4020&amp;key=2-70-70-22361716-2-20181217-0-0-1-0-0-4020-0-36200264-1-0-0-0-0">22.00 Britannia Way</a></div>
         <div id="menuItem_7" class="Level2"  >&#187;  <a href="../coupon/?ptid=4020&amp;key=2-70-70-22361719-2-20181217-0-0-1-0-0-4020-0-36200267-1-0-0-0-0">22.03 Britannia Way</a></div>
         <div id="menuItem_8" class="Level2"  >&#187;  <a href="../coupon/?ptid=4020&amp;key=2-70-70-22361722-2-20181217-0-0-1-0-0-4020-0-36200270-1-0-0-0-0">22.06 Britannia Way</a></div>

The XPath I am using is 100% correct because it works in the browser on the same page, but here is an example a tag which it is retreiving from the previously shown page:

<a href="./coupon/?ptid=4020&amp;key=2-70-70-22361710-2-20181217-0-0-1-0-0-4020-0-36200258-1-0-0-0-0"">1.54 Britannia Way</</a>

And here is the original which I copied from above for simplicity:

<a href="../coupon/?ptid=4020&amp;key=2-70-70-22361710-2-20181217-0-0-1-0-0-4020-0-36200258-1-0-0-0-0">21.54 Britannia Way</a></div>

As you can see, the InnerText has changed considerably and so has the URL. Obviously my program doesn't work, but I don't know how. What can cause this? Is it a bug in HtmlAgilityPack? Please advise! Thanks for reading!

6/11/2014 9:10:32 PM

Accepted Answer

After many hours of guessing and debugging, the problem turned out to be an HtmlDocument that I was re-using. I solved the problem by creating a new HtmlDocument each time I wanted to load a new page, instead of using the same one.

I hope this saves you time that I lost!

6/12/2014 10:33:31 PM

Popular Answer

Don't make the assumption that an XPath expression working in your browser (after DOM-conversion, possibly loading data with AJAX, ...). This seems a site giving bet quotes, I'd guess they're loading the data with some javascript calls.

Verify whether your XPath expression matches the pages source code (like fetched using wget or by clicking "View Source Code" in your browser – don't use Firebug/... for this!

If the site is using AJAX to load the data, you might have luck by using Firebug to monitor what resources get fetched while the page is loaded. Often these are JSON- or XML-files very easy to parse, and it's even easier to work with them than parsing a website of horrible messes of HTML.

Update: In this special case, the site forwards users not sending an Accept-Language header to a language-selection-page. Send such a header to receive the same contents as the browser does. In curl, it would look like this:

curl -H "Accept-Language: en-US;q=0.6,en;q=0.4"

Related Questions


Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow