Trying to extract data from a webpage using HtmlAgilityPack

c# html-agility-pack web

Question

I'm attempting to extract one data point from
http://www.dsebd.org/displayCompany.php?name=NBL
In the accompanying image, I highlighted the necessary field for Xpath: /html/body/table[2]/tbody/tr/td[2]/table/tbody/tr[3]/td1/p1/table1/tbody/tr/td1/table/tbody/tr[2]/td[2]/font

Error: Exception is occurring and data utilizing that Xpath cannot be located. "HtmlAgilityPack.dll encountered an unhandled exception of type 'System.Net.WebException'."

enter image description here

Code of Origin:

static void Main(string[] args)
    {
        /************************************************************************/
        string tickerid = "Bse_Prc_tick";
        HtmlAgilityPack.HtmlDocument doc = new   HtmlWeb().Load(@"http://www.dsebd.org/displayCompany.php?name=NBL", "GET");

        if (doc != null)
        {
            // Fetch the stock price from the Web page
            string stockprice = doc.DocumentNode.SelectSingleNode(string.Format("./html/body/table[2]/tbody/tr/td[2]/table/tbody/tr[3]/td1/p1/table1/tbody/tr/td1/table/tbody/tr[2]/td[2]/font", tickerid)).InnerText;
            Console.WriteLine(stockprice);
        }
        Console.WriteLine("ReadKey Starts........");
        Console.ReadKey();
}
1
2
6/20/2014 8:05:55 AM

Accepted Answer

I did a check. The XPaths we were using are just wrong. When you attempt to identify the issue, the real fun begins.

Just look at the page's source code to see that it has various HTML elements in addition to numerous faults that make XPath difficult to use.

Your tool, Chrome Dev Tools, operates on a dom tree that has been rectified by the browser (all packed into single html node, added some tbody, etc).

HtmlAgilityPack processing got broken because HTML structure is just broken.

Due to the current circumstances, you may either utilize RegExp or just search for known items in the code (which is much faster, but less agile).

For instance:

...
using System.Net; //required for Webclient
...
        class Program
        {
            //entry point of console app
            static void Main(string[] args)
            {
                // url to download
                // "var" means I am too lazy to write "string" and let compiler decide typing
                var url = @"http://www.dsebd.org/displayCompany.php?name=NBL";

                // creating object in using makes Garbage Collector delete it when using block ends, as opposed to standard cleaning after whole function ends
                using (WebClient client = new WebClient()) // WebClient class inherits IDisposable
                {

                    // simply download result to string, in this case it will be html code
                    string htmlCode = client.DownloadString(url);
                    // cut html in half op position of "Last Trade:"
                    // searching from beginning of string is easier/faster than searching in middle
                    htmlCode = htmlCode.Substring(
                        htmlCode.IndexOf("Last Trade:")
                        );
                    // select from .. to .. and then remove leading and trailing whitespace characters
                    htmlCode = htmlCode.Substring("2\">", "</font></td>").Trim();
                    Console.WriteLine(htmlCode);
                }
                Console.ReadLine();
            }
        }
        // http://stackoverflow.com/a/17253735/3147740 <- copied from here
        // this is Extension Class which adds overloaded Substring() I used in this code, it does what its comments says
        public static class StringExtensions
        {
            /// <summary>
            /// takes a substring between two anchor strings (or the end of the string if that anchor is null)
            /// </summary>
            /// <param name="this">a string</param>
            /// <param name="from">an optional string to search after</param>
            /// <param name="until">an optional string to search before</param>
            /// <param name="comparison">an optional comparison for the search</param>
            /// <returns>a substring based on the search</returns>
            public static string Substring(this string @this, string from = null, string until = null, StringComparison comparison = StringComparison.InvariantCulture)
            {
                var fromLength = (from ?? string.Empty).Length;
                var startIndex = !string.IsNullOrEmpty(from)
                    ? @this.IndexOf(from, comparison) + fromLength
                    : 0;

                if (startIndex < fromLength) { throw new ArgumentException("from: Failed to find an instance of the first anchor"); }

                var endIndex = !string.IsNullOrEmpty(until)
                ? @this.IndexOf(until, startIndex, comparison)
                : @this.Length;

                if (endIndex < 0) { throw new ArgumentException("until: Failed to find an instance of the last anchor"); }

                var subString = @this.Substring(startIndex, endIndex - startIndex);
                return subString;
            }
        }
2
6/20/2014 9:45:54 AM

Popular Answer

To learn more about the exception, enclose your code in a try-catch block.



Related Questions





Related

Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow