Get specific data from a webpage with HTMLAgilityPack

c# html-agility-pack xpath

Question

I've been trying to get data from a webpage in C# using the HTML Agility Pack. I have been able to retrieve data from different webpage, but on this webpage I am getting a NullReferenceException and my only guess is that it has something to do with the XPath.

Here is my code, trying to reach the 'Limbo Wand' text

string url = "https://www.dofus.com/en/mmorpg/encyclopedia/weapons/180-limbo-wand";
HtmlWeb htmlWeb = new HtmlWeb();
HtmlDocument doc = htmlWeb.Load(url);

string weaponName = doc.DocumentNode.SelectNodes("/html/body/div[2]/div[2]/div/div/div/main/div[2]/div/div[2]/h1/text()")[0].InnerText; // <-- NullReferenceException here

Removing the text()in my XPath doesn't work, and even trying to get the text from /html/head/title doesn't work.

Is there anything wrong with my XPath ? Or is it a problem with the webpage that HTML Agility Pack can't use it properly ?

Thank you in advance to anyone who may be able to give me some hints!

Popular Answer

HtmlWeb is just shit for fetching the source of a site. Mostly because it doesn't handle redirects. But I am not sure that is the underlying problem here. Use a Web request instead. like so:

HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
try
{
    var request = (HttpWebRequest)WebRequest.Create("https://www.dofus.com/en/mmorpg/encyclopedia/weapons/180-limbo-wand");
    request.Method = "GET";

    using (var response = (HttpWebResponse)request.GetResponse())
    {
        using (var stream = response.GetResponseStream())
        {
            doc.Load(stream, Encoding.GetEncoding("iso-8859-9"));
        }
    }
}
catch (WebException ex)
{
    Console.WriteLine(ex.Message);
}

After this you have a HtmlDocument. And you can easily get the title like so(since there is only one title tag):

Console.WriteLine(doc.DocumentNode.SelectNodes("/title")[0].InnerText);

Now to get the weapon name using the most simple and easy xpath would be like this:

Console.WriteLine(doc.DocumentNode.SelectSingleNode("//h1[@class='ak-return-link']").InnerText.Trim());

The Trim() at the end is just to remove the whitespace at the start and end of the string.




Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Is this KB legal? Yes, learn why
Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Is this KB legal? Yes, learn why