Using Html Agility Pack, how do I retrieve img/src or a/hrefs?

.net c# html html-agility-pack html-parsing


I want to parse images and href links from an HTML page using the HTML Agility Pack, but I don't know a lot about XML or XPath. Despite searching through several websites for assistance papers, I am still unable to resolve the issue. In Visual Studio 2005, I also use C#. I also don't speak English well, therefore I'd want to express my gratitude to anybody who can create some useful programs.

1/29/2011 8:48:02 AM

Accepted Answer

The first instance on the main page does a pretty similar function, but think about:

 HtmlDocument doc = new HtmlDocument();
 doc.Load("file.htm"); // would need doc.LoadHtml(htmlSource) if it is not a file
 foreach(HtmlNode link in doc.DocumentElement.SelectNodes("//a[@href"])
    string href = link["href"].Value;
    // store href somewhere

You can see that by replacing each with img@src.a with img , andhref with src . You could even manage to shorten it to:

 foreach(HtmlNode node in doc.DocumentElement
              .SelectNodes("//a/@href | //img/@src")

Look at the for relative url handlingUri class.

1/29/2011 8:51:20 AM

Popular Answer

The example and the recommended solution are incorrect. With the most recent version, it doesn't compile. I make another attempt:

    private List<string> ParseLinks(string html)
        var doc = new HtmlDocument(); 
        var nodes = doc.DocumentNode.SelectNodes("//a[@href]");
        return nodes == null ? new List<string>() : nodes.ToList().ConvertAll(
               r => r.Attributes.ToList().ConvertAll(
               i => i.Value)).SelectMany(j => j).ToList();

It functions for me.

Related Questions


Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow