Using Html Agility Pack, how do I retrieve img/src or a/hrefs?

.net c# html html-agility-pack html-parsing


I want to use the HTML agility pack to parse image and href links from a HTML page,but I just don't know much about XML or XPath.Though having looking up help documents in many web sites,I just can't solve the problem.In addition,I use C# in VisualStudio 2005.And I just can't speak English fluently,so,I will give my sincere thanks to the one can write some helpful codes.

1/29/2011 8:48:02 AM

Accepted Answer

The first example on the home page does something very similar, but consider:

 HtmlDocument doc = new HtmlDocument();
 doc.Load("file.htm"); // would need doc.LoadHtml(htmlSource) if it is not a file
 foreach(HtmlNode link in doc.DocumentElement.SelectNodes("//a[@href"])
    string href = link["href"].Value;
    // store href somewhere

So you can imagine that for img@src, just replace each a with img, and href with src. You might even be able to simplify to:

 foreach(HtmlNode node in doc.DocumentElement
              .SelectNodes("//a/@href | //img/@src")

For relative url handling, look at the Uri class.

1/29/2011 8:51:20 AM

Popular Answer

The example and the accepted answer is wrong. It doesn't compile with the latest version. I try something else:

    private List<string> ParseLinks(string html)
        var doc = new HtmlDocument(); 
        var nodes = doc.DocumentNode.SelectNodes("//a[@href]");
        return nodes == null ? new List<string>() : nodes.ToList().ConvertAll(
               r => r.Attributes.ToList().ConvertAll(
               i => i.Value)).SelectMany(j => j).ToList();

This works for me.

Related Questions


Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow