How to get img/src or a/hrefs using Html Agility Pack?

.net c# html html-agility-pack html-parsing


I want to use the HTML agility pack to parse image and href links from a HTML page,but I just don't know much about XML or XPath.Though having looking up help documents in many web sites,I just can't solve the problem.In addition,I use C# in VisualStudio 2005.And I just can't speak English fluently,so,I will give my sincere thanks to the one can write some helpful codes.

Accepted Answer

The first example on the home page does something very similar, but consider:

 HtmlDocument doc = new HtmlDocument();
 doc.Load("file.htm"); // would need doc.LoadHtml(htmlSource) if it is not a file
 foreach(HtmlNode link in doc.DocumentElement.SelectNodes("//a[@href"])
    string href = link["href"].Value;
    // store href somewhere

So you can imagine that for img@src, just replace each a with img, and href with src. You might even be able to simplify to:

 foreach(HtmlNode node in doc.DocumentElement
              .SelectNodes("//a/@href | //img/@src")

For relative url handling, look at the Uri class.

Popular Answer

The example and the accepted answer is wrong. It doesn't compile with the latest version. I try something else:

    private List<string> ParseLinks(string html)
        var doc = new HtmlDocument(); 
        var nodes = doc.DocumentNode.SelectNodes("//a[@href]");
        return nodes == null ? new List<string>() : nodes.ToList().ConvertAll(
               r => r.Attributes.ToList().ConvertAll(
               i => i.Value)).SelectMany(j => j).ToList();

This works for me.


Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow