Using HTML Agility Pack to get meta-tags and comments

.net c# html-agility-pack html-parsing

Question

Since HTML Agility Pack looks to be able to perform what I need it to, I've searched for lessons on using it online. However, it seems that for such a powerful program, there isn't much buzz online.

I'm creating a simple function that uses the name to obtain any given tag:

public string[] GetTagsByName(string TagName, string Source) {
    ...
}

Regex can be used to do this easily, but as we all know, regex shouldn't be used to parse HTML. I now possess the following code:

...
// TODO: Clear Comments (can this be done or should I use RegEx?)
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(Source);
ArrayList tags = new ArrayList();
string xpath = "//" + TagName;
foreach (HtmlTextNode node in doc.DocumentNode.SelectNodes(xpath) {
    tags.Add(node.Text);
}
return (string[])tags.ToArray(typeof(String));

I'd want to be able to retrieve the right tag based on its name after first removing any HTML comments. Additionally, if feasible, I'd want to return certain meta-tags depending on attributes, like robot. Any assistance with xpath would be helpful since I'm not that good at it.

Any assistance would be much appreciated.

1
3
3/2/2010 2:17:44 PM

Accepted Answer

The HtmlAgilityPack's HtmlDocument implements IXpathNavigable and utilizes the default.NET XPath engine as a result. Any documentation for XPath 1.0 will work, particularly if it mentions System.Xml.XPath.

All comments are found with "/comment()"
All "meta" items are found using "/meta"

Since HtmlDocument was created to resemble XmlDocument as closely as possible, examples and lessons pertaining to it will be somewhat appropriate.

a few MSDN links

10
3/2/2010 2:49:35 PM


Related Questions





Related

Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow