What would be the preferred way to remove all empty and unnecessery nodes? For example
<p></p>
should be removed and <font><p><span><br></span></p></font>
should also be removed (so the br tag is considered unneccesery in this case)
Will I have to use some sort of recursive function for this? I'm thinking something along the lines of this maybe:
RemoveEmptyNodes(HtmlNode containerNode)
{
var nodes = containerNode.DescendantsAndSelf().ToList();
if (nodes != null)
{
foreach (HtmlNode node in nodes)
{
if (node.InnerText == null || node.InnerText == "")
{
RemoveEmptyNodes(node.ParentNode);
node.Remove();
}
}
}
}
But that obviously doesn't work (stackoverflow exception).
tags that should not be removed you can add the names to the list and nodes with attributes are also not removed because of containerNode.Attributes.Count == 0 (e.g. Images)
static List<string> _notToRemove;
static void Main(string[] args)
{
_notToRemove = new List<string>();
_notToRemove.Add("br");
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml("<html><head></head><body><p>test</p><br><font><p><span></span></p></font></body></html>");
RemoveEmptyNodes(doc.DocumentNode);
}
static void RemoveEmptyNodes(HtmlNode containerNode)
{
if (containerNode.Attributes.Count == 0 && !_notToRemove.Contains(containerNode.Name) && string.IsNullOrEmpty(containerNode.InnerText))
{
containerNode.Remove();
}
else
{
for (int i = containerNode.ChildNodes.Count - 1; i >= 0; i-- )
{
RemoveEmptyNodes(containerNode.ChildNodes[i]);
}
}
}