I'm trying to remove any duplicate or more occurrences of any < br > tags in my html document. This is what I've come up with so far (really stupid code):
HtmlNodeCollection elements = nodeCollection.ElementAt(0)
.SelectNodes("//br");
if (elements != null)
{
foreach (HtmlNode element in elements)
{
if (element.Name == "br")
{
bool iterate = true;
while(iterate == true)
{
iterate = removeChainElements(element);
}
}
}
}
private bool removeChainElements(HtmlNode element)
{
if (element.NextSibling != null && element.NextSibling.Name == "br")
{
element.NextSibling.Remove();
}
if (element.NextSibling != null && element.NextSibling.Name == "br")
return true;
else
return false;
}
}
The code does find the br tags but it doesn't remove any elements at all.
I think you too complicated your solution, although the idea is seems to be correct, as I understand.
Suppose, it would be easier to find all the <br />
nodes first, and just remove those, whose previous sibling is <br />
node.
Let's start with the next example:
var html = @"<div>the first line<br /><br />the next one<br /></div>";
var doc = new HtmlDocument();
doc.LoadHtml(html);
now find <br />
nodes and remove the chain of duplicate elements:
var nodes = doc.DocumentNode.SelectNodes("//br").ToArray();
foreach (var node in nodes)
if (node.PreviousSibling != null && node.PreviousSibling.Name == "br")
node.Remove();
and get the result of it:
var output = doc.DocumentNode.OuterHtml;
it is:
<div>the first line<br>the next one<br></div>
Maybe you can do this htmlsource = htmlSource.Replace("<br /><br />", <br />);
or maybe something like this
string html = "<br><br><br><br><br>";
html = html.Replace("<br>", string.Empty);
html = string.Format("{0}<br />", html);
html = html.Replace(" ", string.Empty);
html = html.Replace("\t", string.Empty);