Before posting I tried the solution from this thread:
C# - Remove spaces in HTML source in between markups?
Here is a snippet of the HTML I'm working with:
<p>This is my text</p>
<p> </p>
<p> </p>
<p> </p>
<p> </p>
<p> </p>
<p> </p>
<p>This is next text</p>
I'm using HTML Agility Pack to clean up the HTML:
HtmlDocument doc = new HtmlDocument();
doc.Load(htmlLocation);
foreach (var item in doc.DocumentNode.Descendants("p").ToList())
{
if (item.InnerHtml == " ")
{
item.Remove();
}
}
The output of the code above is
<p>This is my text</p>
<p>This is next text</p>
So my problem is how do I remove the extra whitespace between the two paragraphs in the HTML source.
Remove the text nodes between the first and last paragraphs:
HTML:
var html = @"
<p>This is my text</p>
<p> </p>
<p> </p>
<p> </p>
<p> </p>
<p> </p>
<p> </p>
<p>This is next text</p>";
Parse it:
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
var paragraphs = doc.DocumentNode.Descendants("p").ToList();
foreach (var item in paragraphs)
{
if (item.InnerHtml == " ") item.Remove();
}
var followingText = paragraphs[0]
.SelectNodes(".//following-sibling::text()")
.ToList();
foreach (var text in followingText)
{
text.Remove();
}
Result:
<p>This is my text</p><p>This is next text</p>
If you want to keep the line break between the paragraphs, use a for
loop and call Remove()
on all except the last text node.
for (int i = 0; i < followingText.Count - 1; ++i)
{
followingText[i].Remove();
}
Result:
<p>This is my text</p>
<p>This is next text</p>