I want to extract only text from my html
var sb = new StringBuilder();
doc.LoadHtml(inputHTml);
foreach (var node in Doc.DocumentNode.ChildNodes)
{
if (node.Name == "strong" || node.Name == "#text"
|| node.Name == "br" || node.Name == "div"
|| node.Name == "p" || node.Name != "img")
{
sb.Append(node.InnerHtml);
}
}
now in my node.InnerHtml is this html:
1.
<br><div>text</div><div>, text</div><div>text<br>
<img src="http://example.com/55.jpg" alt="" title="" height="100">
<img src="http://example.com/45.jpg" alt="text" title="text" height="100"></div>
2.
text text text. <a
href="/content/essie-classics">text</a><br>
<img> src="" alt="" title="" height="100"><img
src="http://example.com/img_8862.jpg"
alt="" title="" height="100">
how to remove img and a tags?
img tag not have the close tag
Not sure I understand what point no.2 means. But if you want to remove all <img>
element from a HtmlNode
, you can try this way :
var imgs = node.SelectNodes("//img");
foreach (var img in imgs)
{
img.Remove();
}
Remove()
function will remove HtmlNode
from it's parent. This works fine for me to remove <img>
elements, even without closing tag.
UPDATE :
You can use this XPath expression to select all <img>
and <a>
elements in single query :
node.SelectNodes("//*[self::img or self::a]");
then you can iterate through result set once to remove each of them.
Refer to this remove html node(img) from htmldocument sample. you can also do like that:
var sb = new StringBuilder();
doc.LoadHtml(inputHTml);
foreach (var node in doc.DocumentNode.ChildNodes)
{
if (node.Name != "img" && node.Name!="a")
{
sb.Append(node.InnerHtml);
}
}