HTML Agility Pack strip标签不在白名单中

c# html-agility-pack html-parsing sanitize tags

我正在尝试创建一个删除不在白名单中的html标签和属性的函数。我有以下HTML:

<b>first text </b>
<b>second text here
       <a>some text here</a>
 <a>some text here</a>

 </b>
<a>some twxt here</a>

我正在使用HTML敏捷包,到目前为止我的代码是:

<b>first text </b>
<b>second text here
       <a>some text here</a>
 <a>some text here</a>

 </b>
<a>some twxt here</a>

我试图实现的输出是

<b>first text </b>
<b>second text here
       <a>some text here</a>
 <a>some text here</a>

 </b>
<a>some twxt here</a>

这意味着我只想保留<b>标签。
我这样做的原因是因为有些用户将MS WORD中的cpoy-paste粘贴到ny WYSYWYG html编辑器中。

谢谢。!

一般承认的答案

嘿,显然我几乎在有人制作的博客文章中找到答案....

using System.Collections.Generic;
using System.Linq;
using HtmlAgilityPack;

namespace Wayloop.Blog.Core.Markup
{
    public static class HtmlSanitizer
    {
        private static readonly IDictionary<string, string[]> Whitelist;

        static HtmlSanitizer()
        {
            Whitelist = new Dictionary<string, string[]> {
                { "a", new[] { "href" } },
                { "strong", null },
                { "em", null },
                { "blockquote", null },
                };
        }

        public static string Sanitize(string input)
        {
            var htmlDocument = new HtmlDocument();

            htmlDocument.LoadHtml(input);
            SanitizeNode(htmlDocument.DocumentNode);

            return htmlDocument.DocumentNode.WriteTo().Trim();
        }

        private static void SanitizeChildren(HtmlNode parentNode)
        {
            for (int i = parentNode.ChildNodes.Count - 1; i >= 0; i--) {
                SanitizeNode(parentNode.ChildNodes[i]);
            }
        }

        private static void SanitizeNode(HtmlNode node)
        {
            if (node.NodeType == HtmlNodeType.Element) {
                if (!Whitelist.ContainsKey(node.Name)) {
                    node.ParentNode.RemoveChild(node);
                    return;
                }

                if (node.HasAttributes) {
                    for (int i = node.Attributes.Count - 1; i >= 0; i--) {
                        HtmlAttribute currentAttribute = node.Attributes[i];
                        string[] allowedAttributes = Whitelist[node.Name];
                        if (!allowedAttributes.Contains(currentAttribute.Name)) {
                            node.Attributes.Remove(currentAttribute);
                        }
                    }
                }
            }

            if (node.HasChildNodes) {
                SanitizeChildren(node);
            }
        }
    }
}

我从这里得到HtmlSanitizer显然它不剥离标签,但删除元素altoghether。

好的,这是以后需要它的人的解决方案。

using System.Collections.Generic;
using System.Linq;
using HtmlAgilityPack;

namespace Wayloop.Blog.Core.Markup
{
    public static class HtmlSanitizer
    {
        private static readonly IDictionary<string, string[]> Whitelist;

        static HtmlSanitizer()
        {
            Whitelist = new Dictionary<string, string[]> {
                { "a", new[] { "href" } },
                { "strong", null },
                { "em", null },
                { "blockquote", null },
                };
        }

        public static string Sanitize(string input)
        {
            var htmlDocument = new HtmlDocument();

            htmlDocument.LoadHtml(input);
            SanitizeNode(htmlDocument.DocumentNode);

            return htmlDocument.DocumentNode.WriteTo().Trim();
        }

        private static void SanitizeChildren(HtmlNode parentNode)
        {
            for (int i = parentNode.ChildNodes.Count - 1; i >= 0; i--) {
                SanitizeNode(parentNode.ChildNodes[i]);
            }
        }

        private static void SanitizeNode(HtmlNode node)
        {
            if (node.NodeType == HtmlNodeType.Element) {
                if (!Whitelist.ContainsKey(node.Name)) {
                    node.ParentNode.RemoveChild(node);
                    return;
                }

                if (node.HasAttributes) {
                    for (int i = node.Attributes.Count - 1; i >= 0; i--) {
                        HtmlAttribute currentAttribute = node.Attributes[i];
                        string[] allowedAttributes = Whitelist[node.Name];
                        if (!allowedAttributes.Contains(currentAttribute.Name)) {
                            node.Attributes.Remove(currentAttribute);
                        }
                    }
                }
            }

            if (node.HasChildNodes) {
                SanitizeChildren(node);
            }
        }
    }
}

我重命名了节点,因为如果我必须解析XML命名空间节点,它将在xpath解析时崩溃。


热门答案

感谢您的代码 - 伟大的事情!!!!

我做了很少的优化......

class TagSanitizer
{
    List<HtmlNode> _deleteNodes = new List<HtmlNode>();

    public static void Sanitize(HtmlNode node)
    {
        new TagSanitizer().Clean(node);
    }

    void Clean(HtmlNode node)
    {
        CleanRecursive(node);
        for (int i = _deleteNodes.Count - 1; i >= 0; i--)
        {
            HtmlNode nodeToDelete = _deleteNodes[i];
            nodeToDelete.ParentNode.RemoveChild(nodeToDelete, true);
        }
    }

    void CleanRecursive(HtmlNode node)
    {
        if (node.NodeType == HtmlNodeType.Element)
        {
            if (Config.TagsWhiteList.ContainsKey(node.Name) == false)
            {
                _deleteNodes.Add(node);
            }
            else if (node.HasAttributes)
            {
                for (int i = node.Attributes.Count - 1; i >= 0; i--)
                {
                    HtmlAttribute currentAttribute = node.Attributes[i];

                    string[] allowedAttributes = Config.TagsWhiteList[node.Name];
                    if (allowedAttributes != null)
                    {
                        if (allowedAttributes.Contains(currentAttribute.Name) == false)
                        {
                            node.Attributes.Remove(currentAttribute);
                        }
                    }
                    else
                    {
                        node.Attributes.Remove(currentAttribute);
                    }
                }
            }
        }

        if (node.HasChildNodes)
        {
            node.ChildNodes.ToList().ForEach(v => CleanRecursive(v));
        }
    }
}



许可下: CC-BY-SA with attribution
不隶属于 Stack Overflow
这个KB合法吗? 是的,了解原因
许可下: CC-BY-SA with attribution
不隶属于 Stack Overflow
这个KB合法吗? 是的,了解原因