使用正確的換行符將HTML轉換(渲染)為Text

c# html-agility-pack

我需要將HTML字符串轉換為純文本(最好使用HTML Agility包)。適當的白色空間,尤其是正確的換行符

通過“正確的換行符”我的意思是這段代碼:

<div>
    <div>
        <div>
            line1
        </div>
    </div>
</div>
<div>line2</div>

應轉換為

line1
line2

只有一個換行符。

我見過的大多數解決方案只是將所有<div> <br> <p>標籤轉換為\n ,顯然,這是s * cks。

有關C#的html到plaintext渲染邏輯的任何建議嗎?不完整的代碼,至少常見的邏輯答案,如“用換行符替換所有關閉的DIV,但只有當下一個兄弟也不是DIV”時才會真正有用。

我試過的事情:簡單地獲取.InnerText屬性(顯然是錯誤的),正則表達式(緩慢,痛苦,大量黑客,也是正則表達式比HtmlAgilityPack慢12倍 - 我測量它),此解決方案和類似(返回更多換行符然後需要)

一般承認的答案

下面的代碼與提供的示例一起正常工作,甚至處理一些奇怪的東西,如<div><br></div> ,還有一些事情需要改進,但基本的想法就在那裡。查看評論。

public static string FormatLineBreaks(string html)
{
    //first - remove all the existing '\n' from HTML
    //they mean nothing in HTML, but break our logic
    html = html.Replace("\r", "").Replace("\n", " ");

    //now create an Html Agile Doc object
    HtmlDocument doc = new HtmlDocument();
    doc.LoadHtml(html);

    //remove comments, head, style and script tags
    foreach (HtmlNode node in doc.DocumentNode.SafeSelectNodes("//comment() | //script | //style | //head"))
    {
        node.ParentNode.RemoveChild(node);
    }

    //now remove all "meaningless" inline elements like "span"
    foreach (HtmlNode node in doc.DocumentNode.SafeSelectNodes("//span | //label")) //add "b", "i" if required
    {
        node.ParentNode.ReplaceChild(HtmlNode.CreateNode(node.InnerHtml), node);
    }

    //block-elements - convert to line-breaks
    foreach (HtmlNode node in doc.DocumentNode.SafeSelectNodes("//p | //div")) //you could add more tags here
    {
        //we add a "\n" ONLY if the node contains some plain text as "direct" child
        //meaning - text is not nested inside children, but only one-level deep

        //use XPath to find direct "text" in element
        var txtNode = node.SelectSingleNode("text()");

        //no "direct" text - NOT ADDDING the \n !!!!
        if (txtNode == null || txtNode.InnerHtml.Trim() == "") continue;

        //"surround" the node with line breaks
        node.ParentNode.InsertBefore(doc.CreateTextNode("\r\n"), node);
        node.ParentNode.InsertAfter(doc.CreateTextNode("\r\n"), node);
    }

    //todo: might need to replace multiple "\n\n" into one here, I'm still testing...

    //now BR tags - simply replace with "\n" and forget
    foreach (HtmlNode node in doc.DocumentNode.SafeSelectNodes("//br"))
        node.ParentNode.ReplaceChild(doc.CreateTextNode("\r\n"), node);

    //finally - return the text which will have our inserted line-breaks in it
    return doc.DocumentNode.InnerText.Trim();

    //todo - you should probably add "&code;" processing, to decode all the &nbsp; and such
}    

//here's the extension method I use
private static HtmlNodeCollection SafeSelectNodes(this HtmlNode node, string selector)
{
    return (node.SelectNodes(selector) ?? new HtmlNodeCollection(node));
}

熱門答案

關注:

  1. 不可見標籤(腳本,樣式)
  2. 塊級標籤
  3. 內聯標籤
  4. Br標籤
  5. 可包含的空間(前導,尾隨和多個空格)
  6. 硬空間
  7. 實體

代數決定:

  plain-text = Process(Plain(html))

  Plain(node-s) => Plain(node-0), Plain(node-1), ..., Plain(node-N)
  Plain(BR) => BR
  Plain(not-visible-element(child-s)) => nil
  Plain(block-element(child-s)) => BS, Plain(child-s), BE
  Plain(inline-element(child-s)) => Plain(child-s)   
  Plain(text) => ch-0, ch-1, .., ch-N

  Process(symbol-s) => Process(start-line, symbol-s)

  Process(start-line, BR, symbol-s) => Print('\n'), Process(start-line, symbol-s)
  Process(start-line, BS, symbol-s) => Process(start-line, symbol-s)
  Process(start-line, BE, symbol-s) => Process(start-line, symbol-s)
  Process(start-line, hard-space, symbol-s) => Print(' '), Process(not-ws, symbol-s)
  Process(start-line, space, symbol-s) => Process(start-line, symbol-s)
  Process(start-line, common-symbol, symbol-s) => Print(common-symbol), 
                                                  Process(not-ws, symbol-s)

  Process(not-ws, BR|BS|BE, symbol-s) => Print('\n'), Process(start-line, symbol-s)
  Process(not-ws, hard-space, symbol-s) => Print(' '), Process(not-ws, symbol-s)
  Process(not-ws, space, symbol-s) => Process(ws, symbol-s)
  Process(not-ws, common-symbol, symbol-s) => Process(ws, symbol-s)

  Process(ws, BR|BS|BE, symbol-s) => Print('\n'), Process(start-line, symbol-s)
  Process(ws, hard-space, symbol-s) => Print(' '), Print(' '), 
                                       Process(not-ws, symbol-s)
  Process(ws, space, symbol-s) => Process(ws, symbol-s)
  Process(ws, common-symbol, symbol-s) => Print(' '), Print(common-symbol),
                                          Process(not-ws, symbol-s)

HtmlAgilityPack和System.Xml.Linq的C#決策:

  //HtmlAgilityPack part
  public static string ToPlainText(this HtmlAgilityPack.HtmlDocument doc)
  {
    var builder = new System.Text.StringBuilder();
    var state = ToPlainTextState.StartLine;

    Plain(builder, ref state, new[]{doc.DocumentNode});
    return builder.ToString();
  }
  static void Plain(StringBuilder builder, ref ToPlainTextState state, IEnumerable<HtmlAgilityPack.HtmlNode> nodes)
  {
    foreach (var node in nodes)
    {
      if (node is HtmlAgilityPack.HtmlTextNode)
      {
        var text = (HtmlAgilityPack.HtmlTextNode)node;
        Process(builder, ref state, HtmlAgilityPack.HtmlEntity.DeEntitize(text.Text).ToCharArray());
      }
      else
      {
        var tag = node.Name.ToLower();

        if (tag == "br")
        {
          builder.AppendLine();
          state = ToPlainTextState.StartLine;
        }
        else if (NonVisibleTags.Contains(tag))
        {
        }
        else if (InlineTags.Contains(tag))
        {
          Plain(builder, ref state, node.ChildNodes);
        }
        else
        {
          if (state != ToPlainTextState.StartLine)
          {
            builder.AppendLine();
            state = ToPlainTextState.StartLine;
          }
          Plain(builder, ref state, node.ChildNodes);
          if (state != ToPlainTextState.StartLine)
          {
            builder.AppendLine();
            state = ToPlainTextState.StartLine;
          }
        }

      }

    }
  }

  //System.Xml.Linq part
  public static string ToPlainText(this IEnumerable<XNode> nodes)
  {
    var builder = new System.Text.StringBuilder();
    var state = ToPlainTextState.StartLine;

    Plain(builder, ref state, nodes);
    return builder.ToString();
  }
  static void Plain(StringBuilder builder, ref ToPlainTextState state, IEnumerable<XNode> nodes)
  {
    foreach (var node in nodes)
    {
      if (node is XElement)
      {
        var element = (XElement)node;
        var tag = element.Name.LocalName.ToLower();

        if (tag == "br")
        {
          builder.AppendLine();
          state = ToPlainTextState.StartLine;
        }
        else if (NonVisibleTags.Contains(tag))
        {
        }
        else if (InlineTags.Contains(tag))
        {
          Plain(builder, ref state, element.Nodes());
        }
        else
        {
          if (state != ToPlainTextState.StartLine)
          {
            builder.AppendLine();
            state = ToPlainTextState.StartLine;
          }
          Plain(builder, ref state, element.Nodes());
          if (state != ToPlainTextState.StartLine)
          {
            builder.AppendLine();
            state = ToPlainTextState.StartLine;
          }
        }

      }
      else if (node is XText)
      {
        var text = (XText)node;
        Process(builder, ref state, text.Value.ToCharArray());
      }
    }
  }
  //common part
  public static void Process(System.Text.StringBuilder builder, ref ToPlainTextState state, params char[] chars)
  {
    foreach (var ch in chars)
    {
      if (char.IsWhiteSpace(ch))
      {
        if (IsHardSpace(ch))
        {
          if (state == ToPlainTextState.WhiteSpace)
            builder.Append(' ');
          builder.Append(' ');
          state = ToPlainTextState.NotWhiteSpace;
        }
        else
        {
          if (state == ToPlainTextState.NotWhiteSpace)
            state = ToPlainTextState.WhiteSpace;
        }
      }
      else
      {
        if (state == ToPlainTextState.WhiteSpace)
          builder.Append(' ');
        builder.Append(ch);
        state = ToPlainTextState.NotWhiteSpace;
      }
    }
  }
  static bool IsHardSpace(char ch)
  {
    return ch == 0xA0 || ch ==  0x2007 || ch == 0x202F;
  }

  private static readonly HashSet<string> InlineTags = new HashSet<string>
  {
      //from https://developer.mozilla.org/en-US/docs/Web/HTML/Inline_elemente
      "b", "big", "i", "small", "tt", "abbr", "acronym", 
      "cite", "code", "dfn", "em", "kbd", "strong", "samp", 
      "var", "a", "bdo", "br", "img", "map", "object", "q", 
      "script", "span", "sub", "sup", "button", "input", "label", 
      "select", "textarea"
  };

  private static readonly HashSet<string> NonVisibleTags = new HashSet<string>
  {
      "script", "style"
  };

  public enum ToPlainTextState
  {
    StartLine = 0,
    NotWhiteSpace,
    WhiteSpace,
  }

}

例子:

// <div>  1 </div>  2 <div> 3  </div>
1
2
3
//  <div>1  <br/><br/>&#160; <b> 2 </b> <div>   </div><div> </div>  &#160;3</div>
1

  2
 3
//  <span>1<style> text </style><i>2</i></span>3
123
//<div>
//    <div>
//        <div>
//            line1
//        </div>
//    </div>
//</div>
//<div>line2</div>
line1
line2


許可下: CC-BY-SA with attribution
不隸屬於 Stack Overflow
這個KB合法嗎? 是的,了解原因
許可下: CC-BY-SA with attribution
不隸屬於 Stack Overflow
這個KB合法嗎? 是的,了解原因