HtmlAgilityPack所有子串的長度

c# html-agility-pack

我有嵌套元素的html(大多數只是divp元素)我需要返回相同的html,但子字符串由給定數量的字母組成。顯然字母計數不應該通過html標籤枚舉,而只計算每個html元素的InnerText字母。 Html結果應保留適當的結構 - 任何結束標記,以保持有效的HTML。

樣本輸入:

<div>
    <p>some text</p>
    <p>some more text some more text some more text some more text some more text</p>
    <div>
        <p>some more text some more text some more text some more text some more text</p>
        <p>some more text some more text some more text some more text some more text</p>
    </div>
</div>

給定int length = 16 ,輸出應如下所示:

<div>
    <p>some text</p>
    <p>some more text some more text some more text some more text some more text</p>
    <div>
        <p>some more text some more text some more text some more text some more text</p>
        <p>some more text some more text some more text some more text some more text</p>
    </div>
</div>

請注意,字母數(包括空格)為16.隨後的<div>被刪除,因為字母數已達到可變length 。請注意,輸出html仍然有效。

我嘗試過以下內容,但這並沒有真正起作用。輸出不符合預期:一些html元素重複出現。

<div>
    <p>some text</p>
    <p>some more text some more text some more text some more text some more text</p>
    <div>
        <p>some more text some more text some more text some more text some more text</p>
        <p>some more text some more text some more text some more text some more text</p>
    </div>
</div>

UPDATE

@SergeBelov提供了適用於第一個樣本輸入的解決方案,但是進一步的測試提出瞭如下輸入的問題。

樣本輸入#2:

<div>
    <p>some text</p>
    <p>some more text some more text some more text some more text some more text</p>
    <div>
        <p>some more text some more text some more text some more text some more text</p>
        <p>some more text some more text some more text some more text some more text</p>
    </div>
</div>

給定變量int maxLength = 7;輸出應該等於某個mo 。由於ParentNode = null代碼,它不能像那樣工作:

<div>
    <p>some text</p>
    <p>some more text some more text some more text some more text some more text</p>
    <div>
        <p>some more text some more text some more text some more text some more text</p>
        <p>some more text some more text some more text some more text some more text</p>
    </div>
</div>

創建一個新的HtmlNode似乎沒有幫助,因為它的InnterText屬性是只讀的。

一般承認的答案

下面的小控制台程序說明了一種可能的方法,即:

  1. 選擇相關的文本節點併計算它們的運行總長度;
  2. 根據需要獲取盡可能多的節點,以達到超過最大長度的運行總計;
  3. 除了在步驟## 1,2中選擇的節點的祖先之外,從文檔中刪除所有元素節點;
  4. 剪切列表最後一個節點中的文本以適合最大長度。

更新:這應該仍然適用於第一個文本節點;可能,需要Trim()來從中刪除空格,如下所示。

    static void Main(string[] args)
    {
        int maxLength = 9;
        string input = @"
            some more text some more text 
            <div>
                <p>some text</p>
                <p>some more text some more text some more text some more text some more text</
            </div>";

        var doc = new HtmlDocument();
        doc.LoadHtml(input);

        // Get text nodes with the appropriate running total
        var acc = 0;
        var nodes = doc.DocumentNode
            .Descendants()
            .Where(n => n.NodeType == HtmlNodeType.Text && n.InnerText.Trim().Length > 0)
            .Select(n => 
            {
                var length = n.InnerText.Trim().Length;
                acc += length;
                return new { Node = n, TotalLength = acc, NodeLength = length }; 
            })
            .TakeWhile(n => (n.TotalLength - n.NodeLength) < maxLength)
            .ToList();

        // Select element nodes we intend to keep
        var nodesToKeep = nodes
            .SelectMany(n => n.Node.AncestorsAndSelf()
                .Where(m => m.NodeType == HtmlNodeType.Element));

        // Select and remove element nodes we don't need
        var nodesToDrop = doc.DocumentNode
            .Descendants()
            .Where(m => m.NodeType == HtmlNodeType.Element)
            .Except(nodesToKeep)
            .ToList();

        foreach (var r in nodesToDrop)
            r.Remove();

        // Shorten the last node as required
        var lastNode = nodes.Last();
        var lastNodeText = lastNode.Node;
        var text = lastNodeText.InnerText.Trim().Substring(0,
                lastNode.NodeLength - lastNode.TotalLength + maxLength);
        lastNodeText
            .ParentNode
            .ReplaceChild(HtmlNode.CreateNode(text), lastNodeText);

        doc.Save(Console.Out);
    }



許可下: CC-BY-SA with attribution
不隸屬於 Stack Overflow
這個KB合法嗎? 是的,了解原因
許可下: CC-BY-SA with attribution
不隸屬於 Stack Overflow
這個KB合法嗎? 是的,了解原因