HtmlAgilityPack將<(小於符號)後的所有內容視為屬性

c# html-agility-pack

我通過textarea獲得了一些輸入,並將該輸入轉換為html文檔,稍後將其解析為PDF文檔。

當我的用戶輸入小於號(<)時,我的HtmlDocument中的所有內容都會被制動。 HtmlAgilityPack突然處理少於符號作為屬性後的所有內容。看輸出:

在這個字符數據塊中,我可以根據需要使用雙短劃線(連同<,&,=“”',=“”和=“”')=“”*和=“”* =“”%= “”myparamentity; =“”will =“”be =“”expanded =“”to =“”=“”text =“”'has =“”been =“”expand'...但是,=“” i =“”不能=“”使用=“”=“”cend =“”序列(if =“”i =“”need =“”to =“”use =“”it =“”i = “”must =“”escape =“”one =“”=“”=“”括號=“”或=“”=“”大於=“”符號。=“”>

如果我加上它,它會好一點

htmlDocument.OptionOutputOptimizeAttributeValues = true;

這給了我:

在這個字符數據塊中,我可以根據需要使用雙短劃線(連同<,&,=',=和=')= *和= * =%= myparamentity; = will = be = expanded = to = the = text ='has = been = expanded'...但是,= i =不能= use = the = cend = sequence(if = i = need = to = use = it = i = must = escape = one = of = the = = = = = = than-than = sign)。=>

我已經嘗試了htmldocument上的所有選項,但沒有一個讓我指定解析器不應該是嚴格的。另一方面,我也許可以忍受它剝離<,但添加所有等號並不適合我。

void Main()
{
    var input = @"Within this Character Data block I can use double dashes as much as I want (along with <, &, ', and ') *and * % MyParamEntity; will be expanded to the text 'Has been expanded'...however, I can't use the CEND sequence(if I need to use it I must escape one of the brackets or the greater-than sign).";

    var htmlDoc = WrapContentInHtml(input);

    htmlDoc.DocumentNode.OuterHtml.ToString().Dump();
}

private HtmlDocument WrapContentInHtml(string content)
{
    var htmlBuilder = new StringBuilder();
    htmlBuilder.AppendLine("<!DOCTYPE html>");
    htmlBuilder.AppendLine("<html>");
    htmlBuilder.AppendLine("<head>");
    htmlBuilder.AppendLine("<title></title>");
    htmlBuilder.AppendLine("</head>");
    htmlBuilder.AppendLine("<body><div id='sagsfremstillingContainer'>");
    htmlBuilder.AppendLine(content); 
    htmlBuilder.AppendLine("</div></body></html>");

    var htmlDocument = new HtmlDocument();
    htmlDocument.OptionOutputOptimizeAttributeValues = true;
    var htmlDoc = htmlBuilder.ToString();

    htmlDocument.LoadHtml(htmlDoc);

    return htmlDocument;
}

有沒有人知道如何解決這個問題。

我能找到的最接近的問題是: 在HtmlAgilityPack loadhtml中丟失'小於'符號

他實際上抱怨<消失對我來說沒問題。當然,修復解析錯誤是最佳解決方案。

編輯:我正在使用HtmlAgilityPack 1.4.9

一般承認的答案

你的內容是完全錯誤的。這不是關於“嚴格性”,而是關於你假裝一段文本是有效HTML的事實。實際上,您獲得的結果正是因為解析器嚴格。

當您需要將純文本插入HTML時,需要先對其進行編碼,以便將所有各種HTML控製字符正確轉換為HTML - 例如, <必須更改為&lt;&&amp;

解決這個問題的一種方法是使用DOM - 在目標div上使用InnerText ,而不是將字符串拼接在一起並假裝它們是HTML。另一種是使用一些顯式編碼方法 - 例如HttpUtility.HtmlEncode


熱門答案

您可以使用System.Net.WebUtility.HtmlEncode ,即使沒有對System.Web.dll的引用也可以使用它也有HttpServerUtility.HtmlEncode

var input = @"Within this Character Data block I can use double dashes as much as I want (along with <, &, ', and ') *and * % MyParamEntity; will be expanded to the text 'Has been expanded'...however, I can't use the CEND sequence(if I need to use it I must escape one of the brackets or the greater-than sign).";
var htmlDocument = new HtmlDocument();
htmlDocument.LoadHtml(System.Net.WebUtility.HtmlEncode(input));
Debug.Assert(!htmlDocument.ParseErrors.Any());

結果:

Within this Character Data block I can use double dashes as much as I want (along with &lt;, &amp;, &#39;, and &#39;) *and * % MyParamEntity; will be expanded to the text &#39;Has been expanded&#39;...however, I can&#39;t use the CEND sequence(if I need to use it I must escape one of the brackets or the greater-than sign).



許可下: CC-BY-SA with attribution
不隸屬於 Stack Overflow
這個KB合法嗎? 是的,了解原因
許可下: CC-BY-SA with attribution
不隸屬於 Stack Overflow
這個KB合法嗎? 是的,了解原因