我想從html文檔中提取有意義的文本,並且我使用的是html-agility-pack。這是我的代碼:
string convertedContent = HttpUtility.HtmlDecode(
ConvertHtml(HtmlAgilityPack.HtmlEntity.DeEntitize(htmlAsString))
);
ConvertHtml:
public string ConvertHtml(string html)
{
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
StringWriter sw = new StringWriter();
ConvertTo(doc.DocumentNode, sw);
sw.Flush();
return sw.ToString();
}
轉換成:
public void ConvertTo(HtmlAgilityPack.HtmlNode node, TextWriter outText)
{
string html;
switch (node.NodeType)
{
case HtmlAgilityPack.HtmlNodeType.Comment:
// don't output comments
break;
case HtmlAgilityPack.HtmlNodeType.Document:
foreach (HtmlNode subnode in node.ChildNodes)
{
ConvertTo(subnode, outText);
}
break;
case HtmlAgilityPack.HtmlNodeType.Text:
// script and style must not be output
string parentName = node.ParentNode.Name;
if ((parentName == "script") || (parentName == "style"))
break;
// get text
html = ((HtmlTextNode)node).Text;
// is it in fact a special closing node output as text?
if (HtmlNode.IsOverlappedClosingElement(html))
break;
// check the text is meaningful and not a bunch of whitespaces
if (html.Trim().Length > 0)
{
outText.Write(HtmlEntity.DeEntitize(html) + " ");
}
break;
case HtmlAgilityPack.HtmlNodeType.Element:
switch (node.Name)
{
case "p":
// treat paragraphs as crlf
outText.Write("\r\n");
break;
}
if (node.HasChildNodes)
{
foreach (HtmlNode subnode in node.ChildNodes)
{
ConvertTo(subnode, outText);
}
}
break;
}
}
現在在某些情況下,當html頁面格式不正確時(例如,以下頁面 - http://rareseeds.com/cart/products/Purple_of_Romagna_Artichoke-646-72.html有一個格式錯誤的元標記,如<meta content="text/html; charset=uft-8" http-equiv="Content-Type">
)[注意”uft“而不是utf]我的代碼在我嘗試加載html文檔時正在嘔吐。
有人可以建議我如何克服這些格式錯誤的HTML頁面並仍然從html文檔中提取相關文本?
謝謝,卡皮爾
正如在HtmlAgilityPack項目頁面中所說的那樣“解析器對'真實世界'格式錯誤的HTML非常寬容”。但是你所描述的那種錯誤太嚴重,可能無法糾正。您可以使用以下命令設置默認編碼:
HtmlDocument doc = new HtmlDocument();
doc.OptionDefaultStreamEncoding = Encoding.UTF8;