Parse Tables in HTML docs and extract TRs and TDs. with HTML Agility Pack

html-agility-pack html-parsing vb.net

Question

I've given a job to convert old data in table format to new format.

Old dummy data is as follows:

<table>
<tr>
<td>Some text 1.</td>
<td>Some text 2.</td>
</tr>
..... //any number of TRs goes here
</table>

The problem is that the new data needs to be in this format:

Some text 1. - Some text 2. ....

Summary of what needs to be done here:

Find all TRs in the table. for each TR find first TD and concatenate with second TD separated by " - ".

I am using HTML Agility Pack with VB.Net.

Please Help.

Thanks and regards.

Popular Answer

You can use Linq and HtmlAgilityPack to get all td's from the table node, get all the InnerText of this nodes and create a new TR / TD.

// tableNode is the <table> HtmlNode. If you know where is this table you can use XPath to find him.

Dim sb As New StringBuilder()
For Each childNode As HtmlNode In tableNode.DescendantNodes().Where(Function(n) n.Name = "td")
    sb.Append(String.Format("{0} - ", childNode.InnerText))
Next

tableNode.RemoveAllChildren()

Dim newTrNode As HtmlNode = tableNode.OwnerDocument.CreateElement("tr")
Dim newTdNode As HtmlNode = tableNode.OwnerDocument.CreateElement("td")

newTdNode.InnerHtml = sb.ToString()
newTrNode.AppendChild(newTdNode)

tableNode.AppendChild(newTrNode)

I hope it helps



Related

Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Is this KB legal? Yes, learn why
Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Is this KB legal? Yes, learn why