c#: HtmlAgilityPack Descendants

c# html html-agility-pack

Question

Good day. I have a task where i need to convert the word document to html.

This can be done using interop and save the document as html. But i need to clean the html output of interop

But i have a problem with htmlagilitypack. I thought its similar to XmlDocument c#

this is my c# code

HtmlDocument doc = new HtmlDocument();
doc.Load(htmlLocation);
      foreach (var item in doc.DocumentNode.Descendants("p"))
      {

      if (item.HasChildNodes)
          {
             foreach (var itm in item.Descendants("span").ToList())
                {
                   Console.WriteLine(itm.InnerText);
                }
          }
      }

This is the html code

<html>

<head>
<meta http-equiv=Content-Type content="text/html; charset=windows-1252">
<meta name=Generator content="Microsoft Word 12 (filtered)">

</head>

<body lang=EN-US link="#0066CC" vlink=purple style='text-justify-trim:punctuation'>

<div class=WordSection1>

<p class=Heading61 style='margin-bottom:0in;margin-bottom:.0001pt;text-indent:
.5in;line-height:normal;page-break-after:avoid;background:transparent'><span
class=Heading6><span style='font-size:12.0pt;color:black;background:yellow'>Epilogue</span></span></p>

<p class=MsoBodyText style='line-height:normal;background:transparent'><span
class=BodytextItalic2><span style='font-size:12.0pt;color:black;font-style:
normal'>&nbsp;</span></span></p>

<p class=MsoBodyText style='line-height:normal;background:transparent'><span
class=BodytextItalic2><span style='font-size:12.0pt;color:black;font-style:
normal'>Rebecca sat outside her lodge cradling her infant son in her arms. How
handsome he was, her little warrior, with his dusky skin and thick black hair.
For the first few days after his birth, she had been afraid to let him out of
her sight, out of her arms, for fear she would lose him, but he was a strong
healthy child.</span></span></p>

<p class=MsoBodyText style='text-indent:.5in;line-height:normal;background:
transparent'><span class=BodytextItalic2><span style='font-size:12.0pt;
color:black;font-style:normal'>Looking at him made her heart swell with love
for him and for his father. She had married Wolf Dreamer the day after they
returned to his people. Summer Moon Rising had left the village the following
day.</span></span></p>

</div>

</body>

</html>

this is the output of the code above

Epilogue
Epilogue
&nbsp;
&nbsp;
Rebecca sat outside her lodge cradling her infant son in her arms. How
handsome he was, her little warrior, with his dusky skin and thick black hair.
For the first few days after his birth, she had been afraid to let him out of
her sight, out of her arms, for fear she would lose him, but he was a strong
healthy child.
Rebecca sat outside her lodge cradling her infant son in her arms. How
handsome he was, her little warrior, with his dusky skin and thick black hair.
For the first few days after his birth, she had been afraid to let him out of
her sight, out of her arms, for fear she would lose him, but he was a strong
healthy child.
Looking at him made her heart swell with love
for him and for his father. She had married Wolf Dreamer the day after they
returned to his people. Summer Moon Rising had left the village the following
day.
Looking at him made her heart swell with love
for him and for his father. She had married Wolf Dreamer the day after they day.

what i expect is the second for each depends on the item elements. but why does it repeat the Text?

Popular Answer

You have 4 p tag and each tags have two span. Descendants, gets all the descendant nodes with matching name so your inner foreach repeats for two span

your inner foreach could be

    foreach (var itm in item.ChildNodes)
    {
      Console.WriteLine(itm.InnerText);
    }


Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Is this KB legal? Yes, learn why
Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Is this KB legal? Yes, learn why