How to get text that has no tag with htmlAgilityPack

c# html html-agility-pack xpath

Question

I have an html file like below

  <div>

  <div style="margin-left:0.5em;">
  <div class="tiny" style="margin-bottom:0.5em;">
  <b><span class="h3color tiny">This review is from: </span>You Meet</b>
  </div>
  If you know Ron Kaufman as I do ...
  <br /><br />Whether you're the CEO....
  <br /><br />Written in a distinctive, ...
  <br /><br />My advice? Don't just get one copy
  <div style="padding-top: 10px; clear: both; width: 100%;"></div>
  </div>

  <div style="margin-left:0.5em;">
  <div class="tiny" style="margin-bottom:0.5em;">
  <b><span class="h3color tiny">This review is from: </span>My Review</b>
  </div>
  I became a fan of Ron Kaufman after reading an earlier book of his years ago...
  <div style="padding-top: 10px; clear: both; width: 100%;"></div>
  </div>

  </div>

I want to get review text which doesnt have any html tag. I am using below code now

  foreach (HtmlNode divReview in doc.DocumentNode.SelectNodes(@"//div[@style='margin-left:0.5em;']"))   
   {
      if (divReview != null)
          {

 review.Add(divReview.Descendants("div").Where(d => d.Attributes.Contains("style") && 
 d.Attributes["style"].Value.Contains("padding-top: 10px; clear: both; width: 100%;")).
                                          Select(d =>
 d.PreviousSibling.InnerText.Trim()).SingleOrDefault());  
          }
       }

which only return "My advice? Don't just get one copy", how can I get the whole text?

Update: Even if I remove all

"br"

tag from htmlnode, still when use the above code I only get "My advice? Don't just get one copy" part!!! any comment?

Accepted Answer

I've updated the code to this:

var allText = (reviewDiv.Descendants("div")
  .First(div => div.Attributes["style"].Value == "padding-top: 10px; clear: both; width: 100%;")
  .SelectNodes("./preceding-sibling::text()") ?? new HtmlNodeCollection(null)) 
  .Select(text => text.InnerText);

This should return an IEnumerable of strings with the text preceding the div with the intricate style.

Without having a little more of the surrounding HTML it's hard to tell whether this is exactly what you're after. I'm currently guessing that you have selected a div and that that div is the direct parent of this whole block of text (given your reference to a reviewDiv). Your HTML sample doesn't seem to contain this piece of HTML, so I'm making a few assumptions here.

With the following input:

<div><div class="tiny" style="margin-bottom:0.5em;">
<b><span class="h3color tiny">This review is from: </span>You Meet</b>
</div>
If you know Ron Kaufman as I do ...
<br /><br />Whether you're the CEO....
<br /><br />Written in a distinctive, ...
<br /><br />My advice? Don't just get one copy
<div style="padding-top: 10px; clear: both; width: 100%;"></div></div>

It extracts this:

If you know Ron Kaufman as I do ...
Whether you're the CEO....
Written in a distinctive, ...
My advice? Don't just get one copy

To build a single string I used: string extractedText = string.Join("", allText);




Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Is this KB legal? Yes, learn why
Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Is this KB legal? Yes, learn why