Find all elements with text string?

c# html html-agility-pack xpath

Question

I'm attempting to get rid of every HTML element (tag) that has a certain text string. I currently have 2376 HTML pages with various doctype standards. Even some of them lack a doctype (might be irrelevant to this question).

As a result, I've discovered that the text string "How to reference this work" is included either inside a<p>-tag , <h4>-tag or a<legend>-tag .

The <p>-tag often resembles this,

<p style="text-align : center; color : Red; font-weight : bold;">How to cite this paper:</i></p>

The <h4>-tag often resembles this,

<h4>How to cite this paper:</h4>Antunes, P., Costa, C.J. &amp; Pino, J.A. (2006).

The <legend>-tag seems to be this,

<legend style="color: white; background-color: maroon; font-size: medium; padding: .1ex .5ex; border-right: 1px solid navy; border-bottom: 1px solid navy; font-weight: bold;">How to cite this paper</legend>

The current objective is to locate these tags, delete them from the file, and then save the document once again. I do have additional tags to delete, but I need some guidance on how to discover certain tags using their values or other distinctive information.

It's a console application that I have created so far in C#. This is my principal (sorry for bad indentation),

//Variables
string Ext = "*.html";
string folder = @"D:\websites\dev.openjournal.tld\public\arkivet\";
IEnumerable<string> files = GetHTMLFiles(folder, Ext);
List<string> cite_files = new List<string>();            
var doc = new HtmlDocument();

//Loop to match all html-elements to query
foreach (var file in files)
{
 try
   {
      doc.Load(file);
      cite_files.Add(doc.DocumentNode.SelectNodes("//h4[contains(., 'How to cite this paper')]").ToString()); 

     cite_files.Add(doc.DocumentNode.SelectNodes("//p[contains(., 'How to cite this paper')]").ToString());
   }                
                    catch (Exception Ex)
                    {
                        Console.WriteLine(Ex.Message);
                    }
                }

                //Counts numbers of hits and prints data to user
                int filecount = files.Count();
                int citations = cite_files.Count();            
                Console.WriteLine("Number of files scanned: " + filecount);
                Console.WriteLine("Number of citations: {0}", citations);

                // Program end
                Console.WriteLine("Press any key to close program....");
                Console.ReadKey();

Additionally, this is the secret technique for searching across folders for files.

//List all HTML-files recursively and return them to a list 
 public static IEnumerable<string> GetHTMLFiles(string directory, string Ext)
    {
        List<string> files = new List<string>();

        try
        {
            files.AddRange(Directory.GetFiles(directory, Ext, SearchOption.AllDirectories));
        }
        catch (Exception ex)
        {
            Console.WriteLine(ex.Message);
        }            
        return files;
    }

Since "How to cite this article" seems to be unique, I'm looking for any particular tags that include these same terms and removing them. I'm attempting to collect all 1094 files that include this phrase, as shown by Notepad.

Any assistance is highly appreciated.

1
2
2/7/2018 8:28:29 PM

Accepted Answer

In this situation, Html Agility Pack's support for LINQ selectors is really useful. Based on the HTML you provided above:

var html =
@"<html><head></head><body>

<!-- selector match: delete these nodes -->
<p style='text-align: center; color: Red; font-weight: bold;'>How to cite this paper:</i></p>
<h4> How to cite this paper:</h4> Antunes, P., Costa, C.J. & amp; Pino, J.A. (2006).
<legend style='color: white; background-color: maroon; font-size: medium; padding: .1ex .5ex; border-right: 1px solid navy; border-bottom: 1px solid navy; font-weight: bold;'>How to cite this paper </legend>
<div><p><i><b>How to cite this paper (NESTED)</b></i></p></div>

<!-- no match: keep these nodes -->
<p>DO NOT DELETE - How to cite</p>
<h4>DO NOT DELETE - cite this paper:</h4>
<legend>DO NOT DELETE</legend>

</body></html>";

Create a set of tags that should be searched, choose the nodes that match, and then delete them as follows:

var tagsToDelete = new string[] { "p", "h4", "legend" };
var nodesToDelete = new List<HtmlNode>();

var document = new HtmlDocument();
document.LoadHtml(html);
foreach (var tag in tagsToDelete)
{
    nodesToDelete.AddRange(
        from searchText in document.DocumentNode.Descendants(tag)
            where searchText.InnerText.Contains("How to cite this paper")
            select searchText
    );
}

foreach (var node in nodesToDelete) node.Remove();

document.Save(OUTPUT);

resulting in the following outcome:

<html><head></head><body>

<!-- XPath match: delete these nodes -->

 Antunes, P., Costa, C.J. & amp; Pino, J.A. (2006).

<div></div>

<!-- no match, keep these nodes -->
<p>DO NOT DELETE - How to cite</p>
<h4>DO NOT DELETE - cite this paper:</h4>
<legend>DO NOT DELETE</legend>

</body></html>
1
2/7/2018 8:21:11 PM


Related Questions





Related

Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow