How to use C# to delete redundant attributes from XML

c# html-agility-pack validation xml


I am parsing some XML files from a third party provider and unfortunately it's not always well-formed XML as sometimes some elements contain duplicate attributes.

I don't have control over the source and I don't know which elements may have duplicate attributes nor do I know the duplicate attribute names in advance.

Obviously, loading the content into an XMLDocument object raises an XmlException on the duplicate attributes so I though I could use an XmlReader to step though the XML element by element and deal with the duplicate attributes when I get to the offending element.

However, the XmlException is raised on reader.Read() - before I get a chance to insepct the element's attributes.

Here's a sample method to demonstrate the issue:

public static void ParseTest()
    const string xmlString = 
        @"<?xml version='1.0'?>
        <!-- This is a sample XML document -->
        <Items dupattr=""10"" id=""20"" dupattr=""33"">
            <Item>test with a child element <more/> stuff</Item>

    var output = new StringBuilder();
    using (XmlReader reader = XmlReader.Create(new StringReader(xmlString)))
        XmlWriterSettings ws = new XmlWriterSettings();
        ws.Indent = true;
        using (XmlWriter writer = XmlWriter.Create(output, ws))
            while (reader.Read())   /* Exception throw here when Items element encountered */
                switch (reader.NodeType)
                    case XmlNodeType.Element:
                        if (reader.HasAttributes){ /* CopyNonDuplicateAttributes(); */}
                    case XmlNodeType.Text:
                    case XmlNodeType.XmlDeclaration:
                    case XmlNodeType.ProcessingInstruction:
                        writer.WriteProcessingInstruction(reader.Name, reader.Value);
                    case XmlNodeType.Comment:
                    case XmlNodeType.EndElement:

    string str = output.ToString();

Is there another way to parse the input and remove the duplicate attributes without having to use regular expressions and string manipulation?

7/13/2011 9:12:59 AM

Accepted Answer

I found a solution by thinking of the XML as an HTML document. Then using the open-source Html Agility Pack library, I was able to get valid XML.

The trick was to save the xml with a HTML header first.
So replace the XML declaration
<?xml version="1.0" encoding="utf-8" ?>
with an HTML declaration like this:
!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "">

Once the contents are saved to file, this method will return a valid XML Document.

// Requires reference to HtmlAgilityPack
public XmlDocument LoadHtmlAsXml(string url)
    var web = new HtmlWeb();

    var m = new MemoryStream();
    var xtw = new XmlTextWriter(m, null);

    // Load the content into the writer
    web.LoadHtmlAsXml(url, xtw);

    // Rewind the memory stream
    m.Position = 0;

    // Create, fill, and return the xml document
    XmlDocument xmlDoc = new XmlDocument();
    xmlDoc.LoadXml((new StreamReader(m)).ReadToEnd());
    return xmlDoc;

The duplicate attribute nodes are automatically removed with the later attribute values overwriting the earlier ones.

7/13/2011 9:09:32 AM

Popular Answer

Ok think you need to catch the error:

Then you should be able to use the following methods:




to get the following properties:


This will enable you to get all the attribute values.

Related Questions


Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow