How to remove duplicate attributes from XML with C#

c# html-agility-pack validation xml

Question

I am parsing some XML files from a third party provider and unfortunately it's not always well-formed XML as sometimes some elements contain duplicate attributes.

I don't have control over the source and I don't know which elements may have duplicate attributes nor do I know the duplicate attribute names in advance.

Obviously, loading the content into an XMLDocument object raises an XmlException on the duplicate attributes so I though I could use an XmlReader to step though the XML element by element and deal with the duplicate attributes when I get to the offending element.

However, the XmlException is raised on reader.Read() - before I get a chance to insepct the element's attributes.

Here's a sample method to demonstrate the issue:

public static void ParseTest()
{
    const string xmlString = 
        @"<?xml version='1.0'?>
        <!-- This is a sample XML document -->
        <Items dupattr=""10"" id=""20"" dupattr=""33"">
            <Item>test with a child element <more/> stuff</Item>
        </Items>";

    var output = new StringBuilder();
    using (XmlReader reader = XmlReader.Create(new StringReader(xmlString)))
    {
        XmlWriterSettings ws = new XmlWriterSettings();
        ws.Indent = true;
        using (XmlWriter writer = XmlWriter.Create(output, ws))
        {
            while (reader.Read())   /* Exception throw here when Items element encountered */
            {
                switch (reader.NodeType)
                {
                    case XmlNodeType.Element:
                        writer.WriteStartElement(reader.Name);
                        if (reader.HasAttributes){ /* CopyNonDuplicateAttributes(); */}
                        break;
                    case XmlNodeType.Text:
                        writer.WriteString(reader.Value);
                        break;
                    case XmlNodeType.XmlDeclaration:
                    case XmlNodeType.ProcessingInstruction:
                        writer.WriteProcessingInstruction(reader.Name, reader.Value);
                        break;
                    case XmlNodeType.Comment:
                        writer.WriteComment(reader.Value);
                        break;
                    case XmlNodeType.EndElement:
                        writer.WriteFullEndElement();
                        break;
                }
            }

        }
    }
    string str = output.ToString();
}

Is there another way to parse the input and remove the duplicate attributes without having to use regular expressions and string manipulation?

Accepted Answer

I found a solution by thinking of the XML as an HTML document. Then using the open-source Html Agility Pack library, I was able to get valid XML.

The trick was to save the xml with a HTML header first.
So replace the XML declaration
<?xml version="1.0" encoding="utf-8" ?>
with an HTML declaration like this:
!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

Once the contents are saved to file, this method will return a valid XML Document.

// Requires reference to HtmlAgilityPack
public XmlDocument LoadHtmlAsXml(string url)
{
    var web = new HtmlWeb();

    var m = new MemoryStream();
    var xtw = new XmlTextWriter(m, null);

    // Load the content into the writer
    web.LoadHtmlAsXml(url, xtw);

    // Rewind the memory stream
    m.Position = 0;

    // Create, fill, and return the xml document
    XmlDocument xmlDoc = new XmlDocument();
    xmlDoc.LoadXml((new StreamReader(m)).ReadToEnd());
    return xmlDoc;
}

The duplicate attribute nodes are automatically removed with the later attribute values overwriting the earlier ones.


Popular Answer

Ok think you need to catch the error:

Then you should be able to use the following methods:

reader.MoveToFirstAttribute();

and

reader.MoveToNextAttribute()

to get the following properties:

reader.Value
reader.Name

This will enable you to get all the attribute values.




Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Is this KB legal? Yes, learn why
Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Is this KB legal? Yes, learn why