HtmlAgilityPack Attributes.Remove on Image Only Removes One, When There Are Two

attributes c# html-agility-pack

Question

I'm using HtmlAgilityPack in our project, so that I can display some Html from another of our systems. I ran across this issue in my unit testing, and want to make sure that I'm not doing something wrong. If I have an image, and it has 2 "src" values, I'd like to pick one, remove them both, and add one back in with the right path. I don't think this will happen with our Html, but just in case....

So, here's an example image tag:

<img align=\"left\" alt=\"\" src=\"/blah.jpg\" src=\"/knowledge/blah.jpg\" border=\"0\" />

Here's the code to manipulate the Html:

    public static string FixHtmlLinks(this string html)
    {
        var htmlDoc = new HtmlDocument()
        {
            OptionWriteEmptyNodes = true
        };
        htmlDoc.LoadHtml(html);

        var imagesToCheck = htmlDoc.DocumentNode.SelectNodes("//img[@src!='']");

        if (null != imagesToCheck)
        {
            foreach (var image in imagesToCheck.ToList())
            {
                var src = image.GetAttributeValue("src", string.Empty);
                if (Uri.IsWellFormedUriString(src, UriKind.Relative))
                {
                    image.Attributes.Remove("src");
                    image.SetAttributeValue("src", string.Format(RELATIVE_IMAGE_PROTOCOL_AND_HOST, src));
                }
                else if (Uri.IsWellFormedUriString(src, UriKind.Absolute))
                {
                    image.Attributes.Remove("src");
                    image.SetAttributeValue("src", src.Replace(ABSOLUTE_IMAGE_HOST_TO_REPLACE, IMAGE_PROTOCOL_AND_HOST));
                }
            }
        }

        return htmlDoc.DocumentNode.OuterHtml;
    }

When I debug, and it gets to the line "image.Attributes.Remove("src");", there are 2 "src" values, as expected. After that line runs, there is 1 "src" value there, the one that starts with "/knowledge". However, I would expect them both to be removed, since the summary for Remove says:

Removes an attribute from the list, using its name. If there are more than one attributes with this name, they will all be removed.

I checked the source code for the HtmlAttributeCollection in CodePlex, and the Remove method puts it through a loop to remove the values, so everything looks like it should work.

Am I using this wrong, or have I found an opportunity to offer a patch in HtmlAgilityPack?

Accepted Answer

Confirmed: image.Attributes.Remove only removes the first occurrence.

One quick fix is to call Remove multiple times. If it's called and the attribute isn't found, it does nothing.

You might want to let the HtmlAgilityPack authors know about this.




Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Is this KB legal? Yes, learn why
Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Is this KB legal? Yes, learn why