I have some cases where my clients send me a html string with some elements attributes not so correctly structured. Like this:
<img src="../imgTest.jpg" alt="Something "quoted here, or here"">
How can I dynamically change these cases to something like this next?
<img src="../imgTest.jpg" alt="Something 'quoted here, or here'">
I need this html not to show on browser but to do some operations with it.
I'm using HtmlAgilityPack to control html problems but for these cases it changes my html string to this and it isn't what I want:
<img src="../imgTest.jpg" alt="Something" quoted="" here,="" or="" here="">
My code with HtmlAgilityPack:
var htmlDoc = new HtmlDocument(); htmlDoc.OptionFixNestedTags = true; htmlDoc.LoadHtml(myHtmlStr); var htmlError = htmlDoc.ParseErrors.SafeAny(); if (!htmlError) myHtmlStr = htmlDoc.DocumentNode.InnerHtml;
My idea is to match aÂ
"Â if it is inside a tag and is not an attribute qualifier.Â
DISCLAIMER: This solution might not work in 100% cases (it will need adaptation in case there are namespaces added to element/attribute names), but it should work when a tag name follows the
< immediately, double quotes are used as attribute value qualifiers, and there are no
< symbols inside attributes.
and replace withÂ
See the regex demo.
The first lookbehind ensures we are searching for a double quote indide a tag, the second one fails the match if there is a word followed with an equal sign right before the double quote, and the negative lookahead fails the match if the double quote is followed with whitespaces followed with a closing angle bracket (probably preceded with the forward slash) or when there are whitespaces followed with a word followed with an equal sign.