In certain instances, my customers provide me an HTML string that contains elements with improperly formatted attributes. akin to this
<img src="../imgTest.jpg" alt="Something "quoted here, or here"">
How can I alter these instances to look like this next? dynamically
<img src="../imgTest.jpg" alt="Something 'quoted here, or here'">
I need to work with this HTML rather than have it display in the browser.
I use HtmlAgilityPack to manage html issues, however in the following situations, it modifies my html string in a way that isn't what I want:
<img src="../imgTest.jpg" alt="Something" quoted="" here,="" or="" here="">
Using HTMLAgilityPack, my code is:
var htmlDoc = new HtmlDocument(); htmlDoc.OptionFixNestedTags = true; htmlDoc.LoadHtml(myHtmlStr); var htmlError = htmlDoc.ParseErrors.SafeAny(); if (!htmlError) myHtmlStr = htmlDoc.DocumentNode.InnerHtml;
My plan is to pair a
if it is not an attribute qualifier and is included inside a tag.
This technique may not be 100% effective (it will need adaption in the event that namespaces are added to the names of elements or attributes), but it should be effective when a tag name follows the element name.
Attribute value qualifiers are immediately enclosed in double quotes, and there are no
within attributes, symbols
then replace with
look at the demo regex.
The first lookbehind verifies that we are looking for a double quote within a tag, the second one fails the match if a word is immediately preceded by an equal sign, and the negative lookahead fails the match if the double quote is followed by whitespaces followed by a closing angle bracket (likely preceded by the forward slash) or when there are whitespaces followed by a word that is immediately preceded by an equal sign.