Detect data URI in image src using HtmlAgilityPack

base64 c# html-agility-pack

Question

I process a lot of html and transform it into PDF files. Before I can transform my html I have to detect if any of the images are referenced files. If it is a referenced file then I base64 encode them and replace the src with it.

Right now I am relying on Regex to do the detection for me, but since I am using HtmlAgilityPack I was wondering if I can achieve the same with HtmlAgilityPack?

I would like to do this so I don't have to maintain the Regex when I am currently already using HtmlAgilityPack.

So right now I am detecting the data uri via RegEx with the following:

void Main()
{
    var myHtml = @"<html><head></head><body><p><img src='data:image/gif;base64,R0lGODlhAQABAIAAAAUEBAAAACwAAAAAAQABAAACAkQBADs='/></p></body></html>";
    var htmlDoc = new HtmlDocument();
    htmlDoc.LoadHtml(myHtml);

    var imgs = htmlDoc.DocumentNode.SelectNodes("//img");
    if (imgs != null && imgs.Count > 0)
    {
        foreach (var imgNode in imgs)
        {
            var srcAttribute = imgNode.Attributes.FirstOrDefault(a => string.Equals("src", a.Name, StringComparison.InvariantCultureIgnoreCase));

            if (!string.IsNullOrEmpty(srcAttribute?.Value) && !StringIsDataUri(srcAttribute.Value))
            {
                Console.WriteLine("BASE ENCODE THE REFERENCED FILE");
            }
        }
    }
}

//Regex from http://stackoverflow.com/a/5714355/1958344
private static Regex regex = new Regex(@"data:(?<mime>[\w/\-\.]+);(?<encoding>\w+),(?<data>.*)", RegexOptions.Compiled);

private bool StringIsDataUri(string stringToTest)
{
    var match = regex.Match(stringToTest);
    return match.Success;
}

Accepted Answer

HtmlAgilityPack doesn't have built-in function to detect data URI, so you still need to incorporate your own implementation of such function.

As an aside, you can use LINQ API of HtmlAgilityPack to select img element that have reference src attribute in the first place :

var referenceImgs = htmlDoc.DocumentNode
                           .Descendants("img")
                           .Where(o => !StringIsDataUri(o.GetAttributeValue("src","")));

foreach(HtmlNode img in referenceImgs)
{
    Console.WriteLine("BASE ENCODE THE REFERENCED FILE");
}



Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Is this KB legal? Yes, learn why
Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Is this KB legal? Yes, learn why