extracting all iframe-tags using htmlagilitypack

c# html html-agility-pack

Question

I'm using htmlagilitypack to extract several html-tags. Heres what I do:

        HtmlDoc = new HtmlDocument();
        StringReader sr = new StringReader(decodedHTML);
        HtmlDoc.Load(sr);
        sr.close();
        var anchor_tags = HtmlDoc.DocumentNode.SelectNodes("//" + HTML.TAG_ANCHOR + "[@" + HTML.ATTRIBUT_HREF + "]");
        var embed_tags = HtmlDoc.DocumentNode.SelectNodes("//" + HTML.TAG_EMBED + "[@" + HTML.TAG_EMBED_SRC + "]");
        var iframe_tags = HtmlDoc.DocumentNode.SelectNodes("//" + HTML.TAG_IFRAME + "[@" + HTML.TAG_IFRAME_SRC + "]");
        var img_tags = HtmlDoc.DocumentNode.SelectNodes("//" + HTML.TAG_IMG + "[@" + HTML.TAG_IMG_SRC + "]");
        var audio_tags = HtmlDoc.DocumentNode.SelectNodes("//" + HTML.TAG_AUDIO);       // may contain inner-html
        var object_tags = HtmlDoc.DocumentNode.SelectNodes("//" + HTML.TAG_OBJECT);     // may contain inner-html
        var video_tags = HtmlDoc.DocumentNode.SelectNodes("//" + HTML.TAG_VIDEO);       // may contain inner-html

Where decodedHTML is the html-page packed in a string. After that I examine if the variables above are null

        if (anchor_tags != null)
        {
            ExtractLinks_AnchorTags(anchor_tags);
        }
        if(audio_tags != null)
        {
            ExtractLinks_AudioTags(audio_tags);
        }
        if(embed_tags!=null)
        {
            ExtractLinks_EmbedTags(embed_tags);
        }
        if (iframe_tags != null)
        {
            ExtractLinks_iFrameTags(iframe_tags);
        }
        if (img_tags != null)
        {
            ExtractLinks_ImgTags(img_tags);
        }
        if (object_tags != null)
        {
            ExtractLinks_ObjectTags(object_tags);
        }
        if (video_tags != null)
        {
            ExtractLinks_ObjectTags(video_tags);
        }

and some of them are definitly null, because most of the extractLinks-methods aren't even called. For example when I'm visiting y o u t u b e . c o m . There are several iframe-tags and the code doesnt recognize them.

edit:

when I'm deleting the "[@" + HTML.TAG_IFRAME_SRC + "]" the iframes are recognized, but I just want to extract the iframes with a src attribute. What's the correct xpath syntax for it?

Accepted Answer

HtmlAgilityPack does not load the contents of iframe elements.

In order to inspect the content of an iframe, read the src attribute (which represents the iframe's URI) and perform a separate web request to load that into a separate HtmlDocument.

Along the way, be aware of these possible issues:

  • the src attribute may contain a relative URI. For example, if you visit http://www.example.com and see that an iframe has src="/samplePage", you should convert that first to an absolute URI (in this case, http://www.example.com/samplePage)

  • it is possible that some iframe elements do not have a src tag, because it is added dynamically, via javascript, when the document is rendered in a browser. It is also possible to create entire iframe elements with javascript, elements that you wouldn't even see if you just do a regular HttpWebRequest. In cases like these, you have to analyze the javascript present on the page and to duplicate that logic in your program.

Update

The XPath expression for iframe elements that have a src attribute is: //iframe[@src]




Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Is this KB legal? Yes, learn why
Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Is this KB legal? Yes, learn why