extracting all iframe-tags using htmlagilitypack

c# html html-agility-pack

Question

I'm extracting a number of html-tags with htmlagilitypack. What I do is this:

        HtmlDoc = new HtmlDocument();
        StringReader sr = new StringReader(decodedHTML);
        HtmlDoc.Load(sr);
        sr.close();
        var anchor_tags = HtmlDoc.DocumentNode.SelectNodes("//" + HTML.TAG_ANCHOR + "[@" + HTML.ATTRIBUT_HREF + "]");
        var embed_tags = HtmlDoc.DocumentNode.SelectNodes("//" + HTML.TAG_EMBED + "[@" + HTML.TAG_EMBED_SRC + "]");
        var iframe_tags = HtmlDoc.DocumentNode.SelectNodes("//" + HTML.TAG_IFRAME + "[@" + HTML.TAG_IFRAME_SRC + "]");
        var img_tags = HtmlDoc.DocumentNode.SelectNodes("//" + HTML.TAG_IMG + "[@" + HTML.TAG_IMG_SRC + "]");
        var audio_tags = HtmlDoc.DocumentNode.SelectNodes("//" + HTML.TAG_AUDIO);       // may contain inner-html
        var object_tags = HtmlDoc.DocumentNode.SelectNodes("//" + HTML.TAG_OBJECT);     // may contain inner-html
        var video_tags = HtmlDoc.DocumentNode.SelectNodes("//" + HTML.TAG_VIDEO);       // may contain inner-html

the HTML page that has been compressed into a string is decodedHTML. I then check to see whether the aforementioned variables are null.

        if (anchor_tags != null)
        {
            ExtractLinks_AnchorTags(anchor_tags);
        }
        if(audio_tags != null)
        {
            ExtractLinks_AudioTags(audio_tags);
        }
        if(embed_tags!=null)
        {
            ExtractLinks_EmbedTags(embed_tags);
        }
        if (iframe_tags != null)
        {
            ExtractLinks_iFrameTags(iframe_tags);
        }
        if (img_tags != null)
        {
            ExtractLinks_ImgTags(img_tags);
        }
        if (object_tags != null)
        {
            ExtractLinks_ObjectTags(object_tags);
        }
        if (video_tags != null)
        {
            ExtractLinks_ObjectTags(video_tags);
        }

Since the most of the extractLinks-methods aren't even called, some of them are unquestionably null. For instance, when I go to yo u t u b e. c o m. There are several iframe-tags present, but the code ignores them.

edit:

at the time I'm removing the"[@" + HTML.TAG_IFRAME_SRC + "]" I only want to extract the iframes with a src property even if the iframes are detected. What would be the appropriate xpath syntax here?

1
1
1/15/2013 1:54:47 PM

Accepted Answer

The contents of HtmlAgilityPack are not loadediframe elements.

To examine the information contained in aniframe reading thesrc (which symbolizes the) attributeiframe URI) and send an independent HTTP request to load it into an independentHtmlDocument .

Keep an eye out for these potential problems along the way:

  • the src attribute could have a relative URI in it. For instance, if you go tohttp://www.example.com observe that aniframe has src="/samplePage" you must first transform it into an absolute URI (in this example,http://www.example.com/samplePage )

  • It's conceivable that a fewiframe There is no a for elements.src time the page is displayed in a browser, javascript dynamically adds the tag. Additionally, it is feasible to make fulliframe Javascript allows you to add components that a conventional browser wouldn't even show you.HttpWebRequest . In such circumstances, you must examine the javascript that is already on the page and replicate that logic in your application.

Update

The XPath formula foriframe components with asrc a quality is://iframe[@src]

1
1/15/2013 1:55:02 PM


Related Questions





Related

Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow