How to scrape a flash based site?

c# flash html-agility-pack web-crawler web-scraping

Question

We are using Html Agility Pack to scrape data for HTML-based site; is there any DLL like Html Agility Pack to scrape flash-based site?

Popular Answer

It really depends on the site you are trying to scrap. There are two types of sites in this regard:

  • If the site has the data inside the swf file, then you'll have to decompile the swf file, and read the data inside. with enough work you can probably do it programmatically. However if this is the case, it might be easier to just gather the data manually, since it's probably isn't going to change much.

  • If most cases however, especially with sites that have a lot of data, the flash file is actually contacting an external API. In that case you can simply ignore the flash altogether and get to the API directly. If your not sure, just activate Firebug's net panel, and start browsing. If it's using an external api it should become obvious.
    Once you find that API, you could probably reverse engineer how to manipulate it to give you whatever data you need.

Also note that if it's a big enough site, there are probably non-flash ways to get to the same data:

  • It might have a mobile site (with no flash) - try accessing the site with an iPhone user-agent.
  • It might have a site for crawlers (like googlebot) - try accessing the site with a googlebot user-agent.

EDIT: if your talking about crawling (crawling means getting data from any random site) rather then scraping (Getting structured data from a specific site), then there's not much you can do, even googlebot isn't scrapping flash content. Mostly because unlike HTML, flash doesn't have a standardized syntax that you can immediately tell what is text, what is a link etc...




Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Is this KB legal? Yes, learn why
Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Is this KB legal? Yes, learn why