Use HTML Agility Pack to extract dynamic content

c# html-agility-pack

Question

Let's say that I have a list of 10 news sources that I like to import into my local database. I need to open each of these external news pages, extract the main content, and save it. The html structure in all of these pages are different. Some use div, while other use article tags.

I know that I can use the HttpWebRequest object to open the page, and use HtmlAgilityPack to load the [HTML] document.

HttpWebRequest oReq = (HttpWebRequest)WebRequest.Create(url);
HttpWebResponse resp = (HttpWebResponse)oReq.GetResponse();
var doc = new HtmlAgilityPack.HtmlDocument();
doc.Load(resp.GetResponseStream());

However, I don't know how I could target the main element without knowing the type.

Is what I'm trying to do even possible?

Popular Answer

HTML Agility Pack is EXTREMELY useful, but the code using it generally has to be customized to the structure of the site.

You can try to be generic and adaptive, but even the "Big Boys" like Evernote have to have different clipping options for different site layouts.

The first thing I'd look at: If it's news, should you be using their RSS feeds instead? (Not just technically, but legally. Check out the sites' terms of service sections.)

If you have to go with parsing their site, I'd suggest making an interface and a separate class for each site that implements the interface. Tweak each class to match the respective site's structure.



Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Is this KB legal? Yes, learn why
Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Is this KB legal? Yes, learn why