Screen Scraping, Web Scraping, Web Harvesting, Web Data Extraction, etc. using C# and the .NET Framework

.net c# html-agility-pack visual-studio web-scraping

Question

I am working on a Microsoft .NET Application in C# for Web Harvesting, Web Scraping, Web Data Extraction, Screen Scraping, etc. whatever you want to call it. For parsing HTML, I'm attempting to incorporate HTML Agility Pack but it's not as easy as I thought it would be. I have included some specifications and images of what I have so far and was hoping to get your opinions on how I could proceed. basically, I want to do something similar to the layout used in Visual Web Ripper but I have no idea how they do it... Any ideas?

Images:

http://img69.imageshack.us/img69/8880/webharvester1.png

http://img198.imageshack.us/img198/9563/webharvester2.png

Specifications:

My goal is to make a very user friendly point-and-click application for downloading data and images from the web. I would like to load HTML pages using the web browser, and output the parsed data and image links into the text box. The user can specify which HTML tags they want and then download the data into the grid. Finally, export the data into whatever format they need.

I'm trying to use HTML Agility Pack to load the HTML on the webpage and display it in the textbox.

    // Load Web Browser
    private void Form6_Load(object sender, EventArgs e)
    {
        // Navigate to webpage
        webBrowser.Navigate("http://www.webopedia.com/TERM/H/HTML.html");

        // Save URL to memory
        SiteMemoryArray[count] = urlTextBox.Text; 

        // Load HTML from webBrowser
        HtmlWindow window = webBrowser.Document.Window; 
        string str = window.Document.Body.OuterHtml;

        // Extract tags using HtmlAgilityPack and display in textbox
        HtmlAgilityPack.HtmlDocument HtmlDoc = new HtmlAgilityPack.HtmlDocument();
        HtmlDoc.LoadHtml(str);

        HtmlAgilityPack.HtmlNodeCollection Nodes = HtmlDoc.DocumentNode.SelectNodes("//a");

        foreach (HtmlAgilityPack.HtmlNode Node in Nodes)
        {
            textBox2.Text += Node.OuterHtml + "\r\n";
        }

    }

For: HtmlWindow window = webBrowser.Document.Window;

I get the error: Object reference not set to an instance of an object.

Accepted Answer

You might not have the page load completed when you are referencing the browser window. You can have the browser control fire the navigationcomplete event when it is done. See this SO answer for an example: C# how to wait for a webpage to finish loading before continuing


Popular Answer

For screen scraping, if you are searching for particuliar images/shapes, you can use http://www.emgu.com/wiki/index.php/Main_Page. Might come in handy.

You can also "read" the screen using WinAPI as such

    private Bitmap Capture(IntPtr hwnd)
    {
        return Capture(hwnd, GetClientRectangle());
    }

    private Bitmap Capture(IntPtr hwnd, Rectangle zone)
    {
        IntPtr hdcSrc = GetWindowDC(hwnd);

        IntPtr hdcDest = CreateCompatibleDC(hdcSrc);

        IntPtr hBitmap = CreateCompatibleBitmap(hdcSrc, zone.Width, zone.Height);

        IntPtr hOld = SelectObject(hdcDest, hBitmap);

        BitBlt(hdcDest, 0, 0, zone.Width, zone.Height, hdcSrc, zone.X, zone.Y, SRCCOPY);


        SelectObject(hdcDest, hOld);

        DeleteDC(hdcDest);
        ReleaseDC(hwnd, hdcSrc);

        Bitmap retBitmap = Bitmap.FromHbitmap(hBitmap);

        DeleteObject(hBitmap);
        return retBitmap;
    }



Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Is this KB legal? Yes, learn why
Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Is this KB legal? Yes, learn why