Using C# and the.NET Framework, do screen scraping, web scraping, web harvesting, web data extraction, and more.

.net c# html-agility-pack visual-studio web-scraping

Question

Whatever you want to call it—web harvesting, web scraping, web data extraction, screen scraping, etc.—I'm working on a Microsoft.NET application in C#. I'm trying to integrate HTML Agility Pack for HTML parsing, but it's more difficult than I anticipated. I've included some requirements and examples of what I currently have, and I'd need your feedback on how I could go further. Basically, I'm trying to replicate the layout of Visual Web Ripper, but I'm not sure how to go about it. Any thoughts?

Images:

http://img69.imageshack.us/img69/8880/webharvester1.png

http://img198.imageshack.us/img198/9563/webharvester2.png

Specifications:

My objective is to create a very user-friendly point-and-click program for downloading information and photos from the internet. I want to use the web browser to load HTML pages and output the parsed data, along with image links, into a text field. After choosing the HTML tags they like, the user may download the data into the grid. Finally, export the data in the desired format.

I'm attempting to load HTML on the website and show it in the textbox using HTML Agility Pack.

    // Load Web Browser
    private void Form6_Load(object sender, EventArgs e)
    {
        // Navigate to webpage
        webBrowser.Navigate("http://www.webopedia.com/TERM/H/HTML.html");

        // Save URL to memory
        SiteMemoryArray[count] = urlTextBox.Text; 

        // Load HTML from webBrowser
        HtmlWindow window = webBrowser.Document.Window; 
        string str = window.Document.Body.OuterHtml;

        // Extract tags using HtmlAgilityPack and display in textbox
        HtmlAgilityPack.HtmlDocument HtmlDoc = new HtmlAgilityPack.HtmlDocument();
        HtmlDoc.LoadHtml(str);

        HtmlAgilityPack.HtmlNodeCollection Nodes = HtmlDoc.DocumentNode.SelectNodes("//a");

        foreach (HtmlAgilityPack.HtmlNode Node in Nodes)
        {
            textBox2.Text += Node.OuterHtml + "\r\n";
        }

    }

For: HtmlWindow window = webBrowser.Document.Window;

I see the following error message: Object reference not assigned to an object instance.

1
2
2/28/2012 8:27:47 PM

Accepted Answer

When you refer to the browser window, the page load may not be finished. The navigationcomplete event may be fired by the browser control after it is finished. For an example, see this SO response: How to wait till a website has fully loaded in C# before moving on

3
5/23/2017 12:24:15 PM

Popular Answer

If you want to search for certain pictures or shapes during screen scraping, use http://www.emgu.com/wiki/index.php/Main_Page. perhaps useful.

WinAPI allows you to "read" the screen as well.

    private Bitmap Capture(IntPtr hwnd)
    {
        return Capture(hwnd, GetClientRectangle());
    }

    private Bitmap Capture(IntPtr hwnd, Rectangle zone)
    {
        IntPtr hdcSrc = GetWindowDC(hwnd);

        IntPtr hdcDest = CreateCompatibleDC(hdcSrc);

        IntPtr hBitmap = CreateCompatibleBitmap(hdcSrc, zone.Width, zone.Height);

        IntPtr hOld = SelectObject(hdcDest, hBitmap);

        BitBlt(hdcDest, 0, 0, zone.Width, zone.Height, hdcSrc, zone.X, zone.Y, SRCCOPY);


        SelectObject(hdcDest, hOld);

        DeleteDC(hdcDest);
        ReleaseDC(hwnd, hdcSrc);

        Bitmap retBitmap = Bitmap.FromHbitmap(hBitmap);

        DeleteObject(hBitmap);
        return retBitmap;
    }


Related Questions





Related

Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow