Is there anyway to use "BrowserSession" to download files? C#

c# cookies download html-agility-pack

Question

Before you may download anything from my website, you must log in. Currently, I log in and do the necessary scrapes using the BrowserSession Class (at least for the most part).

source for the BrowserSession Class at the bottom of the post

The document nodes display the download links. Since I already had to heavily modify the BrowserSession class (I should have modified it as a partial but didn't), I don't really want to switch from using the BrowserSession Class. However, I don't know how to add download functionality to that class, and If I try to download them with a webclient it fails.

The websites are downloaded and loaded using htmlAgilityPack.HtmlWeb, in my opinion.

Is there a way to utilize the CookieCollection from the BrowserSession's CookieCollection with Webclient if there is no simple method to change it?

PS: To download the file, I must be logged in; otherwise, the link takes me to the login page. Because of this, I can't just use WebClient; instead, I have to either change the BrowserSession class to allow downloading or change WebClient to utilize cookies before retrieving a page.

I'll confess that I don't really understand cookies (I'm not sure whether they're utilized on GETs as well as POSTs), but so far, BrowserSession has taken care of everything.

PPS: Although I did not change anything to the browser session I posted, the basic features remain the same.

public class BrowserSession
{
private bool _isPost;
private HtmlDocument _htmlDoc;

/// <summary>
/// System.Net.CookieCollection. Provides a collection container for instances of Cookie class 
/// </summary>
public CookieCollection Cookies { get; set; }

/// <summary>
/// Provide a key-value-pair collection of form elements 
/// </summary>
public FormElementCollection FormElements { get; set; }

/// <summary>
/// Makes a HTTP GET request to the given URL
/// </summary>
public string Get(string url)
{
    _isPost = false;
    CreateWebRequestObject().Load(url);
    return _htmlDoc.DocumentNode.InnerHtml;
}

/// <summary>
/// Makes a HTTP POST request to the given URL
/// </summary>
public string Post(string url)
{
    _isPost = true;
    CreateWebRequestObject().Load(url, "POST");
    return _htmlDoc.DocumentNode.InnerHtml;
}

/// <summary>
/// Creates the HtmlWeb object and initializes all event handlers. 
/// </summary>
private HtmlWeb CreateWebRequestObject()
{
    HtmlWeb web = new HtmlWeb();
    web.UseCookies = true;
    web.PreRequest = new HtmlWeb.PreRequestHandler(OnPreRequest);
    web.PostResponse = new HtmlWeb.PostResponseHandler(OnAfterResponse);
    web.PreHandleDocument = new HtmlWeb.PreHandleDocumentHandler(OnPreHandleDocument);
    return web;
}

/// <summary>
/// Event handler for HtmlWeb.PreRequestHandler. Occurs before an HTTP request is executed.
/// </summary>
protected bool OnPreRequest(HttpWebRequest request)
{
    AddCookiesTo(request);               // Add cookies that were saved from previous requests
    if (_isPost) AddPostDataTo(request); // We only need to add post data on a POST request
    return true;
}

/// <summary>
/// Event handler for HtmlWeb.PostResponseHandler. Occurs after a HTTP response is received
/// </summary>
protected void OnAfterResponse(HttpWebRequest request, HttpWebResponse response)
{
    SaveCookiesFrom(response); // Save cookies for subsequent requests
}

/// <summary>
/// Event handler for HtmlWeb.PreHandleDocumentHandler. Occurs before a HTML document is handled
/// </summary>
protected void OnPreHandleDocument(HtmlDocument document)
{
    SaveHtmlDocument(document);
}

/// <summary>
/// Assembles the Post data and attaches to the request object
/// </summary>
private void AddPostDataTo(HttpWebRequest request)
{
    string payload = FormElements.AssemblePostPayload();
    byte[] buff = Encoding.UTF8.GetBytes(payload.ToCharArray());
    request.ContentLength = buff.Length;
    request.ContentType = "application/x-www-form-urlencoded";
    System.IO.Stream reqStream = request.GetRequestStream();
    reqStream.Write(buff, 0, buff.Length);
}

/// <summary>
/// Add cookies to the request object
/// </summary>
private void AddCookiesTo(HttpWebRequest request)
{
    if (Cookies != null && Cookies.Count > 0)
    {
        request.CookieContainer.Add(Cookies);
    }
}

/// <summary>
/// Saves cookies from the response object to the local CookieCollection object
/// </summary>
private void SaveCookiesFrom(HttpWebResponse response)
{
    if (response.Cookies.Count > 0)
    {
        if (Cookies == null)  Cookies = new CookieCollection(); 
        Cookies.Add(response.Cookies);
    }
}

/// <summary>
/// Saves the form elements collection by parsing the HTML document
/// </summary>
private void SaveHtmlDocument(HtmlDocument document)
{
    _htmlDoc = document;
    FormElements = new FormElementCollection(_htmlDoc);
}
}

Class: FormElementCollection

/// <summary>
/// Represents a combined list and collection of Form Elements.
/// </summary>
public class FormElementCollection : Dictionary<string, string>
{
/// <summary>
/// Constructor. Parses the HtmlDocument to get all form input elements. 
/// </summary>
public FormElementCollection(HtmlDocument htmlDoc)
{
    var inputs = htmlDoc.DocumentNode.Descendants("input");
    foreach (var element in inputs)
    {
        string name = element.GetAttributeValue("name", "undefined");
        string value = element.GetAttributeValue("value", "");
        if (!name.Equals("undefined")) Add(name, value);
    }
}

/// <summary>
/// Assembles all form elements and values to POST. Also html encodes the values.  
/// </summary>
public string AssemblePostPayload()
{
    StringBuilder sb = new StringBuilder();
    foreach (var element in this)
    {
        string value = System.Web.HttpUtility.UrlEncode(element.Value);
        sb.Append("&" + element.Key + "=" + value);
    }
    return sb.ToString().Substring(1);
}
}
1
0
11/12/2015 7:36:57 PM

Accepted Answer

Using a modified webClient and BrowserSession, I was able to get it to function:

First of all To view the document, change the _htmlDoc to Public. Nodes:

public class BrowserSession
{
    private bool _isPost;
    public string previous_Response { get; private set; }
    public HtmlDocument _htmlDoc { get; private set; }
}

Add the following method to BrowserSession, second:

 public void DownloadCookieProtectedFile(string url, string Filename)
    {
        using (CookieAwareWebClient wc = new CookieAwareWebClient())
        {
            wc.Cookies = Cookies;
            wc.DownloadFile(url, Filename);
        }
    }
//rest of BrowserSession

Third, somewhere add this class, which enables the WebClient to get cookies from the BrowserSession.

public class CookieAwareWebClient : WebClient
{
    public CookieCollection Cookies = new CookieCollection();
    private void AddCookiesTo(HttpWebRequest request)
    {
        if (Cookies != null && Cookies.Count > 0)
        {
            request.CookieContainer.Add(Cookies);
        }
    }

    protected override WebRequest GetWebRequest(Uri address)
    {
        WebRequest request = base.GetWebRequest(address);
        HttpWebRequest webRequest = request as HttpWebRequest;
        if (webRequest != null)
        {
            if (webRequest.CookieContainer == null) webRequest.CookieContainer = new CookieContainer();
            AddCookiesTo(webRequest);
        }
        return request;
    }
}

As a result, you should be able to use BrowserSession as usual. Then, when you require a file that you can only access while signed in, use BrowserSession. DownloadCookieProtectedFile() Only Set the Cookies in the Following Way, As If It Were a WebClient:

Using(wc = new CookieAwareWebClient())
{
    wc.Cookies = BrowserSession.Cookies
    //Download with WebClient As normal
    wc.DownloadFile();
}
0
11/19/2015 4:13:18 AM

Popular Answer

It is not easy to log in and download the WebPages. I just had the same problem. If you come up with a different solution, do share it.

Using PhantomJS and Selenium was what I did. I can communicate with the web browser of my choosing using Selenium.

Additionally, Html Agility Pack, a third-party library accessible via nuget, is not used by the Browser class.

I'd like to direct your attention to this question, where I've produced a comprehensive example of how to utilize Selenium, download an HTML document, and use xpath to filter out the relevant data.



Related Questions





Related

Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow