HttpClient使用登錄c#從網站上抓取數據

c# html-agility-pack httpclient

我想從以下網站獲取一些數據:

http://wttv.click-tt.de/cgi-bin/WebObjects/nuLigaTTDE.woa/wa/teamPortrait?teamtable=1673669&pageState=rueckrunde&championship=SK+Bez.+BB+13%2F14&group=204559#

該網站包含一些有關乒乓球的數據。只有登錄才能在沒有登錄的情況下訪問實際賽季。對於實際的季節,我已經創建了一些代碼來獲取數據並且它工作正常。我正在使用HtmlAgilityPack中的HttpClient。代碼如下所示:

            HttpClient http = new HttpClient();
            var response = await http.GetByteArrayAsync(website);
            String source = Encoding.GetEncoding("utf-8").GetString(response, 0, response.Length - 1);
            source = WebUtility.HtmlDecode(source);
            HtmlDocument resultat = new HtmlDocument();
            resultat.LoadHtml(source);

            Do something to get the relevant data from resultat by scanning the DocumentNodes from resultat...

現在我想從需要登錄的網站上獲取數據。有沒有人知道如何登錄網站並獲取數據?必須通過單擊“Ergebnishistorie freischalten ...”然後輸入用戶名和密碼來完成登錄。

一般承認的答案

有許多方法可以執行登錄到網站,這取決於特定站點使用的身份驗證方法(表單身份驗證,基本身份驗證,Windows身份驗證等)。通常網站使用FormsAuthentication。

要使用HttpClient在標準FormsAuthentication網站中執行登錄,您需要設置CookieContainer,因為將在Cookie上設置身份驗證數據。

在您的具體示例中,登錄表單對HTTPS中的任何頁面進行POST,我使用https://wttv.click-tt.de/cgi-bin/WebObjects/nuLigaTTDE.woa/wa/teamPortrait?teamtable=1673669&pageState = rueckrunde&championship = SK + Bez。+ BB + 13%2F14&group = 204559為例。這是使用HttpClient發出請求的代碼:

var baseAddress = new Uri("https://wttv.click-tt.de/");
var cookieContainer = new CookieContainer();
using (var handler = new HttpClientHandler() { CookieContainer = cookieContainer })
using (var client = new HttpClient(handler) { BaseAddress = baseAddress })
{
    //usually i make a standard request without authentication, eg: to the home page.
    //by doing this request you store some initial cookie values, that might be used in the subsequent login request and checked by the server
    var homePageResult = client.GetAsync("/");
    homePageResult.Result.EnsureSuccessStatusCode();

    var content = new FormUrlEncodedContent(new[]
    {
        //the name of the form values must be the name of <input /> tags of the login form, in this case the tag is <input type="text" name="username">
        new KeyValuePair<string, string>("username", "username"),
        new KeyValuePair<string, string>("password", "password"),
    });
    var loginResult = client.PostAsync("/cgi-bin/WebObjects/nuLigaTTDE.woa/wa/teamPortrait?teamtable=1673669&pageState=rueckrunde&championship=SK+Bez.+BB+13%2F14&group=204559", content).Result;
    loginResult.EnsureSuccessStatusCode();

    //make the subsequent web requests using the same HttpClient object
}

但是,許多網站使用一些javascript加載的表單值或甚至更多的一些驗證碼控件,顯然這個解決方案將無法正常工作。這可以通過WebBrowser控件完成(通過自動化表單字段上的用戶輸入然後點擊登錄按鈕,這個鏈接有一個例子: https//social.msdn.microsoft.com/Forums/vstudio/en- US / 0b77ca8c-48ce-4fa8-9367-c7491aa359b0 / yahoo-login-via-systemnetsockets-namespace?forum = vbgeneral )。

作為一般規則檢查登陸您所需網站的方式,請使用Fiddler: http//www.telerik.com/fiddler :當您點擊網站上的登錄按鈕時,請觀看Fiddler並找到登錄請求(通常是單擊“登錄”按鈕後的第一個請求,通常是POST請求)。

然後檢查請求數據(選擇請求並轉到“檢查器” - “TextView”選項卡)並嘗試在代碼上複製請求。

在左側窗格中,Fiddler攔截了所有請求,右側窗格中有請求和響應檢查器

在左側窗格中,Fiddler攔截了所有請求,在右側窗格中有請求和響應檢查員(頂部有請求檢查員,底部有響應檢查員)

編輯

與舊WebRequest類相同的代碼: http//rextester.com/LLP86817

var cookieContainer = new CookieContainer();

HttpWebRequest request = (HttpWebRequest)HttpWebRequest.Create("https://wttv.click-tt.de/");
request.CookieContainer = cookieContainer;
//set the user agent and accept header values, to simulate a real web browser
request.UserAgent = "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.101 Safari/537.36";
request.Accept = "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8";


//SET AUTOMATIC DECOMPRESSION
request.AutomaticDecompression = DecompressionMethods.Deflate | DecompressionMethods.GZip;

Console.WriteLine("FIRST RESPONSE");
Console.WriteLine();
using (WebResponse response = request.GetResponse())
{
    using (StreamReader sr = new StreamReader(response.GetResponseStream()))
    {
        Console.WriteLine(sr.ReadToEnd());
    }
}

request = (HttpWebRequest)HttpWebRequest.Create("https://wttv.click-tt.de/cgi-bin/WebObjects/nuLigaTTDE.woa/wa/teamPortrait?teamtable=1673669&pageState=rueckrunde&championship=SK+Bez.+BB+13%2F14&group=204559");
//set the cookie container object
request.CookieContainer = cookieContainer;
request.UserAgent = "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.101 Safari/537.36";
request.Accept = "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8";

//set method POST and content type application/x-www-form-urlencoded
request.Method = "POST";
request.ContentType = "application/x-www-form-urlencoded";

//SET AUTOMATIC DECOMPRESSION
request.AutomaticDecompression = DecompressionMethods.Deflate | DecompressionMethods.GZip;

//insert your username and password
string data = string.Format("username={0}&password={1}", "username", "password");
byte[] bytes = System.Text.Encoding.UTF8.GetBytes(data);

request.ContentLength = bytes.Length;

using (Stream dataStream = request.GetRequestStream())
{
    dataStream.Write(bytes, 0, bytes.Length);
    dataStream.Close();
}

Console.WriteLine("LOGIN RESPONSE");
Console.WriteLine();
using (WebResponse response = request.GetResponse())
{
    using (StreamReader sr = new StreamReader(response.GetResponseStream()))
    {
        Console.WriteLine(sr.ReadToEnd());
    }
}

//request = (HttpWebRequest)HttpWebRequest.Create("INTERNAL PROTECTED PAGE ADDRESS");
//After a successful login, you must use the same cookie container for all request
//request.CookieContainer = cookieContainer;

//....



許可下: CC-BY-SA with attribution
不隸屬於 Stack Overflow
這個KB合法嗎? 是的,了解原因
許可下: CC-BY-SA with attribution
不隸屬於 Stack Overflow
這個KB合法嗎? 是的,了解原因