在Xamarin中使用HtmlAgilityPack等待AJAX

ajax c# html-agility-pack xamarin

我之前似乎有一個問題,但有點不同。我正在嘗試從這個網站上抓取數據,但問題是它似乎裝滿了AJAX。因為我的應用程序無法在我正在尋找的HTML中找到id和類。

您可以通過檢查元素或查看源來重現此問題。在查看源代碼時,我看到的內容比檢查元素要少得多。

我以為我可以通過按F12,進入網絡選項卡並選擇XHR來跟踪包含AJAX的文件以加載此html,但我無法找到它。

我的問題是:如何檢索此數據或找出用於收集數據的文件?

我的代碼示例(我無法找到Timetable_toolbar_elementSelect_popup0 ):

private async Task GetHtmlDocument(string url)
        {
            HttpWebRequest request = (HttpWebRequest)HttpWebRequest.Create(url);
            //request.Credentials = new LoginCredentials().Credentials;

            try
            {
                WebResponse myResponse = await request.GetResponseAsync();
                HtmlDocument htmlDoc = new HtmlDocument();
                htmlDoc.OptionFixNestedTags = true;
                htmlDoc.Load(myResponse.GetResponseStream());
                var test = htmlDoc.GetElementbyId("Timetable_toolbar_elementSelect_popup0");
            }
            catch (Exception e)
            {
            }
        }

一般承認的答案

使用webrequest調用ajax方法的解決方案。

所以我感到無聊並想出了大部分內容。以下缺少的是如何通過id識別Klase。以下示例將獲取klase'1GLD'。我們需要cookie的原因是為了讓請求知道我們從哪個學校獲取Klase。此外,下面的代碼只返回JSON - 而不是HTML,因為它是我們調用的ajax方法。

CookieContainer cookies = new CookieContainer();
try
{
    string webAddr = "https://roosters.windesheim.nl/";
    var httpWebRequest = (HttpWebRequest)WebRequest.Create(webAddr);
    httpWebRequest.ContentType = "application/json; charset=utf-8";
    httpWebRequest.Method = "POST";
    httpWebRequest.CookieContainer = cookies;        
    httpWebRequest.AutomaticDecompression = DecompressionMethods.GZip | DecompressionMethods.Deflate;
    httpWebRequest.Headers.Add("X-Requested-With", "XMLHttpRequest");

    var httpResponse = (HttpWebResponse)httpWebRequest.GetResponse();
    using (var streamReader = new StreamReader(httpResponse.GetResponseStream()))
    {
        cookies.Add(httpWebRequest.CookieContainer.GetCookies(httpWebRequest.RequestUri));
    }
}
catch (WebException ex)
{
    Console.WriteLine(ex.Message);
}

//According to my web debugger the cookie will last until the 10th of December. So need to fix a new cookie until then.
//I noticed the url used unixtimestamps at the end of the url. So we just add the unixtimestamp at the end for each request.
long unixTimeStamp = new DateTimeOffset(DateTime.Now).ToUnixTimeMilliseconds() - 100;

//we are now ready to call the ajax method and get the JSON.
try
{
    string webAddr = "https://roosters.windesheim.nl/WebUntis/Timetable.do?request.preventCache="+unixTimeStamp.ToString();
    var httpWebRequest = (HttpWebRequest)WebRequest.Create(webAddr);
    httpWebRequest.ContentType = "application/x-www-form-urlencoded; charset=utf-8";
    httpWebRequest.Method = "POST";
    httpWebRequest.CookieContainer = cookies;
    httpWebRequest.AutomaticDecompression = DecompressionMethods.GZip | DecompressionMethods.Deflate;
    httpWebRequest.Headers.Add("X-Requested-With", "XMLHttpRequest");

    using (var streamWriter = new StreamWriter(httpWebRequest.GetRequestStream()))
    {
        string json = "ajaxCommand=getWeeklyTimetable&elementType=1&elementId=13090&date=20171126&formatId=7&departmentId=0&filterId=-2";

        //The command below will return a JSON datastructure containing all the klases and their relevant ID.
        //string otherJson = "ajaxCommand=getPageConfig&type=1&filter=-2"


        streamWriter.Write(json);
        streamWriter.Flush();
    }


    var httpResponse = (HttpWebResponse)httpWebRequest.GetResponse();
    using (var streamReader = new StreamReader(httpResponse.GetResponseStream()))
    {
        var responseText = streamReader.ReadToEnd();
        //THE RESULTS GETS PRINTED HERE.
        Console.Write(responseText);
    }
}
catch (WebException ex)
{
    Console.WriteLine(ex.Message);
}

使用Selenium和Firefox驅動程序的其他解決方案。

這樣做更容易。但它也需要一些時間。並非所有的線程睡眠都是必要的。這將使HTML與isntead一起使用,就像您要求的那樣。但我發現在最後一個foreach循環中它是必要的。

public static void Main(string[] args)
{
    HtmlDocument doc = new HtmlDocument();
    //According to my web debugger the cookie will last until the 10th of December. So need to fix a new cookie until then.
    //I noticed the url used unixtimestamps at the end of the url. So we just add the unixtimestamp at the end for each request.
    long unixTimeStamp = new DateTimeOffset(DateTime.Now).ToUnixTimeMilliseconds() - 100;
    string webAddr = "https://roosters.windesheim.nl/WebUntis/Timetable.do?request.preventCache="+unixTimeStamp.ToString();
    var ffOptions = new FirefoxOptions();
    ffOptions.BrowserExecutableLocation = @"C:\Program Files (x86)\Mozilla Firefox\firefox.exe";
    ffOptions.LogLevel = FirefoxDriverLogLevel.Default;
    ffOptions.Profile = new FirefoxProfile { AcceptUntrustedCertificates = true };
    var service = FirefoxDriverService.CreateDefaultService();

    var driver = new FirefoxDriver(service, ffOptions, TimeSpan.FromSeconds(120));


    driver.Navigate().GoToUrl(webAddr);


    driver.FindElement(By.XPath("//input[@id='school']")).SendKeys("Windesheim"+Keys.Enter);
    Thread.Sleep(2000);
    driver.FindElement(By.XPath("//span[@id='dijit_PopupMenuBarItem_0_text' and text() ='Lesrooster']")).Click();

    driver.FindElement(By.XPath("//td[@id='dijit_MenuItem_0_text' and text() ='Klassen']")).Click();
    Thread.Sleep(2000);

    driver.FindElement(By.XPath("//div[@id='widget_Timetable_toolbar_elementSelect']//input[@class='dijitReset dijitInputField dijitArrowButtonInner']")).Click();

    //we get all the options for Klase
    doc.LoadHtml(driver.PageSource);
    HtmlNodeCollection nodes = doc.DocumentNode.SelectNodes("//div[@id='Timetable_toolbar_elementSelect_popup']/div[@item]");
    List<String> options = new List<String>();
    foreach (HtmlNode n in nodes)
    {
        options.Add(n.InnerText);
    }

    foreach(string s in options)
    {
        driver.FindElement(By.XPath("//input[@id='Timetable_toolbar_elementSelect']")).Clear();
        driver.FindElement(By.XPath("//input[@id='Timetable_toolbar_elementSelect']")).SendKeys(s);
        Thread.Sleep(2000);
        driver.FindElement(By.XPath("//body")).SendKeys(Keys.Enter);
        Thread.Sleep(2000);
        doc.LoadHtml(driver.PageSource);
        //Console.WriteLine(driver.Url); //Now we can see the id of the current Klase
    }

    Console.WriteLine(doc.DocumentNode.InnerHtml);

    Console.ReadKey();
}

最後更新

使用Selenium解決方案,我能夠獲得所有課程的ID。我已將文件包含在此處,因此您可以將其與ajax和Web請求一起使用。


熱門答案

我打算將此作為評論。但它的格式太大而且太糟糕了。所以我們走了。

首先。使用使用ajax命令調用的javascript動態更新站點。

如果您可以打開會話並存儲包含SESSIONID和現在“加密”學校名稱的cookie,那麼您可以調用ajax命令。

    https://roosters.windesheim.nl/ajaxCommand=getWeeklyTimetable&elementType=1&elementId=13090&date=20171126&formatId=7&departmentId=0&filterId=-2

但是,這確實需要您知道elementType是什麼以及elementId是什麼。

在這種情況下,elementId在等於1GLD時指的是Klas。格式ID(7)在等於“Beknopt”時引用Roosterformaat。你必須弄清楚剩下的變量是做什麼的。更重要的是,如果您成功地能夠向服務器發出有效的ajax命令,那麼您將不會獲得html作為響應,您將收到JSON中的數據。

做你想做的最簡單的方法是將所有類放在一個單獨的文件中 。並將其作為參考點。其他選項也是如此。

然後使用無頭瀏覽器,如phantomjs.org 。通過這種方式,您可以找到並單擊要刪除的類。將html加載到HtmlAgilityPack.HtmlDocument中,然後執行您需要執行的操作。 Selenium / PhantomJS直到跟踪你的cookies。這種方法較慢 - 但更容易做到。

編輯從webrequest存儲cookie - 簡單方法。

我並不熱衷於這個問題。但是OP問道。如果有人有更好的方法,請編輯。

CookieContainer cookies = new CookieContainer();
try
{
    string webAddr = "https://roosters.windesheim.nl/WebUntis/";

    var httpWebRequest = (HttpWebRequest)WebRequest.Create(webAddr);
    httpWebRequest.ContentType = "application/json; charset=utf-8";
    httpWebRequest.Method = "POST";
    httpWebRequest.CookieContainer = cookies;

    httpWebRequest.AutomaticDecompression = DecompressionMethods.GZip | DecompressionMethods.Deflate;
    httpWebRequest.Headers.Add("X-Requested-With", "XMLHttpRequest");
    using (var streamWriter = new StreamWriter(httpWebRequest.GetRequestStream()))
    {
        string json = "ajaxCommand=getWeeklyTimetable&elementType=1&elementId=13092&date=20171126&formatId=7&departmentId=0&filterId=-2";

        streamWriter.Write(json);
        streamWriter.Flush();
    }


    var httpResponse = (HttpWebResponse)httpWebRequest.GetResponse();
    using (var streamReader = new StreamReader(httpResponse.GetResponseStream()))
    {
        cookies.Add(httpWebRequest.CookieContainer.GetCookies(httpWebRequest.RequestUri));
        //cookies.Add(httpResponse.Cookies);
        var responseText = streamReader.ReadToEnd();
        doc.LoadHtml(responseText);
        foreach(Cookie c in httpResponse.Cookies)
        {
            Console.WriteLine(c.ToString());
        } 
    }
}
catch (WebException ex)
{
    Console.WriteLine(ex.Message);
}
    Console.WriteLine(doc.DocumentNode.InnerHtml);

    Console.ReadKey();


Related

許可下: CC-BY-SA with attribution
不隸屬於 Stack Overflow
許可下: CC-BY-SA with attribution
不隸屬於 Stack Overflow