How to parse a FORM from a WebResponse into a WebRequest's POST body

c# html-agility-pack webrequest webresponse web-scraping

Question

The goal at hand is to develop a transaction in C# that will use WebRequest/WebResponse to traverse through a page flow of a web app. I'm new to this; this is my first journey. The challenge is creating a dynamic POST body and POST URL for the WebRequest from the value pairs of the WebRequest. I have the Request/Response mechanism functioning, cookies and everything (I can successfully complete a transaction with hardcoded values for POST URLs and POST bodies). Basically, each subsequent Request is created from the FORM value pairs of the preceding Response after the initial WebRequest, which always has the same static URL and "hardcoded" body: I've replaced the HTML opening and closing brackets with square ones since I'm unsure how to insert HTML directly into this box. This is a portion of the FORM that is in the Response:

    <form id="expressform" method="post" action="">
<div>
    <input type="hidden" name="ScreenData.widgets.modified" value=""/><input type="hidden" name="ScreenData.header.hidden.name" value="ScreenData.widgets.modified"/><input type="hidden" name="ScreenData.marshalled" value="true"/><input type="hidden" name="ScreenData.header.hidden.name" value="ScreenData.marshalled"/><input type="hidden" name="isCreateAccountWizard" value="true"/><input type="hidden" name="ScreenData.header.hidden.name" value="isCreateAccountWizard"/>
    <input type="hidden" name="versionPoint" value="77777"/>

and then a few text fields for entering values, like these:

<tr>
    <td class="dataOut" style="padding-left:30px">
        <textarea name="ScreenData.sicInfo.natureOfBusiness" rows="5"  cols="60" class="dataOut" onmouseup="textAreaCounter(this,250);;" onkeypress="textAreaCounter(this,250);;" onkeyup="textAreaCounter(this,250);;" onchange="markDataDirty(this);;"></textarea> 
    </td>
</tr>

and the URL is located after Submit:

 <a class="detailBtnOn" href="javascript:submitForm('express?displayAction=CreateAccountWizard&amp;saveAction=SaveCreateSICCode&amp;flow=forward&amp;saveActionToken=84454A7D-50FE-5856-CE17-916B70EDFE1A&amp;flowToken=CF3827F4-1DE7-54B1-D87B-D72F01C454C3')">Submit</a>

The following WebResponse should include the following in its POST body:

ScreenData.widgets.modified=&ScreenData.header.hidden.name=ScreenData.widgets.modified&ScreenData.marshalled=true&ScreenData.header.hidden.name=ScreenData.marshalled&isCreateAccountWizard=true&ScreenData.header.hidden.name=isCreateAccountWizard&versionPoint=77777&ScreenData.commonHeaderInfo.accountName=SomeAccountName&ScreenData.commonHeaderInfo.effectiveDate=08%2F01%2F2011&ScreenData.sicInfo.natureOfBusiness=business&ScreenData.sicInfo.sic=7777&ScreenData.widgets.modified=ScreenData.sicInfo.natureOfBusiness&ScreenData.widgets.modified=ScreenData.sicInfo.sic

and the following URL:

express?displayAction=CreateAccountWizard&saveAction=SaveCreateSICCode&flow=forward&saveActionToken=84454A7D-50FE-5856-CE17-916B70EDFE1A&flowToken=CF3827F4-1DE7-54B1-D87B-D72F01C454C3 

But in addition to being unable to construct this parsing engine, I am also unable to extract value pairs from the FORM. I'm attempting to utilize AgilityPack. The following should at the very least print out the "essential" text of FORMs:

var page = new HtmlDocument();
page.OptionReadEncoding = false;
var stream = HttpWResponse.GetResponseStream(); 
page.Load(stream);
foreach (var f in page.DocumentNode.Descendants("form"))
{
    foreach (var d in page.DocumentNode.Descendants("div"))
    {
        Loggers.EventsLogger.Info("");
        Loggers.EventsLogger.Info((f.GetAttributeValue("name", null) ?? f.GetAttributeValue("id", "<no name>")) + ": ");
        Loggers.EventsLogger.Info("");
        Loggers.EventsLogger.Info(f.GetAttributeValue("method", "<no method>") + ' ');
        Loggers.EventsLogger.Info("");
        Loggers.EventsLogger.Info(f.GetAttributeValue("action", "<no action>"));

        foreach(var i in f.Descendants("input"))//{

        {
            Loggers.EventsLogger.Info("");
            Loggers.EventsLogger.Info('\t' + (i.GetAttributeValue("name", null) ?? f.GetAttributeValue("id", "<no name>")));
            Loggers.EventsLogger.Info("");
            Loggers.EventsLogger.Info(" (");
            Loggers.EventsLogger.Info("");
            Loggers.EventsLogger.Info(i.GetAttributeValue("type", "<no type>"));
            Loggers.EventsLogger.Info("");
            Loggers.EventsLogger.Info("): " + i.GetAttributeValue("value", "<no value>"));
        }
        Loggers.EventsLogger.Info("");
        Loggers.EventsLogger.Info("");
    }
}

But all that is printed is this:

INFO  EventsLogger - 
INFO  EventsLogger - expressform: 
INFO  EventsLogger - 
INFO  EventsLogger - post 

(If the "div" bit is removed) foreach (var d in page. Nothing happens after calling DocumentNode.Descendants("div"));


Any advice on what the FORM print out parser is doing or how to construct a parsing engine for constructing requests from responses would be highly appreciated.

1
1
8/5/2011 3:47:52 AM

Popular Answer

Check out these zzz-5, zzz-9, zzz-13, and zzz-17 for more information.

EDIT: Additional details:

You use a foreach loop to iterate through the forms in the HTML page, but in the subsequent foreach you ignore the current form and instead focus on the DIVs. You need something like to in the inner foreach loop(s)

foreach (var d in f.SelectNodes(".//div"))

and

foreach (var i in d.SelectNodes(".//input"))
0
5/23/2017 11:48:56 AM


Related Questions





Related

Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow