How to parse a FORM from WebResponse into a POST body of a WebRequest

c# html-agility-pack webrequest webresponse web-scraping

Question

I'm new to this, this is my virgin voyage, the task at hand is to create a transaction in C# that will navigate through a page flow of a web app via WebRequest/WebResponse. I got the Request/Response mechanism working, cookies and all (I can successfully execute a transaction with hardcoded values for POST URLs and POST bodies), the difficulty is with generating dynamic POST body and POST URL for the WebRequest from the value pairs of WebRequest. Essentially, once the flow is started with first WebRequest, which has always the same static URL and "hardcoded" body, each following Request is built from the FORM value pairs of the previous Response, for example: part of the FORM that's in the Response (I've replaced HTML opening and closing brackets with square ones, not sure how to paste HTML straight into here):

    <form id="expressform" method="post" action="">
<div>
    <input type="hidden" name="ScreenData.widgets.modified" value=""/><input type="hidden" name="ScreenData.header.hidden.name" value="ScreenData.widgets.modified"/><input type="hidden" name="ScreenData.marshalled" value="true"/><input type="hidden" name="ScreenData.header.hidden.name" value="ScreenData.marshalled"/><input type="hidden" name="isCreateAccountWizard" value="true"/><input type="hidden" name="ScreenData.header.hidden.name" value="isCreateAccountWizard"/>
    <input type="hidden" name="versionPoint" value="77777"/>

and then some text areas in the form to submit values, like this:

<tr>
    <td class="dataOut" style="padding-left:30px">
        <textarea name="ScreenData.sicInfo.natureOfBusiness" rows="5"  cols="60" class="dataOut" onmouseup="textAreaCounter(this,250);;" onkeypress="textAreaCounter(this,250);;" onkeyup="textAreaCounter(this,250);;" onchange="markDataDirty(this);;"></textarea> 
    </td>
</tr>

and then on Submit there's the URL:

 <a class="detailBtnOn" href="javascript:submitForm('express?displayAction=CreateAccountWizard&amp;saveAction=SaveCreateSICCode&amp;flow=forward&amp;saveActionToken=84454A7D-50FE-5856-CE17-916B70EDFE1A&amp;flowToken=CF3827F4-1DE7-54B1-D87B-D72F01C454C3')">Submit</a>

And then the next WebResponse should have this in its POST body:

ScreenData.widgets.modified=&ScreenData.header.hidden.name=ScreenData.widgets.modified&ScreenData.marshalled=true&ScreenData.header.hidden.name=ScreenData.marshalled&isCreateAccountWizard=true&ScreenData.header.hidden.name=isCreateAccountWizard&versionPoint=77777&ScreenData.commonHeaderInfo.accountName=SomeAccountName&ScreenData.commonHeaderInfo.effectiveDate=08%2F01%2F2011&ScreenData.sicInfo.natureOfBusiness=business&ScreenData.sicInfo.sic=7777&ScreenData.widgets.modified=ScreenData.sicInfo.natureOfBusiness&ScreenData.widgets.modified=ScreenData.sicInfo.sic

and this as a URL:

express?displayAction=CreateAccountWizard&saveAction=SaveCreateSICCode&flow=forward&saveActionToken=84454A7D-50FE-5856-CE17-916B70EDFE1A&flowToken=CF3827F4-1DE7-54B1-D87B-D72F01C454C3 

But not only I can't figure out how to build this parsing engine, I can't even grab value pairs from the FORM. I'm trying to use AgilityPack, here's a bit that should at least print out FORMs "important" content:

var page = new HtmlDocument();
page.OptionReadEncoding = false;
var stream = HttpWResponse.GetResponseStream(); 
page.Load(stream);
foreach (var f in page.DocumentNode.Descendants("form"))
{
    foreach (var d in page.DocumentNode.Descendants("div"))
    {
        Loggers.EventsLogger.Info("");
        Loggers.EventsLogger.Info((f.GetAttributeValue("name", null) ?? f.GetAttributeValue("id", "<no name>")) + ": ");
        Loggers.EventsLogger.Info("");
        Loggers.EventsLogger.Info(f.GetAttributeValue("method", "<no method>") + ' ');
        Loggers.EventsLogger.Info("");
        Loggers.EventsLogger.Info(f.GetAttributeValue("action", "<no action>"));

        foreach(var i in f.Descendants("input"))//{

        {
            Loggers.EventsLogger.Info("");
            Loggers.EventsLogger.Info('\t' + (i.GetAttributeValue("name", null) ?? f.GetAttributeValue("id", "<no name>")));
            Loggers.EventsLogger.Info("");
            Loggers.EventsLogger.Info(" (");
            Loggers.EventsLogger.Info("");
            Loggers.EventsLogger.Info(i.GetAttributeValue("type", "<no type>"));
            Loggers.EventsLogger.Info("");
            Loggers.EventsLogger.Info("): " + i.GetAttributeValue("value", "<no value>"));
        }
        Loggers.EventsLogger.Info("");
        Loggers.EventsLogger.Info("");
    }
}

but it only prints out this:

INFO  EventsLogger - 
INFO  EventsLogger - expressform: 
INFO  EventsLogger - 
INFO  EventsLogger - post 

(if i get rid of the "div" bit - foreach (var d in page.DocumentNode.Descendants("div")), - nothing changes)


Any help or suggestions on what's going on with the FORM print out parser and how to build a parsing engine for building Requests from Responses would be greatly appreciated.

Popular Answer

check this out Parsing HTML page with HtmlAgilityPack and this http://refactoringaspnet.blogspot.com/2010/04/using-htmlagilitypack-to-get-and-post_19.html and http://htmlagilitypack.codeplex.com/discussions/247206 and How would I get the inputs from a certain form with HtmlAgility Pack? Lang: C#.net

EDIT - some more info:

you loop via foreach over the forms in the HTML document but you go after the DIVs in the next foreach without referencing the current form... in the inner foreach loop(s) you need something similar to

foreach (var d in f.SelectNodes(".//div"))

and

foreach (var i in d.SelectNodes(".//input"))


Related

Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow