HTTP Protocol violation when downloading webpage using HtmlAgilityPack

.net c# html-agility-pack system.net.webexception

Question

I'm trying to parse download pages from www.mediafire.com, but i really often get a System.Net.WebException with the following message, when i try to load a page to a HtmlDocument:

The server committed a protocol violation. Section=ResponseStatusLine

This is my code:

HtmlAgilityPack.HtmlWeb web = new HtmlAgilityPack.HtmlWeb();

HtmlAgilityPack.HtmlDocument doc = null;

string url = www.mediafire.com/?abcdefghijkl //There are many different links

try
{
    doc = web.Load(url); //From 30 links, usually only 10 load properly
}

catch (WebException)
{

}

Any ideas why only 10 of 30 links work (the links change everytime, because my program is a "search engine") and how i can resolve the problem?

When i load those sites in my browser, everything works fine.


I've tried to add the following lines to my app.config, but that doesn't help either

<system.net>
    <settings>
        <httpWebRequest useUnsafeHeaderParsing="true" />
    </settings>
</system.net>

Accepted Answer

This is not related to the Html Agility Pack directly, but rather to the underlying HTTP/socket layer. This error means the server is not sending back a correct HTTP status line.

The status line is defined in HTTP RFC available here: http://www.w3.org/Protocols/rfc2616/rfc2616-sec6.html

I quote:

The first line of a Response message is the Status-Line, consisting of the protocol version followed by a numeric status code and its associated textual phrase, with each element separated by SP characters. No CR or LF is allowed except in the final CRLF sequence.

   Status-Line = HTTP-Version SP Status-Code SP Reason-Phrase CRLF

You can add socket traces with full hex report to check this:

<configuration>
    <system.diagnostics>
        <sources>
            <source name="System.Net.Sockets" tracemode="includehex">
                <listeners>
                    <add name="System.Net.Sockets" type="System.Diagnostics.TextWriterTraceListener" initializeData="SocketTrace.log" />
                </listeners>
            </source>
        </sources>
        <switches>
            <add name="System.Net.Sockets" value="Verbose"/>
        </switches>
        <trace autoflush="true" />
    </system.diagnostics>
</configuration>

This will create a SocketTrace.log file in the current executing directory. Have a look in there, the protocol violation should be visible. You can post it here if it's not too big :-)

Unfortunately, if you don't own the server, there is not much you can do (if you already added the useUnsafeHeaderParsing setting, which is good) but fail gracefully in these cases.


Popular Answer

Setting keep alive property to false will fix this issue. But I am not sure if htmlagilitypack has this property. So using WebClient would be a better alternative.

This worked for me. Instead of directly loading the url with web.Load, download the html of desired url using your custom WebClient. In your custom WebClient override GetWebRequest method to make HttpWebRequest.KeepAlive = false. Now load the downloaded file in web.Load().

MyWebClient client = new MyWebClient();
client.DownloadFile(searchURL, @"C:\\index.html");
var doc = web.Load("C:\\index.html");

Overriding GetWebRequest

using System;
using System.Net;

namespace MyProject
{
    internal class CustomWebClient : WebClient
    {
        protected override WebRequest GetWebRequest(Uri address)
        {
            WebRequest request = base.GetWebRequest(address);
            if (request is HttpWebRequest)
            {
                (request as HttpWebRequest).KeepAlive = false;
            }
            return request;
        }
    }
}



Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Is this KB legal? Yes, learn why
Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Is this KB legal? Yes, learn why