I want to extract title , description and keywords of a seris of URLs
I have this code
WebClient x = new WebClient();
string pageSource = (x.DownloadString(url));
query.title = Regex.Match(pageSource, @"\<title\b[^>]*\>\s*(?<Title>[\s\S]*?)\</title\>", RegexOptions.IgnoreCase).Groups["Title"].Value;
But I do not want to download whole page because It is so time consuming for a series of URLs. Is there any way to get get these information without downloading whole page?
I should mention that I get these URLs in google search result page buy sending query to google.
You can request and download partial result using HttpClient
by specifying range header. You can define the buffer length you want to download and read:
static void Main()
{
Test().GetAwaiter().GetResult();
}
private static async Task Test()
{
const string url = "http://google.com";
const int bytesToRead = 2000;
using (var httpclient = new HttpClient())
{
httpclient.DefaultRequestHeaders.Range = new RangeHeaderValue(0, bytesToRead);
var response = await httpclient.GetAsync(url, HttpCompletionOption.ResponseHeadersRead);
using (var stream = await response.Content.ReadAsStreamAsync())
{
var buffer = new byte[bytesToRead];
stream.Read(buffer, 0, buffer.Length);
var partialHtml = Encoding.UTF8.GetString(buffer);
//extract required info from partial html
}
}
}
Same result could be achieved using "old" WebClient