How to extract meta tags of a series of URLs without downloading whole html in c# c# html html-agility-pack metadata


I want to extract title , description and keywords of a seris of URLs
I have this code

 WebClient x = new WebClient();
 string  pageSource = (x.DownloadString(url));     
 query.title = Regex.Match(pageSource, @"\<title\b[^>]*\>\s*(?<Title>[\s\S]*?)\</title\>", RegexOptions.IgnoreCase).Groups["Title"].Value;

But I do not want to download whole page because It is so time consuming for a series of URLs. Is there any way to get get these information without downloading whole page?
I should mention that I get these URLs in google search result page buy sending query to google.

6/21/2016 6:00:40 AM

Popular Answer

You can request and download partial result using HttpClient by specifying range header. You can define the buffer length you want to download and read:

    static void Main()

    private static async Task Test()
        const string url = "";
        const int bytesToRead = 2000;

        using (var httpclient = new HttpClient())
            httpclient.DefaultRequestHeaders.Range = new RangeHeaderValue(0, bytesToRead);

            var response = await httpclient.GetAsync(url, HttpCompletionOption.ResponseHeadersRead);

            using (var stream = await response.Content.ReadAsStreamAsync())
                var buffer = new byte[bytesToRead];
                stream.Read(buffer, 0, buffer.Length);

                var partialHtml = Encoding.UTF8.GetString(buffer);
                //extract required info from partial html

Same result could be achieved using "old" WebClient

6/22/2016 5:52:17 AM

Related Questions


Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow