How to extract meta tags of a series of URLs without downloading whole html in c#

asp.net c# html html-agility-pack metadata

Question

I want to extract title , description and keywords of a seris of URLs
I have this code

 WebClient x = new WebClient();
 string  pageSource = (x.DownloadString(url));     
 query.title = Regex.Match(pageSource, @"\<title\b[^>]*\>\s*(?<Title>[\s\S]*?)\</title\>", RegexOptions.IgnoreCase).Groups["Title"].Value;

But I do not want to download whole page because It is so time consuming for a series of URLs. Is there any way to get get these information without downloading whole page?
I should mention that I get these URLs in google search result page buy sending query to google.

1
0
6/21/2016 6:00:40 AM

Popular Answer

You can request and download partial result using HttpClient by specifying range header. You can define the buffer length you want to download and read:

    static void Main()
    {
        Test().GetAwaiter().GetResult();
    }

    private static async Task Test()
    {
        const string url = "http://google.com";
        const int bytesToRead = 2000;

        using (var httpclient = new HttpClient())
        {
            httpclient.DefaultRequestHeaders.Range = new RangeHeaderValue(0, bytesToRead);

            var response = await httpclient.GetAsync(url, HttpCompletionOption.ResponseHeadersRead);

            using (var stream = await response.Content.ReadAsStreamAsync())
            {
                var buffer = new byte[bytesToRead];
                stream.Read(buffer, 0, buffer.Length);

                var partialHtml = Encoding.UTF8.GetString(buffer);
                //extract required info from partial html
            }
        }
    }

Same result could be achieved using "old" WebClient

2
6/22/2016 5:52:17 AM


Related Questions





Related

Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow