html agility pack url scraping-- getting full html link

c# html-agility-pack url web-crawler web-scraping

Question

Hello, I'm scraping a website to gather all of the urls on the page using the HTML Agility Pack from the Nuget packages. Below is a display of the code. However, the URLs it gives me back in the output are merely longer versions of the website itself rather than the whole url address like http://www.foo/bar/foobar.com. All I'll get is "/foobar." Is it possible to use the code below to acquire the entire links of the url? Thanks!

static void Main(string[] args)
    {
        List<string> linksToVisit = ParseLinks("https://www.facebook.com");
    }

public static List<string> ParseLinks(string email)
    {

        WebClient webClient = new WebClient();

        byte[] data = webClient.DownloadData(email);
        string download = Encoding.ASCII.GetString(data);

        HashSet<string> list = new HashSet<string>();

        var doc = new HtmlDocument();
        doc.LoadHtml(download);
        HtmlNodeCollection nodes =    doc.DocumentNode.SelectNodes("//a[@href]");

            foreach (var n in nodes)
            {
                string href = n.Attributes["href"].Value;
                list.Add(href);
            }
        return list.ToList();
    }
1
2
1/3/2016 11:20:24 PM

Popular Answer

You may determine the HREF value's absolute or relative URL by looking at it. Use a Uri to load the URL and determine if it is relative. The best course of action if the value is relative is to make it absolute.

static void Main(string[] args)
    {
        List<string> linksToVisit = ParseLinks("https://www.facebook.com");
    }

public static List<string> ParseLinks(string urlToCrawl)
    {

        WebClient webClient = new WebClient();

        byte[] data = webClient.DownloadData(urlToCrawl);
        string download = Encoding.ASCII.GetString(data);

        HashSet<string> list = new HashSet<string>();

        var doc = new HtmlDocument();
        doc.LoadHtml(download);
        HtmlNodeCollection nodes =    doc.DocumentNode.SelectNodes("//a[@href]");

            foreach (var n in nodes)
            {
                string href = n.Attributes["href"].Value;
                list.Add(GetAbsoluteUrlString(urlToCrawl, href));
            }
        return list.ToList();
    }

a function that changes a relative URL to an absolute one.

static string GetAbsoluteUrlString(string baseUrl, string url)
{
    var uri = new Uri(url, UriKind.RelativeOrAbsolute);
    if (!uri.IsAbsoluteUri)
        uri = new Uri(new Uri(baseUrl), uri);
    return uri.ToString();
}
2
1/5/2016 3:06:13 AM


Related Questions





Related

Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow