html agility pack url scraping-- getting full html link

c# html-agility-pack url web-crawler web-scraping

Question

Hi I am using html agility pack from the nuget packages in order to scrape a web page to get all of the urls on the page. The code is shown below. However the way it returns to me in the output the links are just extensions of the actual website but not the full url link like http://www.foo/bar/foobar.com. All I will get is "/foobar". Is there a way to get the full links of the url with the code below? Thanks!

static void Main(string[] args)
    {
        List<string> linksToVisit = ParseLinks("https://www.facebook.com");
    }

public static List<string> ParseLinks(string email)
    {

        WebClient webClient = new WebClient();

        byte[] data = webClient.DownloadData(email);
        string download = Encoding.ASCII.GetString(data);

        HashSet<string> list = new HashSet<string>();

        var doc = new HtmlDocument();
        doc.LoadHtml(download);
        HtmlNodeCollection nodes =    doc.DocumentNode.SelectNodes("//a[@href]");

            foreach (var n in nodes)
            {
                string href = n.Attributes["href"].Value;
                list.Add(href);
            }
        return list.ToList();
    }

Popular Answer

You can check the HREF value if it's relative URL or absolute. Load the link into a Uri and test whether it is relative If it relative convert it to absolute will be the way to go.

static void Main(string[] args)
    {
        List<string> linksToVisit = ParseLinks("https://www.facebook.com");
    }

public static List<string> ParseLinks(string urlToCrawl)
    {

        WebClient webClient = new WebClient();

        byte[] data = webClient.DownloadData(urlToCrawl);
        string download = Encoding.ASCII.GetString(data);

        HashSet<string> list = new HashSet<string>();

        var doc = new HtmlDocument();
        doc.LoadHtml(download);
        HtmlNodeCollection nodes =    doc.DocumentNode.SelectNodes("//a[@href]");

            foreach (var n in nodes)
            {
                string href = n.Attributes["href"].Value;
                list.Add(GetAbsoluteUrlString(urlToCrawl, href));
            }
        return list.ToList();
    }

Function to convert Relative URL to Absolute

static string GetAbsoluteUrlString(string baseUrl, string url)
{
    var uri = new Uri(url, UriKind.RelativeOrAbsolute);
    if (!uri.IsAbsoluteUri)
        uri = new Uri(new Uri(baseUrl), uri);
    return uri.ToString();
}



Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Is this KB legal? Yes, learn why
Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Is this KB legal? Yes, learn why