html敏捷包url抓 - 獲得完整的HTML鏈接

c# html-agility-pack url web-crawler web-scraping

嗨,我正在使用nuget包中的html agility pack來刮取網頁以獲取頁面上的所有網址。代碼如下所示。然而,它在輸出中返回給我的方式鏈接只是實際網站的擴展,而不是像http://www.foo/bar/foobar.com這樣的完整網址鏈接。我得到的只是“/ foobar”。有沒有辦法通過下面的代碼獲取網址的完整鏈接?謝謝!

static void Main(string[] args)
    {
        List<string> linksToVisit = ParseLinks("https://www.facebook.com");
    }

public static List<string> ParseLinks(string email)
    {

        WebClient webClient = new WebClient();

        byte[] data = webClient.DownloadData(email);
        string download = Encoding.ASCII.GetString(data);

        HashSet<string> list = new HashSet<string>();

        var doc = new HtmlDocument();
        doc.LoadHtml(download);
        HtmlNodeCollection nodes =    doc.DocumentNode.SelectNodes("//a[@href]");

            foreach (var n in nodes)
            {
                string href = n.Attributes["href"].Value;
                list.Add(href);
            }
        return list.ToList();
    }

熱門答案

您可以檢查HREF值,如果它是相對URL或絕對值。將鏈接加載到Uri並測試它是否是相對的如果它相對轉換為絕對將是要走的路。

static void Main(string[] args)
    {
        List<string> linksToVisit = ParseLinks("https://www.facebook.com");
    }

public static List<string> ParseLinks(string urlToCrawl)
    {

        WebClient webClient = new WebClient();

        byte[] data = webClient.DownloadData(urlToCrawl);
        string download = Encoding.ASCII.GetString(data);

        HashSet<string> list = new HashSet<string>();

        var doc = new HtmlDocument();
        doc.LoadHtml(download);
        HtmlNodeCollection nodes =    doc.DocumentNode.SelectNodes("//a[@href]");

            foreach (var n in nodes)
            {
                string href = n.Attributes["href"].Value;
                list.Add(GetAbsoluteUrlString(urlToCrawl, href));
            }
        return list.ToList();
    }

將相對URL轉換為絕對的函數

static string GetAbsoluteUrlString(string baseUrl, string url)
{
    var uri = new Uri(url, UriKind.RelativeOrAbsolute);
    if (!uri.IsAbsoluteUri)
        uri = new Uri(new Uri(baseUrl), uri);
    return uri.ToString();
}


Related

許可下: CC-BY-SA with attribution
不隸屬於 Stack Overflow
許可下: CC-BY-SA with attribution
不隸屬於 Stack Overflow