html敏捷包url抓 - 获得完整的HTML链接

c# html-agility-pack url web-crawler web-scraping

嗨,我正在使用nuget包中的html agility pack来刮取网页以获取页面上的所有网址。代码如下所示。然而,它在输出中返回给我的方式链接只是实际网站的扩展,而不是像http://www.foo/bar/foobar.com这样的完整网址链接。我得到的只是“/ foobar”。有没有办法通过下面的代码获取网址的完整链接?谢谢!

static void Main(string[] args)
    {
        List<string> linksToVisit = ParseLinks("https://www.facebook.com");
    }

public static List<string> ParseLinks(string email)
    {

        WebClient webClient = new WebClient();

        byte[] data = webClient.DownloadData(email);
        string download = Encoding.ASCII.GetString(data);

        HashSet<string> list = new HashSet<string>();

        var doc = new HtmlDocument();
        doc.LoadHtml(download);
        HtmlNodeCollection nodes =    doc.DocumentNode.SelectNodes("//a[@href]");

            foreach (var n in nodes)
            {
                string href = n.Attributes["href"].Value;
                list.Add(href);
            }
        return list.ToList();
    }

热门答案

您可以检查HREF值,如果它是相对URL或绝对值。将链接加载到Uri并测试它是否是相对的如果它相对转换为绝对将是要走的路。

static void Main(string[] args)
    {
        List<string> linksToVisit = ParseLinks("https://www.facebook.com");
    }

public static List<string> ParseLinks(string urlToCrawl)
    {

        WebClient webClient = new WebClient();

        byte[] data = webClient.DownloadData(urlToCrawl);
        string download = Encoding.ASCII.GetString(data);

        HashSet<string> list = new HashSet<string>();

        var doc = new HtmlDocument();
        doc.LoadHtml(download);
        HtmlNodeCollection nodes =    doc.DocumentNode.SelectNodes("//a[@href]");

            foreach (var n in nodes)
            {
                string href = n.Attributes["href"].Value;
                list.Add(GetAbsoluteUrlString(urlToCrawl, href));
            }
        return list.ToList();
    }

将相对URL转换为绝对的函数

static string GetAbsoluteUrlString(string baseUrl, string url)
{
    var uri = new Uri(url, UriKind.RelativeOrAbsolute);
    if (!uri.IsAbsoluteUri)
        uri = new Uri(new Uri(baseUrl), uri);
    return uri.ToString();
}


Related

许可下: CC-BY-SA with attribution
不隶属于 Stack Overflow
许可下: CC-BY-SA with attribution
不隶属于 Stack Overflow