HTML Agility Pack link correction

c# html-agility-pack syntax

Question

I'm working on a small project and I got a little problem, hope you could help me.

I got this basic few lines that load a given url and takes out some tags:

var webGet2 = new HtmlWeb();
HtmlAgilityPack.HtmlDocument doc = webGet2.Load(pattern);
var htmlMatches = doc.DocumentNode.SelectNodes("//li[@class=''] | //li[@class='f']");

After I'm receiving the collection, I need to run a foreach loop that can take all the href and src link and make them valid, because when I'm downloading the source, the link looks like /folder/folder/image.jpg I want to add http://www.site.com before each link.

I've build this project with Regex and had no problem doing that, but with HTML agility its not getting straight with my mind.

Thank you!

1
1
7/31/2012 7:59:06 PM

Accepted Answer

So you want to search some nodes for certain attributes that contain relative urls and change them to absolute urls? You could do this:

static void AdjustAttributes(HtmlNode root, string baseUrl, string attrName)
{
    var query =
        from node in root.Descendants()
        let attr = node.Attributes[attrName]
        where attr != null
        select attr;
    foreach (var attr in query)
    {
        var url = GetAbsoluteUrlString(baseUrl, attr.Value);
        attr.Value = url;
    }
}

static string GetAbsoluteUrlString(string baseUrl, string url)
{
    var uri = new Uri(url, UriKind.RelativeOrAbsolute);
    if (!uri.IsAbsoluteUri)
        uri = new Uri(new Uri(baseUrl), uri);
    return uri.ToString();
}
var web = new HtmlWeb();
var doc = web.Load(pattern);
var selectedNodes = doc.DocumentNode.SelectNodes("//li[@class=''] | //li[@class='f']");
foreach (var node in selectedNodes)
{
    AdjustAttributes(node, url, "href");
    AdjustAttributes(node, url, "src");
}
3
7/31/2012 10:32:37 PM


Related Questions





Related

Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow