How to extract full url with HtmlAgilityPack - C#

c# extraction html-agility-pack hyperlink

Question

Alright with the way below it is extracting only referring url like this

the extraction code :

foreach (HtmlNode link in hdDoc.DocumentNode.SelectNodes("//a[@href]"))
{
    lsLinks.Add(link.Attributes["href"].Value.ToString());
}

The url code

<a href="Login.aspx">Login</a>

The extracted url

Login.aspx

But i want to get real link what browser parsed like

http://www.monstermmorpg.com/Login.aspx

I can do it with checking the url whether containing http and if not add the domain value but it may cause some problems at some occasions and i think not a very wise solution.

c# 4.0 , HtmlAgilityPack.1.4.0

Accepted Answer

Assuming you have the original url, you can combine the parsed url something like this:

// The address of the page you crawled
var baseUrl = new Uri("http://example.com/path/to-page/here.aspx");

// root relative
var url = new Uri(baseUrl, "/Login.aspx");
Console.WriteLine (url.AbsoluteUri); // prints 'http://example.com/Logon.aspx'

// relative
url = new Uri(baseUrl, "../foo.aspx?q=1");
Console.WriteLine (url.AbsoluteUri); // prints 'http://example.com/path/foo.aspx?q=1'

// absolute
url = new Uri(baseUrl, "http://stackoverflow.com/questions/7760286/");
Console.WriteLine (url.AbsoluteUri); // prints 'http://stackoverflow.com/questions/7760286/'

// other...
url = new Uri(baseUrl, "javascript:void(0)");
Console.WriteLine (url.AbsoluteUri); // prints 'javascript:void(0)'

Note the use of AbsoluteUri and not relying on ToString() because ToString decodes the URL (to make it more "human-readable"), which is not typically what you want.


Popular Answer

I can do it with checking the url whether containing http and if not add the domain value

That's what you should do. Html Agility Pack has nothing to help you with this:

var url = new Uri(
    new Uri(baseUrl).GetLeftPart(UriPartial.Path), 
    link.Attributes["href"].Value)
); 



Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Is this KB legal? Yes, learn why
Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Is this KB legal? Yes, learn why