Using HtmlAgilityPack to extract the complete URL - C#

c# extraction html-agility-pack hyperlink

Question

Okay, so in this method, just referring urls are extracted.

a code for extraction

foreach (HtmlNode link in hdDoc.DocumentNode.SelectNodes("//a[@href]"))
{
    lsLinks.Add(link.Attributes["href"].Value.ToString());
}

The link code

<a href="Login.aspx">Login</a>

The retrieved link

Login.aspx

But I want to see how the actual URL was interpreted by the browser.

http://www.monstermmorpg.com/Login.aspx

I can accomplish that by determining if the url contains http and adding the domain value if it doesn't, but it might sometimes lead to issues, and in my opinion, it's not a very smart approach.

HTMLAgilityPack.1.4.0 and C# 4.0

1
8
10/13/2011 8:52:51 PM

Accepted Answer

You may combine the parsed url somewhat like this, assuming you have the original url:

// The address of the page you crawled
var baseUrl = new Uri("http://example.com/path/to-page/here.aspx");

// root relative
var url = new Uri(baseUrl, "/Login.aspx");
Console.WriteLine (url.AbsoluteUri); // prints 'http://example.com/Logon.aspx'

// relative
url = new Uri(baseUrl, "../foo.aspx?q=1");
Console.WriteLine (url.AbsoluteUri); // prints 'http://example.com/path/foo.aspx?q=1'

// absolute
url = new Uri(baseUrl, "http://stackoverflow.com/questions/7760286/");
Console.WriteLine (url.AbsoluteUri); // prints 'http://stackoverflow.com/questions/7760286/'

// other...
url = new Uri(baseUrl, "javascript:void(0)");
Console.WriteLine (url.AbsoluteUri); // prints 'javascript:void(0)'

Observe the usage ofAbsoluteUri without depending onToString() because ToString decodes the URL, which is usually what you want (to make it more "human-readable").

16
10/13/2011 10:33:39 PM

Popular Answer

I can do it with checking the url whether containing http and if not add the domain value

What you ought to do is that. Nothing in the HTML Agility Pack can assist you with this:

var url = new Uri(
    new Uri(baseUrl).GetLeftPart(UriPartial.Path), 
    link.Attributes["href"].Value)
); 


Related Questions





Related

Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow