I'm trying to make a webscraper where I get all the download links for the css/js/images from a html file.
Problem
The first breakpoint does hit, but the second one not after hitting "Continue".
Code I'm talking about:
private static async void GetHtml(string url, string downloadDir)
{
//Get html data, create and load htmldocument
HttpClient httpClient = new HttpClient();
//This code gets executed
var html = await httpClient.GetStringAsync(url);
//This code not
Console.ReadLine();
var htmlDocument = new HtmlDocument();
htmlDocument.LoadHtml(html);
//Get all css download urls
var linkUrl = htmlDocument.DocumentNode.Descendants("link")
.Where(node => node.GetAttributeValue("type", "")
.Equals("text/css"))
.Select(node=>node.GetAttributeValue("href",""))
.ToList();
//Downloading css, js, images and source code
using (var client = new WebClient())
{
for (var i = 0; i <scriptUrl.Count; i++)
{
Uri uri = new Uri(scriptUrl[i]);
client.DownloadFile(uri,
downloadDir + @"\js\" + uri.Segments.Last());
}
}
Edit
Im calling the getHtml method from here:
private static void Start()
{
//Create a list that will hold the names of all the subpages
List<string> subpagesList = new List<string>();
//Ask user for url and asign that to var url, also add the url to the url list
Console.WriteLine("Geef url van de website:");
string url = "https://www.hethwc.nl";
//Ask user for download directory and assign that to var downloadDir
Console.WriteLine("Geef locatie voor download:");
var downloadDir = @"C:\Users\Daniel\Google Drive\Almere\C# II\Download tests\hethwc\";
//Download and save the index file
var htmlSource = new System.Net.WebClient().DownloadString(url);
System.IO.File.WriteAllText(@"C:\Users\Daniel\Google Drive\Almere\C# II\Download tests\hethwc\index.html", htmlSource);
// Creating directories
string jsDirectory = System.IO.Path.Combine(downloadDir, "js");
string cssDirectory = System.IO.Path.Combine(downloadDir, "css");
string imagesDirectory = System.IO.Path.Combine(downloadDir, "images");
System.IO.Directory.CreateDirectory(jsDirectory);
System.IO.Directory.CreateDirectory(cssDirectory);
System.IO.Directory.CreateDirectory(imagesDirectory);
GetHtml("https://www.hethwc.nu", downloadDir);
}
How are you calling GetHtml
? Presumably that is from a sync Main
method, and you don't have any other non-worker thread in play (because your main thread exited): the process will terminate. Something like:
static void Main() {
GetHtml();
}
The above will terminate the process immediately after GetHtml
returns and the Main
method ends, which will be at the first incomplete await
point.
In current C# versions (C# 7.1 onwards) you can create an async Task Main()
method, which will allow you to await
your GetHtml
method properly, as long as you change GetHtml
to return Task
:
async static Task Main() {
await GetHtml();
}