Getting text between all tags in a given html and recursively going through links

c# html html-agility-pack web-crawler

Question

I looked at a few postings on stack overflow about obtaining every word between every html tag. They all made me completely lost! For a single tag, some individuals suggest regular expression, while others have offered parsing methods! Basically, I'm attempting to create a web crawler! I've got the HTML of the URL I downloaded in a string for that purpose! I also took the links out of the HTML and put them in my data string! I now want to go into the depths of every page of every link I pulled from my string and extract words from there! I have two inquiries! How can I get the text on each web page while disregarding tags and Java script? Second, how would I crawl the links recursively?

I'm receiving html in the string in the following way:

public void getting_html_code_of_link()
    {
        string urlAddress = "http://google.com";

        HttpWebRequest request = (HttpWebRequest)WebRequest.Create(urlAddress);
        HttpWebResponse response = (HttpWebResponse)request.GetResponse();
        if (response.StatusCode == HttpStatusCode.OK)
        {
            Stream receiveStream = response.GetResponseStream();
            StreamReader readStream = null;
            if (response.CharacterSet == null)
                readStream = new StreamReader(receiveStream);
            else
                readStream = new StreamReader(receiveStream, Encoding.GetEncoding(response.CharacterSet));
            data = readStream.ReadToEnd();
            response.Close();
            readStream.Close();
            Console.WriteLine(data);
        }
    }

Using the url I provide, I extract link references as follows:

public void regex_ka_kaam()
    {
        StringBuilder sb = new StringBuilder();
        //Regex hrefs = new Regex("<a href.*?>");
        Regex http = new Regex("http://.*?>");

        foreach (Match m in http.Matches(data))
        {
            sb.Append(m.ToString());
            if (http.IsMatch(m.ToString()))
            {

                sb.Append(http.Match(m.ToString()));
                sb.Append("                                                                        ");
                //sb.Append("<br>");
            }
            else
            {
                sb.Append(m.ToString().Substring(1, m.ToString().Length - 1)); //+ "<br>");
            }
        }
        Console.WriteLine(sb);
    }
1
0
12/1/2012 6:52:38 PM

Popular Answer

Regex should not be used to parse HTML files.

HTML's structure is neither rigid nor consistent.

employ zzz-13 zzz


This pulls every link from the website.

public List<string> getAllLinks(string webAddress)
{
    HtmlAgilityPack.HtmlWeb web = new HtmlAgilityPack.HtmlWeb();
    HtmlDocument newdoc=web.Load(webAddress);

    return doc.DocumentNode.SelectNodes("//a[@href]")
              .Where(y=>y.Attributes["href"].Value.StartsWith("http"))
              .Select(x=>x.Attributes["href"].Value)
              .ToList<string>();
}

this removes all HTML tags from the text.

public string getContent(string webAddress)
{
    HtmlAgilityPack.HtmlWeb web = new HtmlAgilityPack.HtmlWeb();
    HtmlDocument doc=web.Load(webAddress);

    return string.Join(" ",doc.DocumentNode.Descendants().Select(x=>x.InnerText));
}

this searches all of the links

public void crawl(string seedSite)
{
        getContent(seedSite);//gets all the content
        getAllLinks(seedSite);//get's all the links
}
2
12/2/2012 9:46:02 AM


Related Questions





Related

Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow