Getting InnerText ignoring script node by using Html Agility Pack in C#

c# html-agility-pack html-parsing


I want to acquire a table-based list of proxy servers from the following page:

The table's rows are all ul elements. My issue is getting the first li element from the ul element, which has a "proxy" related class. I execute an InnerText to get the IP and Port, but since the li element includes a script child node, it just gives the text of the script node.

Below is a screenshot of the page's layout:

enter image description here

I tried the following code using LINQ and HTML Agility Pack:

WebClient webClient = new WebClient();
string page = webClient.DownloadString("");

HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();

List<List<string>> table = doc.DocumentNode.SelectSingleNode("//div[@class='table']")
            .Where(ul => ul.Elements("li").Count() > 1)
            .Select(ul => ul.Elements("li").Select(li =>
                    string result = string.Empty;
                    if (li.HasClass("proxy"))
                        HtmlNode liTmp = li.Clone();
                        result = liTmp.InnerText.Trim();
                        result = li.InnerText.Trim();
                    return result;

I can get a list with the fields (Proxy, Pa­s, Tipo, Velocidad, HTTPS/SSL) shown for each item, but the field proxy is never filled in. Additionally, I have absolutely no access to the "Pas" and "Ciudad" columns.

5/20/2018 1:09:49 AM

Accepted Answer

This is because, upon page load, JavaScript injects those data into the DOM. In actuality, the value insideProxy() is a Base64 representation of the information you want.

The value in the picture you uploaded aboveMTQ4LjI0My4zNy4xMDE6NTMyODE= to decode148.243.37.101:53281

You are giving the Agility pack a raw parsed string that just includes theProxy field...

    <div class=\ "table-wrap\">\r\n
        <div class=\ "table\">\r\n
                <li class=\ "proxy\">
                    <script type=\ "text/javascript\">
                <li class=\ "https\">HTTP</li>\r\n
                <li class=\ "speed\">29.5kbit</li>\r\n
                <li class=\ "type\">
                <li class=\ "country-city\">\r\n
                        <span class=\ "country\" title=\ "Brazil\">
                            <span class=\ "country-code\">
                                <span class=\ "flag br\"></span>
                                <span class=\ "name\">BR Brasil</span>
                        <!--\r\n                     -->
                        <span class=\ "city\">
                        </span>\r\n </div>\r\n </li>\r\n </ul>\r\n
            <div class=\ "clear\"></div>\r\n

Using the code below:

        HttpClient client = new HttpClient();
        var docResult = client.GetStringAsync("").Result;
        HtmlDocument doc = new HtmlDocument();
        Regex reg = new Regex(@"Proxy\('(?<value>.*?)'\)", RegexOptions.Compiled | RegexOptions.IgnoreCase);

        var stuff = doc.DocumentNode.SelectSingleNode("//div[@class='table']")
        .Where(x => x.HasClass("proxy"))
        .Select(li =>
            return li.InnerText;

        foreach (var item in stuff)
            var match = reg.Match(item);
            var proxy = Encoding.Default.GetString(System.Convert.FromBase64String(match.Groups["value"].Value));
            Console.WriteLine($"{item}\t\tproxy = {proxy}");

I get enter image description here.

5/20/2018 2:20:04 AM

Related Questions


Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow