Getting InnerText ignoring script node by using Html Agility Pack in C#

c# html-agility-pack html-parsing

Question

I have following page from which I want to get a list of proxy servers from a table:

http://proxy-list.org/spanish/search.php?search=&country=any&type=any&port=any&ssl=any

Each row in the table is an ul element. My problem is when obtaining the first li element which associated class is "proxy" from the ul element. I want to obtain the IP and Port so I perform an InnerText but as li element has an script child node, it returns the text of the script node.

Below an image of the structure of the page:

enter image description here

I have tried below code using Html Agility Pack and LINQ:

WebClient webClient = new WebClient();
string page = webClient.DownloadString("http://proxy-list.org/spanish/search.php?search=&country=any&type=any&port=any&ssl=any");

HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(page);

List<List<string>> table = doc.DocumentNode.SelectSingleNode("//div[@class='table']")
            .Descendants("ul")
            .Where(ul => ul.Elements("li").Count() > 1)
            .Select(ul => ul.Elements("li").Select(li =>
                {
                    string result = string.Empty;
                    if (li.HasClass("proxy"))
                    {
                        HtmlNode liTmp = li.Clone();
                        liTmp.RemoveAllChildren();
                        result = liTmp.InnerText.Trim();
                    }
                    else
                    {
                        result = li.InnerText.Trim();
                    }
                    return result;
                }).ToList()).ToList();

I can obtain a list which each item is a list containing the fields (Proxy, País, Tipo, Velocidad, HTTPS/SSL) but field proxy is always empty. Also I am not getting at all the "País" and "Ciudad" columns.

Accepted Answer

That is because those values are injected into the DOM by JavaScript after page load. Actually the value inside the Proxy() is a Base64 representation of what you are looking for.

In the image you have posted above the value MTQ4LjI0My4zNy4xMDE6NTMyODE= decodes to 148.243.37.101:53281

The raw parsed string you are feeding to the Agility pack only contains the Proxy field...

    <div class=\ "table-wrap\">\r\n
        <div class=\ "table\">\r\n
            <ul>\r\n
                <li class=\ "proxy\">
                    <script type=\ "text/javascript\">
                        Proxy('MTM4Ljk3LjkyLjI0OTo1MzgxNg==')
                    </script>
                </li>\r\n
                <li class=\ "https\">HTTP</li>\r\n
                <li class=\ "speed\">29.5kbit</li>\r\n
                <li class=\ "type\">
                    <strong>Elite</strong>
                </li>\r\n
                <li class=\ "country-city\">\r\n
                    <div>\r\n
                        <span class=\ "country\" title=\ "Brazil\">
                            <span class=\ "country-code\">
                                <span class=\ "flag br\"></span>
                                <span class=\ "name\">BR Brasil</span>
                            </span>
                        </span>
                        <!--\r\n                     -->
                        <span class=\ "city\">
                            <span>Rondon</span>
                        </span>\r\n </div>\r\n </li>\r\n </ul>\r\n
            <div class=\ "clear\"></div>\r\n

Using the following code:

        HttpClient client = new HttpClient();
        var docResult = client.GetStringAsync("http://proxy-list.org/spanish/search.php?search=&country=any&type=any&port=any&ssl=any").Result;
        HtmlDocument doc = new HtmlDocument();
        doc.LoadHtml(docResult);
        Regex reg = new Regex(@"Proxy\('(?<value>.*?)'\)", RegexOptions.Compiled | RegexOptions.IgnoreCase);

        var stuff = doc.DocumentNode.SelectSingleNode("//div[@class='table']")
        .Descendants("li")
        .Where(x => x.HasClass("proxy"))
        .Select(li =>
        {
            return li.InnerText;
        }).ToList();

        foreach (var item in stuff)
        {
            var match = reg.Match(item);
            var proxy = Encoding.Default.GetString(System.Convert.FromBase64String(match.Groups["value"].Value));
            Console.WriteLine($"{item}\t\tproxy = {proxy}");
        }

I get: enter image description here




Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Is this KB legal? Yes, learn why
Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Is this KB legal? Yes, learn why