How to parse html document using c#

c# html html-agility-pack html-parsing

Question

I have to parse a document as follows. I am trying HtmlAgilityPack but it is very complicated. I need this tag inner text: <td style="background: #36461f;color: #ffffff;font-weight: bold;padding: 2px;font-size: 12px;height: 25px;">Mac Bahsi</td> and children

<div class="div1" style="width: 288px;" onclick="parent.javaScriptAddSlip('slip', '164518117;;;-;11.25;1;Maç Bahsi;164518117')">
<div class="div1" style="width: 288px;" onclick="parent.javaScriptAddSlip('slip', '164518117;;;-;6.50;0;Maç Bahsi;164518117')">,
<div class="div1" style="width: 288px;" onclick="parent.javaScriptAddSlip('slip', '164518117;;;-;1.18;2;Maç Bahsi;164518117')">

<!DOCTYPE HTML>
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
    <style>
        .table1 {
            width: 100%;
            margin: 0px;
            padding: 0px;
            border-collapse: collapse;
            padding: 0px;
        }

        .div1 {
            cursor: pointer;
            margin: 1px;
            border: 1px solid #999999;
            float: left;
            font-size: 12px;
        }

        .td1 {
            text-align: center;
            font-size: 20px;
            font-weight: bold;
            color: #33460E;
            height: 20px;
            padding: 0px;
        }

        .td2 {
            text-align: center;
            font-weight: bold;
            color: #808000;
            padding: 0px;
        }
    </style>
</head>
<body style="background: #FFFFCC;margin: 0px;padding: 0px;font-size: 12px;">
    <p></p>
    <table style="width: 100%" cellpadding="0" cellspacing="0">
        <tr>
            <td style="background: #36461f;color: #ffffff;font-weight: bold;padding: 2px;font-size: 12px;height: 25px;">Mac Bahsi</td>
        </tr>
        <tr>
            <td>
                <div class="div1" style="width: 288px;" onclick="parent.javaScriptAddSlip('slip', '164518117;;;-;11.25;1;Maç Bahsi;164518117')">
                    <table class="table1">
                        <tr class="menuClickable">
                            <td class="td1">11.25</td>
                        </tr>
                        <tr class="menuClickable">
                            <td class="td2">Club America Mexico</td>
                        </tr>
                    </table>
                </div>
                <div class="div1" style="width: 288px;" onclick="parent.javaScriptAddSlip('slip', '164518117;;;-;6.50;0;Maç Bahsi;164518117')">
                    <table class="table1">
                        <tr class="menuClickable">
                            <td class="td1">6.50</td>
                        </tr>
                        <tr class="menuClickable">
                            <td class="td2">Beraberlik</td>
                        </tr>
                    </table>
                </div>
                <div class="div1" style="width: 288px;" onclick="parent.javaScriptAddSlip('slip', '164518117;;;-;1.18;2;Maç Bahsi;164518117')">
                    <table class="table1">
                        <tr class="menuClickable">
                            <td class="td1">1.18</td>
                        </tr>
                        <tr class="menuClickable">
                            <td class="td2">Real Madrid</td>
                        </tr>
                    </table>
                </div>
            </td>
        </tr>
    </table>
    <table style="width: 100%" cellpadding="0" cellspacing="0">
        <tr>
            <td style="background: #36461f;color: #ffffff;font-weight: bold;padding: 2px;font-size: 12px;height: 25px;">Ilk Yari Bahsi</td>
        </tr>
        <tr>
            <td>
                <div class="div1" style="width: 288px;" onclick="parent.javaScriptAddSlip('slip', '164518128;;;-;8.50;1;İlk Yarı Bahsi;164518128')">
                    <table class="table1">
                        <tr class="menuClickable">
                            <td class="td1">8.50</td>
                        </tr>
                        <tr class="menuClickable">
                            <td class="td2">Club America Mexico</td>
                        </tr>
                    </table>
                </div>
                <div class="div1" style="width: 288px;" onclick="parent.javaScriptAddSlip('slip', '164518128;;;-;3.05;0;İlk Yarı Bahsi;164518128')">
                    <table class="table1">
                        <tr class="menuClickable">
                            <td class="td1">3.05</td>
                        </tr>
                        <tr class="menuClickable">
                            <td class="td2">Beraberlik</td>
                        </tr>
                    </table>
                </div>
                <div class="div1" style="width: 288px;" onclick="parent.javaScriptAddSlip('slip', '164518128;;;-;1.50;2;İlk Yarı Bahsi;164518128')">
                    <table class="table1">
                        <tr class="menuClickable">
                            <td class="td1">1.50</td>
                        </tr>
                        <tr class="menuClickable">
                            <td class="td2">Real Madrid</td>
                        </tr>
                    </table>
                </div>
            </td>
        </tr>
    </table>
    <table style="width: 100%" cellpadding="0" cellspacing="0">
        <tr>
            <td style="background: #36461f;color: #ffffff;font-weight: bold;padding: 2px;font-size: 12px;height: 25px;">İkinci Yarı Bahsi</td>
        </tr>
        <tr>
            <td>
                <div class="div1" style="width: 288px;" onclick="parent.javaScriptAddSlip('slip', '164518133;;;-;8.50;1;İkinci Yarı Bahsi;164518133')">
                    <table class="table1">
                        <tr class="menuClickable">
                            <td class="td1">8.50</td>
                        </tr>
                        <tr class="menuClickable">
                            <td class="td2">Club America Mexico</td>
                        </tr>
                    </table>
                </div>
                <div class="div1" style="width: 288px;" onclick="parent.javaScriptAddSlip('slip', '164518133;;;-;3.70;0;İkinci Yarı Bahsi;164518133')">
                    <table class="table1">
                        <tr class="menuClickable">
                            <td class="td1">3.70</td>
                        </tr>
                        <tr class="menuClickable">
                            <td class="td2">Beraberlik</td>
                        </tr>
                    </table>
                </div>
                <div class="div1" style="width: 288px;" onclick="parent.javaScriptAddSlip('slip', '164518133;;;-;1.40;2;İkinci Yarı Bahsi;164518133')">
                    <table class="table1">
                        <tr class="menuClickable">
                            <td class="td1">1.40</td>
                        </tr>
                        <tr class="menuClickable">
                            <td class="td2">Real Madrid</td>
                        </tr>
                    </table>
                </div>
            </td>
        </tr>
    </table>
    <br />
    <br />
    <br />
</body>
</html>

Popular Answer

First, install the HTMLAgilityPack nuget package into your project.

Then, as an example:

HtmlAgilityPack.HtmlDocument htmlDoc = new HtmlAgilityPack.HtmlDocument();

// There are various options, set as needed
htmlDoc.OptionFixNestedTags=true;

// filePath is a path to a file containing the html
htmlDoc.Load(filePath);

// Use:  htmlDoc.LoadHtml(xmlString);  to load from a string (was htmlDoc.LoadXML(xmlString)

// ParseErrors is an ArrayList containing any errors from the Load statement
if (htmlDoc.ParseErrors != null && htmlDoc.ParseErrors.Count() > 0)
{
    // Handle any parse errors as required

}
else
{

    if (htmlDoc.DocumentNode != null)
    {
        HtmlAgilityPack.HtmlNode bodyNode = htmlDoc.DocumentNode.SelectSingleNode("//body");

        if (bodyNode != null)
        {
            // Do something with bodyNode
        }
    }
}

(NB: This code is an example only and not necessarily the best/only approach. Do not use it blindly in your own application.)

The HtmlDocument.Load() method also accepts a stream which is very useful in integrating with other stream oriented classes in the .NET framework. While HtmlEntity.DeEntitize() is another useful method for processing html entities correctly. (thanks Matthew)

HtmlDocument and HtmlNode are the classes you'll use most. Similar to an XML parser, it provides the selectSingleNode and selectNodes methods that accept XPath expressions.

Pay attention to the HtmlDocument.Option?????? boolean properties. These control how the Load and LoadXML methods will process your HTML/XHTML.

There is also a compiled help file called HtmlAgilityPack.chm that has a complete reference for each of the objects. This is normally in the base folder of the solution.




Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Is this KB legal? Yes, learn why
Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Is this KB legal? Yes, learn why