Using HtmlAgilityPack to get specific data in C# and serialize it to json

c# html html-agility-pack json visual-studio

Question

I've downloaded an html source code and I'm trying to get some data out of it to serialize it to a "json" file.

This is the html source file: https://drive.google.com/file/d/0BzweTZsfeoxMTWk2LVdnYTJMRUE/view?usp=sharing

In the html code there are "2" groups that I wish to collect data from.

At the moment I managed to get the code inside this "2" groups and display it in two panels using labels. My code is as fallows:

using System;
using System.Collections.Generic;
using System.ComponentModel;
using System.Data;
using System.Drawing;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using System.Windows.Forms;
using HtmlAgilityPack;

namespace Parser_Test_1._0
{
    public partial class Form1 : Form
    {
        public Form1()
        {
            InitializeComponent();
        }

        private void Form1_Load(object sender, EventArgs e)
        {

        }

        private void button1_Click(object sender, EventArgs e)
        {
            HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
            doc.Load(@"C:...\bin\Debug\xbFrSourceCode.txt");

            string datacollected1 = doc.DocumentNode.SelectNodes("//*[@id=\"favoritesContent\"]/div[2]/div[2]/ul")[0].InnerHtml;
            string datacollected2 = doc.DocumentNode.SelectNodes("//*[@id=\"friendsContent\"]/div[2]/div[2]")[0].InnerHtml;
            label1.Text = datacollected1;
            label2.Text = datacollected2;
        }      

    }
}

From this two groups I wish to collect the users in them and for each user, their respective data to serialize it unto a json file.

Each user is separated with <li ...></li>

For each user I wish to get:

  • Gamertag: data-gamertag="this is the gamertag"
  • Gamerpic: it's in class="gamerpicWrapper" the src="this is the gamerpic"
  • Realname: <div class="realName">this is the realname</div>
  • PrimaryInfo: <div class="primaryInfo">this is the primaryinfo</div>
  • isOnline: <div class="statusIcon"> if there is code here, then in the json file this value will be true </div>

This is an example of the desired "json" file format (Note that the fallowing code is probably badly written.):

{
    "favorites" : 
    [
        {
            "gamertag" : "Gamertag1",
            "gamerpic" : "gamerpicURL",
            "realname" : "",
            "primaryInfo" : "",
            "isOnline" : false,
        },
        {
            "gamertag" : "Gamertag2",
            "gamerpic" : "gamerpicURL",
            "realname" : "realname2",
            "primaryInfo" : "primaryinfo2",
            "isOnline" : true,
        },
        {
            "gamertag" : "Gamertag3",
            "gamerpic" : "gamerpicURL",
            "realname" : "",
            "primaryInfo" : "",
            "isOnline" : false,
        },
        {
            "gamertag" : "Gamertag4",
            "gamerpic" : "gamerpicURL",
            "realname" : "realname4",
            "primaryInfo" : "",
            "isOnline" : true,
        }

    ]
    "friends" : 
    [
        {
            "gamertag" : "Gamertag1",
            "gamerpic" : "gamerpicURL",
            "realname" : "",
            "primaryInfo" : "",
            "isOnline" : true,
        },
        {
            "gamertag" : "Gamertag2",
            "gamerpic" : "gamerpicURL",
            "realname" : "realname2",
            "primaryInfo" : "primaryinfo2",
            "isOnline" : false,
        },
        {
            "gamertag" : "Gamertag3",
            "gamerpic" : "gamerpicURL",
            "realname" : "realname3",
            "primaryInfo" : "",
            "isOnline" : true,
        },
        {
            "gamertag" : "Gamertag4",
            "gamerpic" : "gamerpicURL",
            "realname" : "",
            "primaryInfo" : "",
            "isOnline" : false,
        }

    ]
}

I would greatly appreciate if anyone could show me how to do this.

Popular Answer

The following code shows an appropriate usage of xpath and HAP. The usage of xpath can be simplified, but you gave me a 4k html files and I don't feel like learning the structure of all of it. However the code gets everything you want as variables. Now it is your job to put into a json structure - but if you don't have any knowledge of JSON then consider using XML.

        HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
        doc.OptionFixNestedTags = true;
        doc.Load("damn.html");

        //First off we find the nodes we want to collect data from. Note that we are only looking for a singlenode compared to your code where you find all nodes
        //this could be cut down to selectnodes where we take all <li> tages with each div tag. But for simplicity.
        HtmlNodeCollection favoritesContent = doc.DocumentNode.SelectNodes("//div[@id='favoritesContent']/div[@class='personListWrapper']/div[@class='gamerList']/ul//li");

        foreach (HtmlNode x in favoritesContent)
        {
            //here we find the gamertag which is an attribute in <li> if <li> does not have that value
            //it will then return the deault value ""(empty string as specified)
            string gamerTag = x.GetAttributeValue("data-gamertag", "");
            HtmlNode temp = x.SelectSingleNode("./a[@class='gamerpicWrapper']/*/img[@class='favorite']");
            string srcOnPic = temp.GetAttributeValue("src", "not found");
            string realName = x.SelectSingleNode("./descendant::*//div[@class='realName']").InnerText;
            string primaryInfo = x.SelectSingleNode("./descendant::*//div[@class='primaryInfo']").InnerText;

            if (0 < x.SelectSingleNode("./div[@class='statusIcon']").InnerHtml.Length)
            {
                bool online = true;

            }
        }



Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Is this KB legal? Yes, learn why
Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Is this KB legal? Yes, learn why