How to identify if tweet is original or retweet in scraping with HtmlAgilityPack?

c# filter html-agility-pack tweetr web-scraping

Question

I wanted Twitter tweets of user for data analysis. For that I have used HtmlAgilityPack package to scrape Twitter and it gives me 30 top tweets.

I recognized tweet-text element and fetched all tweets. But I want to identify if it is tweet or retweet. How can I do that?

I have analysed HTML. In retweet there will be an element having tweet-context with-icn class. But when I scrape tweet on that class it throws null exception, because not all tweets will have that class. Then based on what and how can I scrape to get to know if it is retweet or not?

Code:

HtmlAgilityPack.HtmlWeb web = new HtmlAgilityPack.HtmlWeb();
HtmlAgilityPack.HtmlDocument doc = web.Load("https://twitter.com/BarackObama");

var TweetsNode= doc.DocumentNode.SelectNodes("//tr[@class='tweet-container']").ToList();

foreach (var item in TweetsNode)
{
    var tweet = new Tweets
    {
        console.WriteLine(item.innertext);
    };
}

In the above code, I have tried to fetch tweets of Barack Obama profile. I'm getting top 30 tweets. How can I recognize which one is retweet?
Thank you.

Accepted Answer

Scraping Twitter 101

  1. Get all Tweets from a page (which comes in handy tables <table class='tweet '>)

    HtmlWeb p = new HtmlWeb();
    var doc = p.Load(@"https://twitter.com/dailygametips");
    var nodes = doc.DocumentNode.SelectNodes("//table[@class='tweet  ']");
    
  2. Look in nodes for the <span class='context'> to indicated that this tweet is a retweet.

    List<Tweet> tweets = new List<Tweet>();
    foreach (var node in nodes)
    {
        bool isRetweet = false;
        var spanNode = node.SelectSingleNode(".//span[@class='context']");
        if (spanNode != null && spanNode.InnerHtml.Contains("retweeted"))
        {
            isRetweet = true;
        }
    
  3. We also want the Message Text, so scrap this next <div class='tweet-text'>:

        string msg = string.Empty;
        var msgNode = node.SelectSingleNode(".//div[@class='tweet-text']");
        if (msgNode != null)
        {
            msg = msgNode.InnerText.Trim();
        }
        tweets.Add(new Tweet(msg, isRetweet));
    }
    

Additional the Tweet Container Class:

class Tweet
{
    public Tweet(string message, bool isRetweet)
    {
        Message = message;
        IsRetweet = isRetweet;
    }

    string Message { get; private set; }
    bool IsRetweet { get; private set; }
}

As you tell, this is not really rocket science. But you need to understand the basic principals of XPath and Scrapping.



Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Is this KB legal? Yes, learn why
Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Is this KB legal? Yes, learn why