Up to 15 percent of active Twitter accounts are really bots: autonomous agents driven by algorithms rather than actual human personalities. That's about 48 million fakes. The 15 percent figure comes courtesy of a new analysis by computer scientists at Indiana University and the University of Southern California using a machine learning framework designed to detect bots based on nearly a thousand distinct Twitter user characteristics. The group's work is described in a paper posted this week to the arXiv preprint server.
On its face, this is a classic machine learning classification problem. Take some properties of an entity—screen name length, account age, and number of retweets per hour, for example—and look at those same properties across many different versions or instances of that entity along with some other property that we want to predict (whether an account is a bot or a human, in our case). Crunch some linear algebra and you'll wind up with an abstracted model of that entity with respect to whatever it is we're making predictions about.
This model is basically a formula that you can then feed in some so-far unseen observations and get a prediction in return. Like, if we take the above properties for many thousands or millions of Twitter accounts we should be able to return a model that will make predictions based on new observations of those properties. Machine learning is really just clever statistics.
Part of what makes the new research interesting is the sheer number of features used in the classification model. Just think of a Twitter account and then try to come up with as many different ways of describing it as you can: number of followers, age, username length, number of retweets, number of tweets, verified or not, average tweet length. Um. I could give you maybe 20 or so. Meanwhile, the researchers here are considering things like "positive emoticons entropy of single tweets," "time between two consecutive tweets," and "fraction of users [friends] with default profile and default picture."
To train their model, the researchers used publicly available datasets consisting of some 15,000 manually verified Twitter bots and 16,000 verified human accounts. (So, these are all accounts that at some point a real person sorted through and decided were human or bot.) They considered the most recent 200 tweets from each account as well as the most recent 100 tweets mentioning each account. That's about 2.6 million bot-generated tweets and 3 million human-generated tweets. Using the resulting model, the researchers went on to classify 14 million Twitter accounts, finding that between 9 and 15 percent of all accounts are likely bots.
There are a couple of interesting caveats here. "First, we do not exclude the possibility that very sophisticated bots can systematically escape a human annotator's judgement," the authors write. "These complex bots may be active on Twitter, and therefore present in our datasets, and may have been incorrectly labeled as humans, making even the 15 percent figure a conservative estimate. Second, increasing evidence suggests the presence on social media of hybrid human-bot accounts (sometimes referred to as cyborgs) that perform automated actions with some human supervision. Some have been allegedly used for terrorist propaganda and recruitment purposes. It remains unclear how these accounts should be labeled, and how pervasive they are."
Twitter itself has estimated that up to 8.5 percent of users, "used third party applications that may have automatically contacted our servers for regular updates without any discernible additional user-initiated action," according to a recent FEC filing. The company had no additional comment on its own classification methods or the current study.
It's worth pointing out that not all bots are inherently evil. There are many bots out there that don't even really pretend to be human and provide automated emergency notifications or serve customer service roles. Though the study observed some clustering of different bot types—PR bots, link pushers, spam accounts with no followers, the aforementioned cyborgs—it's not totally clear where benevolent service-bots fit into the ecosystem.
from New Machine Learning Framework Uncovers Twitter's Vast Bot Population