Few things are worse than spam disguised as real content.
Over the past few years, spam – including actual fake episodes – has become a real problem in our industry. In RSS.com’s ongoing battle to combat podcast spam and thanks to the magic of natural language processing techniques, we have come out solidly in the lead.
Curious to find out how we did it? Then read on…
By Alberto Betella, PhD
This is a guest post case study by our friends at podcast hosting provider RSS.com
Identifying Exploited Areas
The modus operandi of a creative malignant agent is to try and figure out how to use a system against itself. In this case, there were three facets of the platform that RSS.com has built to promote podcasters that the spammers realized they could potentially exploit.
The first exploit targeted RSS.com’s freemium business model, which lets creators start using our tools without providing credit card details upfront. Obviously, this meant that spammers could create a theoretically limitless number of new “shows” with exactly one episode and use our platform to publish them.
Secondly, they wanted to leech off our website’s clean technical reputation to lend a misleading credibility to themselves. You see, to make content delivery seamless, we’ve automated the creation of a public webpage for each show. This not only gave spammers a quick and easy method of establishing technically clean search engine optimization (SEO) for their “shows,” but also gave them improved backlinks via the podcast description section.
And finally, to improve their discoverability, they submitted their podcast “episodes,” which were usually randomly strung together strings of noise and musical extracts, to major podcast directories (such as Amazon Music and Spotify) either manually or through our API integrations.
Hundreds of Fake Podcasts
To emphasize the scale of this problem: there were hundreds upon hundreds of these fake podcast shows. Normally, when something like this is done at scale, one expects to find that it’s being driven by bots. These were not. Bots have patterns of behavior that are easy for other bots to identify automatically. But the usual rules of identification were not flagging these fake shows. This meant that these were all being created and uploaded by real users.
Once we had identified their goals (which we outlined above), we got to work!
The first two facets of their exploit—our freemium model and automated webpage building—were simple to solve, especially after consulting Google’s methods for dealing with such common tactics:
First, we identified every show that met the following criteria:
- it had only one short episode, and
- the publisher didn’t have an active subscription to our service.
Once we’d done that, we added a rel=”ugc” attribute, and similar, to the webpage meta tags and to all the hyperlinks for all such shows. The purpose of modifying the metadata of these pages in this way was to signal to web crawlers not to index them for search results, and also to discount the backlinks linking to these pages. In other words, we curtailed the incentive to create such spam pages.
The third facet, involving high-volume spam podcast submissions to other directories, was harder to solve. We had two options:
- we could either hire more people whose sole task would be to review submissions manually, or
- we could let our in-house team flex their under-utilized AI/machine-learning muscles to craft a scalable, cost-efficient way to deal with this.
We went with option B.
Natural Language Processing
Okay, let’s talk about Natural Language Processing (NLP).
Teaching AI how to read, parse and understand how humans write or speak has been one of the most wildly successful projects in computing. Generally speaking, NLP uses pre-tagged text from a huge corpus of words to parse the syntax and semantics of human communication. It does this by breaking words down to their root form, removing filler words, analyzing word frequency, and chunk texts into easy-to-understand tokens.
At this point, the science of NLP has progressed enough to let trained AI models analyze survey results, deal with customer feedback, and, as it pertains to our case, detect spam! You see, most SEO information is textual, which means it’s smack in the domain of NLP analysis!
Back to our problem.
It turns out that our spammers would rarely put any effort into the audio portion of the content they were uploading to our platform. In other words, we could safely ignore that and focus instead on the text portions of their uploads.
We manually reviewed suspicious podcast pages on our platforms and isolated the ones that we thought qualified as spam. Next, we extracted the titles and descriptions from these suspect pages and used them to create a corpus of training data. We fed this data to our ML model. Once trained, we built SpamBot, whose job was to crawl through our shows, extract the relevant text, and use our ML model to output a SPAM score.
If the SPAM score for any given podcast was too high, the bot would auto-delete it. If it was under that critical score but still suspicious, the bot would notify us via a Slack integration. Then, we had to review the podcast ourselves to determine if it was spam.
An important aspect of this manual review process is that it helps our team flag false positives. We were also able to improve the bot’s performance by correcting it in those instances. And yes, we can provide the bot with that feedback via a simple click right there in Slack!
This continuous semi-supervisory training was critical to getting our bot’s spam recognition accuracy to where it is now.
An important aspect of this manual review process is that it also helps our team flag false positives. This semi-supervisory training was critical to helping improve the accuracy of our SPAM bot. You can see the results in the confusion matrix below:
As you can see, with a precision of 99.32%, there’s barely a need for human intervention anymore.
Good to Great
But just because our bot was good didn’t mean we couldn’t make it great. We came up with other ways to differentiate between genuine content and spam. For instance, genuine content was more likely to have its own custom cover art rather than the default that RSS.com provides.
Another thing we did was to start auto-transcribing the audio in our content into text, and then feed the transcripted text into the same machine-learning model we use to flag the podcast descriptions.
That transcription not only helps us detect spam, but also becomes a bonus feature that we can offer to our users.
Our bot was also having problems with false positives, e.g. flagging of podcast content that was produced in less-frequently represented languages (like French). We enhanced this via the simple solution of training our ML model using the text from the most popular podcasts from France.
Stopping Podcast Spammers
We hope that by sharing these strategies, other serious players and creators in podcasting are able to tackle their own inevitable flood of spam content.