How RSS.com Outsmarted Podcast Spammers

Few things are worse than spam disguised as real content.

Over the past few years, spam – including actual fake episodes – has become a real problem in our industry. In RSS.com’s ongoing battle to combat podcast spam and thanks to the magic of natural language processing techniques, we have come out solidly in the lead.

Curious to find out how we did it? Then read on…

This is a guest post case study by our friends at podcast hosting provider RSS.com

Identifying Exploited Areas

The modus operandi of a creative malignant agent is to try and figure out how to use a system against itself. In this case, there were three facets of the platform that RSS.com has built to promote podcasters that the spammers realized they could potentially exploit.

The first exploit targeted RSS.com’s freemium business model, which lets creators start using our tools without providing credit card details upfront. Obviously, this meant that spammers could create a theoretically limitless number of new “shows” with exactly one episode and use our platform to publish them.

Secondly, they wanted to leech off our website’s clean technical reputation to lend a misleading credibility to themselves. You see, to make content delivery seamless, we’ve automated the creation of a public webpage for each show. This not only gave spammers a quick and easy method of establishing technically clean search engine optimization (SEO) for their “shows,” but also gave them improved backlinks via the podcast description section.

And finally, to improve their discoverability, they submitted their podcast “episodes,” which were usually randomly strung together strings of noise and musical extracts, to major podcast directories (such as Amazon Music and Spotify) either manually or through our API integrations.

Hundreds of Fake Podcasts

To emphasize the scale of this problem: there were hundreds upon hundreds of these fake podcast shows. Normally, when something like this is done at scale, one expects to find that it’s being driven by bots. These were not. Bots have patterns of behavior that are easy for other bots to identify automatically. But the usual rules of identification were not flagging these fake shows. This meant that these were all being created and uploaded by real users.

Once we had identified their goals (which we outlined above), we got to work!

The first two facets of their exploit—our freemium model and automated webpage building—were simple to solve, especially after consulting Google’s methods for dealing with such common tactics:

First, we identified every show that met the following criteria:

it had only one short episode, and
the publisher didn’t have an active subscription to our service.

Once we’d done that, we added a rel=”ugc” attribute, and similar, to the webpage meta tags and to all the hyperlinks for all such shows. The purpose of modifying the metadata of these pages in this way was to signal to web crawlers not to index them for search results, and also to discount the backlinks linking to these pages. In other words, we curtailed the incentive to create such spam pages.

The third facet, involving high-volume spam podcast submissions to other directories, was harder to solve. We had two options:

we could either hire more people whose sole task would be to review submissions manually, or
we could let our in-house team flex their under-utilized AI/machine-learning muscles to craft a scalable, cost-efficient way to deal with this.

We went with option B.

Natural Language Processing

Okay, let’s talk about Natural Language Processing (NLP).

Teaching AI how to read, parse and understand how humans write or speak has been one of the most wildly successful projects in computing. Generally speaking, NLP uses pre-tagged text from a huge corpus of words to parse the syntax and semantics of human communication. It does this by breaking words down to their root form, removing filler words, analyzing word frequency, and chunk texts into easy-to-understand tokens.

At this point, the science of NLP has progressed enough to let trained AI models analyze survey results, deal with customer feedback, and, as it pertains to our case, detect spam! You see, most SEO information is textual, which means it’s smack in the domain of NLP analysis!

Back to our problem.

It turns out that our spammers would rarely put any effort into the audio portion of the content they were uploading to our platform. In other words, we could safely ignore that and focus instead on the text portions of their uploads.

We manually reviewed suspicious podcast pages on our platforms and isolated the ones that we thought qualified as spam. Next, we extracted the titles and descriptions from these suspect pages and used them to create a corpus of training data. We fed this data to our ML model. Once trained, we built SpamBot, whose job was to crawl through our shows, extract the relevant text, and use our ML model to output a SPAM score.

SPAM Score

If the SPAM score for any given podcast was too high, the bot would auto-delete it. If it was under that critical score but still suspicious, the bot would notify us via a Slack integration. Then, we had to review the podcast ourselves to determine if it was spam.

An important aspect of this manual review process is that it helps our team flag false positives. We were also able to improve the bot’s performance by correcting it in those instances. And yes, we can provide the bot with that feedback via a simple click right there in Slack!

This continuous semi-supervisory training was critical to getting our bot’s spam recognition accuracy to where it is now.

An important aspect of this manual review process is that it also helps our team flag false positives. This semi-supervisory training was critical to helping improve the accuracy of our SPAM bot. You can see the results in the confusion matrix below:

rss.com outsmarted podcast spammers. Confusion matrix.

As you can see, with a precision of 99.32%, there’s barely a need for human intervention anymore.

Good to Great

But just because our bot was good didn’t mean we couldn’t make it great. We came up with other ways to differentiate between genuine content and spam. For instance, genuine content was more likely to have its own custom cover art rather than the default that RSS.com provides.

Another thing we did was to start auto-transcribing the audio in our content into text, and then feed the transcripted text into the same machine-learning model we use to flag the podcast descriptions.

That transcription not only helps us detect spam, but also becomes a bonus feature that we can offer to our users.

Our bot was also having problems with false positives, e.g. flagging of podcast content that was produced in less-frequently represented languages (like French). We enhanced this via the simple solution of training our ML model using the text from the most popular podcasts from France.

Stopping Podcast Spammers

We hope that by sharing these strategies, other serious players and creators in podcasting are able to tackle their own inevitable flood of spam content.

A big thanks to our friends at RSS.com for this guest post case study. Check out our full review of RSS.com to find out more about their great podcast hosting services.

Cookie	Duration	Description
_hjAbsoluteSessionInProgress	1 hour	Hotjar sets this cookie to detect a user's first pageview session, which is a True/False flag set by the cookie.
tph_hp_filter	365 days	Stores which filters you have enabled in our Hosting Picker Chooser tool for user convenience.
tph_news_sign_up	365 days	Determines if the "Get weekly podcast industry insights like this straight to your inbox" banner is shown.
tph-article-feedback-submitted	365 days	Checks whether you submitted feedback to an article. If you did, we will no longer show you that section to avoid spam & user confusion.
wp-wpml_current_language	session	WordPress multilingual plugin sets this cookie to store the current language/language settings.

Cookie	Duration	Description
_ce.gtld	session	Crazyegg sets this cookie to identify the top-level domain.
_clck	1 year	Microsoft Clarity sets this cookie to retain the browser's Clarity User ID and settings exclusive to that website. This guarantees that actions taken during subsequent visits to the same website will be linked to the same user ID.
_clsk	1 day	Microsoft Clarity sets this cookie to store and consolidate a user's pageviews into a single session recording.
_ga_*	1 year 1 month 4 days	Google Analytics sets this cookie to store and count page views.
_gat_gtag_UA_*	1 minute	Google Analytics sets this cookie to store a unique user ID.
_gat_UA-*	1 minute	Google Analytics sets this cookie for user behaviour tracking.n
_gcl_au	3 months	Google Tag Manager sets the cookie to experiment advertisement efficiency of websites using their services.
_hjRecordingEnabled	session	Hotjar sets this cookie when a Recording starts and is read when the recording module is initialized, to see if the user is already in a recording in a particular session.
_hjSession_*	1 hour	Hotjar sets this cookie to ensure data from subsequent visits to the same site is attributed to the same user ID, which persists in the Hotjar User ID, which is unique to that site.
_hjSessionUser_*	1 year	Hotjar sets this cookie to ensure data from subsequent visits to the same site is attributed to the same user ID, which persists in the Hotjar User ID, which is unique to that site.
browser_id	5 years	This cookie is used for identifying the visitor browser on re-visit to the website.
cebs	session	Crazyegg sets this cookie to trace the current user session internally.
CLID	1 year	Microsoft Clarity set this cookie to store information about how visitors interact with the website. The cookie helps to provide an analysis report. The data collection includes the number of visitors, where they visit the website, and the pages visited.
CONSENT	2 years	YouTube sets this cookie via embedded YouTube videos and registers anonymous statistical data.
last_pys_landing_page	7 days	PixelYourSite plugin sets this cookie to manages the analytical services.
last_pysTrafficSource	7 days	PixelYourSite plugin sets this cookie to manage the analytical services.
MR	7 days	This cookie, set by Bing, is used to collect user information for analytics purposes.
prism_*	1 month	Active Campaign sets this cookie to track and store interactions.
pys_first_visit	7 days	PixelYourSite plugin sets this cookie to manage the analytical services.
pys_landing_page	7 days	PixelYourSite plugin sets this cookie to manages the analytical services.
pys_session_limit	1 hour	PixelYourSite plugin sets this cookie to manage the analytical services.
pys_start_session	session	PixelYourSite plugin sets this cookie to manage the analytical services.
pysTrafficSource	7 days	PixelYourSite plugin sets this cookie to manage the analytical services.
SM	session	Microsoft Clarity cookie set this cookie for synchronizing the MUID across Microsoft domains.
vuid	1 year 1 month 4 days	Vimeo installs this cookie to collect tracking information by setting a unique ID to embed videos on the website.

Cookie	Duration	Description
ANONCHK	10 minutes	The ANONCHK cookie, set by Bing, is used to store a user's session ID and verify ads' clicks on the Bing search engine. The cookie helps in reporting and personalization as well.
ckid	never	Adara yield sets this cookie to deliver advertisements tailored to user interests on other websites and track transactions
MUID	1 year 24 days	Bing sets this cookie to recognise unique web browsers visiting Microsoft sites. This cookie is used for advertising, site analytics, and other operations.
scribd_ubtc	10 years	Scribd sets this cookie to gather data on user behaviour across several websites and maximise the relevancy of the advertisements on the website.
test_cookie	15 minutes	doubleclick.net sets this cookie to determine if the user's browser supports cookies.
VISITOR_INFO1_LIVE	6 months	YouTube sets this cookie to measure bandwidth, determining whether the user gets the new or old player interface.
YSC	session	Youtube sets this cookie to track the views of embedded videos on Youtube pages.
yt-remote-connected-devices	never	YouTube sets this cookie to store the user's video preferences using embedded YouTube videos.
yt-remote-device-id	never	YouTube sets this cookie to store the user's video preferences using embedded YouTube videos.
yt.innertube::nextId	never	YouTube sets this cookie to register a unique ID to store data on what videos from YouTube the user has seen.
yt.innertube::requests	never	YouTube sets this cookie to register a unique ID to store data on what videos from YouTube the user has seen.

Cookie	Duration	Description
_ce.clock_data	1 day	Description is currently not available.
_ce.clock_event	1 day	Description is currently not available.
_ce.irv	session	Description is currently not available.
_ce.s	1 year	Description is currently not available.
_CEFT	1 year	No description available.
_hjIncludedInSessionSample_271830	1 hour	Description is currently not available.
cebsp_	session	Description is currently not available.
memberful_tracking_params	never	No description available.
pbid	6 months	Description is currently not available.
VISITOR_PRIVACY_METADATA	6 months	Description is currently not available.

Sign-up to PodCraft Perspectives

How RSS.com Outsmarted Podcast Spammers

Identifying Exploited Areas

Hundreds of Fake Podcasts

Natural Language Processing

SPAM Score

Good to Great

Stopping Podcast Spammers

From idea to legendary podcast...

Plan & launch

Produce & edit

Presenting

Grow & monetise