Google Analytics: How to filter out spam traffic to get accurate data

When it comes to analysing your site’s traffic it’s often an assumption that the number of sessions you see in your Google Analytics account accurately represents the number of real people visiting your site, but this isn’t always the case.

Sometimes traffic isn’t a real person, sometimes it’s a bot, or spider, and sometimes it’s a spammer.

First up, definitions:

A bot​ is a service looking around your site for any number of reasons, but typically to check for new content, to update some kind of feed, or a service you may use yourself to see if your site is still up and running.

A spider​ is a type of bot specifically checking each page of your site so that it can be indexed by a search engine.

By and large both bots and spiders are harmless, or useful in fact, and you can always employ one of several methods to keep them away from pages that you don’t think they need to be checking.

Then there are spammers​. A spammer is a real person, but the traffic they send to your site is not from a real interaction with your site. The majority of this kind of spam comes from a tool that Google has added onto its service, to allow unusual interactions with your Analytics account. It’s called the Measurement Protocol and its intended purpose is innocent.

This article goes into detail about how the spammers do what they do, how much it might be affecting your site, and then focuses on a particular method of reducing or entirely eliminating such spam.

The how – Google’s Measurement Protocol

Google is actually very good at empowering developers to interact with their tools in novel ways – for Google Analytics this means allowing hits to be reported from sources that aren’t necessarily websites. That’s what the Measurement Protocol is for. Anything from a toaster to a server-room can have its own Analytics account via the Measurement Protocol API, or add offline events to a regular website’s account.

For example: an online shop could integrate with their courier service and report package delivery details back to Analytics. The journey of a single user could be viewed in Google Analytics from landing on the site, through purchasing, all the way to delivery. Or you could just rig up your office coffee pot to record events when it’s full, empty, brewing, and so on. It’s a very versatile tool, the possibilities are only limited by your imagination…

…and that is the problem.

Data can be injected into your account by anyone with enough know-how. There’s no robust security system behind the Measurement Protocol to prevent a malicious user from sending you traffic.

You could even have a go right now in your browser. The Measurement Protocol Hit Builder Tool can be used to very easily set up a fake payload:

google hit builder

But that’s just the ‘how’ of traffic spam, there’s the ‘who’ and the ‘why’ too.

The who and the why – the real impact of fake traffic

Fortunately there is a very public case study for the misuse of the Measurement Protocol, a man from Russia by the name of Vitaly Popov. You’d think that misusing a tool in this way would be the sole reserve of shady practitioners, but Vitaly has made sure that there is a very earnest and public face to the misuse of the Measurement Protocol.

Vitaly isn’t a spammer in his own mind, he’s just a canny marketer. The measurement protocol is a method of advertising himself that few others are trying. As a result Vitaly’s name and exploits can be seen across most Google Analytics accounts, and considering that in 2017 50% of sites used Google Analytics, that’s a lot of free advertising.

If you want to make this article more interactive just go into a Google Analytics account now, open up the date range to the 24 months, move to Acquisition All Traffic Channels and add ‘Language’ as a secondary dimension. In the language field you may notice a few odd entries amongst the “en-gb”, “en-us”, or whatever language codes your site normally sees.

One in particular may look like this:

vitaly ga spam

That’s Vitaly.

And so are a lot of the other odd entries. Vitaly has a number of domains that he likes to promote, sometimes claiming they’re better versions of Google Search, and sometimes maybe even telling you to vote for Trump in 2016 and congratulating Trump after he won.

But how does this help Vitaly? It’s just a dodgy language value!

Vitaly also inserts one of his domains as a referral. This means you think you’re getting traffic from somewhere that you’re actually not. Furthermore Vitaly often uses misspelled domains, or characters from other alphabets that look very much like roman alphabet characters. For example:

vitaly ga spam

Can you see it?

That’s a cyrillic ‘ĸ’ in ‘lifehacĸ’ and will take you not to the popular productivity site but to one of Vitaly’s sites.

Vitaly is actually a godsend in this regard. Yes, he’s ruining referral and social channels with his spam, but he’s doing it obviously, to push his opinions and sites. This means we can learn from him, because there may also be spammers doing this subtly. Some spammers may do this for the sole reason of ruining your site’s stats.

The solution – Google Tag Manager to the rescue

All this spam traffic needs filtering out and so, for a long time, analysts, webmasters, site-owners, and marketers have poured over resources online to set up filters within Google Analytics. From filtering out bad language settings, only recording traffic with the correct hostname, to simply blocking referrals from specific spam domains. It’s been a game of whack-a-mole – reacting to when the problems pop up. What really needs to be done is to make your site more secure, ask more of your users to verify that they aren’t fake. Sounds like a tall order for your users, but really it’s not.

What we can do is have the site set a cookie. HTTP cookies are small pieces of data sent from a website and stored on the user’s computer by the user’s web browser. Just by setting our own cookie with any arbitrary value we can check for it and confirm that the hit to the site is a real session.

The easiest way for us to do that ourselves is with a little help from Google Tag Manager.

The Solution in a nutshell:

  1. Get Google Tag Manager to set a first-party cookie with any value we want (or use an existing cookie if you like!)
  2. Set up Google Analytics to accept values to a new Custom Dimension
  3. Have the Google Analytics tag in Google Tag Manager insert the value it finds in the cookie to our new Custom Dimension
  4. Then, set Google Analytics to only record hits from traffic that includes the correct value in that Custom Dimension.

The way it works is that the current slew of fake traffic doesn’t ever reach your site, so it never gets the cookie, and traffic without the cookie never gets recorded in your Google Analytics view. It’s a secret handshake of sorts.

And now, in detail:

Step 1

Set that cookie!

New tag

Type: Custom HTML

  1. Make a cookie from a simple script. You can do this however you like, but you can see how we did it below.
  2. Set the tag to ‘Support document.write’
  3. Set the tag firing priority to an arbitrarily high value, like ‘100’ (we’ll set a lower priority tag later, as we want this one to fire first)
  4. Set the tag to fire once per event
  5. Set the trigger to be All Pages

You should now have something like this:

set cookie script

Step 2

Now get your Custom Dimension set up in Google Analytics

  1. Load up your site’s Google Analytics property
  2. Navigate to the admin section
  3. In the central column select ‘Custom Definitions’, then ‘Custom Dimensions’
  4. Enter a name for your new custom dimension
  5. Select the ‘User’ scope

Which should get you something like this:

user scope

Step 3

Now we’ll set up the cookie as a variable in Google Tag Manager and then have the Google Analytics tag add that variable as the value to be picked up by the Custom Dimension.

1. Add a new user-defined variable in Google Tag Manager

2. Select the ‘1st-Party Cookie’ type

3. Enter the name of your cookie from earlier (so the variable knows which cookie to look for)

variable first party config cookie

4. Now go to your Universal Google Analytics tag in Google Tag Manager and in the settings add the cookie variable as the value you want to send to that Custom Dimension you set up earlier, just by setting the Index value, and the variable name:

custom dimension

5. Now set your Google Analytics tag to fire at an arbitrarily lower value than the Cookie tag, just to make sure the cookie is there before Google Analytics looks for it. In our Google Tag Manager we set ‘100’ for the cookie, so we’ll set ‘50’ here. If we start making tonnes of tags that need to fire in a particular order we can tweak this, but there’s already plenty of space between 50 and 100 to fit other tags in.

6. Finally, for this step we’ll make that firing order a bit more robust by set up Tag Sequencing, right at the bottom of the tag. Select it to ‘Fire a tag before [name here] fires’, and select the Cookie tag you made earlier.

Things may all seem a bit arcane in this step but all we’re doing is looking out for the cookie, pulling the value from inside the cookie, and then getting that into Google Analytics as a Custom Dimension.

Step 4

Finally, we can set up a trusty Google Analytics filter! Google Analytics can’t filter based on the presence of a cookie normally, but we’ve handily translated the presence of the cookie into our Custom Dimension value.

  1. In the Google Analytics Admin section on the far right column set up a new filter
  2. Give it a name
  3. Select the Custom type
  4. Select ‘Include’
  5. Select the name of your Custom Dimension from the ‘Filter Field’
  6. Enter the value you’re expecting the cookie to contain
  7. Hit save!

ga filter

Now, you’re pretty much done but I’d highly recommend testing this setup before publishing the container. Preview the changes in Google Tag Manager and see if your filtered view picks up your own traffic to your site.

As an extra step for testing a drastic new filter like this one we at Bozboz make use of three views.

  1. is our master view, no filtering whatsoever,
  2. is our filtered no-spam view with the above filter,
  3. is the exact opposite, just so we can see how much spam the filter is catching.

To make that yourself just follow the same steps above, but at Step 4 make it an ‘exclude’ filter in Google Analytics instead of ‘include’.

Then just wait for a bit of traffic. You should see the traffic splitting between the views, and you’ll know very quickly whether it’s working or not as the fake traffic will have very odd behavioural patterns, like 0-1 second pageviews or 100% bounce rate.


Obviously, this is a hefty filter, requiring a bit more to happen on your user’s side before their session gets tracked. We’re talking in milliseconds, but there’s still an added delay in Google’s very purposefully optimised tracking setup. However, any real traffic that is missed from the new filter because of timings is traffic that is barely interacting with the site anyhow.

The filter will ignore any hit from a real user who blocks cookies from being set. Then again, a user who blocks cookies is already relatively anti-tracking.

Finally, this kind of filtering isn’t just targeting the fake traffic, it stopping use of the Measurement Protocol entirely. If you use it legitimately this filter will need amendments. Though if you know your way round the Measurement Protocol this should be simple for you. Adding a new item to your Measurement Protocol payloads which purposefully triggers the custom dimension is all that’s needed.

Econsultancy offers training in Google Analytics (standard and three-day advanced  courses), as well as a beginners measurement and analytics course.

Article source:

Related Posts