Bot detection should be a cybersecurity priority for any businesses with a website or any online presence. Close to 40% of global web traffic comes from malicious bots, a lot of them are responsible for many serious data breaches, DDoS attacks, and various other cybersecurity threats. 

On the other hand, detecting and filtering out malicious bot traffic is now harder than ever. Today’s bots are so sophisticated in mimicking human behaviors and/or masking their attack patterns to circumvent the common bot detection solutions. 

Today’s 4th-generation of bots have also utilized various AI and machine learning technologies, so an advanced bot detection solution is now a necessity. They are virtually impossible to detect without truly specialized expertise in dealing with both activities and the assistance of AI-based behavioral detection.

In this guide, we will take a look at the current state of malicious bot activities and how they are distributed, the key requirements for a reliable bot detection solution, and how we can properly secure our system from bot-related cybersecurity threats.

Let us begin by discussing the current state of malicious bot activities and the challenges in bot detection and filtration.

Current Challenges in Detecting Malicious Bot Traffic

First, we have to understand that detecting malicious bot traffic involves two different aspects: differentiating between bot traffic and legitimate human traffic, and then, differentiating between good bots and malicious bots.

Obviously we wouldn’t want to accidentally block our valuable human traffic, and with the sophistication of today’s bots in mimicking human activities, differentiating between human users and bots alone is now pretty challenging. Today’s malicious bots are purposely designed to evade traditional bot detection solutions. 

Second, we have to remember that there are ‘good’ bots we wouldn’t want to block from our site. For example, we wouldn’t want to block Google’s crawler bots, which is useful in getting our site ranked on Google’s SERP. Discerning between these good bots with malicious ones is even more difficult at the moment. We will discuss more about good bots vs malicious bots in the next section.

Internet bots have dramatically evolved in recent years, and in cybersecurity point of view, we can divide these bots into four different generations: 

  • Gen-1 bots: the first generation bots are built with basic programming languages and are used to perform basic automation tasks like web/content scraping, carding, comment/form spamming, and other relatively simple threats. They are often characterized by their use of inconsistent UAs (User-Agents), and they tend to use the same IP addresses. So, IP-based detection and blocking are typically effective. 
  • Gen-2 bots: second-generation bots typically operate via “headless” browsers like PhantomJS and headless mode of Chrome or Firefox. Characterized by their ability to execute JavaScript and maintain cookies, so JavaScript-based challenges and CAPTCHAs are ineffective with these gen-2 bots. 
  • Gen-3 bots: third-generation bots can mimic basic human behaviors and interactions like simple non-linear mouse movements, clicks, irregular keystrokes, and so on. However, they are not yet very sophisticated in simulating human-like randomness. 
  • Gen-4 bots: at the moment, the latest generation of bots that can perform sophisticated human-like behaviors like truly non-linear mouse movements and random clicks. Also, 4th-gen bots can change their User-Agents while changing between hundreds or even thousands of IP addresses. With these advanced traits, detecting gen-4 bots is now very challenging.

Good Bots VS Bad Bots

As briefly discussed, it’s very important to avoid accidentally blocking good bots since they can be really beneficial for our website.

The typical good bots can perform the following tasks:

  • Index your different pages to parse your content
  • Monitor your overall website’s performance (i.e. so it’s available in Google Analytics)
  • Obtain RSS feed data
  • Some cybersecurity solutions use bots to protect your website and webserver

On the other hand, bad bots or malicious bots are used for harmful purposes, for example:

  • Content scraping, the bad bots can strain your bandwidth and might also access data/information that isn’t supposed to be accessed
  • DDoS attack or other forms of attack to disrupt your site’s performance.
  • Post spam content (i.e. in the comment section), illegal form generation, and other similar activities
  • Click PPC ads so will ruin your cost and ROI

The damages caused by malicious bots can be extremely serious. It can distort website traffic, causing incorrect metrics, and disturbing the accuracy of your reports. It can steal sensitive data and cause long-term damage to your reputation, and bad bots are often the means of launching DDoS attacks. 

This is why properly differentiating traffic coming from good bots and malicious bots are very important if you want to ensure your site’s and business’s cybersecurity. Below we will list some of the major types of good bots so you can identify them properly: 

  • Search engine crawlers: like Googlebot and Bingbot. As we’ve briefly discussed above, these bots crawl your site to index it for the respective search engine. We can specify how these bots can index our site via the robots.txt file
  • Social media bots: major social media platforms have their own respective bots to conduct various tasks like driving engagement on their platforms and even engage the users. Facebook Bot is an example for this type of bot.
  • Vendor bots: for example,  Alexa and Slackbot. These are the bots that provide information and services related to the vendor’s app. 
  • Link checker bots: especially owned by SEO analytics solutions, these bots analyze the incoming and outgoing links on a website. SEMRushBot owned by SEMRush is a good example for this type of bot.
  • Monitoring bots: their main task is to monitor the uptime and performance of the checked websites. AlertBot is an example of monitoring bots.
  • Aggregator bots: like Google Feedfetcher, these bots aggregate information from websites and provide users with customized news/feeds.

Managing Good Bots

While as discussed, good bots are beneficial for our site and we shouldn’t block them, in certain cases, they can cause harm. 

For example, good bot traffic can eat a lot of bandwidth capacity and surpasses the limits of your server, slowing down your site or even causing total failure in the process. This is why before we further discuss how we can detect and block malicious bot activities, we should discuss how we can manage good bots traffic. 

Proper management of good bot traffic will also help us further in differentiating between good bots and bad bots to avoid false blocking.

In managing these good bots, it’s important to note that blocking good bots altogether isn’t always recommended, which can produce negative impacts. It’s important to implement whitelisting on good bots that are essential for your site. For example, you’re obviously going to whitelist Googlebot and social media bots in virtually all cases.

It is, however, advisable to block good bots that are totally unnecessary for your business. For example, if you don’t have a blog, then blocking aggregator bots might not affect your business at all. You can also implement geographic restrictions to block bots from certain countries. If you don’t serve your business in Asia, for example, you might want to blog all bots from China and Japan. Doing this can help you save your site’s resources.

It’s generally better to always prioritize bots that are critical for your businesses and block/limit non-crucial ones, and a good practice here is to set up rules in your site’s robots.txt file.

The robots.txt file is a text file that lives on a web server and specifies the rules for any bots who make a request to the website. For example, we can write a rule in the robots.txt file to limit Googlebot from indexing a specific page (i.e. if it’s an old content). 

Good bots are always programmed to look for robots.txt and follow its specified rules, but it is important to note that bad bots tend to disregard the robots.txt file or even try to use what’s written on robots.txt file to find potential vulnerabilities. So, while defining the rules/policies for bot behavior in the robots.txt file it’s important, it’s not going to be 100% effective. 

Another important thing to consider is that it is a common approach for malicious bots to mask itself as crawler bots or other types of good bots to evade traditional bot detection systems. We can implement measures like reverse DNS lookup, behavior comparison (more on this below), and other techniques to help counter this attack vector.

Two Types of Bot Detection Techniques

While there are various different methods in attempting bot detection, they can be boiled down into just two main detection techniques: fingerprinting/signature-based detection and behavioral-based detection

  1. Fingerprinting-based detection

This family of detection techniques involve recognizing information (the ‘fingerprints) like the device used by the incoming traffic, browsers used, OS, number of CPU cores, and other signatures and patterns that can be identified.

The main approach here is to collect a set of attributes related to the traffic’s device or browser to characterize the traffic, and then all the attributes collected are analyzed whether they contain any ‘fingerprint’ of malicious bots.

While attempting fingerprinting in the context of cybersecurity, it’s safe to assume that attackers and malicious bots are lying about their fingerprints. This is why in security fingerprinting, it’s necessary to collect multiple attributes and compare them to check whether their values are consistent with each other (to identify spoofing). 

Browser Fingerprinting

Browser fingerprinting typically utilizes Javascript to especially check HTTP headers sent by the browser. The main idea in detecting bots using browser fingerprinting is to collect information about the device, operating system (OS), and the browser used. The collected ‘fingerprint’ is then analyzed via heuristic analysis to check whether the fingerprints belong to known malicious bots and/or if they have been modified to mask its identity.

Typically heuristics is done in another server since if it’s directly performed in the webserver/browser, the bot can use it to gain a fingerprinting script and analyze it to check why their bot is detected (and modify it). 

It’s common in sophisticated bot activities to remove obvious fingerprints and attributes like attributes for headless browsers (PhantomJS, Selenium, Nightmare, Puppeteer, etc.), so it’s very important in fingerprinting to also check whether the browser’s attributes have been modified. This is why we must perform consistency checks to check for modification:

  • OS consistency: This test is used to check whether the claimed OS in the User-Agent has been modified. 
  • Browser consistency: we can check if some features that should be present in the claimed browser is present, and we can also use various JavaScript challenges to check the validity of the claimed browser. 
  • Inconsistency in behavior: here we check for inconsistencies mainly to detect whether the browser is currently operating in headless mode. 
  1. Behavioral-Based Detection

Behavioral detection techniques mainly leverage on the differences in behaviors between human users and bots. Bots, for example, can create a perfectly straight line with their mouse movements, but most humans can’t (at least, not at the same speed as a bot). However, as we’ve discussed above, today’s 3rd and especially 4th-gen bots are pretty good in mimicking human behaviors. 

Yet, even when these sophisticated bots are already really good in simulating human behaviors, behavioral-based detection techniques are still useful. Even if the behavioral-based detection is ineffective, it’s going to force the malicious bot to constantly mimic human behaviors, which results in them performing the tasks much slower. This will slow down the bots in achieving their objectives, and the attacker may just give up on the attack. 

However, today’s advanced bot detection solutions have applied machine-learning and AI-based behavioral detection techniques to analyze features extracted from user interactions to better detect gen-4 bots. 

In behavioral-based detection, the detection solution can analyze behaviors such as:

  • Mouse movements (linear or non-linear)
  • Scrolling speed and randomness
  • Keystrokes and the time between two consecutive keys pressed
  • Mouse clicks
  • The number of pages seen during each session (i.e. bots may see the same number of pages in repeated sessions)
  • The order of the pages seen if there’s any pattern. Bots tend to follow a certain pattern, especially in web scraping tasks to be more efficient. Humans tend to be more random.
  • The total number of requests
  • The average time spent between two consecutive pages
  • Resources blocked/loaded. To be more efficient, bots may block certain resources like CSS, images, ads, and trackers.

Ghost Traffic and How To Deal With Them

Ghost traffic refers to bots that do not visit our site at all, but it will appear in your Analytics reports – especially in the form of referral traffic from irrelevant sites. Ghost traffic targets Google Analytics servers instead of your web server to add data and distorts your reports. 

What is the purpose of a ghost traffic ‘attack’? The main purpose is to attract you—the webmaster—to check the source and visit the referral site. Once you visit the referral site, you become vulnerable to the attacker, and the perpetrator can then hack your system, inject malware/virus, or even just to show you an ad to earn money. 

Dealing with ghost traffic is fairly simple: we should filter it from our Analytics reports. First, we need to make a list of all the hostnames sending ghost traffic to your Analytics, which is any domain where your Analytics tracking code is present. Most of the ghost traffic tends to come from “(not set)” hostname or they might use real-looking hostnames (i.e. Facebook.com) but the URL will not match once you’ve checked the Source tab.

We should first try to get a list of all the hostnames coming to your site, by going to Audience-> Technology-> Network in the reporting section, then selecting Hostname as your primary dimension. 

The idea is to create a whitelist of genuine hostnames (including your site’s name) you can do this by creating a regular expression while including these authentic hostnames. The next thing to do is to create a filter that includes only the valid hostnames:

  1. Click on the Admin section on your Google Analytics dashboard
  2. Go to Filters, then Add Filter in the View section, then select Custom as your filter type.
  3. Select Hostname, then paste the regular expression you’ve created above under Filter Pattern
  4. Click Save

Don’t forget to update your whitelist filter whenever you add your site’s tracking ID to a new website.

Solutions To Block Malicious Bot Traffic

Now, how should we deal with malicious bot traffic coming to your site? There are several approaches we can implement: 

  • Investing in a Bot Management Solution

Arguably the best approach, provided you have the budget for it. A bot management solution is a dedicated software/hardware that will identify and protect your website and system from malicious bot traffic. These solutions typically possess databases listing all the good and bad bots out there that are constantly updated, so you wouldn’t need to worry about accidentally blocking good bots. 

The thing is, a good anti bot management solution like DataDome is not free, and you’d need to consider whether this investment can fit your current cybersecurity budget. 

  • WAF (Web Application Firewall) 

A web application firewall or WAF is a type of firewall designed to protect your web application (web app), and since most websites nowadays involve the use or web apps, then a WAF is a common solution used by many websites to block bad bots from launching web application attacks. 

A WAF acts as a reverse proxy server, sitting between the web application (the webserver) and the client. This way, resources first go to the WAF, where it’s analyzed and filtered, and then once it has passed the ‘test’, it is sent to the client. 

There are many free, open-source WAFs to choose from, and there are also advanced, premium ones that can monitor all HTTP requests in real-time. However, WAFs are only effective in blocking web-application attacks, and not in blocking all types of malicious bots.

  • CAPTCHA

CAPTCHA stands for “Completely Automated Public Turing test to tell Computers and Humans Apart”), and as the name suggests, is a common approach to filter out bots and allow human visitors. CAPTCHAs are designed to be (very) easy to solve by human users, and yet very difficult to solve by bots. 

Implementing CAPTCHA nowadays is pretty simple, and we can easily use Google’s reCAPTCHA for free. However, not only today’s bots are getting better in solving CAPTCHAs, but there are also various CAPTCHA farm services where real humans are paid to solve the CAPTCHA before the bot traffic can take over. 

So, think of CAPTCHA as a preliminary security approach in blocking bots instead of a one-size-fits-all answer. 

End Words

Above we have discussed some of the most effective ways to keep malicious bots away from your website. However, obviously they are not the only ways available. Site owners should always pay close attention to the incoming traffic and implement the best practices of web application security to identify and minimize bad bot activities as early as possible. 

When left unchecked, bad bot traffic can easily grow into more serious forms of cybersecurity threats such as DDoS attacks, hacking, and even major data breaches.