Some server operators find themselves battling with a high CPU load on their systems, which inevitably slows down website responses.
The main cause of high CPU loads is often due to search engine crawlers and “bad bots”, which are essentially crawlers similar to search engines but do not serve any purpose for you. What bad bots do with the crawled data remains a mystery.
Hackers also scan websites to identify the software being used to operate them. They do this to exploit any security vulnerabilities found in the software. In some cases, these scans may even attempt to access passwords that have been accidentally left behind by web designers in web space files.
In this article, you will learn how to effectively keep annoying visitors away with simple means using Plesk and Fail2Ban. To do this, we extend the existing Fail2Ban “plesk-apache-badbot” jail and the “apache-badbots” filter.
If you are not yet using Fail2Ban, it is high time you did.If you’re unfamiliar with Fail2Ban, fear not – we’ll provide an overview in this article. And if you haven’t yet incorporated Fail2Ban into your system, there’s no better time than now to do so.
A Few Tips on CPU Load
How To Notice High CPU Load?
A simple answer: the server reacts slowly. For everything. Whether websites, mail retrieval, hard disk processes – when the load is high, everything runs a little slower than usual. Command line commands such as “top”, “htop” and “uptime” provide insight. “uptime” shows the utilization of the processor’s CPUs, “top” and “htop” show which processes run particularly long-term on a server. An evaluation of the Linux processes can also provide good insights. If you want to observe the 20 processes that continuously place the highest load on a server live, you can do this with a watch command line command:
# watch "ps aux | sort -nrk 3,3 | head -n 20"
Sometimes slow database processes or a high number of them can slow down database transactions. Let’s extend the watch with an extra section that displays the current database processes, too:
# MYSQL_PWD=`cat /etc/psa/.psa.shadow` watch "ps aux | sort -nrk 3,3 | head -n 20 && echo "\ " && mysqladmin proc status -u admin"
Which Processes Cause the Most Problems?
In practice, problems arise primarily due to the widespread use of PHP. Many websites use PHP, which makes it correspondingly computationally intensive if the websites are to deliver data to the requestor. With
# ps aux | grep php-fpm | grep -vE "master process|grep "
you get a snapshot of which of your domains is currently particularly busy. The CPU column of the display shows how much a process is using a CPU.
What Can Cause a High CPU Load?
High CPU Load Due to Useless Traffic
Every time requests arrive at the network interface, your host computer and server software such as Nginx and Apache web server have to do work. Data packets have to be read and interpreted, the web server has to start an interpreter such as PHP, scripts are executed, database queries are made, resulting in disk accesses, until finally the finished web page is rendered and sent back to the requestor. A single request results in thousands of operations. Each of these operations requires computing time and, in the worst case, access to the hard disk.
The more such requests arrive, the higher the load increases. All requests are queued and processed one after the other. (For a certain number of requests, parallel processing is done, but once limits are exceeded, requests must be queued.) The longer the queues, the longer the processing time, the slower the response time of a website.
Your goal is to shorten the queues. You can do this by blocking visitors that you do not want on the server. Your websites should serve regular visitors, but your server should recognize useless visitors and not let them in at all. These useless visitors are “bad bots” and hackers.
High CPU Load Due to Well-Intentioned Processes and Software
Not all situations in which a server is heavily loaded are caused by bad bots and hackers.
For example, data compression processes require a lot of computing power. If you use the Plesk Backup Manager to create a backup and compress the contents of the backup, this can place a heavy load on the server. In Plesk, however, you can reduce the priority of such processes (adjust the niceness) and limit the data compression so that you have some control over the load that occurs. Due to the priority that can be given to other processes in the operating system, website visitors hardly notice the backup process.
Another common cause is caching plugins for popular content management systems. Such plugins visit your website as if they were users. They call up the pages of your website and generate cached, fully rendered files from them. Later, when a real user visits the website, only the fully rendered files need to be delivered. A good idea in itself, but if a website has a lot of content but few visitors, caching plugins often generate significantly more server load (and thus slow down the server overall) than if the website was operated without them. Caching plugins are therefore counterproductive for many websites.
Limit CPU Load With Cgroups
To prevent individual websites from generating so much load that other websites are slowed down because the server no longer has enough computing power available, you can limit the load by using “cgroups” (“control groups”). Individual services and users thus receive a defined maximum share of computing power.
Pro Tip
Cgroups can be set up manually in the operating system, but it is easy to lose track of them because there are so many control options. It is much easier with the Plesk Cgroups component. The Cgroups component is included free of charge in the Plesk Web Pro and Web Host editions. You can find out more here.
Cgroups are a good idea and important, but they do not prevent useless traffic from continuing to arrive on a server and demanding maximum performance from individual websites. It is therefore better to get to the root of the problem.
Typical Attack Scenarios
Here we look at the type of request we want to avoid on a server as far as possible.
Bad Bots
“Bad bots” are useless crawlers that generate a lot of traffic on your server but will not promote your website. What bad bots do with the data they collect often remains a mystery. Some harvest email addresses from the source code of websites, some check websites for trademark infringements.
The bots use many different source IP addresses and behave like other crawlers that scan a website, but are often more penetrating. In your website log file, for example, it looks like this:
123.123.123.123 - - [16/Jul/2023:20:26:11 +0200] "GET / HTTP/1.1" 200 162 "-" "Mozilla/5.0 (compatible; MJ12bot/v1.4.8; http://www.majestic12.co.uk/bot.php?+)"
234.234.234.234 - - [16/Jul/2023:22:22:24 +0200] "GET / HTTP/1.1" 200 162 "-" "Mozilla/5.0 (compatible; MJ12bot/v1.4.8; http://www.majestic12.co.uk/bot.php?+)"
111.222.111.222 - - [16/Jul/2023:22:23:36 +0200] "GET / HTTP/1.1" 200 162 "-" "Mozilla/5.0 (compatible; MJ12bot/v1.4.8; http://www.majestic12.co.uk/bot.php?+)"
232.232.232.232 - - [16/Jul/2023:23:10:01 +0200] "GET / HTTP/1.1" 200 162 "-" "Mozilla/5.0 (compatible; MJ12bot/v1.4.8; http://www.majestic12.co.uk/bot.php?+)"
The example shows how the bot retrieves the homepage four times within a short period of time (“GET /”). A normal search engine would only retrieve the page once and thus generate much less CPU load and traffic. Below we show how to block these requests.
Hacker Scans
General Scans
In the website log file you will occasionally find requests that want to retrieve non-existent files. The suspicious requests come from the same source IP address, but they may have different user agents. Log entries could look like this:
123.123.123.123 - - [02/Nov/2023:15:04:33 +0100] "GET //domain.tld/wp-content/plugins/woocommerce/assets/js/zoom/jquery.zoom.min.js HTTP/1.0" 301 728 "-" "Mozilla/5.0 (Linux; Android 7.1.1; XT1710-02 Build/NDS26.74-36) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.125 Mobile Safari/537.36"
123.123.123.123 - - [02/Nov/2023:15:04:32 +0100] "GET //domain.tld/wp-content/plugins/borlabs-cookie/assets/javascript/borlabs-cookie.min.js HTTP/1.0" 301 737 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.127 Safari/537.36"
123.123.123.123 - - [02/Nov/2023:15:04:33 +0100] "GET //domain.tld/wp-content/plugins/quform/cache/quform.js HTTP/1.0" 301 705 "-" "Mozilla/5.0 (X11; Linux x86_64; rv:15.0) Gecko/20120724 Debian Iceweasel/15.02"
123.123.123.123 - - [02/Nov/2023:15:04:33 +0100] "GET //domain.tld/wp-content/plugins/gtranslate/js/flags.js HTTP/1.0" 301 705 "-" "Web Downloader/6.9"
123.123.123.123 - - [02/Nov/2023:15:04:33 +0100] "GET //domain.tld/wp-content/plugins/woocommerce/assets/js/frontend/cart-fragments.min.js HTTP/1.0" 301 735 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.5005.49 Safari/537.36"
In this example, you can see that the IP address is always the same, but the user agents vary:
- Mozilla/5.0 (Linux; Android …)
- Mozilla/5.0 (Macintosh; Intel Mac OS …)
- Mozilla/5.0 (X11; Linux x86_64 …)
- Web Downloader/6.9
Real search engines would use the same user agent entry for every query. Consequently, it cannot be the queries of a search engine.
Such crawls want to find out which software a website uses. For this purpose, the files characteristic of a particular software are searched for. For example, the file “cart-fragments.min.js” is characteristic of Woocommerce. If the web server returns the file to the requestor, the requestor knows that Woocommerce is running on the website. In a further step, the attacker can tailor an attack to vulnerabilities common to Woocommerce websites.
In the sample log extract, you can also see the server’s “301” response code.
123.123.123.123 - - [02/Nov/2023:15:04:33 +0100] "GET //domain.tld/wp-content/plugins/gtranslate/js/flags.js HTTP/1.0" 301 705 "-" "Web Downloader/6.9"
The call is therefore redirected – an indication that the requested file does not exist, but instead of a 404-not-found, a 301-permanent-redirect is sent to the requestor so that an ordinary requestor who has mistyped the URL still sees a valid page in response. This is a procedure that many websites use to avoid losing valuable visitors in the event of typos.
If the file had been available and no URL rewrites had jumped in to respond with a 301 code, the log entry would have looked like this:
123.123.123.123 - - [02/Nov/2023:15:04:33 +0100] "GET //domain.tld/wp-content/plugins/gtranslate/js/flags.js HTTP/1.0" 200 705 "-" "Web Downloader/6.9"
The code 200 would have signaled to the attacker: “This is a website that uses the ‘gtranslate’ plugin.” The hacker could now specifically check security vulnerabilities that are known for this plugin.
Since such scans are carried out within seconds on hundreds of file names and against different websites on the same host, a high unnecessary base load is created for the CPU and hard disk.
Scans for Author IDs
In WordPress each author has a unique number. Hackers want to find out which authors exist so they can target the accounts. When an author account is addressed, normal requests against the website use a valid number. However, hackers do not yet know the valid numbers and therefore have to iterate through numbers in the URL such as “?author=8” to test which numbers generate valid responses. If an author ID is missing, the website returns a 404-not-found error. This is how it could look in your log:
123.123.123.123 - - [17/Dec/2023:23:44:15 +0100] "GET /?author=10 HTTP/1.0" 404 45852 "-" "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:61.0) Gecko/20100101 Firefox/61.0"
123.123.123.123 - - [17/Dec/2023:23:44:15 +0100] "GET /?author=11 HTTP/1.0" 404 45852 "-" "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:61.0) Gecko/20100101 Firefox/61.0"
123.123.123.123 - - [17/Dec/2023:23:44:24 +0100] "GET /?author=12 HTTP/1.0" 404 45852 "-" "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:61.0) Gecko/20100101 Firefox/61.0"
123.123.123.123 - - [17/Dec/2023:23:44:24 +0100] "GET /?author=13 HTTP/1.0" 404 45852 "-" "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:61.0) Gecko/20100101 Firefox/61.0"
Scans for Passwords
In a similar way, hackers try to read passwords directly from a website. You may wonder how this is possible, as files such as the WordPress “wp-config.php” cannot be downloaded directly: Hackers have found that many users save backups of important configuration files in their websites – with extensions such as .bak, .backup, .txt. However, such files can be displayed directly in the browser.
This is not just an attempt to crack WordPress, Joomla and other well-known systems. Some users also thoughtlessly save access data to the server itself as a downloadable “backup” file. We have observed scans that attempt to determine Amazon AWS access data. Based on the timestamps in the following example, you can see that several file names are checked within a very short time. As with the general hacker scan shown in the previous section, the attacker pretends to be a different browser with each request:
123.123.123.123 - - [22/Oct/2023:19:09:06 +0200] "GET /opt/data/secrets/aws.csv HTTP/1.1" 301 162 "-" "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.127 Safari/537.36"
123.123.123.123 - - [22/Oct/2023:19:09:06 +0200] "GET /usr/local/etc/aws/config.json HTTP/1.1" 301 162 "-" "Mozilla/50 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.77 Safari/537.36"
123.123.123.123 - - [22/Oct/2023:19:09:06 +0200] "GET /v1/credentials/aws.json HTTP/1.1" 301 162 "-" "Mozilla/5.0 (iPhone; CPU OS 13_3_1 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) FxiOS/28.0 Mobile/15E148 Safari/605.1.15"
123.123.123.123 - - [22/Oct/2023:19:09:06 +0200] "GET /webpack-aws.config.js HTTP/1.1" 301 162 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.99 Safari/537.36"
123.123.123.123 - - [22/Oct/2023:19:09:06 +0200] "GET /usr/local/aws/credentials.json HTTP/1.1" 301 162 "-" "Mozilla/5.0 (Linux; Android 12; SM-G991B) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.127 Mobile Safari/537.36 EdgA/100.0.1185.50"
123.123.123.123 - - [22/Oct/2023:19:09:06 +0200] "GET /usr/local/etc/aws/credentials.yml HTTP/1.1" 301 162 "-" "Mozilla/4.0 (compatible; GoogleToolbar 4.0.1019.5266-big; Windows XP 5.1; MSIE 6.0.2900.2180)"
123.123.123.123 - - [22/Oct/2023:19:09:06 +0200] "GET /salt/pillar/aws.sls HTTP/1.1" 301 162 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.846.563 Safari/537.36"
123.123.123.123 - - [22/Oct/2023:19:09:06 +0200] "GET /salt/pillar/aws_keys.sls HTTP/1.1" 301 162 "-" "Mozilla/4.1 (compatible; MSIE 5.0; Symbian OS; Nokia 6600;452) Opera 6.20 [en-US]"
123.123.123.123 - - [22/Oct/2023:19:09:06 +0200] "GET /wp-content/plugins/secrets/aws.yml HTTP/1.1" 301 162 "-" "Mozilla/5.0 (iPad; CPU OS 5_1 like Mac OS X) AppleWebKit/534.46 (KHTML, like Gecko ) Version/5.1 Mobile/9B176 Safari/7534.48.3"
How to Combat Bad Bots and Other Attackers
Approaches With Little Effect
Four common methods of blocking attackers work more poorly than well:
- Send a complaint to the server operator of the attacker IP
- Exclude crawlers by the robots.txt file
- Block individual IP addresses
- Block user agents with a rewrite rule in the .htaccess file
Send a Complaint to the Network Operator of the Attacker IP
It was noticeable in our tests that attempts to guess AWS data always came from AWS EC2 server instances themselves. Of course, it is not the case that Amazon is carrying out these attacks. Instead, hackers rent a server instance there, upload a script and execute it. Or an existing EC2 instance is infected because it already has vulnerabilities, and immediately starts working as a bot for the hacker. Something that can be done in a matter of minutes. As a reputable provider, Amazon AWS is very cooperative in stopping such server instances. You can inform the provider, for example, via their abuse form here.
Other reputable providers also offer Internet users the opportunity to contact them if a domain or server is under attack. In the domain and network data of providers, you will usually find “Abuse” e-mail contact addresses. The search for the network provider of a domain via the “Domain Dossier” on Central Ops is promising.
Unfortunately, it takes a while to process such complaints to the network operators. In addition, attackers can easily change the IP address. This usually happens automatically because websites of unsuspecting users are infected with malware through hacker attacks, which in turn attacks other servers. Complaints are therefore ineffective in the medium term.
Exclude Crawlers by the robots.txt File
The robots.txt file is intended to give crawlers instructions as to which files on a website should be crawled and which should not. You can try using entries such as
User-agent: MJ12bot
Disallow: /
to block a bot such as MJ12bot. Unfortunately, bots only see this as a recommendation. Bots that are not interested in robots.txt or ignore the user’s wishes will continue to crawl the website. This procedure is unsuitable for excluding bad bots, because bad bots ignore robots.txt settings. This method is useless against hackers anyway.
Blocking Individual IP Addresses
If numerous requests from individual IP addresses are directed against a website that you want to protect, you could add the IP addresses to the recidive jail of Fail2Ban with a console command, e.g.
# fail2ban-client set recidive banip 123.123.123.123
This means that no further network requests from the attacker IP are passed on to the web server for the duration of the period set in the recidive jail. This method is particularly suitable if you want to stop traffic from certain sources as quickly as possible as a first aid measure. The recidive jail is chosen because it usually has a long ban period. Once IP addresses have been identified as problematic, you want to block them in the long term.
You can determine the ten most frequent requester IPs from your web server log file with this console command:
# awk '{ print $1}' access_ssl_log | sort | uniq -c | sort -nr | head -n 10
Before you block any of these, however, please make sure that you do not block your own server IP or any IPs of real search engines, as these could also appear in the result.
Unfortunately, attackers often change IP addresses. This is why you will see attacks with the same scheme again after a short time, but with a different IP address. The protective measure is only effective for a short time. However, it is suitable as an immediate measure to allow you to take a deep breath.
Block User Agents With a Rewrite Rule in the .htaccess File
You know the .htaccess file as a helpful tool for controlling the Apache web server.
You could add a rule there at the very beginning that blocks the request whenever a visitor’s “user agent” contains a certain character string:
RewriteEngine On
RewriteBase /
RewriteCond %{HTTP_USER_AGENT} (PetalBot|UptimeRobot|seocompany|LieBaoFast|SEOkicks|Uptimebot|Cliqzbot|ssearch_bot|domaincrawler|spot|DigExt|Sogou|MegaIndex.ru|majestic12|80legs|SISTRIX|HTTrack|Semrush|MJ12|Ezooms|CCBot|TalkTalk|Ahrefs|BLEXBot) [NC]
RewriteRule .* - [F]
If the rule applies, no further actions are executed. This means that although the web server must always view and process the request, it does not have to render a page or return extensive data to the attacker via the network. Instead, it only responds with a 403 forbidden. This reduces the load on the CPU. The example only shows a short list of user agents. Below, we show a comprehensive bad bot list as part of the highly effective solution approaches.
The rewrite rule procedure is only effective against bad bots, which identify with their bot name in the logs, and it has the disadvantage that incoming network traffic must be viewed by the Apache web server. This still creates unnecessary CPU load. In addition, the setting must be made for each individual website. Overall, the use of .htaccess costs computing time and adds disk I/O load, because the web server has to read and take the file into account for every resource access.
Highly Effective Approaches
Once traffic gets through to your web server, it causes memory, CPU and disk I/O load. That’s why you need a solution in which no attacker traffic is forwarded to services on the server.
This is what Fail2Ban does in combination with “iptables” on your server. You are probably already using Fail2Ban. If not, it’s high time you did. Fail2Ban can scan the log files of your websites, detect attacks and add the IP address of the attackers to the on-server “ipables” firewall to block further traffic from the attacker IP. To do this, it uses “filters” and “jails”.
A “filter” determines which IP addresses are to be evaluated as malicious. A “jail” determines what should be done with such IP addresses, e.g. how long they should be banned. The jail that is relevant for you is the “plesk-apache-badbot” entry in the file /etc/fail2ban/jail.local with the corresponding filter in the file /etc/fail2ban/filter.d/apache-badbots.conf. It bans well known bad bots by default. It is also mentioned in /etc/fail2ban/jail.d/plesk.conf, but we won’t edit that file to achieve our goal.
Jails are tolerant. For example, they wait for several malicious requests until they actually ban the attacker. Here we will tighten up and block bad bots hard as nails. Once the tolerance limit has been removed, the same jail is also suitable for blocking hacker scans.
How To Block Bad Bots and Hackers Quickly and for the Long Term With Fail2Ban
In the default settings, Fail2Ban is tolerant of misbehavior. For example, a user who enters a password incorrectly several times in a login dialog should not be locked out of the server immediately. In the case of bad bots and hacker scans, however, you should mutate into an extremist. You want to catch these attackers at the first attempt and lock them out. To do this, add to the file /etc/fail2ban/jail.local using a text editor.
The file begins with the [DEFAULT] section. There you will find all the settings that are used by Fail2Ban if no individual settings have been made for certain jails, e.g.
[DEFAULT]
maxretry = 6
This setting means that a user has six failed attempts before a jail blocks them. This is a reasonable tolerance limit for users who mistype their password. But not for hackers or bad bots. You can set the “maxretry” parameter (as well as almost all other Fail2Ban jail parameters) individually for each jail. Anyone who behaves like a bad bot or hacker has no business on your server after the first attempt. Therefore, do not edit “maxretry” for the default setting, but set it to “1” for the “plesk-apache-badbot” jail. “1” is the default on new Plesk installations anyway, but my test installations are upgrades that started out from Plesk 12, then 17, 17.5 where this value may have still been greater. Maybe you are in the same situation and should edit the value?
In the /etc/fail2ban/jail.local file, scroll down to the [plesk-apache-badbot] section and change the “maxretry” entry there. If there is none yet, simply add it:
[plesk-apache-badbot]
maxretry = 1
You should also be very strict with the blocking time (“bantime”). Popular bantime entries for [DEFAULT] are 300 or 600 seconds, i.e. 5 minutes or 10 minutes. This makes sense for users who accidentally mistype their password. However, you can lock out bad bots and hackers immediately and for a very long time without hesitation, e.g. a whole month. This is what it looks like now:
[plesk-apache-badbot]
maxretry = 1
bantime = 30d
You may say that 30 days is too long, because ip addresses change dynamically and could be reused by legitimate users who can then not access your website. However, by my experience, attacks are not driven from dial-up lines, but from infected websites. Such websites normally won’t contact your server at all, so I deem it safe to ban requests from malicious websites for a long time. But after all, it is up to you how long you want to ban.
You do not need to set the otherwise frequently used “findtime” parameter, because the first attempt at unwanted access is already recognized and blocked by “maxretry = 1”. It is therefore irrelevant how many further access attempts are made within a certain time.
Save the changes. In order for the changes in /etc/fail2ban/jail.local to take effect, you must reload Fail2Ban once:
# systemctl reload fail2ban
How To Combat Bad Bots
The apache-badbots filter is installed from the original Fail2Ban resources. But it has not been updated by the Fail2Ban maintainers since 2013. Since that time a lot has changed. Chances are that your apache-badbots filter is dated, so it’s time to update it and make it more effective.
We suggest to update the list of bad bots with the following. It is built based on user agents collected from User Agents and converted into a Fail2Ban filter file by Yaroslav Halchenko’s Github resource, enriched by some extras taken from everyday web hosting business experience. Replace the content of the /etc/fail2ban/filter.d/apache-badbots.conf file with
[Definition]
badbotscustom = thesis-research-bot
badbots = GPTBot|AmazonBot|Bytespider|Bytedance|fidget-spinner-bot|EmailCollector|WebEMailExtrac|ClaudeBot| TrackBack/1\.02|sogou music spider|seocompany|LieBaoFast|SEOkicks|Uptimebot|Cliqzbot|ssearch_bot|domaincrawler|AhrefsBot|spot|DigExt|Sogou|MegaIndex\.ru|majestic12|80legs|SISTRIX|HTTrack|Semrush|MJ12|Ezooms|CCBot|TalkTalk|Ahrefs|BLEXBot|Atomic_Email_Hunter/4\.0|atSpider/1\.0|autoemailspider|bwh3_user_agent|China Local Browse 2\.6|ContactBot/0\.2|ContentSmartz|DataCha0s/2\.0|DBrowse 1\.4b|DBrowse 1\.4d|Demo Bot DOT 16b|Demo Bot Z 16b|DSurf15a 01|DSurf15a 71|DSurf15a 81|DSurf15a VA|EBrowse 1\.4b|Educate Search VxB|EmailSiphon|EmailSpider|EmailWolf 1\.00|ESurf15a 15|ExtractorPro|Franklin Locator 1\.8|FSurf15a 01|Full Web Bot 0416B|Full Web Bot 0516B|Full Web Bot 2816B|Guestbook Auto Submitter|Industry Program 1\.0\.x|ISC Systems iRc Search 2\.1|IUPUI Research Bot v 1\.9a|LARBIN-EXPERIMENTAL \(efp@gmx\.net\)|LetsCrawl\.com/1\.0 \+http\://letscrawl\.com/|Lincoln State Web Browser|LMQueueBot/0\.2|LWP\:\:Simple/5\.803|Mac Finder 1\.0\.xx|MFC Foundation Class Library 4\.0|Microsoft URL Control - 6\.00\.8xxx|Missauga Locate 1\.0\.0|Missigua Locator 1\.9|Missouri College Browse|Mizzu Labs 2\.2|Mo College 1\.9|MVAClient|Mozilla/2\.0 \(compatible; NEWT ActiveX; Win32\)|Mozilla/3\.0 \(compatible; Indy Library\)|Mozilla/3\.0 \(compatible; scan4mail \(advanced version\) http\://www\.peterspages\.net/?scan4mail\)|Mozilla/4\.0 \(compatible; Advanced Email Extractor v2\.xx\)|Mozilla/4\.0 \(compatible; Iplexx Spider/1\.0 http\://www\.iplexx\.at\)|Mozilla/4\.0 \(compatible; MSIE 5\.0; Windows NT; DigExt; DTS Agent|Mozilla/4\.0 efp@gmx\.net|Mozilla/5\.0 \(Version\: xxxx Type\:xx\)|NameOfAgent \(CMS Spider\)|NASA Search 1\.0|Nsauditor/1\.x|PBrowse 1\.4b|PEval 1\.4b|Poirot|Port Huron Labs|Production Bot 0116B|Production Bot 2016B|Production Bot DOT 3016B|Program Shareware 1\.0\.2|PSurf15a 11|PSurf15a 51|PSurf15a VA|psycheclone|RSurf15a 41|RSurf15a 51|RSurf15a 81|searchbot admin@google\.com|ShablastBot 1\.0|snap\.com beta crawler v0|Snapbot/1\.0|Snapbot/1\.0 \(Snap Shots, \+http\://www\.snap\.com\)|sogou develop spider|Sogou Orion spider/3\.0\(\+http\://www\.sogou\.com/docs/help/webmasters\.htm#07\)|sogou spider|Sogou web spider/3\.0\(\+http\://www\.sogou\.com/docs/help/webmasters\.htm#07\)|sohu agent|SSurf15a 11 |TSurf15a 11|Under the Rainbow 2\.2|User-Agent\: Mozilla/4\.0 \(compatible; MSIE 6\.0; Windows NT 5\.1\)|VadixBot|WebVulnCrawl\.unknown/1\.0 libwww-perl/5\.803|Wells Search II|WEP Search 00
failregex = ^<HOST> -[^"]*"(?:GET|POST|HEAD) \/.* HTTP\/\d(?:\.\d+)" \d+ \d+ "[^"]*" "[^"]*(?:%(badbots)s|%(badbotscustom)s)[^"]*"$
ignoreregex =
The variable “badbotscustom” is an option to improve clarity. You can add further user agent strings there if required.
The “failregex” line inserts the variables into a regular expression that generates the filter. Fail2Ban will search for all log entries that contain one of the badbots or badbotscustom strings as a user agent.
Test against a real log file whether the new rules work:
# fail2ban-regex /var/www/vhosts/<your domain>/logs/access_ssl_log /etc/fail2ban/filter.d/apache-badbots.conf
In the output you’ll see a line mentioning the number of matches:
Lines: 54 lines, 0 ignored, 1 matched, 53 missed
If your log has user agent strings that match your bad bot list, you’ll see at least one match. Is everything working as expected? Let Fail2Ban load the new version of the filter:
# fail2ban-client reload plesk-apache-badbot
How to Combat Hacker Scans – Vol I: Beginner Level
The plesk-apache-badbot jail is not only suitable for detecting bad bots. With an extension, you can also use it to detect hacker scans.
Hackers test websites not just for the presence of a few files, but for many. Although they may come across files that are actually present and thus receive the response code 200 from the web server, the attackers will also request files that are missing. The web server responds to such requests with a 301-permanent-redirect or 404-not-found. We take advantage of this to recognize the scan. More on this below.
Hackers also like to scan for common file names of authentication systems, which we do not use in our websites. For example, if a website does not run in an AWS instance and does not use any special software from AWS, we can safely evaluate all calls to such files that are aimed at obtaining sensitive information from a website as an attack and thus ban the attacker. We become the trapper by editing the file /etc/fail2ban/filter.d/apache-badbots.conf again. There we add further filter lines after the existing “failregex = …” line. A professional system administrator would prefer creating a separate filter file and a new jail for these extra lines: 1) Modifying the existing file can prevent the packaging system (YUM/DNF or APT) from updating it. 2) If a packaging system updates it, it might overwrite the file and modifications will be lost. For the proof of concept how to ban specific hacker attacks and to not make this blog post longer than what it already is, editing the existing filter will do. If you prefer creating a separate filter and jail, search engines are your best friends to find dozens of manuals that explain how to do that.
Since attackers don’t know what kind of websites they are dealing with, they have to try a few things. If you know your own website well, you can formulate rules that include paths that are typical for security information but are not found in your website. A regular visitor will never retrieve such paths, but an attacker requesting these paths will be recognized immediately.
In the following example, we assume that there are no “aws” folders or files in your websites, nor is oauth used. This is the case for most websites. We add the following filter lines:
^<HOST> .*GET .*aws(/|_|-)(credentials|secrets|keys).*
^<HOST> .*GET .*credentials/aws.*
^<HOST> .*GET .*secrets/(aws|keys).*
^<HOST> .*GET .*oauth/config.*
^<HOST> .*GET .*config/oauth.*
Remember to check the regular expressions of the filter after changing the /etc/fail2ban/filter.d/apache-badbots.conf file:
# fail2ban-regex /var/www/vhosts/<your domain>/logs/access_ssl_log /etc/fail2ban/filter.d/apache-badbots.conf
and then restart the jail so that the new filters take effect:
# fail2ban-client reload plesk-apache-badbot
The rules ensure that all GET queries to files and paths with the names oauth/config, config/oauth and aws-credentials, aws/credentials, aws_secrets etc. are considered an attack. If you actually use these paths in your websites, you must of course not use the rules. In the vast majority of cases, however, they can be used without hesitation.
You can build similar rules for other widely used software that you do not use yourself. For example, if you know that you do not use “Prestashop” on your servers, you could use a filter to recognize common folder names for Prestashop such as “travis-scripts”, “tests-legacy”, which are uncommon for other applications. Someone who tests your websites and tries to access these paths will be recognized as an attacker:
^<HOST> .*GET .*(travis-scripts|tests-legacy)/.*
These are just examples of how you can better detect and automatically stop attacks in day-to-day server operation. Use these examples as inspiration to write your own filters to suit your environment.
How To Combat Hacker Scans – Vol II: Sleight Of Hand Level
Finally, we have a special trick for you that is very effective in detecting hacker scans:
^<HOST> .*"GET .*(freshio|woocommerce).*frontend.*" (301|404).*
^<HOST> .*"GET .*contact-form-7/includes.*" (301|404).*
^<HOST> .*"(GET|POST) .*author=.*" 404.*
In the first two lines of this example we take advantage of the fact that popular software is included in hacker scans, e.g. “Woocommerce” or “Contact Form 7”. However, since an attacker does not know this before his tests, he not only searches for exactly this software, but also for other applications, e.g. “Freshio”. Regular website visitors will never call up a plugin path that does not exist on a website. But attackers are forced to try a lot and are very likely to come across a file that is not there.
This is exactly the case we are interested in, because it recognizes that the requester is not a regular visitor, but an attacker. This is why the first two filters mentioned as examples check the response code of the web server and filter the lines in which code 301 or 404 is returned. As a result, Fail2Ban knows that it is an attack and can ban the attacker’s IP address.
If your websites actually use Woocommerce, Freshio, Contact Form 7, nothing happens. But the attacker who provokes a 301 or 404 response is recognized by the filter.
In the third line of this example we are banning all users who are performing author scans on websites. If an author ID is missing, the website returns a 404-not-found error. So if someone retrieves a missing number, Fail2Ban recognizes the 404 error code and thus the attacker. Some of these scans are also carried out against websites that are not running WordPress. All the better for us, because these also result in a 404 error and thus provide us with the knowledge that it is a hacker attack.
In live tests with a five-digit number of websites, these methods were able to detect almost all hacker scans. They contribute much to reducing the CPU load of a server when many websites are running on the same host. This is because once an attacker IP has been blocked, further attacks against other websites on the same host are also prevented.
Conclusion
When a server receives many requests, it can get bogged down with high CPU usage, slowing down website responses. Thankfully, the Plesk Cgroups component steps in to regulate this load, offering a very effective solution.
With the Fail2Ban apache-badbots jail improvements demonstrated here, you can automatically fend off unwanted requests from “bad bots” and hackers. This is more effective than trying to block bad bots and hackers manually.
As a result, your host computer is less busy processing pointless requests, reducing the overall CPU load. This means faster responses for regular visitors since the processor has more room to handle computing processes right away.
No comment yet, add your voice below!