Block IP addresses which overload server

mikussikus

Verified User
Joined
Dec 21, 2022
Messages
24
Hi,
I'm looking for solution which to prevent access to my server, ip addresses which are abuse.

On my server I have ~ 300 websites and every day in logs I see many request from IPs which come from Microsft Data center abd they have sometimes 400 connections. Many of them are IPs bots. After that I check these Ips on page abuseipdb.com. If exist IP then I block it but it is manual process so it isn't comfortable.

I don't know whic solution will be excellent for my server and my customers. What do you think about below propositions. Please reply how you resolve an issue with many connections from IPs.

1. Add abuseipdb API to CSF.
2. In CSF limit connections per IP.
2. In file robots.txt every pages add Crawl-delay and set some time.
3. In httpd.conf block specific bots
4. Add custom configurations in WAF of modsecurity.
 
CSF has blocklists that you can enable, one of which is abuseipdb. Check the readme file for CSF, it's easy to set up.
 
We use option 2 and a 3 alternative, we have a bad bot list which we use in .htaccess of sites which are attacked often by certain bots. But ofcourse there are more ways to do that.
We had option 1 active before, but that's always 10k ip addresses and didn't seem to really do anything so we disabled that one again.

The .htaccess works great though.
 
This is a .htaccess file, you can ofcourse change or add as desired.
Code:
BrowserMatchNoCase "libwww-perl" bad_bot
BrowserMatchNoCase "wget" bad_bot
BrowserMatchNoCase "meta-externalagent" bad_bot
BrowserMatchNoCase "LieBaoFast" bad_bot
BrowserMatchNoCase "Mb2345Browser" bad_bot
BrowserMatchNoCase "zh-CN" bad_bot
BrowserMatchNoCase "MicroMessenger" bad_bot
BrowserMatchNoCase "zh_CN" bad_bot
BrowserMatchNoCase "Kinza" bad_bot
BrowserMatchNoCase "Bytespider" bad_bot
BrowserMatchNoCase "Baiduspider" bad_bot
BrowserMatchNoCase "Sogou" bad_bot
BrowserMatchNoCase "Datanyze" bad_bot
BrowserMatchNoCase "AspiegelBot" bad_bot
BrowserMatchNoCase "adscanner" bad_bot
BrowserMatchNoCase "serpstatbot" bad_bot
BrowserMatchNoCase "spaziodat" bad_bot
BrowserMatchNoCase "undefined" bad_bot
BrowserMatchNoCase "claudebot" bad_bot
BrowserMatchNoCase "facebook" bad_bot
BrowserMatchNoCase "Petalbot" bad_bot
BrowserMatchNoCase "YandexBot" bad_bot
BrowserMatchNoCase "Applebot" bad_bot
BrowserMatchNoCase "aiohttp" bad_bot
BrowserMatchNoCase "facebookexternalhit/1.1" bad_bot
BrowserMatchNoCase "facebookcatalog/1.0" bad_bot
BrowserMatchNoCase "aiohttp" bad_bot
Order deny,allow
Deny from env=bad_bot

And this is a robots.txt example from one of the sites:
Code:
User-agent: BoardReader
Disallow: /

User-agent: meta-externalagent
Disallow: /

User-agent: facebookexternalhit
Disallow: /

User-agent: AhrefsBot
Disallow: /

User-agent: Amazonbot 
Disallow: / 

User-agent: anthropic-ai
Disallow: /

User-agent: BoardReader
Disallow: /

User-agent: BLEXBot
Disallow: /

User-agent: dotbot
Disallow: /

User-agent: MJ12bot
Disallow: / 

User-agent: GPTBot
Disallow: /

User-agen: nbot
Disallow: /

User-agent: SemrushBot
Disallow: / 

User-agent: SemrushBot-SA
Disallow: / 

User-Agent: trendictionbot
Disallow: /

And in robots.txt one can add this too:
Code:
User-agent: YandexImages
User-agent: msnbot-media
User-agent: msnbot-MM
User-agent: PetalBot
User-agent: dotbot
User-agent: PetalBot
User-agent: AspiegelBot
User-agent: AhrefsBot
User-agent: DotBot
User-agent: MauiBot
User-agent: MJ12bot
# SemrushBot list
User-agent: SemrushBot
User-agent: SiteAuditBot
User-agent: SemrushBot-BA
User-agent: SemrushBot-SI
User-agent: SemrushBot-SWA
User-agent: SemrushBot-CT
User-agent: SemrushBot-BM
User-agent: SplitSignalBot
User-agent: SemrushBot-COUB
Disallow: /
 
I'm prefer block bots on nginx side, or add this to custom nginx or add to separate conf and include via custom nginx template:
Code:
if ($http_user_agent ~* (360Spider|80legs.com|GPTBot|YandexBot|Abonti|AcoonBot|Acunetix|adbeat_bot|AddThis.com|adidxbot|ADmantX|AhrefsBot|AngloINFO|Antelope|Applebot|BaiduSpider|BeetleBot|billigerbot|binlar|bitlybot|BlackWidow|BLP_bbot|BoardReader|Bolt\ 0|BOT\ for\ JCE|Bot\ mailto\:craftbot@yahoo\.com|casper|CazoodleBot|CCBot|checkprivacy|ChinaClaw|chromeframe|Clerkbot|Cliqzbot|clshttp|CommonCrawler|comodo|CPython|crawler4j|Crawlera|CRAZYWEBCRAWLER|Curious|Custo|CWS_proxy|Default\ Browser\ 0|diavol|DigExt|Digincore|DIIbot|discobot|DISCo|DoCoMo|DotBot|Download\ Demon|DTS.Agent|EasouSpider|eCatch|ecxi|EirGrabber|Elmer|EmailCollector|EmailSiphon|EmailWolf|Exabot|ExaleadCloudView|ExpertSearchSpider|ExpertSearch|Express\ WebPictures|ExtractorPro|extract|EyeNetIE|Ezooms|F2S|FastSeek|feedfinder|FeedlyBot|FHscan|finbot|Flamingo_SearchEngine|FlappyBot|FlashGet|flicky|Flipboard|g00g1e|Genieo|genieo|GetRight|GetWeb\!|GigablastOpenSource|GozaikBot|Go\!Zilla|Go\-Ahead\-Got\-It|GrabNet|grab|Grafula|GrapeshotCrawler|GTB5|GT\:\:WWW|Guzzle|harvest|heritrix|HMView|HomePageBot|HTTP\:\:Lite|HTTrack|HubSpot|ia_archiver|icarus6|IDBot|id\-search|IlseBot|Image\ Stripper|Image\ Sucker|Indigonet|Indy\ Library|integromedb|InterGET|InternetSeer\.com|Internet\ Ninja|IRLbot|ISC\ Systems\ iRc\ Search\ 2\.1|jakarta|Java|JetCar|JobdiggerSpider|JOC\ Web\ Spider|Jooblebot|kanagawa|KINGSpider|kmccrew|larbin|LeechFTP|libwww|Lingewoud|LinkChecker|linkdexbot|LinksCrawler|LinksManager\.com_bot|linkwalker|LinqiaRSSBot|LivelapBot|ltx71|LubbersBot|lwp\-trivial|Mail.RU_Bot|masscan|Mass\ Downloader|maverick|Maxthon$|Mediatoolkitbot|MegaIndex|MegaIndex|megaindex|MFC_Tear_Sample|Microsoft\ URL\ Control|microsoft\.url|MIDown\ tool|miner|Missigua\ Locator|Mister\ PiX|mj12bot|Mozilla.*Indy|Mozilla.*NEWT|MSFrontPage|msnbot|Navroad|NearSite|NetAnts|netEstate|NetSpider|NetZIP|Net\ Vampire|NextGenSearchBot|nutch|Octopus|Offline\ Explorer|Offline\ Navigator|OpenindexSpider|OpenWebSpider|OrangeBot|Owlin|PageGrabber|PagesInventory|panopta|panscient\.com|Papa\ Foto|pavuk|pcBrowser|PECL\:\:HTTP|PeoplePal|Photon|PHPCrawl|planetwork|PleaseCrawl|PNAMAIN.EXE|PodcastPartyBot|prijsbest|proximic|psbot|purebot|QuerySeekerSpider|R6_CommentReader|R6_FeedFetcher|RealDownload|ReGet|Riddler|Rippers\ 0|rogerbot|RSSingBot|rv\:1.9.1|RyzeCrawler|SafeSearch|SBIder|Scrapy|Scrapy|Screaming|SeaMonkey$|search.goo.ne.jp|SearchmetricsBot|search_robot|SemrushBot|Semrush|SentiBot|SEOkicks|SeznamBot|ShowyouBot|SightupBot|SISTRIX|sitecheck\.internetseer\.com|siteexplorer.info|SiteSnagger|skygrid|Slackbot|Slurp|SmartDownload|Snoopy|Sogou|Sosospider|spaumbot|Steeler|sucker|SuperBot|Superfeedr|SuperHTTP|SurdotlyBot|Surfbot|tAkeOut|Teleport\ Pro|TinEye-bot|TinEye|Toata\ dragostea\ mea\ pentru\ diavola|Toplistbot|trendictionbot|TurnitinBot|turnit|Twitterbot|URI\:\:Fetch|urllib|Vagabondo|Vagabondo|vikspider|VoidEYE|VoilaBot|WBSearchBot|webalta|WebAuto|WebBandit|WebCollage|WebCopier|WebFetch|WebGo\ IS|WebLeacher|WebReaper|WebSauger|Website\ eXtractor|Website\ Quester|WebStripper|WebWhacker|WebZIP|Web\ Image\ Collector|Web\ Sucker|Wells\ Search\ II|WEP\ Search|WeSEE|Wget|Widow|WinInet|woobot|woopingbot|worldwebheritage.org|Wotbox|WPScan|WWWOFFLE|WWW\-Mechanize|Xaldon\ WebSpider|XoviBot|yacybot|Yahoo|YisouSpider|zermelo|Zeus|zh-CN|ZmEu|ZumBot|ZyBorg) ) {
    return 503;
}
 
Does DirectAdmin have a native feature for example to create custom .htaccess, next this file will be paste this file in public_html when new user will create?

I think good idea is block all bots, however if user want to allow some bots then can be able to edit .htaccess.
 
Yes. You can add files you want in the users public_html on account creation.
You can for example add a .htaccess in your /home/admin/domains/default directory.
Or as reseller the /home/reseller/domains/default directory.

If you check you will see that already an index.html and logo.jpg is present for example. You can create your own landing page and logo on domain creation this way too.
 
if you using modsecurity ( comodo Rules ), you can add those agent into field list. This ways, the customer can easy toggle on/off or bypass some rules.

Since ".htaccess" is using by many framework, so they might remove all thing include your ".htaccess" before setup their site.
 
Yep I know, but no I don't use modsecurity. I thought if it was possible in nGinx with seperate conf, it might be in Apache too some way.
 
#./httpd/conf/extra/httpd-include.conf
Code:
<IfModule mod_rewrite.c>
    RewriteEngine On
    RewriteCond %{HTTP_USER_AGENT} (EvilBotHere|SpamSpewer|SecretAgentAgent) [NC]
    RewriteRule (.*) - [F,L]
</IfModule>

Try this, it's global block.

I have rewrite rules to prevent access the phpmyadmin or webmail from proxy service like cloudflare or other CDN. so this rules should work too. I guest...
 
RewriteRule (.*) - [F,L]
Just about this one. I'm not that good in redirecting, but I found this post.
Is it correct as you have it or should I also for this reason change it to:
Code:
RewriteCond %{REQUEST_URI} !^/my-403-page\.html$
or some otherway to prevent 403 page blocking? Or is that not needed?
 
It's up to you.


###note
From @Zhenyapan reply, he meant it better put filter list in front of web application ( :80, :443), not behind proxy.

Example: if using nginx_apache, "nginx" will work as front and "apache" will stays behind the nginx. so it's can reduce resource usage better than put in "apache".
 
Found another one for you to block. Today I got high load on a server, caused by GPTbot.
If you want (your own choice ofcourse) you can block it in robots.txt:
User-agent: GPTBot
Disallow: /

Or block the agent the other way via the other method.
This was it's identification in the logs:
GPTBot/1.2; +https://openai.com/gptbot)
 
I came here with this exact question. In my case it's not bots that announce themselves but actively hunt for .env, vulnerabilities, older backups, ... and they hit almost all sites on the server at the same time. Ideally you would have something that looks at the logs and if it sees multiple 404's from the same IP in multiple files, blocks that one.

As the OP it's mostly coming from free Azure IPs. I've already blocked quite a few subnets but new ones pop up every day.
 
Zhenyapan has right. it is better to put all bots in a conf file to block globaly.

you have thing it's like this. .htaccess it is like plugin each time have to hook it from apache
if you use .htaccess (each sites) for each time have to hook it. this can be negativ effect of site speed,
if you have long list of bots or uri on .htaccess file

you can also add the bots name or uri in Apache custom files like Zhenyapan make in nginx to block spieder or uri.

for Apache test this
open your /etc/httpd/conf/extra/httpd-includes.conf file

add following lines. i put some exampel bots name you can add more

and restart Apache this will give the bots name added error 403


<IfModule mod_headers.c>
<Location />
SetEnvIfNoCase User-Agent "Baiduspider" bad_bot
SetEnvIfNoCase User-Agent "MJ12bot" bad_bot
SetEnvIfNoCase User-Agent "AhrefsBot" bad_bot
SetEnvIfNoCase User-Agent "SemrushBot" bad_bot
SetEnvIfNoCase User-Agent "BLEXBot" bad_bot
SetEnvIfNoCase User-Agent "Sogou Spider" bad_bot
SetEnvIfNoCase User-Agent "SeznamBot" bad_bot
<RequireAll>
Require all granted
Require not env bad_bot
</RequireAll>
</Location>
</IfModule>


About iworx issue ".env" attack
or like WordPress xlmprc.php file attack


just add lines whit Request_URI
this will give also 403 error

<IfModule mod_headers.c>
<Location />
SetEnvIfNoCase User-Agent "Baiduspider" bad_bot
SetEnvIfNoCase User-Agent "MJ12bot" bad_bot
SetEnvIfNoCase User-Agent "AhrefsBot" bad_bot
SetEnvIfNoCase User-Agent "SemrushBot" bad_bot
SetEnvIfNoCase User-Agent "BLEXBot" bad_bot
SetEnvIfNoCase User-Agent "Sogou Spider" bad_bot
SetEnvIfNoCase User-Agent "SeznamBot" bad_bot
SetEnvIfNoCase Request_URI ".env" bad_bot
SetEnvIfNoCase Request_URI "xlmrpc.php" bad_bot
<RequireAll>
Require all granted
Require not env bad_bot
</RequireAll>
</Location>
</IfModule>

you can read it about more SetEnvIf
 
Just curious. What is the difference between this one and from @Hostmavi the example of @Ohm J?
Or is there no difference in load or resources or whatever and both just block?

This last one is nice because you don't have everything on 1 line so better overview of what's in there.
 
Back
Top