How to Identify and Block Malicious Bots Using Server Logs

📑 Table of Contents

Introduction: The Bot Problem
Types of Malicious Bots
Detecting Fake Googlebots
Identifying Scraper Bots
Credential Stuffing Detection
Server-Level Blocking Techniques
Advanced Detection with Log Analysis
Building a Bot Management Strategy
Conclusion

Introduction: The Bot Problem

Malicious bots account for roughly 30% of all internet traffic, and the number is climbing. They scrape content, stuff credentials, launch DDoS attacks, spam forms, and impersonate legitimate crawlers to slip past basic defenses. Unlike the AI crawlers covered in our AI crawlers guide, malicious bots operate with deliberate intent to harm, steal, or exploit your infrastructure.

Your server logs are the single most reliable source of truth for understanding bot activity. Every request carries a fingerprint: the IP address, User-Agent string, request path, timing, status code, and referrer. Analyzed together, these fields expose patterns that no bot can fully disguise.

🔑 Key Insight: Most bot detection services operate at the edge and miss bots that rotate IPs or mimic browser headers. Server-side log analysis catches what perimeter tools miss because it reveals behavioral patterns across time, not just single-request signatures.

In this guide, we'll walk through concrete techniques for identifying and blocking the most common malicious bot categories using nothing more than your server logs, standard Linux tools, and a few carefully crafted rules. If you want to automate this analysis, tools like LogBeast can parse millions of log lines and surface suspicious patterns in seconds.

Types of Malicious Bots

Before you can block bots, you need to classify them. Different bot categories exhibit different behavioral signatures and require different mitigation strategies.

Bot Type	Goal	Key Log Signature	Risk Level
Scraper Bots	Steal content, prices, listings	Sequential URL patterns, high request rate, no JS/CSS loads	🟡 Medium
Credential Stuffers	Test stolen username/password pairs	POST floods to /login, /api/auth; many 401/403 responses	🔴 Critical
DDoS Bots	Overwhelm server resources	Thousands of requests/sec from distributed IPs, identical paths	🔴 Critical
SEO Spam Bots	Inject backlinks, spam comments	POST to /comments, /contact, /wp-admin; referrer spam domains	🟡 Medium
Fake Googlebots	Bypass allowlists, scrape content	Googlebot UA but non-Google IP; fails reverse DNS check	🟠 High
Vulnerability Scanners	Find exploitable weaknesses	Requests to /wp-admin, /.env, /phpmyadmin, /actuator	🔴 Critical
Account Creation Bots	Mass-create fake accounts	POST floods to /register, /signup; similar form data patterns	🟠 High
Click Fraud Bots	Drain advertising budgets	Repeated ad URL clicks from same IP ranges, short sessions	🟡 Medium

⚠️ Warning: Many malicious bots rotate User-Agent strings and IPs to evade simple signature-based blocking. Always combine multiple detection signals rather than relying on a single field.

Detecting Fake Googlebots

Fake Googlebots are one of the most common and dangerous forms of bot impersonation. Attackers spoof the Googlebot User-Agent string because many websites whitelist Googlebot to ensure their content gets indexed. This creates a bypass for rate limits, paywalls, and access controls.

Step 1: Find All Googlebot Requests

Start by extracting every request that claims to be Googlebot:

# Extract IPs claiming to be Googlebot
grep "Googlebot" /var/log/nginx/access.log | awk '{print $1}' | sort -u > googlebot_ips.txt

# Count requests per IP
grep "Googlebot" /var/log/nginx/access.log | awk '{print $1}' | sort | uniq -c | sort -rn | head -20

Step 2: Reverse DNS Verification

Google publishes its crawler IP ranges and all legitimate Googlebots resolve to *.googlebot.com or *.google.com via reverse DNS. This is the definitive verification method:

#!/bin/bash
# verify_googlebot.sh - Verify Googlebot IPs via reverse DNS
while read ip; do
    hostname=$(host "$ip" 2>/dev/null | awk '/domain name pointer/ {print $NF}')
    if [[ "$hostname" == *".googlebot.com." ]] || [[ "$hostname" == *".google.com." ]]; then
        # Forward DNS confirmation
        forward_ip=$(host "$hostname" 2>/dev/null | awk '/has address/ {print $NF}')
        if [[ "$forward_ip" == "$ip" ]]; then
            echo "LEGITIMATE: $ip -> $hostname"
        else
            echo "FAKE (forward mismatch): $ip -> $hostname -> $forward_ip"
        fi
    else
        echo "FAKE: $ip -> ${hostname:-NO_PTR_RECORD}"
    fi
done < googlebot_ips.txt

💡 Pro Tip: LogBeast automatically performs reverse DNS verification on all bot IPs and flags impersonators in its bot analysis report. This saves hours of manual verification on busy sites.

Step 3: IP Range Validation

Google publishes its crawler IP ranges in JSON format. You can cross-reference suspicious IPs against these ranges programmatically:

#!/usr/bin/env python3
"""Check if an IP belongs to Google's published crawler ranges."""
import ipaddress
import json
import urllib.request

def load_google_ranges():
    url = "https://developers.google.com/search/apis/ipranges/googlebot.json"
    with urllib.request.urlopen(url) as resp:
        data = json.loads(resp.read())
    networks = []
    for prefix in data["prefixes"]:
        if "ipv4Prefix" in prefix:
            networks.append(ipaddress.ip_network(prefix["ipv4Prefix"]))
        elif "ipv6Prefix" in prefix:
            networks.append(ipaddress.ip_network(prefix["ipv6Prefix"]))
    return networks

def is_real_googlebot(ip_str, networks):
    ip = ipaddress.ip_address(ip_str)
    return any(ip in net for net in networks)

if __name__ == "__main__":
    import sys
    networks = load_google_ranges()
    for line in open(sys.argv[1]):
        ip = line.strip()
        status = "REAL" if is_real_googlebot(ip, networks) else "FAKE"
        print(f"{status}: {ip}")

What to Do with Fake Googlebots

Block immediately: Fake Googlebots have zero legitimate purpose
Log the behavior: Note which pages they target to understand the attacker's goal
Add to blocklist: Feed verified-fake IPs into your firewall or fail2ban rules
Monitor for patterns: Fake Googlebot IPs often come from the same ASN or hosting provider

Identifying Scraper Bots

Content scrapers are bots that systematically download your pages to steal content, harvest pricing data, or replicate your site. They are often harder to detect than brute-force bots because they try to mimic human browsing patterns.

Behavioral Signals in Server Logs

Sequential path crawling: Requests follow predictable URL patterns (e.g., /product/1, /product/2, /product/3)
No static asset requests: Real browsers load CSS, JS, images, and fonts. Scrapers typically skip these entirely
Uniform request intervals: Human browsing has irregular timing; scrapers fire requests at fixed intervals (e.g., exactly 2.0 seconds apart)
Missing or static referrer: Every page request has the same referrer or no referrer at all
High page-to-session ratio: Hundreds of pages accessed from a single IP with no dwell time

Detection with grep and awk

# Find IPs that request more than 100 pages but zero CSS/JS/image files
# Step 1: IPs with high page request counts
awk '$7 ~ /\.(html|php|\/)$/ {print $1}' access.log | sort | uniq -c | sort -rn | \
  awk '$1 > 100 {print $2}' > high_volume_ips.txt

# Step 2: Check which of those IPs never requested static assets
while read ip; do
    static_count=$(grep "^$ip " access.log | grep -cE '\.(css|js|png|jpg|woff2|svg)')
    page_count=$(grep "^$ip " access.log | grep -cvE '\.(css|js|png|jpg|woff2|svg|ico)')
    if [ "$static_count" -eq 0 ] && [ "$page_count" -gt 50 ]; then
        echo "SCRAPER: $ip ($page_count pages, 0 static assets)"
    fi
done < high_volume_ips.txt

# Find IPs with suspiciously regular request intervals
awk '/^203\.0\.113\.42/ {print $4}' access.log | \
  sed 's/\[//' | \
  while read ts; do
    date -d "$(echo $ts | sed 's/:/ /' | sed 's/\// /g')" +%s
  done | awk 'NR>1 {print $1 - prev} {prev=$1}' | sort | uniq -c | sort -rn

User-Agent Analysis

# Find User-Agents making the most requests
awk -F'"' '{print $6}' access.log | sort | uniq -c | sort -rn | head -20

# Common scraper User-Agent patterns
grep -iE "(python-requests|scrapy|wget|curl|httpclient|java/|Go-http|libwww)" access.log | \
  awk '{print $1}' | sort | uniq -c | sort -rn

🔑 Key Insight: Sophisticated scrapers use real browser User-Agents and even execute JavaScript. For these, behavioral analysis (request patterns, timing, asset loading ratios) is the only reliable detection method. LogBeast calculates per-IP behavioral scores that flag these advanced scrapers automatically.

Credential Stuffing Detection

Credential stuffing attacks use lists of stolen username/password pairs (from data breaches) to attempt logins on your site. They are one of the most damaging bot attacks because a successful hit gives the attacker direct access to a real user account.

Log Patterns to Watch

POST volume to auth endpoints: Abnormal spike in POST requests to /login, /api/auth, /oauth/token, or /wp-login.php
High 401/403 rate: Legitimate users occasionally mistype passwords; credential stuffers produce 95%+ failure rates
Geographic anomalies: Login attempts from countries where you have no users
Distributed source IPs: Attackers rotate through hundreds or thousands of proxy IPs
Timing patterns: Requests arrive in machine-like bursts, often with identical inter-request gaps

Monitoring Login Endpoints

# Count POST requests to login endpoints per minute
awk '$6 ~ /POST/ && $7 ~ /\/(login|signin|api\/auth|wp-login)/' access.log | \
  awk '{print substr($4, 2, 17)}' | sort | uniq -c | sort -rn | head -20

# Find IPs with high login failure rates
awk '$6 ~ /POST/ && $7 ~ /\/login/ && ($9 == 401 || $9 == 403) {print $1}' access.log | \
  sort | uniq -c | sort -rn | head -20

# Check if login failures come from distributed IPs (credential stuffing signature)
awk '$6 ~ /POST/ && $7 ~ /\/login/ && $9 == 401 {print $1}' access.log | \
  sort -u | wc -l
# If this number is high (100+) with many failures each, it's likely credential stuffing

Python Script for Credential Stuffing Detection

#!/usr/bin/env python3
"""Detect credential stuffing patterns in access logs."""
import re
import sys
from collections import defaultdict
from datetime import datetime

LOGIN_PATTERN = re.compile(r'(POST|PUT).*/(?:login|signin|auth|wp-login|oauth/token)')
LOG_PATTERN = re.compile(
    r'(\d+\.\d+\.\d+\.\d+).*\[(.+?)\].*"(\w+) (.+?) HTTP.*" (\d+)'
)

def analyze_logs(log_file, threshold_failures=10, threshold_ips=5):
    ip_failures = defaultdict(int)
    ip_successes = defaultdict(int)
    ip_timestamps = defaultdict(list)
    minute_counts = defaultdict(int)

    with open(log_file) as f:
        for line in f:
            match = LOG_PATTERN.search(line)
            if not match:
                continue
            ip, timestamp, method, path, status = match.groups()
            if not LOGIN_PATTERN.search(f"{method} {path}"):
                continue

            minute_key = timestamp[:17]
            minute_counts[minute_key] += 1

            if status in ('401', '403'):
                ip_failures[ip] += 1
                ip_timestamps[ip].append(timestamp)
            elif status in ('200', '302'):
                ip_successes[ip] += 1

    # Report suspicious IPs
    print("=== CREDENTIAL STUFFING SUSPECTS ===\n")
    suspects = 0
    for ip, failures in sorted(ip_failures.items(), key=lambda x: -x[1]):
        successes = ip_successes.get(ip, 0)
        total = failures + successes
        failure_rate = failures / total if total else 0
        if failures >= threshold_failures and failure_rate > 0.9:
            suspects += 1
            print(f"  IP: {ip}")
            print(f"  Failures: {failures} | Successes: {successes} | Rate: {failure_rate:.1%}")
            print()

    # Report minute-by-minute spikes
    avg_per_min = sum(minute_counts.values()) / max(len(minute_counts), 1)
    print(f"\n=== LOGIN VOLUME SPIKES (avg: {avg_per_min:.1f}/min) ===\n")
    for minute, count in sorted(minute_counts.items(), key=lambda x: -x[1])[:10]:
        if count > avg_per_min * 3:
            print(f"  {minute}: {count} attempts ({count/max(avg_per_min,1):.1f}x normal)")

    unique_ips = len(ip_failures)
    print(f"\n=== SUMMARY ===")
    print(f"  Unique IPs with login failures: {unique_ips}")
    print(f"  Suspect IPs (>{threshold_failures} failures, >90% fail rate): {suspects}")
    if unique_ips > threshold_ips * 10:
        print(f"  ⚠ DISTRIBUTED ATTACK: {unique_ips} distinct source IPs detected")

if __name__ == "__main__":
    analyze_logs(sys.argv[1])

⚠️ Warning: Credential stuffing attacks often succeed on 0.1-2% of attempts. Even a low-volume attack testing 10,000 credentials can compromise 10-200 accounts. Early detection is critical.

Server-Level Blocking Techniques

Once you have identified malicious bot IPs and patterns, you need to block them at the server level. Here are production-ready configurations for the most common stacks.

Nginx: Rate Limiting and Bot Blocking

# /etc/nginx/conf.d/bot-protection.conf

# Define rate limiting zones
limit_req_zone $binary_remote_addr zone=login:10m rate=5r/m;
limit_req_zone $binary_remote_addr zone=general:10m rate=30r/m;
limit_req_zone $binary_remote_addr zone=api:10m rate=60r/m;

# Map known bad User-Agents to a block variable
map $http_user_agent $bad_bot {
    default 0;
    ~*(python-requests|scrapy|wget|curl/|HttpClient) 1;
    ~*(MJ12bot|AhrefsBot|SemrushBot|DotBot) 1;
    ~*(masscan|nikto|sqlmap|nmap) 1;
}

# Map for fake Googlebot detection (use with geo module or Lua)
# This blocks non-Google IPs claiming to be Googlebot
map $http_user_agent $claims_googlebot {
    default 0;
    ~*Googlebot 1;
}

server {
    # Block known bad bots
    if ($bad_bot) {
        return 403;
    }

    # Rate limit login endpoints
    location ~ ^/(login|signin|api/auth|wp-login\.php) {
        limit_req zone=login burst=3 nodelay;
        limit_req_status 429;
        proxy_pass http://backend;
    }

    # Rate limit API endpoints
    location /api/ {
        limit_req zone=api burst=20 nodelay;
        limit_req_status 429;
        proxy_pass http://backend;
    }

    # General rate limiting
    location / {
        limit_req zone=general burst=10 nodelay;
        proxy_pass http://backend;
    }

    # Block access to sensitive paths
    location ~ /\.(env|git|svn|htaccess|htpasswd) {
        return 404;
    }

    location ~ ^/(phpmyadmin|wp-admin|administrator|actuator) {
        # Only allow from trusted IPs
        allow 10.0.0.0/8;
        deny all;
    }
}

Apache: .htaccess Bot Blocking

# Block bad bots by User-Agent
RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} (python-requests|scrapy|wget|HttpClient) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} (masscan|nikto|sqlmap|nmap) [NC]
RewriteRule .* - [F,L]

# Block by IP ranges (add confirmed malicious IPs)
Require not ip 203.0.113.0/24
Require not ip 198.51.100.0/24

# Rate limit login pages with mod_evasive
<IfModule mod_evasive24.c>
    DOSHashTableSize 3097
    DOSPageCount 5
    DOSSiteCount 50
    DOSPageInterval 1
    DOSSiteInterval 1
    DOSBlockingPeriod 600
    DOSLogDir "/var/log/mod_evasive"
</IfModule>

iptables: Network-Level Blocking

# Block specific IPs
iptables -A INPUT -s 203.0.113.42 -j DROP
iptables -A INPUT -s 198.51.100.0/24 -j DROP

# Rate limit new connections per IP (anti-DDoS)
iptables -A INPUT -p tcp --dport 80 -m connlimit --connlimit-above 50 -j REJECT
iptables -A INPUT -p tcp --dport 443 -m connlimit --connlimit-above 50 -j REJECT

# Rate limit new connections per second
iptables -A INPUT -p tcp --dport 80 -m state --state NEW -m recent --set
iptables -A INPUT -p tcp --dport 80 -m state --state NEW -m recent --update \
  --seconds 60 --hitcount 30 -j DROP

# Block entire country ranges using ipset (more efficient than individual rules)
ipset create blocked_countries hash:net
ipset add blocked_countries 5.188.0.0/16    # Example range
ipset add blocked_countries 185.220.0.0/16  # Example range
iptables -A INPUT -m set --match-set blocked_countries src -j DROP

Fail2Ban: Automated Blocking

# /etc/fail2ban/filter.d/bot-detection.conf
[Definition]
failregex = ^<HOST>.*"(GET|POST|HEAD).*HTTP.*" (400|401|403|404|405) .* "(python-requests|scrapy|wget|curl|Go-http-client|Java/)".*$
            ^<HOST>.*"(GET|POST).*/(\.env|\.git|wp-admin|phpmyadmin|actuator).*HTTP.*".*$
            ^<HOST>.*"POST.*/(?:login|signin|wp-login).*HTTP.*" (401|403).*$
ignoreregex =

# /etc/fail2ban/jail.d/bot-detection.conf
[bot-detection]
enabled = true
port = http,https
filter = bot-detection
logpath = /var/log/nginx/access.log
maxretry = 10
findtime = 300
bantime = 86400
action = iptables-multiport[name=bot-detection, port="http,https"]

# Aggressive jail for credential stuffing
[credential-stuffing]
enabled = true
port = http,https
filter = bot-detection
logpath = /var/log/nginx/access.log
maxretry = 5
findtime = 60
bantime = 604800
action = iptables-multiport[name=credential-stuffing, port="http,https"]

💡 Pro Tip: Use LogBeast to continuously analyze your logs and generate dynamic blocklists that you can feed directly into fail2ban or iptables. This creates a feedback loop: logs reveal bots, bots get blocked, logs confirm the block is working.

Advanced Detection with Log Analysis

Simple rules catch simple bots. Advanced attackers use residential proxies, real browser User-Agents, and randomized timing. To catch these, you need statistical and behavioral analysis.

Request Rate Scoring

Assign a suspicion score to each IP based on multiple behavioral factors:

Signal	Score Weight	Detection Logic
High request volume	+3	>200 requests/hour from single IP
No static assets	+4	0 CSS/JS/image requests with >20 page loads
Regular timing	+3	Standard deviation of inter-request time < 0.5s
Sequential URLs	+3	Requests follow numeric or alphabetic sequence
High error rate	+2	>50% responses are 4xx or 5xx
Known bot UA	+5	Matches known scraper/tool User-Agent
No cookies	+2	Never sends session cookies after initial visit
Single page type	+2	>80% requests target same URL pattern

An IP scoring 8+ out of 24 warrants investigation. An IP scoring 12+ should be blocked automatically.

Session Fingerprinting

Even when bots rotate IPs, they often share fingerprint characteristics:

#!/usr/bin/env python3
"""Fingerprint and cluster bot sessions from access logs."""
import re
import sys
from collections import defaultdict

LOG_RE = re.compile(
    r'(\S+) \S+ \S+ \[(.+?)\] "(\S+) (\S+) \S+" (\d+) (\d+) "([^"]*)" "([^"]*)"'
)

def fingerprint_ip(lines):
    """Create a behavioral fingerprint for an IP's session."""
    paths = []
    status_codes = defaultdict(int)
    has_static = False
    ua_set = set()
    sizes = []

    for line in lines:
        m = LOG_RE.search(line)
        if not m:
            continue
        ip, ts, method, path, status, size, referer, ua = m.groups()
        paths.append(path)
        status_codes[status] += 1
        ua_set.add(ua)
        sizes.append(int(size) if size != '-' else 0)
        if re.search(r'\.(css|js|png|jpg|gif|woff|svg|ico)$', path):
            has_static = True

    total = sum(status_codes.values())
    error_rate = sum(v for k, v in status_codes.items() if k.startswith(('4', '5'))) / max(total, 1)
    unique_paths = len(set(paths))

    return {
        'total_requests': total,
        'unique_paths': unique_paths,
        'error_rate': round(error_rate, 2),
        'has_static_assets': has_static,
        'unique_user_agents': len(ua_set),
        'avg_response_size': sum(sizes) // max(len(sizes), 1),
        'path_diversity': round(unique_paths / max(total, 1), 2),
    }

def score_fingerprint(fp):
    score = 0
    if fp['total_requests'] > 200:
        score += 3
    if not fp['has_static_assets'] and fp['total_requests'] > 20:
        score += 4
    if fp['error_rate'] > 0.5:
        score += 2
    if fp['path_diversity'] < 0.1:
        score += 2
    if fp['unique_user_agents'] > 3:
        score += 2  # rotating UAs is suspicious
    return score

if __name__ == "__main__":
    ip_lines = defaultdict(list)
    with open(sys.argv[1]) as f:
        for line in f:
            ip = line.split()[0]
            ip_lines[ip].append(line)

    print(f"{'IP':<20} {'Reqs':>6} {'Errors':>7} {'Static':>7} {'Score':>6} {'Verdict'}")
    print("-" * 75)
    for ip, lines in sorted(ip_lines.items(), key=lambda x: -len(x[1]))[:50]:
        fp = fingerprint_ip(lines)
        score = score_fingerprint(fp)
        verdict = "🔴 BLOCK" if score >= 12 else "🟡 WATCH" if score >= 8 else "✅ OK"
        print(f"{ip:<20} {fp['total_requests']:>6} {fp['error_rate']:>6.0%} "
              f"{'Yes' if fp['has_static_assets'] else 'No':>7} {score:>6} {verdict}")

ASN and Hosting Provider Analysis

Legitimate users rarely browse from data center IPs. If you see traffic from hosting providers like DigitalOcean, AWS, Hetzner, or OVH hitting your user-facing pages, it is almost certainly automated:

# Install whois and use it to check ASN for suspicious IPs
while read ip; do
    asn_info=$(whois -h whois.cymru.com " -v $ip" 2>/dev/null | tail -1)
    echo "$ip | $asn_info"
done < suspicious_ips.txt

# Common hosting ASNs to flag:
# AS14061 - DigitalOcean
# AS16509 - Amazon AWS
# AS24940 - Hetzner
# AS16276 - OVH
# AS45090 - Tencent Cloud
# AS37963 - Alibaba Cloud

🔑 Key Insight: Combine ASN analysis with behavioral scoring. A data center IP with a high suspicion score is almost certainly a bot. CrawlBeast can help you verify your blocking rules by crawling your site from different IPs and confirming that legitimate access still works while malicious patterns are blocked.

Building a Bot Management Strategy

Effective bot management is not a one-time configuration but an ongoing process. Here is a framework for building a sustainable strategy.

1. Establish a Baseline

Before you can detect anomalies, you need to know what normal looks like:

Normal request volume: Average requests per minute/hour/day
Typical bot ratio: What percentage of traffic is bots vs. humans
Login attempt baseline: Normal login failure rate and volume
Geographic distribution: Where your real users come from
Peak traffic patterns: When your site naturally gets more traffic

2. Implement Layered Defense

No single technique stops all bots. Use defense in depth:

Layer	Technique	Blocks
Network	iptables, ipset, firewall rules	Known-bad IPs, DDoS floods, port scans
Edge / CDN	Cloudflare, AWS WAF, rate limiting	Volumetric attacks, known bot signatures
Web Server	Nginx/Apache rules, mod_security	Bad UAs, path traversals, injection attempts
Application	CAPTCHA, device fingerprinting, JS challenges	Headless browsers, advanced scrapers
Log Analysis	LogBeast, custom scripts, SIEM	Behavioral anomalies, slow-and-low attacks, new patterns

3. Create a Response Playbook

Severity 1 (Critical): Credential stuffing, active DDoS -- Block immediately at network level, alert security team
Severity 2 (High): Aggressive scraping, fake Googlebots -- Block at web server level, review daily
Severity 3 (Medium): SEO spam, comment spam -- Mitigate with rate limiting and CAPTCHA, review weekly
Severity 4 (Low): Known benign bots behaving aggressively -- Rate limit, monitor, adjust crawl-delay in robots.txt

4. Automate and Iterate

Manual log review does not scale. Automate your detection and blocking pipeline:

# Example: Automated daily bot analysis pipeline
#!/bin/bash
# daily_bot_scan.sh - Run via cron at midnight

LOG="/var/log/nginx/access.log"
BLOCKLIST="/etc/nginx/conf.d/blocklist.conf"
REPORT="/var/log/bot-reports/$(date +%Y-%m-%d).txt"

# 1. Extract suspicious IPs (>500 requests, >80% error rate)
python3 /opt/scripts/score_ips.py "$LOG" --threshold 12 > /tmp/block_candidates.txt

# 2. Verify none are legitimate (reverse DNS check on Googlebot claimants)
python3 /opt/scripts/verify_bots.py /tmp/block_candidates.txt > /tmp/verified_bad.txt

# 3. Update nginx blocklist
echo "# Auto-generated $(date)" > "$BLOCKLIST"
while read ip; do
    echo "deny $ip;" >> "$BLOCKLIST"
done < /tmp/verified_bad.txt

# 4. Reload nginx
nginx -t && systemctl reload nginx

# 5. Generate report
cat /tmp/verified_bad.txt | wc -l | xargs -I{} echo "Blocked {} IPs on $(date)" > "$REPORT"
cat /tmp/verified_bad.txt >> "$REPORT"

💡 Pro Tip: LogBeast provides automated bot scoring, trend analysis, and exportable blocklists out of the box. Pair it with CrawlBeast to verify that your blocking rules do not accidentally block legitimate crawlers like Googlebot, Bingbot, or your own monitoring tools.

5. Measure Effectiveness

Track these metrics to verify your bot management is working:

Bot-to-human ratio: Should decrease over time as blocking improves
Login failure rate: Should drop after credential stuffing mitigation
Server resource usage: CPU and bandwidth consumed by bot traffic should decline
False positive rate: Monitor support tickets for users incorrectly blocked
New bot patterns: Track how many new unrecognized bot signatures appear each week

Conclusion

Malicious bots are not going away. As defenses improve, attackers adapt with residential proxies, headless browsers, and AI-generated browsing patterns. The fundamentals, however, remain constant: every request leaves a trace in your server logs, and statistical analysis of those traces will always reveal automated behavior.

The key takeaways from this guide:

Classify first, block second. Understand what type of bot you are dealing with before writing rules
Verify Googlebot claims. Reverse DNS is the definitive test; never trust User-Agent strings alone
Use behavioral signals. Request patterns, timing, and asset loading ratios are harder for bots to fake than headers
Layer your defenses. Network, server, application, and log analysis layers catch different bot categories
Automate the pipeline. Manual log review does not scale; build scripts and use tools like LogBeast to stay ahead

Start with your server logs today. Run the commands in this guide against your access logs, and you will likely discover bot traffic you never knew existed. From there, build your rules, automate your blocking, and iterate.

🎯 Next Steps: Read our guide on detecting DDoS attacks in server logs for more on volumetric attack detection, and check out the complete server logs guide for a primer on log formats and parsing techniques.