📑 Table of Contents
1. Why Log Format Matters
Every HTTP request that hits your web server generates a log entry. The format of that entry determines what you can analyze, how quickly you can parse it, and whether your monitoring pipeline can ingest it efficiently. Choosing the right log format is not a trivial decision -- it directly impacts your ability to debug production issues, detect security threats, optimize performance, and understand traffic patterns.
Apache HTTP Server and Nginx account for over 70% of all web servers on the internet. Despite serving the same fundamental purpose, they use different syntax for log format configuration, different default field orders, and different variable naming conventions. Understanding both is essential for any operations engineer, SRE, or developer working with web infrastructure.
Key Insight: The default log formats for both Apache and Nginx derive from the NCSA Common Log Format defined in the early 1990s. Despite being over 30 years old, CLF remains the foundation that most log analysis tools expect. Understanding this lineage helps you make informed decisions about custom formats.
In this guide, we will dissect every field in both Apache and Nginx log formats, compare their configuration directives side-by-side, build custom JSON log formats for modern observability pipelines, and write parsers in multiple languages. By the end, you will have a complete reference for any log format scenario you encounter.
2. Apache Common Log Format (CLF)
The Common Log Format is the most basic standardized log format. Apache defines it with the following LogFormat directive:
LogFormat "%h %l %u %t \"%r\" %>s %b" common
CustomLog /var/log/apache2/access.log common
A typical CLF entry looks like this:
203.0.113.50 - frank [10/Feb/2025:13:55:36 -0700] "GET /api/v2/users HTTP/1.1" 200 2326
Field-by-Field Breakdown
| Field | Directive | Example Value | Description |
|---|---|---|---|
| Remote Host | %h |
203.0.113.50 |
Client IP address. Uses DNS hostname if HostnameLookups On |
| Identity | %l |
- |
RFC 1413 identity. Almost always a hyphen. Requires mod_ident |
| User | %u |
frank |
Authenticated username. Hyphen if no auth |
| Timestamp | %t |
[10/Feb/2025:13:55:36 -0700] |
Request time in strftime format [dd/Mon/yyyy:HH:mm:ss zzzzz] |
| Request Line | %r |
GET /api/v2/users HTTP/1.1 |
Full first line of request: method, URI, protocol |
| Status Code | %>s |
200 |
Final HTTP status code (after internal redirects) |
| Bytes Sent | %b |
2326 |
Response body size in bytes. Hyphen for zero bytes. Use %B for numeric zero |
Warning: The %h directive will perform a DNS reverse lookup if HostnameLookups is enabled, which can significantly slow your server under load. Always keep HostnameLookups Off in production and use %a if you need the client IP behind a proxy.
Status Code Nuance: %s vs %>s
Apache distinguishes between the original status code (%s) and the final status code (%>s). This matters when internal redirects occur:
# Original request returns 301, internal redirect returns 200
# %s = 301 (original status)
# %>s = 200 (final status after redirect)
# For ErrorDocument handling:
# Request to /missing -> 404 -> ErrorDocument -> 200
# %s = 404
# %>s = 200
Always use %>s in production log formats unless you specifically need to track pre-redirect status codes for debugging rewrite rules.
3. Apache Combined Log Format
The Combined Log Format extends CLF with two critical fields: Referer and User-Agent. This is the de facto standard for web analytics and the default recommendation for most deployments.
LogFormat "%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\"" combined
CustomLog /var/log/apache2/access.log combined
Example output:
203.0.113.50 - frank [10/Feb/2025:13:55:36 -0700] "GET /api/v2/users HTTP/1.1" 200 2326 "https://example.com/dashboard" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36"
The Two Extra Fields
| Field | Directive | Description |
|---|---|---|
| Referer | %{Referer}i |
The URL the client came from. Hyphen if direct/bookmarked. Note: the HTTP header misspells "Referrer" |
| User-Agent | %{User-Agent}i |
Browser or bot identification string. Critical for bot detection and SEO analysis |
Custom LogFormat Directives
Apache's mod_log_config supports extensive customization. Here are the most useful directives beyond the defaults:
# Add response time in microseconds
LogFormat "%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\" %D" combined_with_time
# Add request duration in seconds (requires mod_logio)
LogFormat "%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\" %T/%D" timed
# Add SSL protocol and cipher
LogFormat "%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\" %{SSL_PROTOCOL}x %{SSL_CIPHER}x" combined_ssl
# Add X-Forwarded-For for reverse proxy setups
LogFormat "%{X-Forwarded-For}i %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\"" proxy_combined
# Add virtual host and server port
LogFormat "%v:%p %h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\"" vhost_combined
| Directive | Description | Example Value |
|---|---|---|
%D |
Request processing time in microseconds | 234567 |
%T |
Request processing time in seconds | 0 |
%{ms}T |
Request processing time in milliseconds | 234 |
%I |
Bytes received (requires mod_logio) |
4872 |
%O |
Bytes sent including headers (requires mod_logio) |
23456 |
%v |
Canonical server name | www.example.com |
%p |
Server port | 443 |
%X |
Connection status: X=aborted, +=keepalive, -=closed |
+ |
%{VARNAME}e |
Environment variable | Varies |
%{Header}i |
Request header value | Varies |
%{Header}o |
Response header value | Varies |
Conditional Logging
Apache supports conditional logging based on environment variables, which is useful for excluding health checks or internal traffic:
# Don't log health check requests
SetEnvIf Request_URI "^/health$" dontlog
SetEnvIf Request_URI "^/readyz$" dontlog
CustomLog /var/log/apache2/access.log combined env=!dontlog
# Log bots to a separate file
SetEnvIf User-Agent "Googlebot" is_bot
SetEnvIf User-Agent "bingbot" is_bot
SetEnvIf User-Agent "GPTBot" is_bot
CustomLog /var/log/apache2/bot_access.log combined env=is_bot
CustomLog /var/log/apache2/human_access.log combined env=!is_bot
Best Practice: Separating bot traffic into its own log file makes analysis significantly faster. You can monitor Googlebot crawl patterns without filtering through millions of human visitor lines. LogBeast can automatically detect and categorize bot traffic from any log format.
4. Nginx Default Log Format
Nginx defines its default log format in the http block using the log_format directive. The built-in format is called combined and closely mirrors Apache's Combined Log Format:
# Nginx built-in default (you don't need to define this)
log_format combined '$remote_addr - $remote_user [$time_local] '
'"$request" $status $body_bytes_sent '
'"$http_referer" "$http_user_agent"';
access_log /var/log/nginx/access.log combined;
error_log /var/log/nginx/error.log warn;
Example output:
203.0.113.50 - frank [10/Feb/2025:13:55:36 +0000] "GET /api/v2/users HTTP/1.1" 200 2326 "https://example.com/dashboard" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
Nginx Variable Reference
| Nginx Variable | Apache Equivalent | Description |
|---|---|---|
$remote_addr |
%h / %a |
Client IP address |
$remote_user |
%u |
Authenticated username |
$time_local |
%t |
Local time in CLF format |
$time_iso8601 |
%{%Y-%m-%dT%H:%M:%S%z}t |
ISO 8601 timestamp |
$request |
%r |
Full request line |
$status |
%>s |
Response status code |
$body_bytes_sent |
%b |
Body bytes sent (excludes headers) |
$bytes_sent |
%O |
Total bytes sent (includes headers) |
$http_referer |
%{Referer}i |
Referer header |
$http_user_agent |
%{User-Agent}i |
User-Agent header |
$request_time |
%D (different unit) |
Request processing time in seconds with ms resolution |
$upstream_response_time |
N/A | Time spent waiting for upstream (proxy/fastcgi) |
$connection |
N/A | Connection serial number |
$connection_requests |
N/A | Number of requests on this connection |
$msec |
N/A | Time in seconds with ms resolution at log write |
$pipe |
N/A | Pipelined request indicator: p or . |
Nginx Error Log Configuration
Unlike access logs, Nginx error logs have a fixed format that cannot be customized. You can only control the severity level:
# Error log levels (from most to least verbose):
# debug, info, notice, warn, error, crit, alert, emerg
error_log /var/log/nginx/error.log warn;
# Per-server block error logs
server {
listen 443 ssl;
server_name example.com;
error_log /var/log/nginx/example.com.error.log error;
}
# Error log format (fixed, cannot be changed):
# 2025/02/10 13:55:36 [error] 1234#1234: *5678 open() "/var/www/html/missing.html" failed (2: No such file or directory), client: 203.0.113.50, server: example.com, request: "GET /missing.html HTTP/1.1", host: "example.com"
Warning: Setting error_log to debug level in production generates enormous volumes of output and measurably impacts performance. Use warn or error for production, and only enable debug temporarily when investigating specific issues.
5. Custom Log Formats: Apache vs Nginx Directives
The real power of web server logging comes from custom formats. Here is a comprehensive side-by-side comparison for achieving the same data capture in both servers.
Complete Directive Comparison Table
| Data Point | Apache Directive | Nginx Variable |
|---|---|---|
| Client IP | %a |
$remote_addr |
| Client IP (behind proxy) | %{X-Forwarded-For}i |
$http_x_forwarded_for |
| Real Client IP (proxy-aware) | %a (with mod_remoteip) |
$realip_remote_addr (with realip module) |
| Server hostname | %v |
$server_name |
| Server port | %p |
$server_port |
| Request method | %m |
$request_method |
| Request URI | %U |
$uri |
| Request URI (original) | %U%q |
$request_uri |
| Query string | %q |
$args |
| Protocol | %H |
$server_protocol |
| Request time (seconds) | %T |
$request_time |
| Request time (microseconds) | %D |
N/A (use $request_time * 1000000) |
| Request time (milliseconds) | %{ms}T |
N/A (use $request_time * 1000) |
| Bytes received | %I |
$request_length |
| Bytes sent (body only) | %b |
$body_bytes_sent |
| Bytes sent (total) | %O |
$bytes_sent |
| SSL protocol | %{SSL_PROTOCOL}x |
$ssl_protocol |
| SSL cipher | %{SSL_CIPHER}x |
$ssl_cipher |
| HTTP/2 push | %{H2_PUSH}e |
$http2 |
| Upstream response time | %{BALANCER_WORKER_ROUTE}e |
$upstream_response_time |
| Upstream status | N/A | $upstream_status |
| Upstream address | N/A | $upstream_addr |
| GeoIP country | %{GEOIP_COUNTRY_CODE}e |
$geoip_country_code |
| Connection keep-alive | %X |
$connection_requests |
| Any request header | %{HeaderName}i |
$http_headername (lowercase, hyphens become underscores) |
| Any response header | %{HeaderName}o |
$sent_http_headername |
Production-Ready Custom Formats
Here are battle-tested custom formats for both servers that include the most useful fields for production analysis:
# Apache - Extended production format
LogFormat "%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\" %D %{X-Forwarded-For}i %v %{SSL_PROTOCOL}x %X" production
# Usage:
CustomLog /var/log/apache2/access.log production
# Nginx - Extended production format
log_format production '$remote_addr - $remote_user [$time_local] '
'"$request" $status $body_bytes_sent '
'"$http_referer" "$http_user_agent" '
'$request_time $http_x_forwarded_for '
'$server_name $ssl_protocol '
'$upstream_response_time $upstream_status';
access_log /var/log/nginx/access.log production;
Tip: When adding custom fields, always append them to the end of the Combined format rather than inserting them in the middle. This ensures backward compatibility with existing log parsers that expect CLF/Combined field order.
6. JSON Structured Logging
Modern observability stacks (Elasticsearch, Splunk, Datadog, Loki) work best with structured data. JSON log formats eliminate parsing ambiguity, support nested fields, and enable schema evolution without breaking downstream consumers.
Apache JSON Configuration
Apache does not natively output JSON, so you must construct it manually in the LogFormat directive. Special care is needed to escape quotes within field values:
# Apache JSON log format
LogFormat "{\"timestamp\":\"%{%Y-%m-%dT%H:%M:%S%z}t\",\"remote_addr\":\"%a\",\"remote_user\":\"%u\",\"request_method\":\"%m\",\"request_uri\":\"%U%q\",\"protocol\":\"%H\",\"status\":%>s,\"body_bytes_sent\":%B,\"http_referer\":\"%{Referer}i\",\"http_user_agent\":\"%{User-Agent}i\",\"request_time_us\":%D,\"ssl_protocol\":\"%{SSL_PROTOCOL}x\",\"ssl_cipher\":\"%{SSL_CIPHER}x\",\"x_forwarded_for\":\"%{X-Forwarded-For}i\",\"vhost\":\"%v\",\"server_port\":\"%p\"}" json
CustomLog /var/log/apache2/access.json.log json
Example JSON output (formatted for readability):
{
"timestamp": "2025-02-10T13:55:36+0000",
"remote_addr": "203.0.113.50",
"remote_user": "-",
"request_method": "GET",
"request_uri": "/api/v2/users?page=2",
"protocol": "HTTP/1.1",
"status": 200,
"body_bytes_sent": 2326,
"http_referer": "https://example.com/dashboard",
"http_user_agent": "Mozilla/5.0 ...",
"request_time_us": 234567,
"ssl_protocol": "TLSv1.3",
"ssl_cipher": "TLS_AES_256_GCM_SHA384",
"x_forwarded_for": "198.51.100.78",
"vhost": "api.example.com",
"server_port": "443"
}
Warning: Apache's manual JSON construction is fragile. If a User-Agent string contains an unescaped double quote, it will produce invalid JSON. Consider piping logs through jq for validation, or use a log shipper (Filebeat, Fluentd) that handles JSON encoding properly.
Nginx JSON Configuration
Nginx has the same limitation -- no native JSON output. However, Nginx's escape=json parameter (available since 1.11.8) properly escapes special characters in variable values:
# Nginx JSON log format with proper escaping
log_format json_log escape=json
'{'
'"timestamp":"$time_iso8601",'
'"remote_addr":"$remote_addr",'
'"remote_user":"$remote_user",'
'"request_method":"$request_method",'
'"request_uri":"$request_uri",'
'"protocol":"$server_protocol",'
'"status":$status,'
'"body_bytes_sent":$body_bytes_sent,'
'"request_length":$request_length,'
'"http_referer":"$http_referer",'
'"http_user_agent":"$http_user_agent",'
'"request_time":$request_time,'
'"upstream_response_time":"$upstream_response_time",'
'"upstream_status":"$upstream_status",'
'"upstream_addr":"$upstream_addr",'
'"ssl_protocol":"$ssl_protocol",'
'"ssl_cipher":"$ssl_cipher",'
'"http_x_forwarded_for":"$http_x_forwarded_for",'
'"server_name":"$server_name",'
'"server_port":"$server_port",'
'"connection":$connection,'
'"connection_requests":$connection_requests,'
'"pipe":"$pipe"'
'}';
access_log /var/log/nginx/access.json.log json_log;
Best Practice: Always use escape=json in Nginx JSON log formats. Without it, user-agent strings, referer URLs, and request URIs containing special characters will produce invalid JSON that breaks your ingestion pipeline. This single directive saves hours of debugging.
JSON Format Comparison
| Feature | Apache JSON | Nginx JSON |
|---|---|---|
| Native JSON support | No | No |
| Auto-escaping | No (manual) | Yes (escape=json) |
| Numeric types | Manual (omit quotes) | Manual (omit quotes) |
| Nested objects | Not supported | Not supported |
| Array values | Not supported | Not supported |
| ISO 8601 timestamps | %{%Y-%m-%dT%H:%M:%S%z}t |
$time_iso8601 |
| Upstream metrics | Limited | Comprehensive |
| Broken JSON risk | High | Low (with escape=json) |
7. Log Rotation and Management
Without rotation, web server logs grow unbounded. A busy site generating 10,000 requests per minute produces roughly 3 GB of Combined format logs per day. Proper rotation ensures you retain useful history without exhausting disk space.
Apache Logrotate Configuration
# /etc/logrotate.d/apache2
/var/log/apache2/*.log {
daily
missingok
rotate 52
compress
delaycompress
notifempty
create 640 root adm
sharedscripts
postrotate
if invoke-rc.d apache2 status > /dev/null 2>&1; then
invoke-rc.d apache2 reload > /dev/null
fi
endscript
}
# For JSON logs with different retention
/var/log/apache2/*.json.log {
daily
missingok
rotate 14
compress
delaycompress
notifempty
create 640 root adm
sharedscripts
postrotate
if invoke-rc.d apache2 status > /dev/null 2>&1; then
invoke-rc.d apache2 reload > /dev/null
fi
endscript
}
Nginx Logrotate Configuration
# /etc/logrotate.d/nginx
/var/log/nginx/*.log {
daily
missingok
rotate 52
compress
delaycompress
notifempty
create 0640 www-data adm
sharedscripts
prerotate
if [ -d /etc/logrotate.d/httpd-prerotate ]; then
run-parts /etc/logrotate.d/httpd-prerotate
fi
endscript
postrotate
invoke-rc.d nginx rotate >/dev/null 2>&1
endscript
}
Key Difference: Apache requires a full reload (graceful restart) to reopen log files after rotation. Nginx supports a dedicated rotate signal (USR1) that reopens log files without any service interruption. This makes Nginx log rotation zero-downtime by default.
Signal-Based Rotation
# Manual rotation with signals
# Apache - graceful restart to reopen logs
sudo apachectl graceful
# or
sudo kill -USR1 $(cat /var/run/apache2/apache2.pid)
# Nginx - reopen log files (zero downtime)
sudo nginx -s reopen
# or
sudo kill -USR1 $(cat /var/run/nginx.pid)
Disk Space Estimation
| Requests/day | CLF (~150 bytes/line) | Combined (~350 bytes/line) | JSON (~600 bytes/line) |
|---|---|---|---|
| 100,000 | ~14 MB | ~33 MB | ~57 MB |
| 1,000,000 | ~143 MB | ~333 MB | ~572 MB |
| 10,000,000 | ~1.4 GB | ~3.3 GB | ~5.7 GB |
| 100,000,000 | ~14 GB | ~33 GB | ~57 GB |
With gzip compression (typical for logrotate), expect 85-95% size reduction. A 3.3 GB daily Combined log compresses to roughly 250-500 MB.
8. Parsing Log Files
Raw log files are only useful if you can parse them. This section provides production-grade patterns for extracting structured data from both Apache and Nginx logs.
Regex Patterns
The Combined Log Format regex works for both Apache and Nginx since they use the same output format:
# Combined Log Format regex (PCRE)
^(?P<ip>\S+) \S+ (?P<user>\S+) \[(?P<timestamp>[^\]]+)\] "(?P<method>\S+) (?P<path>\S+) (?P<protocol>\S+)" (?P<status>\d{3}) (?P<bytes>\S+) "(?P<referer>[^"]*)" "(?P<useragent>[^"]*)"
# Common Log Format regex
^(?P<ip>\S+) \S+ (?P<user>\S+) \[(?P<timestamp>[^\]]+)\] "(?P<method>\S+) (?P<path>\S+) (?P<protocol>\S+)" (?P<status>\d{3}) (?P<bytes>\S+)
# Handle malformed requests (missing method/path/protocol)
^(?P<ip>\S+) \S+ (?P<user>\S+) \[(?P<timestamp>[^\]]+)\] "(?P<request>[^"]*)" (?P<status>\d{3}) (?P<bytes>\S+) "(?P<referer>[^"]*)" "(?P<useragent>[^"]*)"
AWK One-Liners for Quick Analysis
# Top 20 IP addresses
awk '{print $1}' access.log | sort | uniq -c | sort -rn | head -20
# Top 20 requested URLs
awk '{print $7}' access.log | sort | uniq -c | sort -rn | head -20
# Status code distribution
awk '{print $9}' access.log | sort | uniq -c | sort -rn
# Requests per hour
awk '{print substr($4,2,14)}' access.log | sort | uniq -c
# Total bandwidth in MB
awk '{sum+=$10} END {printf "%.2f MB\n", sum/1024/1024}' access.log
# Average response size per status code
awk '{status[$9]++; bytes[$9]+=$10} END {for (s in status) printf "%s: %d requests, avg %.0f bytes\n", s, status[s], bytes[s]/status[s]}' access.log
# Find all Googlebot requests
awk -F'"' '$6 ~ /Googlebot/ {print $2}' access.log | sort | uniq -c | sort -rn | head -20
# Requests per minute (for spike detection)
awk '{print substr($4,2,17)}' access.log | sort | uniq -c | sort -rn | head -20
# 5xx errors with full details
awk '$9 ~ /^5/ {print $0}' access.log | tail -50
# Slow requests (if request time is the last field, in microseconds)
awk '{if ($NF > 1000000) print $0}' access.log | head -20
Python Log Parser
#!/usr/bin/env python3
"""
Production-grade log parser for Apache/Nginx Combined Log Format.
Handles malformed lines, compressed files, and streaming input.
"""
import re
import gzip
import sys
from datetime import datetime
from collections import defaultdict, Counter
# Compiled regex for performance
COMBINED_RE = re.compile(
r'^(?P<ip>\S+) \S+ (?P<user>\S+) '
r'\[(?P<timestamp>[^\]]+)\] '
r'"(?P<method>\S+) (?P<path>\S+) (?P<protocol>\S+)" '
r'(?P<status>\d{3}) (?P<bytes>\S+) '
r'"(?P<referer>[^"]*)" '
r'"(?P<useragent>[^"]*)"'
)
# Fallback for malformed request lines
FALLBACK_RE = re.compile(
r'^(?P<ip>\S+) \S+ (?P<user>\S+) '
r'\[(?P<timestamp>[^\]]+)\] '
r'"(?P<request>[^"]*)" '
r'(?P<status>\d{3}) (?P<bytes>\S+)'
)
def parse_line(line):
"""Parse a single log line, returning a dict or None."""
match = COMBINED_RE.match(line)
if match:
d = match.groupdict()
d['bytes'] = 0 if d['bytes'] == '-' else int(d['bytes'])
d['status'] = int(d['status'])
return d
match = FALLBACK_RE.match(line)
if match:
d = match.groupdict()
d['bytes'] = 0 if d['bytes'] == '-' else int(d['bytes'])
d['status'] = int(d['status'])
d['method'] = d['path'] = d['protocol'] = None
d['referer'] = d['useragent'] = None
return d
return None
def open_log(filepath):
"""Open plain or gzipped log files."""
if filepath.endswith('.gz'):
return gzip.open(filepath, 'rt', encoding='utf-8', errors='replace')
return open(filepath, 'r', encoding='utf-8', errors='replace')
def analyze_log(filepath):
"""Analyze a log file and print summary statistics."""
stats = {
'total': 0,
'parsed': 0,
'failed': 0,
'status_codes': Counter(),
'top_ips': Counter(),
'top_paths': Counter(),
'top_agents': Counter(),
'bytes_total': 0,
'hourly': Counter(),
}
with open_log(filepath) as f:
for line in f:
stats['total'] += 1
entry = parse_line(line.strip())
if entry is None:
stats['failed'] += 1
continue
stats['parsed'] += 1
stats['status_codes'][entry['status']] += 1
stats['top_ips'][entry['ip']] += 1
stats['bytes_total'] += entry['bytes']
if entry.get('path'):
stats['top_paths'][entry['path']] += 1
if entry.get('useragent'):
stats['top_agents'][entry['useragent']] += 1
# Print report
print(f"\n{'='*60}")
print(f"Log Analysis: {filepath}")
print(f"{'='*60}")
print(f"Total lines: {stats['total']:,}")
print(f"Parsed: {stats['parsed']:,}")
print(f"Failed: {stats['failed']:,}")
print(f"Total bytes: {stats['bytes_total']/1024/1024:.2f} MB")
print(f"\nStatus codes:")
for code, count in stats['status_codes'].most_common():
print(f" {code}: {count:,}")
print(f"\nTop 10 IPs:")
for ip, count in stats['top_ips'].most_common(10):
print(f" {ip}: {count:,}")
print(f"\nTop 10 paths:")
for path, count in stats['top_paths'].most_common(10):
print(f" {path}: {count:,}")
if __name__ == '__main__':
if len(sys.argv) < 2:
print(f"Usage: {sys.argv[0]} <logfile> [logfile2 ...]")
sys.exit(1)
for filepath in sys.argv[1:]:
analyze_log(filepath)
Parsing JSON Logs
If you have already configured JSON output, parsing becomes trivially simple:
# jq - extract all 5xx errors
cat access.json.log | jq -r 'select(.status >= 500) | "\(.timestamp) \(.status) \(.request_uri) \(.upstream_response_time)"'
# jq - top IPs by request count
cat access.json.log | jq -r '.remote_addr' | sort | uniq -c | sort -rn | head -20
# jq - average response time per endpoint
cat access.json.log | jq -r '"\(.request_uri) \(.request_time)"' | \
awk '{sum[$1]+=$2; count[$1]++} END {for (u in sum) printf "%s %.3f (%d reqs)\n", u, sum[u]/count[u], count[u]}' | \
sort -k2 -rn | head -20
# Python one-liner for JSON logs
python3 -c "
import json, sys
from collections import Counter
c = Counter()
for line in sys.stdin:
try:
e = json.loads(line)
c[e['status']] += 1
except: pass
for k,v in c.most_common(): print(f'{k}: {v}')
" < access.json.log
Pro Tip: JSON logs eliminate the need for complex regex parsing entirely. If your analysis pipeline supports it, the small disk space overhead (roughly 70% larger than Combined format) is well worth the parsing simplicity and reliability. LogBeast natively supports both Combined and JSON log formats with automatic format detection.
9. Which Format to Choose
The right format depends on your infrastructure, team, and use case. Use this decision matrix to guide your choice:
Decision Matrix
| Criteria | CLF | Combined | Custom Extended | JSON |
|---|---|---|---|---|
| Disk usage | Lowest | Medium | Medium-High | Highest |
| Parse complexity | Simple regex | Moderate regex | Complex regex | Trivial (native JSON) |
| Tool compatibility | Universal | Universal | Custom parsers needed | Modern tools only |
| Bot/SEO analysis | No (no User-Agent) | Yes | Yes | Yes |
| Performance debugging | No (no timing) | No (no timing) | Yes | Yes |
| ELK/Splunk/Datadog | Supported | Supported | Grok patterns needed | Native ingest |
| Schema evolution | Rigid | Rigid | Version carefully | Add fields freely |
| Human readability | Good | Good | Moderate | Verbose but clear |
| Malformed line risk | Low | Medium (UA strings) | Medium | Low (with escape=json) |
Recommendations by Use Case
Small sites (< 100K requests/day): Use Combined format. It provides the best balance of information and simplicity. Every tool supports it natively, and disk usage is negligible at this scale.
Medium sites (100K - 10M requests/day): Use Custom Extended format with request timing. Performance data becomes critical at this scale. Consider running a Combined log alongside for compatibility.
Large sites (> 10M requests/day): Use JSON format piped directly to your observability platform. The parsing efficiency gain at this volume justifies the disk overhead. Use log sampling if storage is a constraint.
Microservices / Kubernetes: Use JSON format exclusively. Container-based environments use stdout/stderr for logging, and JSON integrates natively with Fluentd, Fluent Bit, and other log collectors in the ecosystem.
SEO-focused analysis: Use Combined or Custom Extended format with the User-Agent field. Bot detection and crawl analysis require the User-Agent string at minimum. Add request timing to track how fast pages are served to Googlebot.
Hybrid Approach: Many production deployments write two log files simultaneously -- a Combined format log for backward compatibility and ad-hoc analysis, plus a JSON format log for pipeline ingestion. Both Apache and Nginx support multiple CustomLog/access_log directives pointing to different files with different formats.
# Apache - Dual logging
CustomLog /var/log/apache2/access.log combined
CustomLog /var/log/apache2/access.json.log json
# Nginx - Dual logging
access_log /var/log/nginx/access.log combined;
access_log /var/log/nginx/access.json.log json_log;
10. Conclusion
Apache and Nginx share a common heritage in the NCSA Common Log Format, but their configuration syntax and available variables differ significantly. The key takeaways from this guide:
- CLF and Combined formats remain the universal standard. Start here unless you have a specific reason not to.
- Apache uses
%directives while Nginx uses$variables. The mapping between them is well-defined but not always one-to-one. - JSON logging eliminates parsing ambiguity at the cost of disk space. Use
escape=jsonin Nginx to avoid broken output. - Request timing (
%Din Apache,$request_timein Nginx) is the single most valuable custom field you can add. - Log rotation is non-negotiable in production. Nginx has a slight advantage with its zero-downtime
USR1signal. - Dual logging (Combined + JSON) gives you the best of both worlds at the cost of double disk usage.
Whatever format you choose, the most important step is to actually analyze your logs regularly. The best log format in the world is useless if no one reads the data.
Next Step: Ready to analyze your Apache and Nginx logs without writing parsers? LogBeast automatically detects CLF, Combined, Custom, and JSON log formats from both servers. Import your logs and get instant insights into traffic patterns, bot behavior, and performance metrics.