📑 Table of Contents
- Understanding the Crawl-to-Index Pipeline
- Common Indexation Problems and Their Log Signatures
- Crawled but Not Indexed: What Logs Reveal
- Orphan Pages: Finding Pages Crawlers Can't Reach
- Crawl Frequency and Indexation Correlation
- Duplicate Content Issues Visible in Crawl Patterns
- JavaScript Rendering and Indexation Challenges
- XML Sitemap Effectiveness Analysis Through Logs
- Internal Linking Impact on Crawl Discovery
- Using LogBeast to Build an Indexation Monitoring Dashboard
Understanding the Crawl-to-Index Pipeline
Before a page appears in search results, it must pass through a multi-stage pipeline: discovery, crawl, render, evaluate, and index. Most SEO teams focus on the end result -- whether a page appears in Google's index -- without understanding where in the pipeline things break down. Server logs give you direct visibility into the first two stages and indirect evidence about the rest.
The pipeline works like this:
- Discovery: Googlebot finds a URL through sitemaps, internal links, external backlinks, or previously known URLs
- Crawl: Googlebot sends an HTTP request to fetch the page. Your server logs record this request with the IP, User-Agent, status code, and response size
- Render: For JavaScript-heavy pages, Google's Web Rendering Service (WRS) executes JavaScript and fetches additional resources. These render requests also appear in your logs
- Evaluate: Google assesses the page's content quality, uniqueness, canonical signals, and whether it merits indexation
- Index: If the page passes evaluation, it is added to Google's index and becomes eligible for ranking
🔑 Key Insight: Google Search Console tells you the outcome -- indexed or not -- but your server logs tell you the process. A page that is "Discovered - currently not indexed" in GSC might never have been crawled at all, or it might have been crawled and returned a 500 error. Only logs reveal which scenario applies.
The critical distinction is between pages that Google cannot crawl (server errors, blocked by robots.txt, unreachable via links) and pages that Google chooses not to index (thin content, duplicate content, low quality signals). Server logs help you identify the first category definitively and provide strong clues about the second.
To start analyzing crawl behavior, you need to isolate Googlebot requests from your access logs:
# Extract all Googlebot requests from the last 30 days
grep "Googlebot" /var/log/nginx/access.log | \
awk '{print $4, $7, $9}' | head -50
# Count unique URLs crawled by Googlebot
grep "Googlebot" /var/log/nginx/access.log | \
awk '{print $7}' | sort -u | wc -l
# Compare total site URLs vs. crawled URLs
echo "Total URLs in sitemap:"
grep -c "" /var/www/html/sitemap.xml
echo "Unique URLs crawled by Googlebot:"
grep "Googlebot" /var/log/nginx/access.log | awk '{print $7}' | sort -u | wc -l
Common Indexation Problems and Their Log Signatures
Every indexation problem leaves a distinct fingerprint in your server logs. Learning to recognize these patterns lets you diagnose issues in minutes rather than weeks of guesswork.
| Problem | Log Signature | GSC Status | Fix Priority |
|---|---|---|---|
| Server errors during crawl | Googlebot requests returning 500/502/503 | Server error (5xx) | 🔴 Critical |
| Soft 404 pages | 200 status but tiny response size (<1KB) | Soft 404 | 🟠 High |
| Redirect loops | Multiple 301/302 chains for same URL in sequence | Redirect error | 🟠 High |
| Blocked by robots.txt | No Googlebot requests for entire URL paths | Blocked by robots.txt | 🟠 High |
| Orphan pages | URL in sitemap but zero Googlebot hits | Discovered - not indexed | 🟡 Medium |
| Crawl budget waste | Googlebot repeatedly hitting parameter URLs, faceted nav | Crawled - not indexed | 🟡 Medium |
| Slow response times | Googlebot requests with high TTFB (>2s) | Crawled - not indexed | 🟡 Medium |
| Duplicate content | Googlebot crawling multiple URLs with identical response sizes | Duplicate without canonical | 🟡 Medium |
Detecting Server Errors During Crawl
Server errors are the most critical indexation blocker because they prevent Google from seeing your content at all. A page that consistently returns 5xx to Googlebot will eventually be dropped from the index entirely.
# Find all 5xx responses to Googlebot
grep "Googlebot" /var/log/nginx/access.log | awk '$9 ~ /^5/ {print $4, $7, $9}' | \
sort | head -30
# Count 5xx errors by URL pattern
grep "Googlebot" /var/log/nginx/access.log | awk '$9 ~ /^5/ {print $7}' | \
sed 's/\?.*//; s/[0-9]\+/ID/g' | sort | uniq -c | sort -rn | head -20
# Calculate Googlebot error rate over time
grep "Googlebot" /var/log/nginx/access.log | \
awk '{day=substr($4,2,11); total[day]++; if($9 ~ /^5/) errors[day]++}
END {for(d in total) printf "%s: %d/%d (%.1f%% errors)\n", d, errors[d]+0, total[d], (errors[d]+0)/total[d]*100}' | sort
⚠️ Warning: Intermittent 5xx errors are worse than consistent ones for indexation. If a page returns 200 sometimes and 503 other times, Google may crawl it less frequently and eventually deindex it, even though the page "works most of the time." Monitor error rates, not just error counts.
Identifying Soft 404 Pages
Soft 404s are pages that return a 200 status code but contain no meaningful content. Google detects these algorithmically, but you can spot them first by checking response sizes:
# Find Googlebot 200 responses with suspiciously small body sizes
grep "Googlebot" /var/log/nginx/access.log | \
awk '$9 == 200 {print $10, $7}' | \
awk '$1 < 1024 {print $1, $2}' | sort -n | head -30
# Compare average response size for known-good vs. suspect pages
grep "Googlebot" /var/log/nginx/access.log | \
awk '$9 == 200 {sum+=$10; count++} END {print "Avg response:", sum/count, "bytes"}'
Crawled but Not Indexed: What Logs Reveal
"Crawled - currently not indexed" is one of the most frustrating statuses in Google Search Console. It means Google fetched your page, looked at the content, and decided it was not worth adding to the index. While the final decision happens on Google's side, your server logs can reveal patterns that explain why.
Crawl Depth as a Quality Signal
Pages buried deep in your site's architecture receive fewer crawler visits, which correlates strongly with lower indexation rates. Log analysis can quantify this:
# Count Googlebot visits per URL depth level
grep "Googlebot" /var/log/nginx/access.log | \
awk '{print $7}' | \
awk -F'/' '{print NF-1, $0}' | \
awk '{depth[$1]++} END {for(d in depth) printf "Depth %d: %d crawls\n", d, depth[d]}' | sort -t' ' -k2 -n
# Find pages crawled only once in 90 days (low priority to Google)
grep "Googlebot" /var/log/nginx/access.log | \
awk '{print $7}' | sort | uniq -c | awk '$1 == 1 {print $2}' | head -30
Response Time Correlation
Google has confirmed that crawl speed affects crawl rate. Pages that take too long to respond get crawled less frequently, reducing their indexation chances:
# If using nginx with $request_time logging:
# Identify slow pages that Googlebot visits
grep "Googlebot" /var/log/nginx/access.log | \
awk '{print $NF, $7}' | \
awk '$1 > 2.0 {print}' | sort -rn | head -20
# Average response time for Googlebot vs. all traffic
echo "Googlebot avg response time:"
grep "Googlebot" /var/log/nginx/access.log | \
awk '{sum+=$NF; n++} END {printf "%.3fs\n", sum/n}'
echo "All traffic avg response time:"
awk '{sum+=$NF; n++} END {printf "%.3fs\n", sum/n}' /var/log/nginx/access.log
🔑 Key Insight: Pages with response times consistently above 2 seconds are significantly less likely to be indexed. If your logs show Googlebot encountering slow responses, fix the performance issue before investigating other indexation factors. Speed is a prerequisite, not just a ranking signal.
Content Size Patterns
Log response sizes can reveal thin content patterns. Pages with very small response bodies are more likely to be classified as "crawled but not indexed":
# Compare response sizes of indexed vs. non-indexed pages
# Export non-indexed URLs from GSC, then cross-reference with log sizes
grep "Googlebot" /var/log/nginx/access.log | \
awk '$9 == 200 {print $10, $7}' | sort -t' ' -k2 > /tmp/crawl_sizes.txt
# Check if non-indexed pages have smaller responses
while read url; do
size=$(grep "$url" /tmp/crawl_sizes.txt | awk '{print $1}' | tail -1)
echo "${size:-0} $url"
done < gsc_not_indexed_urls.txt | sort -rn
Orphan Pages: Finding Pages Crawlers Can't Reach
Orphan pages are URLs that exist on your site but have no internal links pointing to them. Search engine crawlers discover pages primarily by following links, so orphan pages often go undiscovered and unindexed even if they appear in your XML sitemap.
Server logs are the most reliable way to identify orphan pages because they show you exactly which pages crawlers actually visit versus which pages you think they visit.
Cross-Referencing Sitemaps with Crawl Logs
#!/bin/bash
# find_orphan_pages.sh - Identify sitemap URLs never crawled by Googlebot
# Step 1: Extract all URLs from sitemap
grep -oP '<loc>\K[^<]+' /var/www/html/sitemap.xml | \
sed 's|https://example.com||' | sort -u > /tmp/sitemap_urls.txt
# Step 2: Extract all URLs Googlebot has crawled (last 90 days)
grep "Googlebot" /var/log/nginx/access.log* | \
awk '{print $7}' | sed 's/\?.*//' | sort -u > /tmp/crawled_urls.txt
# Step 3: Find URLs in sitemap but never crawled
comm -23 /tmp/sitemap_urls.txt /tmp/crawled_urls.txt > /tmp/orphan_urls.txt
echo "=== ORPHAN PAGE REPORT ==="
echo "Sitemap URLs: $(wc -l < /tmp/sitemap_urls.txt)"
echo "Crawled URLs: $(wc -l < /tmp/crawled_urls.txt)"
echo "Orphan URLs: $(wc -l < /tmp/orphan_urls.txt)"
echo ""
echo "Top orphan URL patterns:"
cat /tmp/orphan_urls.txt | sed 's/[0-9]\+/NUM/g' | sort | uniq -c | sort -rn | head -15
Detecting Link Isolation
Some pages are technically linked but only from deep, rarely-crawled sections of the site. These are "practically orphaned" even if they are not structurally orphaned:
# Find pages crawled fewer than 3 times in 90 days
# These are "practically orphaned" - linked but poorly connected
grep "Googlebot" /var/log/nginx/access.log | \
awk '$9 == 200 {print $7}' | sed 's/\?.*//' | \
sort | uniq -c | sort -n | awk '$1 <= 2 {print $1, $2}' | head -30
💡 Pro Tip: LogBeast automatically cross-references your sitemap URLs with actual crawl data and generates an orphan page report. Combined with CrawlBeast, you can verify your internal link structure and confirm which pages are truly reachable from your homepage.
Fixing Orphan Pages
- Add internal links: Link to orphan pages from topically relevant, frequently crawled pages
- Update your sitemap: Ensure orphan pages are listed and that the sitemap is submitted to GSC
- Create hub pages: Build category or topic pages that link to groups of otherwise isolated content
- Check navigation: Verify that important pages are accessible within 3 clicks from the homepage
- Remove dead weight: If a page has no links and no search value, consider removing it rather than forcing indexation
Crawl Frequency and Indexation Correlation
There is a strong correlation between how often Googlebot crawls a page and whether that page stays indexed. Pages that are crawled frequently tend to remain in the index. Pages that are crawled infrequently are at risk of being dropped, especially during core algorithm updates.
Measuring Crawl Frequency per URL
#!/usr/bin/env python3
"""Analyze Googlebot crawl frequency patterns from access logs."""
import re
import sys
from collections import defaultdict
from datetime import datetime
LOG_RE = re.compile(
r'(\S+) \S+ \S+ \[(.+?)\] "(\S+) (\S+) \S+" (\d+) (\d+)'
)
def analyze_crawl_frequency(log_file):
url_dates = defaultdict(set)
url_counts = defaultdict(int)
with open(log_file) as f:
for line in f:
if 'Googlebot' not in line:
continue
m = LOG_RE.search(line)
if not m:
continue
ip, timestamp, method, path, status, size = m.groups()
# Normalize path (remove query strings)
path = path.split('?')[0]
# Extract date
date_str = timestamp.split(':')[0]
url_dates[path].add(date_str)
url_counts[path] += 1
print(f"{'URL':<60} {'Crawls':>7} {'Days':>5} {'Freq':>8}")
print("-" * 85)
for url, dates in sorted(url_dates.items(), key=lambda x: -len(x[1])):
crawls = url_counts[url]
unique_days = len(dates)
freq = f"{crawls/max(unique_days,1):.1f}/day"
print(f"{url[:60]:<60} {crawls:>7} {unique_days:>5} {freq:>8}")
if __name__ == "__main__":
analyze_crawl_frequency(sys.argv[1])
Crawl Velocity Trends
A declining crawl rate is an early warning sign that Google is losing interest in sections of your site:
# Googlebot crawls per day over the last 30 days
grep "Googlebot" /var/log/nginx/access.log | \
awk '{print substr($4,2,11)}' | sort | uniq -c | \
awk '{printf "%s: %s crawls", $2, $1;
for(i=0;i<$1/100;i++) printf "#"; print ""}'
# Compare crawl velocity for different site sections
grep "Googlebot" /var/log/nginx/access.log | \
awk '{
day=substr($4,2,11)
url=$7
if(url ~ /^\/blog\//) section="blog"
else if(url ~ /^\/product\//) section="products"
else if(url ~ /^\/category\//) section="categories"
else section="other"
data[day][section]++
} END {
for(d in data) {
printf "%s |", d
for(s in data[d]) printf " %s:%d", s, data[d][s]
print ""
}
}' | sort
⚠️ Warning: If you see a sudden drop in crawl frequency for an entire section of your site, investigate immediately. Common causes include accidental robots.txt changes, server errors that spike during peak crawl hours, or a manual action in Google Search Console. Check your recent deployments and robots.txt changes first.
Duplicate Content Issues Visible in Crawl Patterns
Duplicate content dilutes crawl budget and confuses search engines about which version of a page to index. Server logs reveal duplicate content issues through several telltale patterns that are invisible to standard SEO audits.
Response Size Fingerprinting
When multiple URLs return the exact same response size to Googlebot, they are likely serving duplicate content:
# Group Googlebot 200 responses by exact response size
grep "Googlebot" /var/log/nginx/access.log | \
awk '$9 == 200 {print $10, $7}' | \
sort -n | \
awk '{
if($1 == prev_size && $1 > 500) {
if(!printed_prev) { print prev_size, prev_url; printed_prev=1 }
print $1, $2
} else {
printed_prev=0
}
prev_size=$1; prev_url=$2
}'
# Find URL patterns that produce identical response sizes
grep "Googlebot" /var/log/nginx/access.log | \
awk '$9 == 200 {sizes[$10]++; urls[$10]=urls[$10] " " $7}
END {for(s in sizes) if(sizes[s] > 3) print sizes[s], s, substr(urls[s],1,200)}' | \
sort -rn | head -20
Parameter Duplication
URL parameters are the most common source of duplicate content issues. Googlebot will crawl every parameter variation it discovers:
# Find URLs where Googlebot is crawling parameter variations
grep "Googlebot" /var/log/nginx/access.log | \
awk '$7 ~ /\?/ {split($7,a,"?"); print a[1]}' | \
sort | uniq -c | sort -rn | head -20
# Show all parameter variations Googlebot is crawling for top base URLs
grep "Googlebot" /var/log/nginx/access.log | \
awk '$7 ~ /\?/ {split($7,a,"?"); base=a[1]; params[base]++;
if(!(base in example)) example[base]=$7}
END {for(b in params) if(params[b] > 5) print params[b], b, example[b]}' | \
sort -rn | head -15
🔑 Key Insight: If Googlebot is crawling 50 parameter variations of the same base URL, it is spending 50x the crawl budget on what is essentially one page. Use the rel="canonical" tag to point all variations to the preferred version, and add parameter handling rules in Google Search Console or your robots.txt.
Trailing Slash and Case Variations
# Check if Googlebot is crawling both /page and /page/ versions
grep "Googlebot" /var/log/nginx/access.log | \
awk '$9 == 200 {print $7}' | \
sed 's/\?.*//; s/\/$//' | sort | uniq -c | \
awk '$1 > 1 {print}' | sort -rn | head -20
# Detect case-sensitivity issues
grep "Googlebot" /var/log/nginx/access.log | \
awk '$9 == 200 {print tolower($7), $7}' | \
sort | awk '{
lower=$1; original=$2
if(lower == prev_lower && original != prev_original)
print "DUPLICATE:", prev_original, "vs", original
prev_lower=lower; prev_original=original
}'
JavaScript Rendering and Indexation Challenges
JavaScript-rendered content creates unique indexation challenges because Google uses a two-phase indexing process. First, Googlebot fetches the raw HTML. Then, the Web Rendering Service (WRS) executes JavaScript to see the final rendered content. Both phases leave traces in your server logs.
Identifying WRS Requests in Logs
The Web Rendering Service makes additional requests for JavaScript files, CSS, API endpoints, and other resources needed to render the page. These requests come from Google's IP ranges but often use a Chrome-like User-Agent:
# Find Google's rendering-related resource requests
# WRS fetches JS, CSS, and API calls needed to render pages
grep -E "Googlebot|Google-InspectionTool|APIs-Google" /var/log/nginx/access.log | \
grep -E "\.(js|css|json|woff2)(\?|$| )" | \
awk '{print $7}' | sort | uniq -c | sort -rn | head -20
# Check if critical JS/CSS files are blocked or returning errors
grep -E "Googlebot|APIs-Google" /var/log/nginx/access.log | \
grep -E "\.(js|css)" | \
awk '$9 != 200 {print $9, $7}' | sort | uniq -c | sort -rn
Render Budget Analysis
Google's rendering resources are limited. If your pages require heavy JavaScript rendering, some pages may be crawled but not rendered, leading to indexation of incomplete content:
# Compare pages that Googlebot fetches vs. resources it renders
grep "Googlebot" /var/log/nginx/access.log | \
awk '{
url=$7
if(url ~ /\.(js|css|json|woff|png|jpg|svg)/) type="resource"
else type="page"
counts[type]++
} END {
printf "Pages crawled: %d\n", counts["page"]
printf "Resources fetched: %d\n", counts["resource"]
printf "Resource-to-page ratio: %.1f\n", counts["resource"]/counts["page"]
}'
# A ratio below 5:1 suggests Google may not be fully rendering your pages
# Modern SPAs typically need 15-30+ resources per page render
⚠️ Warning: If your logs show Googlebot fetching HTML pages but not the corresponding JavaScript bundles, your content is being indexed from raw HTML only. For SPAs and heavily JavaScript-dependent sites, this means Google may see an empty page. Implement server-side rendering (SSR) or pre-rendering as a priority fix.
API Endpoint Visibility
JavaScript applications that fetch content from API endpoints have an additional failure point. If the API returns errors to Google, the rendered page will be empty:
# Check API endpoint responses to Google
grep -E "Googlebot|APIs-Google" /var/log/nginx/access.log | \
grep "/api/" | \
awk '{print $9, $7}' | sort | uniq -c | sort -rn | head -20
# Find API calls that are blocking rendering (non-200 responses)
grep -E "Googlebot|APIs-Google" /var/log/nginx/access.log | \
grep "/api/" | awk '$9 != 200 {print $9, $7}' | \
sort | uniq -c | sort -rn
XML Sitemap Effectiveness Analysis Through Logs
Your XML sitemap is a direct communication channel with search engines, declaring which URLs you consider important. But submitting a sitemap does not guarantee crawling or indexation. Server logs let you measure exactly how effective your sitemap is at driving crawler behavior.
Sitemap Fetch Monitoring
First, confirm that search engines are actually fetching your sitemap:
# Check when Googlebot last fetched your sitemap
grep "Googlebot" /var/log/nginx/access.log | \
grep -i "sitemap" | \
awk '{print $4, $7, $9}' | tail -10
# Monitor sitemap fetch frequency and status codes
grep -E "Googlebot|Bingbot|bingbot" /var/log/nginx/access.log | \
grep -i "sitemap" | \
awk '{print substr($4,2,11), $7, $9}' | sort
Sitemap URL Coverage Analysis
#!/bin/bash
# sitemap_effectiveness.sh - Measure how well your sitemap drives crawling
SITEMAP="/var/www/html/sitemap.xml"
LOG="/var/log/nginx/access.log"
# Extract sitemap URLs
grep -oP '<loc>\K[^<]+' "$SITEMAP" | \
sed 's|https://example.com||' | sort -u > /tmp/sitemap_urls.txt
SITEMAP_COUNT=$(wc -l < /tmp/sitemap_urls.txt)
# Extract Googlebot-crawled URLs
grep "Googlebot" "$LOG" | awk '$9 == 200 {print $7}' | \
sed 's/\?.*//' | sort -u > /tmp/crawled_urls.txt
CRAWLED_COUNT=$(wc -l < /tmp/crawled_urls.txt)
# URLs in sitemap AND crawled
comm -12 /tmp/sitemap_urls.txt /tmp/crawled_urls.txt > /tmp/sitemap_crawled.txt
BOTH_COUNT=$(wc -l < /tmp/sitemap_crawled.txt)
# URLs crawled but NOT in sitemap (discovered via links)
comm -13 /tmp/sitemap_urls.txt /tmp/crawled_urls.txt > /tmp/crawled_not_in_sitemap.txt
EXTRA_CRAWLED=$(wc -l < /tmp/crawled_not_in_sitemap.txt)
# URLs in sitemap but NOT crawled (orphan or ignored)
comm -23 /tmp/sitemap_urls.txt /tmp/crawled_urls.txt > /tmp/in_sitemap_not_crawled.txt
NOT_CRAWLED=$(wc -l < /tmp/in_sitemap_not_crawled.txt)
echo "=== SITEMAP EFFECTIVENESS REPORT ==="
echo "Sitemap URLs: $SITEMAP_COUNT"
echo "Crawled URLs: $CRAWLED_COUNT"
echo "Sitemap URLs crawled: $BOTH_COUNT ($(( BOTH_COUNT * 100 / SITEMAP_COUNT ))%)"
echo "Sitemap URLs NOT crawled: $NOT_CRAWLED ($(( NOT_CRAWLED * 100 / SITEMAP_COUNT ))%)"
echo "Crawled but not in sitemap: $EXTRA_CRAWLED"
echo ""
echo "Top uncrawled sitemap URL patterns:"
cat /tmp/in_sitemap_not_crawled.txt | \
sed 's/[0-9]\+/NUM/g' | sort | uniq -c | sort -rn | head -10
💡 Pro Tip: A healthy sitemap should have 80%+ of its URLs crawled within 30 days. If your coverage is below 50%, your sitemap may contain too many low-quality URLs, diluting its effectiveness. LogBeast generates sitemap effectiveness reports automatically and tracks coverage trends over time.
Lastmod Accuracy Check
The <lastmod> tag in your sitemap tells crawlers when a page was last updated. If this date is inaccurate, Google may learn to ignore your sitemap's update signals:
# Extract lastmod dates and compare with actual content changes
# Verify that recently-modified sitemap URLs are getting recrawled
grep -B1 "lastmod" /var/www/html/sitemap.xml | \
grep -E "(loc|lastmod)" | paste - - | \
awk '{print $2, $1}' | sort -r | head -20
# Check if Google recrawls pages after lastmod updates
# Compare sitemap lastmod dates with most recent crawl dates
Internal Linking Impact on Crawl Discovery
Internal links are the primary mechanism by which search engines discover and prioritize pages on your site. Your server logs reveal exactly how internal linking structure influences crawl behavior, giving you data to optimize link placement for maximum indexation impact.
Crawl Depth Analysis
Pages closer to the homepage (fewer clicks away) receive more crawler attention. Log analysis quantifies this relationship directly:
# Measure crawl frequency by URL depth
grep "Googlebot" /var/log/nginx/access.log | \
awk '$9 == 200 {
url=$7
gsub(/[^\/]/, "", url)
depth=length(url)
counts[depth]++
total++
} END {
for(d=1; d<=max(depth); d++)
if(counts[d]) printf "Depth %d: %6d crawls (%5.1f%%)\n", d, counts[d], counts[d]/total*100
}'
# Find high-value pages buried too deep
grep "Googlebot" /var/log/nginx/access.log | \
awk '$9 == 200 {
url=$7; gsub(/[^\/]/,"",url); depth=length(url)
if(depth > 4) deep[url]++
} END {for(u in deep) if(deep[u] < 3) print deep[u], u}' | sort -n | head -20
Hub Page Identification
Pages that Googlebot visits most frequently are acting as "hubs" in your link architecture. These are the pages whose internal links carry the most crawl influence:
# Identify your most-crawled pages (these are hub pages)
grep "Googlebot" /var/log/nginx/access.log | \
awk '$9 == 200 {print $7}' | sed 's/\?.*//' | \
sort | uniq -c | sort -rn | head -25
# Check if hub pages link to your most important content
# Compare most-crawled pages with pages you WANT indexed
# High crawl count = hub page = add links to underperforming content here
🔑 Key Insight: If your most-crawled pages are utility pages (login, search, cart) rather than content pages (articles, products, categories), your internal linking is wasting crawl budget. Use log data to identify your true hub pages and add internal links from them to your priority indexation targets.
Referrer Path Analysis
While Googlebot does not send traditional referrer headers, you can infer crawl paths by analyzing the sequence of requests from the same Googlebot IP within short time windows:
#!/usr/bin/env python3
"""Infer Googlebot crawl paths from request sequences."""
import re
import sys
from collections import defaultdict
from datetime import datetime
LOG_RE = re.compile(
r'(\S+) \S+ \S+ \[(.+?)\] "GET (\S+) HTTP'
)
def parse_timestamp(ts):
return datetime.strptime(ts.split()[0], '%d/%b/%Y:%H:%M:%S')
def analyze_crawl_paths(log_file):
ip_sessions = defaultdict(list)
with open(log_file) as f:
for line in f:
if 'Googlebot' not in line:
continue
m = LOG_RE.search(line)
if not m:
continue
ip, timestamp, path = m.groups()
ts = parse_timestamp(timestamp)
ip_sessions[ip].append((ts, path))
# Analyze crawl sequences
transitions = defaultdict(int)
for ip, requests in ip_sessions.items():
requests.sort()
for i in range(1, len(requests)):
prev_ts, prev_path = requests[i-1]
curr_ts, curr_path = requests[i]
# Only count transitions within 30 seconds
if (curr_ts - prev_ts).seconds <= 30:
transitions[(prev_path, curr_path)] += 1
print("=== MOST COMMON CRAWL TRANSITIONS ===\n")
for (src, dst), count in sorted(transitions.items(), key=lambda x: -x[1])[:30]:
print(f" {src[:40]:<40} -> {dst[:40]:<40} ({count}x)")
if __name__ == "__main__":
analyze_crawl_paths(sys.argv[1])
Using LogBeast to Build an Indexation Monitoring Dashboard
Manual log analysis with grep and awk is powerful for investigation, but it does not scale for ongoing monitoring. To keep indexation problems from recurring, you need a continuous monitoring dashboard that alerts you to issues before they impact your search visibility.
LogBeast is a desktop log analysis tool that automates all the analyses covered in this guide and presents them in an actionable dashboard format.
Key Dashboard Metrics
An effective indexation monitoring dashboard should track these metrics daily:
| Metric | What It Tells You | Alert Threshold |
|---|---|---|
| Total Googlebot crawls/day | Overall crawl budget allocation | Drop >30% vs. 7-day average |
| Googlebot error rate | Server health from crawler perspective | Any 5xx rate above 2% |
| Unique URLs crawled | Breadth of crawl coverage | Drop >20% vs. prior week |
| Sitemap coverage % | Sitemap effectiveness | Below 70% coverage |
| Orphan page count | Internal linking gaps | Increase >10% month-over-month |
| Avg response time to Googlebot | Crawl speed and server performance | Above 1.5 seconds |
| Parameter URL crawl % | Crawl budget waste on duplicates | Above 20% of total crawls |
| JS resource error rate | Rendering pipeline health | Any critical JS file returning non-200 |
Setting Up LogBeast for Indexation Analysis
LogBeast parses standard access log formats (Apache, Nginx, IIS) and automatically segments bot traffic from human traffic. To configure it for indexation monitoring:
- Import your logs: Point LogBeast at your access log files or configure log streaming for real-time analysis
- Import your sitemap: Upload your XML sitemap so LogBeast can calculate coverage metrics and identify orphan pages
- Set baseline metrics: Let LogBeast analyze 30 days of historical data to establish normal crawl patterns
- Configure alerts: Set thresholds for the key metrics listed above to get notified when indexation issues emerge
- Schedule reports: Generate weekly indexation health reports to track trends and measure the impact of your fixes
Combining LogBeast with CrawlBeast
For a complete indexation monitoring workflow, pair LogBeast (passive log analysis) with CrawlBeast (active crawling):
- LogBeast identifies the problems: Orphan pages, 5xx errors to Googlebot, low crawl frequency pages, duplicate content patterns
- CrawlBeast validates the fixes: After you add internal links, fix server errors, or implement canonical tags, CrawlBeast re-crawls your site to confirm the changes are working
- LogBeast monitors the results: After fixes deploy, track whether Googlebot crawl patterns change -- increased frequency to previously orphaned pages, elimination of error patterns, improved response times
💡 Pro Tip: The most effective SEO teams run LogBeast analysis weekly and CrawlBeast audits monthly. This cadence catches indexation issues within days of their introduction while keeping your crawl infrastructure data fresh.
Building Custom Reports
LogBeast supports exporting analysis data for custom reporting. Here is an example workflow for building an indexation trends report:
# Export LogBeast crawl data for custom analysis
# 1. Export Googlebot crawl summary as CSV
logbeast export --bot googlebot --format csv --output crawl_summary.csv
# 2. Generate week-over-week comparison
logbeast compare --period weekly --metric crawl_count,error_rate,unique_urls
# 3. Cross-reference with GSC data
# Download GSC Coverage report and merge with LogBeast output
logbeast merge --gsc-report coverage_report.csv --output full_indexation_report.csv
Conclusion
Indexation problems are diagnostic problems. The difference between a site with 95% indexation and one with 50% indexation is rarely one catastrophic issue -- it is an accumulation of small barriers that compound over time. Server logs give you the forensic evidence to identify every barrier and prioritize fixes based on actual crawler behavior rather than guesswork.
The key takeaways from this guide:
- Understand the pipeline. Crawling is a prerequisite for indexation, but crawling alone does not guarantee it. Use logs to verify both stages
- Fix server errors first. Any 5xx response to Googlebot is a critical indexation blocker that should be resolved before anything else
- Eliminate orphan pages. Cross-reference your sitemap with crawl logs to find pages that Google never reaches
- Monitor crawl frequency. Declining crawl rates are an early warning of indexation problems and should trigger immediate investigation
- Stop crawl budget waste. Parameter URLs, duplicate content, and faceted navigation consume crawl budget that should go to your important pages
- Verify JavaScript rendering. If Googlebot is not fetching your JS resources, your dynamic content is invisible to search engines
- Measure sitemap effectiveness. A sitemap with under 70% crawl coverage is not doing its job
- Automate monitoring. Use tools like LogBeast to catch indexation problems as they emerge, not weeks later in GSC reports
Start by running the grep and awk commands in this guide against your server logs today. You will almost certainly discover indexation barriers you did not know existed. Then build a systematic monitoring process -- with LogBeast or your own scripts -- to prevent those barriers from returning.
🎯 Next Steps: Read our guide on crawl budget optimization for deeper strategies on maximizing how search engines spend their time on your site, and check out the complete server logs guide for a primer on log formats and parsing techniques.