A production-grade deep dive into Linux logging for DevOps Engineers, SREs, and Platform Engineers working in Kubernetes, cloud, and regulated environments.
1. Introduction: Why Logging Is Non-Negotiable
In any production system, three pillars define observability: metrics, traces, and logs. Metrics tell you that something is wrong. Traces tell you where in the call chain it went wrong. Logs tell you why.
Logging is the oldest, most universal form of system telemetry, and yet it remains one of the most misunderstood. Engineers over-log, under-log, log the wrong things, fail to centralize, or ignore retention entirely — until a 3 AM incident forces a painful reckoning with /var/log full or log data that doesn't exist for the timeframe in question.
In modern distributed systems — Kubernetes clusters, microservices, serverless functions — logging has become substantially more complex. A single user request may touch dozens of pods across multiple nodes and namespaces. Without structured, correlated, centralized logging, reconstructing the sequence of events is nearly impossible.
In FinTech and regulated environments, logging isn’t just an engineering convenience — it’s a compliance requirement. PCI-DSS, PSD2, GDPR, and CESOP all mandate audit trails, access logging, and data retention policies. The consequences of missing logs during a regulatory audit are severe.
This guide covers the full logging stack: from kernel ring buffer and systemd-journald at the bottom, through rsyslog and syslog-ng in the middle, to Fluent Bit, Loki, and Grafana at the top. Every section is oriented toward production use.
2. Linux Logging Architecture Overview
The Logging Hierarchy
Linux logging is a layered system. Understanding each layer prevents misdiagnosis during incidents.
┌─────────────────────────────────────────┐
│ Central Logging Platform │
│ (Loki / Elasticsearch / CloudWatch) │
└──────────────────┬──────────────────────┘
│
┌──────────────────▼──────────────────────┐
│ Log Shipper / Aggregator │
│ (Fluent Bit / Fluentd / Logstash) │
└──────────────────┬──────────────────────┘
│
┌──────────────────▼──────────────────────┐
│ Syslog Daemon Layer │
│ (rsyslog / syslog-ng) │
└────────┬─────────────────────┬──────────┘
│ │
┌────────▼──────┐ ┌─────────▼──────────┐
│ systemd- │ │ /var/log files │
│ journald │ │ (flat files) │
└────────┬──────┘ └────────────────────┘
│
┌────────▼──────────────────────────────────┐
│ Applications / Services / Kernel │
│ (stdout/stderr, syslog(), /dev/kmsg) │
└────────────────────────────────────────────┘Kernel Logging
The kernel writes messages to a circular ring buffer accessible via /dev/kmsg. The dmesg command reads this buffer. Messages include hardware detection, driver initialization, OOM killer events, and filesystem errors. On systems with systemd, journald captures kernel messages and makes them available via journalctl -k.
dmesg -T | grep -i error
dmesg -T | grep -i "oom"
journalctl -k --since "1 hour ago"User-Space Logging
Applications log through several mechanisms:
- syslog() system call — the classic POSIX interface; routes through journald on modern systems
- stdout/stderr — captured by the service manager (systemd or container runtime)
- Direct file writes — applications writing to
/var/log/app/directly - Structured logging libraries — writing JSON to stdout (common in Go, Python, Java microservices)
The syslog Protocol
The syslog protocol (RFC 5424) defines a facility (who is logging: kern, auth, daemon, user, etc.) and a severity (emergency, alert, critical, error, warning, notice, info, debug). This combination creates a priority level used to route messages to different destinations.
Severity Level Meaning emerg 0 System is unusable alert 1 Action must be taken immediately crit 2 Critical conditions err 3 Error conditions warning 4 Warning conditions notice 5 Normal but significant info 6 Informational debug 7 Debug-level messages
3. systemd-journald Deep Dive
How journald Works
systemd-journald is the primary log collection daemon on modern Linux systems (RHEL 7+, Ubuntu 16.04+, Debian 8+). It replaces the traditional syslog daemon as the first recipient of log messages.
journald collects from:
/dev/kmsg(kernel messages)/dev/log(syslog socket, legacy)stdout/stderrof systemd-managed services- The native journal protocol via
/run/systemd/journal/ - Audit subsystem
Binary Journal Format
Unlike traditional syslog’s plaintext files, journald stores logs in a structured binary format at:
- Volatile (runtime):
/run/log/journal/— lost on reboot - Persistent:
/var/log/journal/— survives reboots
Each journal file includes forward and backward sealing (with --seal) to detect tampering, and supports indexed querying by unit, PID, timestamp, and message ID.
To enable persistent storage, create the directory:
mkdir -p /var/log/journal
systemd-tmpfiles --create --prefix /var/log/journal
systemctl restart systemd-journaldEssential journalctl Commands
# Follow logs in real time (like tail -f)
journalctl -f# Show all logs from the last hour
journalctl --since "1 hour ago"# Show logs for a specific service
journalctl -u nginx.service
journalctl -u nginx.service --since "2024-01-15 10:00:00" --until "2024-01-15 11:00:00"# Show kernel messages only
journalctl -k# Show verbose context with explanations for errors
journalctl -xe# Filter by priority (0=emerg to 7=debug)
journalctl -p err # errors and above
journalctl -p warning..err# Filter by PID
journalctl _PID=1234# Output formats
journalctl -u nginx -o json-pretty
journalctl -u nginx -o short-iso# Disk usage
journalctl --disk-usage# Vacuum old logs
journalctl --vacuum-time=7d
journalctl --vacuum-size=500M
journald Configuration
/etc/systemd/journald.conf:
[Journal]
Storage=persistent
Compress=yes
SystemMaxUse=2G
SystemKeepFree=500M
SystemMaxFileSize=100M
MaxRetentionSec=30day
RateLimitInterval=30s
RateLimitBurst=10000
ForwardToSyslog=yesKey tuning parameters:
SystemMaxUse— hard cap on journal disk usageRateLimitBurst— max messages perRateLimitIntervalper service; critical for noisy servicesForwardToSyslog=yes— bridges journald to rsyslog for forwarding
4. rsyslog Deep Dive
Architecture
rsyslog is a high-performance syslog daemon that handles log routing, filtering, transformation, and forwarding. On most distributions it sits alongside journald, receiving forwarded messages and writing to /var/log/ files or remote destinations.
rsyslog processes messages through a pipeline:
Input → Parser → Rulesets (filters + actions) → OutputConfiguration Structure
/etc/rsyslog.conf # Main config
/etc/rsyslog.d/*.conf # Drop-in configs (loaded alphabetically)A typical /etc/rsyslog.conf:
# Load modules
module(load="imuxsock") # Local syslog socket
module(load="imjournal" # Read from journald
StateFile="imjournal.state")# Global settings
$WorkDirectory /var/spool/rsyslog
$ActionFileDefaultTemplate RSYSLOG_TraditionalFileFormat# Local log files
auth,authpriv.* /var/log/auth.log
*.*;auth,authpriv.none /var/log/syslog
kern.* /var/log/kern.log
*.emerg :omusrmsg:*# Include drop-ins
$IncludeConfig /etc/rsyslog.d/*.conf
Remote Forwarding
# UDP forwarding (fire and forget, lower overhead)
*.* @logserver.example.com:514# TCP forwarding (reliable, buffered)
*.* @@logserver.example.com:514# TCP with TLS
*.* action(type="omfwd"
target="logserver.example.com"
port="6514"
protocol="tcp"
StreamDriver="gtls"
StreamDriverMode="1"
StreamDriverAuthMode="x509/name")
UDP vs TCP for log forwarding:
UDP (single @) is low-overhead but loses messages under load or network failures. TCP (double @@) buffers and retransmits, making it appropriate for compliance and audit logging. For high-volume production, add a queue:
*.* action(type="omfwd"
target="logserver.example.com"
port="514"
protocol="tcp"
queue.type="LinkedList"
queue.size="10000"
queue.dequeueBatchSize="100"
queue.filename="fwdRule"
queue.maxDiskSpace="1g"
queue.saveOnShutdown="on"
action.resumeRetryCount="-1")Structured Output with Templates
rsyslog can output JSON for downstream consumers:
template(name="JSONFormat" type="string"
string="{\"time\":\"%timereported:::date-rfc3339%\",\"host\":\"%HOSTNAME%\",\"severity\":\"%syslogseverity-text%\",\"facility\":\"%syslogfacility-text%\",\"program\":\"%programname%\",\"pid\":\"%procid%\",\"message\":\"%msg:::json%\"}\n")*.* action(type="omfile" file="/var/log/json/all.log" template="JSONFormat")
5. syslog-ng vs rsyslog
syslog-ng takes a different design philosophy: it uses a declarative configuration language based on sources, filters, and destinations rather than rsyslog’s hybrid rule-based approach.
When to choose rsyslog:
- Default on RHEL/CentOS, Ubuntu, Debian — familiar, well-documented
- Complex routing rules with rulesets
- High-performance single-node ingestion
- Integration with imjournal for journald bridging
When to choose syslog-ng:
- Configuration clarity is a priority (readable pipeline syntax)
- Heavy use of parsers (CSV, JSON, Apache, etc.)
- Complex correlation and rewriting
- Premium Enterprise features (store-box, etc.)
A syslog-ng equivalent for remote forwarding:
source s_local {
system();
internal();
};destination d_remote {
network("logserver.example.com"
port(514)
transport("tcp"));
};log {
source(s_local);
destination(d_remote);
};
For most Linux infrastructure teams, rsyslog is the practical default unless team familiarity or specific parsing requirements favor syslog-ng.
6. Log Files and Locations
Standard Log Locations
Path Contents /var/log/syslog General system messages (Debian/Ubuntu) /var/log/messages General system messages (RHEL/CentOS) /var/log/auth.log Authentication events (Debian/Ubuntu) /var/log/secure Authentication events (RHEL/CentOS) /var/log/kern.log Kernel messages /var/log/dmesg Boot-time kernel messages /var/log/apt/ Package management (Debian) /var/log/yum.log Package management (RHEL) /var/log/nginx/ Nginx access and error logs /var/log/apache2/ Apache access and error logs /var/log/mysql/ MySQL error log /var/log/postgresql/ PostgreSQL logs /var/log/audit/audit.log Linux audit daemon
Application Logging Best Practices
Modern applications should write to stdout/stderr rather than files. This is the Twelve-Factor App standard, and it’s what Kubernetes, Docker, and systemd all expect. The log infrastructure (journald, container runtime) handles capture, rotation, and forwarding.
When direct file logging is unavoidable (legacy apps, databases), ensure:
- Log directory permissions are tight (app user only)
- Logrotate is configured (see Section 7)
- The application handles SIGHUP for log file reopening (or use
copytruncate)
7. Log Rotation with logrotate
Without log rotation, /var/log fills the root filesystem. On a busy web server, access logs can grow at gigabytes per day.
Configuration
System logrotate runs daily via cron or systemd timer at /etc/cron.daily/logrotate or logrotate.timer.
# /etc/logrotate.d/nginx
/var/log/nginx/*.log {
daily
rotate 14
compress
delaycompress
missingok
notifempty
sharedscripts
postrotate
nginx -s reopen
endscript
}Option reference:
Option Effect daily/weekly/monthly Rotation frequency rotate N Keep N rotated files compress Gzip old files delaycompress Skip compressing the most recent rotated file (allows apps to finish writing) missingok Don't error if log file missing notifempty Skip rotation if file is empty copytruncate Copy then truncate (for apps that don't support SIGHUP) postrotate Script to run after rotation (e.g., reload nginx) dateext Use date in rotated filename instead of numbers
Testing logrotate
# Dry run
logrotate -d /etc/logrotate.d/nginx# Force rotation now
logrotate -f /etc/logrotate.d/nginx
systemd-journald Rotation
journald has its own built-in rotation controlled by journald.conf. For systemd-only environments without rsyslog, ensure SystemMaxUse is set to prevent unbounded growth.
8. Centralized Logging Architecture
Single-host logging is fragile and unscalable. Centralized logging solves:
- Single pane of glass for multi-host/multi-service debugging
- Correlation across services and nodes
- Long-term retention independent of host lifecycle
- Compliance and audit requirements
Architecture Patterns
Pattern 1: Lightweight (small scale)
Linux hosts → rsyslog TCP → Central syslog server → FilesPattern 2: Modern observability stack
Linux hosts → Fluent Bit → Loki → GrafanaPattern 3: Enterprise ELK
Linux hosts → Fluent Bit → Kafka → Logstash → Elasticsearch → KibanaPattern 4: High-volume regulated environment
Linux hosts → Fluent Bit (per node) → Kafka (durable buffer) →
Logstash (parse/enrich) → Elasticsearch (hot-warm-cold) → Kibana
↓
Cold storage (S3/GCS)Kafka as a buffer between shippers and the indexer is critical in regulated environments. If Elasticsearch goes down for maintenance, Kafka holds the messages. Without it, you lose log data during indexer downtime.
Fluent Bit
Fluent Bit is the lightweight log shipper of choice for Kubernetes and resource-constrained environments. Written in C, it consumes ~450KB RAM at idle. It runs as a DaemonSet in Kubernetes or as a systemd service on Linux hosts.
# /etc/fluent-bit/fluent-bit.conf
[SERVICE]
Flush 5
Daemon Off
Log_Level info
Parsers_File parsers.conf[INPUT]
Name systemd
Tag host.*
Systemd_Filter _SYSTEMD_UNIT=nginx.service
DB /var/log/flb_systemd.db
Read_From_Tail On[FILTER]
Name record_modifier
Match host.*
Record hostname ${HOSTNAME}
Record environment production[OUTPUT]
Name loki
Match host.*
Host loki.monitoring.svc.cluster.local
Port 3100
Labels job=fluentbit,host=${HOSTNAME}
line_format json
Logstash
Logstash is the heavy-duty ETL layer: it parses, enriches, and routes logs. Use it when you need complex transformations — grok parsing of unstructured logs, GeoIP enrichment, field normalization, or routing to multiple outputs.
input {
kafka {
bootstrap_servers => "kafka:9092"
topics => ["logs"]
codec => json
}
}filter {
if [kubernetes][namespace] == "payments" {
mutate {
add_field => { "compliance_scope" => "pci" }
}
}
if [message] =~ /ERROR/ {
mutate { add_tag => ["error"] }
}
}output {
elasticsearch {
hosts => ["elasticsearch:9200"]
index => "logs-%{+YYYY.MM.dd}"
}
}
9. Logging in Kubernetes
The Kubernetes Logging Model
Kubernetes has no built-in centralized logging. The official model: containers write to stdout/stderr, the container runtime captures these streams, and you’re responsible for the rest.
On each node, container logs land at:
/var/log/containers/<pod>_<namespace>_<container>-<id>.logThese are symlinks to:
/var/log/pods/<namespace>_<pod>_<uid>/<container>/<n>.logkubectl Logging Commands
# Current logs
kubectl logs <pod> -n <namespace># Follow
kubectl logs -f -n # Previous container instance (after crash)
kubectl logs --previous -n # Multi-container pod
kubectl logs -c -n # All pods matching a label
kubectl logs -l app=nginx -n production --prefix# With timestamps
kubectl logs --timestamps=true
Node-Level Log Architecture
Container → container runtime (containerd/CRI-O) → /var/log/pods/...
↓
Fluent Bit DaemonSet
↓
Central log systemjournald integration depends on the container runtime. With systemd cgroups, containerd can forward to journald:
journalctl -u containerd CONTAINER_NAME=nginxKubernetes Logging Stack: Fluent Bit + Loki
The production-recommended lightweight stack for Kubernetes:
# fluent-bit-daemonset.yaml (simplified)
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: fluent-bit
namespace: logging
spec:
selector:
matchLabels:
app: fluent-bit
template:
spec:
serviceAccountName: fluent-bit
containers:
- name: fluent-bit
image: cr.fluentbit.io/fluent/fluent-bit:3.0
volumeMounts:
- name: varlog
mountPath: /var/log
- name: config
mountPath: /fluent-bit/etc
volumes:
- name: varlog
hostPath:
path: /var/log
- name: config
configMap:
name: fluent-bit-configLoki is a horizontally scalable log aggregation system designed to work like Prometheus but for logs. It indexes only labels (not full text), making it far cheaper than Elasticsearch for pure log storage.
# Loki labels strategy for Kubernetes
labels:
namespace: "{{ .kubernetes.namespace }}"
pod: "{{ .kubernetes.pod_name }}"
container: "{{ .kubernetes.container_name }}"
node: "{{ .kubernetes.host }}"Critical: Keep Loki label cardinality low. Don’t use pod IDs or request IDs as labels. Use them as log line content instead. High cardinality labels destroy Loki performance.
Kubernetes Logging Pitfalls
- Log rotation on nodes: kubelet rotates container logs by default (10MB / 5 files). Fluent Bit’s
DBoption tracks position and survives rotation. - Pod lifecycle: When a pod is deleted, its logs are deleted from the node. Ship logs before pods die.
- CrashLoopBackOff: Use
--previousflag and ship logs before the process exits to avoid losing crash context. - Sidecar containers: Some teams inject a sidecar that reads app log files and writes to stdout. This works but adds resource overhead.
10. Cloud Logging
AWS CloudWatch Logs
CloudWatch Logs is the native AWS solution. The CloudWatch Agent collects from:
- systemd journal
/var/log/*files- Custom application logs
// /opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.json
{
"logs": {
"logs_collected": {
"files": {
"collect_list": [
{
"file_path": "/var/log/nginx/access.log",
"log_group_name": "/ec2/nginx/access",
"log_stream_name": "{instance_id}",
"timestamp_format": "%d/%b/%Y:%H:%M:%S %z"
}
]
},
"journald": {
"collect_list": [
{
"log_group_name": "/ec2/system",
"log_stream_name": "{instance_id}/journald"
}
]
}
}
}
}For Kubernetes on EKS, Fluent Bit ships directly to CloudWatch:
[OUTPUT]
Name cloudwatch_logs
Match kube.*
region eu-west-1
log_group_name /eks/cluster/application
log_stream_prefix ${HOST_NAME}-
auto_create_group trueGCP Cloud Logging
Google Cloud Logging (formerly Stackdriver) uses the Ops Agent on GCE, and the GKE logging integration is automatic. For custom configurations:
logging:
receivers:
nginx_access:
type: files
include_paths:
- /var/log/nginx/access.log
processors:
nginx_parser:
type: parse_nginx_combined
pipelines:
nginx_pipeline:
receivers: [nginx_access]
processors: [nginx_parser]Azure Monitor
Azure Monitor Logs (Log Analytics) uses the Azure Monitor Agent (AMA), which replaces the legacy Log Analytics Agent (MMA). Configure via Data Collection Rules (DCR):
{
"dataSources": {
"syslog": [
{
"streams": ["Microsoft-Syslog"],
"facilityNames": ["auth", "authpriv", "daemon"],
"logLevels": ["Warning", "Error", "Critical"],
"name": "syslogSource"
}
]
}
}11. Debugging and Troubleshooting with Logs
Diagnosing a Failed Service
# Step 1: What's the service status?
systemctl status nginx# Step 2: Get full journal context
journalctl -u nginx -xe --since "10 minutes ago"# Step 3: Check last 100 lines
journalctl -u nginx -n 100 --no-pager# Step 4: Check for dependency failures
journalctl -b -p err
SSH Authentication Failures
# Failed logins
journalctl -u ssh --since today | grep "Failed password"# Successful logins
journalctl -u ssh --since today | grep "Accepted"# On systems with auth.log:
grep "Failed password" /var/log/auth.log | awk '{print $11}' | sort | uniq -c | sort -rn | head -20
Kernel OOM Events
# OOM killer events
journalctl -k | grep -i "oom\|killed process\|out of memory"# Memory pressure before OOM
dmesg -T | grep -A5 "oom_kill"
Nginx 502/503 Debugging
# Combine nginx error log with upstream logs
journalctl -u nginx --since "15 minutes ago" -o json | \
python3 -c "
import sys, json
for line in sys.stdin:
e = json.loads(line)
msg = e.get('MESSAGE','')
if 'upstream' in msg or 'error' in msg.lower():
print(e['__REALTIME_TIMESTAMP'], msg)
"Correlating Logs Across Services
In a microservices environment, use a correlation ID (trace ID) injected at the API gateway and propagated through all service calls. When debugging:
# Search for a specific request ID across all logs
grep "req-abc123" /var/log/*/access.log
# Or with Loki:
{namespace="production"} |= "req-abc123"12. Security and Compliance Logging
auditd: The Kernel Audit Framework
auditd captures security-relevant system calls at the kernel level — before any application-layer filtering. This is essential for PCI-DSS, SOX, and CESOP compliance.
# Install
apt install auditd audispd-plugins # Debian/Ubuntu
dnf install audit # RHEL# Enable
systemctl enable --now auditd
Audit rules in /etc/audit/rules.d/:
# Monitor privileged command execution
-a always,exit -F arch=b64 -S execve -F euid=0 -k root_commands# Monitor file access to sensitive files
-w /etc/passwd -p wa -k identity
-w /etc/shadow -p wa -k identity
-w /etc/sudoers -p wa -k sudoers# Monitor network configuration changes
-a always,exit -F arch=b64 -S sethostname -S setdomainname -k system-locale# Monitor successful/failed login attempts
-w /var/log/faillog -p wa -k logins
-w /var/log/lastlog -p wa -k logins# Monitor sudo usage
-w /usr/bin/sudo -p x -k sudo_usage
Query audit logs:
# All events for a specific user
ausearch -ua 1001 --start today# All file writes to /etc
ausearch -f /etc --success yes# Failed login attempts
ausearch -m USER_AUTH --success no --start today# Generate a summary report
aureport --summary
aureport --failed
Authentication Log Analysis
# Monitor for brute force attempts (>10 failures from same IP)
awk '/Failed password/{print $11}' /var/log/auth.log | \
sort | uniq -c | sort -rn | awk '$1>10{print $2, $1, "attempts"}'# Account lockouts
grep "pam_unix.*authentication failure" /var/log/auth.log# sudo escalations
grep "sudo:" /var/log/auth.log | grep "COMMAND"
FinTech and Compliance Specifics
In regulated environments (PSD2, CESOP, PCI-DSS), logging requirements include:
- Tamper-evident storage: Use journald sealing (
--seal) or write-once storage (S3 Object Lock, WORM drives) - Retention: PCI-DSS requires 1 year minimum (3 months online, 9 months archival)
- Access logging: Every access to cardholder data must be logged with user, timestamp, action, and source IP
- Privileged access monitoring: All root/sudo activity must be captured and reviewed
- Log integrity verification: Regular hash verification of archived logs
- Separation of duties: Log data must be inaccessible to the users being monitored
Example compliance-focused rsyslog config for financial systems:
# Auth events to tamper-evident remote store (TCP with TLS)
auth.* action(type="omfwd"
target="siem.compliance.internal"
port="6514"
protocol="tcp"
StreamDriver="gtls"
StreamDriverMode="1"
queue.type="LinkedList"
queue.filename="complianceFwd"
queue.saveOnShutdown="on"
action.resumeRetryCount="-1")# Local copy for immediate access
auth.* /var/log/auth.log
13. Performance and Optimization
Rate Limiting
A misbehaving service can generate millions of log lines per second, flooding journald and consuming disk. journald rate limiting:
# /etc/systemd/journald.conf
RateLimitInterval=30s
RateLimitBurst=10000If a service exceeds 10,000 messages in 30 seconds, journald drops further messages until the interval resets. When this happens you’ll see:
Suppressed N messages from unit nginx.serviceFor rsyslog:
module(load="omprog")
if $programname == 'noisy-app' then {
action(type="omfile" file="/dev/null")
stop
}Disk I/O Optimization
- Async writes: rsyslog uses async I/O by default; don’t disable this
- journald compression: Keep
Compress=yes; journal entries compress well (70-80% reduction) - Separate log partition: Mount
/var/logon a dedicated partition or LVM volume to prevent root filesystem exhaustion - tmpfs for volatile logs: If you only need logs for live debugging (no persistence requirement), use
/run/log/journal/(volatile mode)
journald Performance Tuning
[Journal]
Storage=persistent
Compress=yes
SystemMaxUse=4G # Absolute cap
SystemKeepFree=1G # Always keep this free on disk
SystemMaxFileSize=200M # Max size per journal file
MaxFileSec=1month # Rotate files older than thisCheck current disk usage and auto-vacuum:
journalctl --disk-usage
journalctl --verify
journalctl --vacuum-size=2G
journalctl --vacuum-time=30dFluent Bit Buffer Tuning
For high-volume environments, configure Fluent Bit’s memory and filesystem buffering:
[SERVICE]
Flush 1
storage.path /var/log/flb-storage/
storage.sync normal
storage.checksum off
storage.max_chunks_up 128[INPUT]
Name tail
storage.type filesystem # Persist to disk if output unavailable
Mem_Buf_Limit 50MB
Buffer_Max_Size 5MB
14. Best Practices
1. Centralize Everything
No exception. Single-host log analysis is acceptable for development but never for production. You will have an incident at 3 AM where you need logs from 12 hosts simultaneously.
2. Use Structured Logging (JSON)
Unstructured logs are human-readable but machine-painful. JSON logs enable field-level filtering, aggregation, and alerting.
Application output:
{"time":"2024-01-15T10:23:45Z","level":"ERROR","service":"payment-api","trace_id":"abc123","user_id":9821,"message":"Payment gateway timeout","duration_ms":5000,"gateway":"stripe"}With structured logs, you can query: {service="payment-api"} | json | duration_ms > 3000
3. Never Log Secrets
PAN numbers, passwords, API keys, tokens, session IDs — none of these belong in logs. Implement log scrubbing:
- At the application level (redact before logging)
- At the shipper level (Fluent Bit
luafilter or Logstashmutate) - Audit regularly (grep for patterns like card numbers in log samples)
-- Fluent Bit Lua filter to redact card numbers
function redact_pci(tag, timestamp, record)
if record["message"] then
record["message"] = string.gsub(record["message"], "%d%d%d%d%s?%d%d%d%d%s?%d%d%d%d%s?%d%d%d%d", "****-****-****-****")
end
return 1, timestamp, record
end4. Add Consistent Metadata
Every log line should carry: hostname, service name, environment (production/staging), version, and a correlation/trace ID. This is non-negotiable in distributed systems.
5. Monitor Log Volume
Log volume is a signal. A spike in log volume often precedes or accompanies an incident. Set up alerts:
- Loki:
rate({job="myapp"}[5m]) > 1000— alert if log rate exceeds 1000/min - Elasticsearch: watcher on index document rate
6. Test Log Retention and Recovery
Quarterly: confirm that you can actually retrieve 90-day-old logs. Compliance logs that can’t be retrieved are compliance failures.
7. Separate Application and Security Logs
Route auth events, audit events, and privileged command logs to a separate, higher-retention, tamper-evident destination. Don’t mix them with application debug logs.
8. Document Log Schema
For every service, maintain a log schema document: what fields are emitted, what values are valid, what severity means in your context. This pays dividends during incidents and onboarding.
15. Modern Logging Stack: journald + Fluent Bit + Loki + Grafana
This stack is the practical choice for teams running Kubernetes with existing Grafana infrastructure. It’s open source, lightweight, and deeply integrated.
┌─────────────┐ ┌──────────────┐ ┌──────────┐ ┌─────────┐
│ systemd │ │ Fluent Bit │ │ Loki │ │ Grafana │
│ journald │───▶│ DaemonSet │───▶│ (ingest) │───▶│(Explore)│
│ │ │ │ │ │ │ │
│ containers │ │ per node │ │ S3/GCS │ │ Alerts │
└─────────────┘ └──────────────┘ └──────────┘ └─────────┘Advantages over ELK:
- Cost: Loki indexes labels only; full-text search is done at query time. Storage costs are 10–20x cheaper than Elasticsearch for equivalent retention.
- Operational overhead: No JVM tuning, no shard management, no complex cluster coordination.
- Grafana native: Same tool for metrics (Prometheus) and logs (Loki). Correlate a spike in error rate directly with log entries in the same dashboard.
- LogQL: A powerful query language that mirrors PromQL, familiar to any Prometheus user.
Loki query examples (LogQL):
# All errors from production payment service
{namespace="production", app="payment-api"} |= "ERROR"# JSON parsing and field filtering
{namespace="production"} | json | level="error" | duration_ms > 1000# Error rate over time
rate({namespace="production"} |= "ERROR" [5m])# Top 10 slowest requests
{app="api"} | json | duration_ms > 0 | line_format "{{.duration_ms}}" | unwrap duration_ms | topk(10, sum by (path) (avg_over_time([1h])))
16. Practical Production Setup: Single Linux Server
Full working example: one Linux server shipping systemd journal to Loki via Fluent Bit.
Step 1: Install Fluent Bit
curl https://raw.githubusercontent.com/fluent/fluent-bit/master/install.sh | sh
systemctl enable fluent-bitStep 2: Configure Fluent Bit
# /etc/fluent-bit/fluent-bit.conf
[SERVICE]
Flush 5
Daemon Off
Log_Level info
Parsers_File parsers.conf
storage.path /var/log/flb-storage/[INPUT]
Name systemd
Tag systemd.*
DB /var/log/flb_journal.db
Read_From_Tail On
Strip_Underscores On[INPUT]
Name tail
Path /var/log/nginx/access.log
Tag nginx.access
DB /var/log/flb_nginx.db
Parser nginx[FILTER]
Name record_modifier
Match *
Record hostname ${HOSTNAME}
Record env production[FILTER]
Name lua
Match *
script /etc/fluent-bit/redact.lua
call redact_pci[OUTPUT]
Name loki
Match *
Host loki.internal
Port 3100
Labels job=fluent-bit,host=${HOSTNAME}
line_format json
auto_kubernetes_labels on
Step 3: Parsers
# /etc/fluent-bit/parsers.conf
[PARSER]
Name nginx
Format regex
Regex ^(?[^ ]*) [^ ]* (?[^ ]*) \[(?<time>[^\]]*)\] "(?\S+)(?: +(?[^\"]*?)(?: +\S*)?)?" (?[^ ]*) (?[^ ]*)(?: "(?[^\"]*)" "(?[^\"]*)")?$
Time_Key time
Time_Format %d/%b/%Y:%H:%M:%S %zStep 4: Loki Configuration (minimal)
# /etc/loki/loki.yaml
auth_enabled: falseserver: http_listen_port: 3100ingester: wal: dir: /var/loki/wal lifecycler: ring: replication_factor: 1schema_config: configs: - from: 2024-01-01 store: tsdb object_store: filesystem schema: v13 index: prefix: index_ period: 24hstorage_config: tsdb_shipper: active_index_directory: /var/loki/index cache_location: /var/loki/cache filesystem: directory: /var/loki/chunkslimits_config: retention_period: 30dStart and verify:
systemctl start fluent-bit loki
# Check Fluent Bit is reading journal
journalctl -u fluent-bit -f
# Verify Loki is receiving
curl http://localhost:3100/ready
curl http://localhost:3100/loki/api/v1/labels17. Common Logging Problems and Solutions
Problem: Logs Missing for a Time Window
Symptoms: Can’t find logs for a specific period.
Diagnosis:
journalctl --list-boots
journalctl --verify
journalctl --disk-usageCauses and fixes:
Volatile storage: journald in volatile mode loses logs on reboot. Enable persistent:mkdir /var/log/journal && systemctl restart systemd-journaldRate limiting: Check for suppression messages:journalctl | grep "Suppressed"Fluent Bit position DB: If Fluent Bit crashed, the DB file may be corrupted. Delete it and restart (accept re-shipping some logs)Log rotation removed files: Fluent Bit tail input with a DB survives rotation; without it, rotating files can cause missed lines
Problem: Logs Not Rotating
# Test config
logrotate -d /etc/logrotate.d/myapp# Force rotationlogrotate -f /etc/logrotate.conf# Check cron/timersystemctl status logrotate.timerCommon cause: file is not owned by the expected user, or the postrotate script is failing (check exit code).
Problem: /var/log Full
# Find largest consumers
du -sh /var/log/* | sort -rh | head -20# Check journaldjournalctl --disk-usage# Emergency cleanupjournalctl --vacuum-size=500M # Trim journal to 500MBjournalctl --vacuum-time=3d # Remove entries older than 3 days# Find any rotated but uncompressed logsfind /var/log -name "*.log.*" ! -name "*.gz" -size +100Mgzip /var/log/bigapp/app.log.1Prevention: Add a filesystem alert at 80% usage on the /var/log partition.
Problem: journald Consuming Excessive Disk
journalctl --disk-usage
# Expected: < configured SystemMaxUse value# If over limit, journald should auto-vacuum; if not:journalctl --vacuum-size=2G# Permanently fixcat >> /etc/systemd/journald.conf << EOFSystemMaxUse=2GSystemKeepFree=500MEOFsystemctl restart systemd-journaldProblem: Fluent Bit Not Forwarding to Loki
# Check Fluent Bit logs
journalctl -u fluent-bit -n 100# Test Loki connectivitycurl -s http://loki.internal:3100/ready# Manually push a test logcurl -H "Content-Type: application/json" \ -X POST http://loki.internal:3100/loki/api/v1/push \ --data '{"streams":[{"stream":{"job":"test"},"values":[["'$(date +%s%N)'","test message"]]}]}'Problem: High Log Cardinality Breaking Loki
Symptoms: Loki query performance degrades, ingestion slows, stream limit errors.
# Check number of unique streams
count(count by(__stream_shard__)(rate({job="myapp"}[5m])))Fix: Reduce label diversity. Never use request IDs, user IDs, or timestamps as Loki labels. These belong in the log body.
18. Conclusion
Linux logging is not a checkbox — it’s operational infrastructure as important as networking or storage. The progression from single-host journald to a fully centralized, structured, correlated logging platform represents the difference between flying blind and having genuine observability.
The modern logging stack — journald capturing everything locally, Fluent Bit shipping efficiently at the node level, Loki storing cost-effectively with label-based indexing, and Grafana providing unified dashboards and alerting — gives you production-grade observability without the operational weight of a full ELK cluster.
For FinTech and regulated environments, logging is compliance. Missing logs are audit failures. The investment in proper centralization, tamper-evident storage, appropriate retention policies, and regular testing of log retrieval isn’t optional — it’s the operational cost of operating in regulated space.
The non-negotiables:
Centralize or accept blindness — distributed logs you can’t query are not observabilityStructured logging — JSON everywhere, from day one; retrofitting is painfulNo secrets in logs — ever; implement scrubbing at multiple layersTest your retention — logs you can’t retrieve don’t count for complianceMonitor log volume — it’s a signal; spikes precede incidents
Master these fundamentals, implement the practices in this guide, and logging becomes a superpower rather than a liability.
_Vladimiras Levinas is a Lead DevOps Engineer with 18+ years in fintech infrastructure. He runs a production K3s homelab and writes about AI infrastructure at doc.thedevops.dev_