Linux Logging Masterclass: Architecture, Tools, and Best Practices for DevOps and SRE

A production-grade deep dive into Linux logging for DevOps Engineers, SREs, and Platform Engineers working in Kubernetes, cloud, and regulated environments.

1. Introduction: Why Logging Is Non-Negotiable

In any production system, three pillars define observability: metrics, traces, and logs. Metrics tell you that something is wrong. Traces tell you where in the call chain it went wrong. Logs tell you why.

Logging is the oldest, most universal form of system telemetry, and yet it remains one of the most misunderstood. Engineers over-log, under-log, log the wrong things, fail to centralize, or ignore retention entirely — until a 3 AM incident forces a painful reckoning with /var/log full or log data that doesn't exist for the timeframe in question.

In modern distributed systems — Kubernetes clusters, microservices, serverless functions — logging has become substantially more complex. A single user request may touch dozens of pods across multiple nodes and namespaces. Without structured, correlated, centralized logging, reconstructing the sequence of events is nearly impossible.

In FinTech and regulated environments, logging isn’t just an engineering convenience — it’s a compliance requirement. PCI-DSS, PSD2, GDPR, and CESOP all mandate audit trails, access logging, and data retention policies. The consequences of missing logs during a regulatory audit are severe.

This guide covers the full logging stack: from kernel ring buffer and systemd-journald at the bottom, through rsyslog and syslog-ng in the middle, to Fluent Bit, Loki, and Grafana at the top. Every section is oriented toward production use.

2. Linux Logging Architecture Overview

The Logging Hierarchy

Linux logging is a layered system. Understanding each layer prevents misdiagnosis during incidents.

┌─────────────────────────────────────────┐ 
│         Central Logging Platform        │ 
│   (Loki / Elasticsearch / CloudWatch)   │ 
└──────────────────┬──────────────────────┘ 
                   │ 
┌──────────────────▼──────────────────────┐ 
│         Log Shipper / Aggregator        │ 
│      (Fluent Bit / Fluentd / Logstash)  │ 
└──────────────────┬──────────────────────┘ 
                   │ 
┌──────────────────▼──────────────────────┐ 
│        Syslog Daemon Layer              │ 
│        (rsyslog / syslog-ng)            │ 
└────────┬─────────────────────┬──────────┘ 
         │                     │ 
┌────────▼──────┐    ┌─────────▼──────────┐ 
│  systemd-     │    │   /var/log files    │ 
│  journald     │    │   (flat files)      │ 
└────────┬──────┘    └────────────────────┘ 
         │ 
┌────────▼──────────────────────────────────┐ 
│  Applications / Services / Kernel         │ 
│  (stdout/stderr, syslog(), /dev/kmsg)     │ 
└────────────────────────────────────────────┘

Kernel Logging

The kernel writes messages to a circular ring buffer accessible via /dev/kmsg. The dmesg command reads this buffer. Messages include hardware detection, driver initialization, OOM killer events, and filesystem errors. On systems with systemd, journald captures kernel messages and makes them available via journalctl -k.

dmesg -T | grep -i error 
dmesg -T | grep -i "oom" 
journalctl -k --since "1 hour ago"

User-Space Logging

Applications log through several mechanisms:

syslog() system call — the classic POSIX interface; routes through journald on modern systems
stdout/stderr — captured by the service manager (systemd or container runtime)
Direct file writes — applications writing to /var/log/app/ directly
Structured logging libraries — writing JSON to stdout (common in Go, Python, Java microservices)

The syslog Protocol

The syslog protocol (RFC 5424) defines a facility (who is logging: kern, auth, daemon, user, etc.) and a severity (emergency, alert, critical, error, warning, notice, info, debug). This combination creates a priority level used to route messages to different destinations.

Severity Level Meaning emerg 0 System is unusable alert 1 Action must be taken immediately crit 2 Critical conditions err 3 Error conditions warning 4 Warning conditions notice 5 Normal but significant info 6 Informational debug 7 Debug-level messages

3. systemd-journald Deep Dive

How journald Works

systemd-journald is the primary log collection daemon on modern Linux systems (RHEL 7+, Ubuntu 16.04+, Debian 8+). It replaces the traditional syslog daemon as the first recipient of log messages.

journald collects from:

/dev/kmsg (kernel messages)
/dev/log (syslog socket, legacy)
stdout/stderr of systemd-managed services
The native journal protocol via /run/systemd/journal/
Audit subsystem

Binary Journal Format

Unlike traditional syslog’s plaintext files, journald stores logs in a structured binary format at:

Volatile (runtime): /run/log/journal/ — lost on reboot
Persistent: /var/log/journal/ — survives reboots

Each journal file includes forward and backward sealing (with --seal) to detect tampering, and supports indexed querying by unit, PID, timestamp, and message ID.

To enable persistent storage, create the directory:

mkdir -p /var/log/journal 
systemd-tmpfiles --create --prefix /var/log/journal 
systemctl restart systemd-journald

Essential journalctl Commands

# Follow logs in real time (like tail -f) 
journalctl -f

# Show all logs from the last hour
journalctl --since "1 hour ago"# Show logs for a specific service
journalctl -u nginx.service
journalctl -u nginx.service --since "2024-01-15 10:00:00" --until "2024-01-15 11:00:00"# Show kernel messages only
journalctl -k# Show verbose context with explanations for errors
journalctl -xe# Filter by priority (0=emerg to 7=debug)
journalctl -p err # errors and above
journalctl -p warning..err# Filter by PID
journalctl _PID=1234# Output formats
journalctl -u nginx -o json-pretty
journalctl -u nginx -o short-iso# Disk usage
journalctl --disk-usage# Vacuum old logs
journalctl --vacuum-time=7d
journalctl --vacuum-size=500M

journald Configuration

/etc/systemd/journald.conf:

[Journal] 
Storage=persistent 
Compress=yes 
SystemMaxUse=2G 
SystemKeepFree=500M 
SystemMaxFileSize=100M 
MaxRetentionSec=30day 
RateLimitInterval=30s 
RateLimitBurst=10000 
ForwardToSyslog=yes

Key tuning parameters:

SystemMaxUse — hard cap on journal disk usage
RateLimitBurst — max messages per RateLimitInterval per service; critical for noisy services
ForwardToSyslog=yes — bridges journald to rsyslog for forwarding

4. rsyslog Deep Dive

Architecture

rsyslog is a high-performance syslog daemon that handles log routing, filtering, transformation, and forwarding. On most distributions it sits alongside journald, receiving forwarded messages and writing to /var/log/ files or remote destinations.

rsyslog processes messages through a pipeline:

Input → Parser → Rulesets (filters + actions) → Output

Configuration Structure

/etc/rsyslog.conf          # Main config 
/etc/rsyslog.d/*.conf      # Drop-in configs (loaded alphabetically)

A typical /etc/rsyslog.conf:

# Load modules 
module(load="imuxsock")       # Local syslog socket 
module(load="imjournal"       # Read from journald 
       StateFile="imjournal.state")

# Global settings
$WorkDirectory /var/spool/rsyslog
$ActionFileDefaultTemplate RSYSLOG_TraditionalFileFormat# Local log files
auth,authpriv.* /var/log/auth.log
*.*;auth,authpriv.none /var/log/syslog
kern.* /var/log/kern.log
*.emerg :omusrmsg:*# Include drop-ins
$IncludeConfig /etc/rsyslog.d/*.conf

Remote Forwarding

# UDP forwarding (fire and forget, lower overhead) 
*.* @logserver.example.com:514

# TCP forwarding (reliable, buffered)
*.* @@logserver.example.com:514# TCP with TLS
*.* action(type="omfwd"
target="logserver.example.com"
port="6514"
protocol="tcp"
StreamDriver="gtls"
StreamDriverMode="1"
StreamDriverAuthMode="x509/name")

UDP vs TCP for log forwarding:

UDP (single @) is low-overhead but loses messages under load or network failures. TCP (double @@) buffers and retransmits, making it appropriate for compliance and audit logging. For high-volume production, add a queue:

*.* action(type="omfwd" 
    target="logserver.example.com" 
    port="514" 
    protocol="tcp" 
    queue.type="LinkedList" 
    queue.size="10000" 
    queue.dequeueBatchSize="100" 
    queue.filename="fwdRule" 
    queue.maxDiskSpace="1g" 
    queue.saveOnShutdown="on" 
    action.resumeRetryCount="-1")

Structured Output with Templates

rsyslog can output JSON for downstream consumers:

template(name="JSONFormat" type="string" 
    string="{\"time\":\"%timereported:::date-rfc3339%\",\"host\":\"%HOSTNAME%\",\"severity\":\"%syslogseverity-text%\",\"facility\":\"%syslogfacility-text%\",\"program\":\"%programname%\",\"pid\":\"%procid%\",\"message\":\"%msg:::json%\"}\n")

*.* action(type="omfile" file="/var/log/json/all.log" template="JSONFormat")

5. syslog-ng vs rsyslog

syslog-ng takes a different design philosophy: it uses a declarative configuration language based on sources, filters, and destinations rather than rsyslog’s hybrid rule-based approach.

When to choose rsyslog:

Default on RHEL/CentOS, Ubuntu, Debian — familiar, well-documented
Complex routing rules with rulesets
High-performance single-node ingestion
Integration with imjournal for journald bridging

When to choose syslog-ng:

Configuration clarity is a priority (readable pipeline syntax)
Heavy use of parsers (CSV, JSON, Apache, etc.)
Complex correlation and rewriting
Premium Enterprise features (store-box, etc.)

A syslog-ng equivalent for remote forwarding:

source s_local { 
    system(); 
    internal(); 
};

destination d_remote {
network("logserver.example.com"
port(514)
transport("tcp"));
};log {
source(s_local);
destination(d_remote);
};

For most Linux infrastructure teams, rsyslog is the practical default unless team familiarity or specific parsing requirements favor syslog-ng.

6. Log Files and Locations

Standard Log Locations

Path Contents /var/log/syslog General system messages (Debian/Ubuntu) /var/log/messages General system messages (RHEL/CentOS) /var/log/auth.log Authentication events (Debian/Ubuntu) /var/log/secure Authentication events (RHEL/CentOS) /var/log/kern.log Kernel messages /var/log/dmesg Boot-time kernel messages /var/log/apt/ Package management (Debian) /var/log/yum.log Package management (RHEL) /var/log/nginx/ Nginx access and error logs /var/log/apache2/ Apache access and error logs /var/log/mysql/ MySQL error log /var/log/postgresql/ PostgreSQL logs /var/log/audit/audit.log Linux audit daemon

Application Logging Best Practices

Modern applications should write to stdout/stderr rather than files. This is the Twelve-Factor App standard, and it’s what Kubernetes, Docker, and systemd all expect. The log infrastructure (journald, container runtime) handles capture, rotation, and forwarding.

When direct file logging is unavoidable (legacy apps, databases), ensure:

Log directory permissions are tight (app user only)
Logrotate is configured (see Section 7)
The application handles SIGHUP for log file reopening (or use copytruncate)

7. Log Rotation with logrotate

Without log rotation, /var/log fills the root filesystem. On a busy web server, access logs can grow at gigabytes per day.

Configuration

System logrotate runs daily via cron or systemd timer at /etc/cron.daily/logrotate or logrotate.timer.

# /etc/logrotate.d/nginx 
/var/log/nginx/*.log { 
    daily 
    rotate 14 
    compress 
    delaycompress 
    missingok 
    notifempty 
    sharedscripts 
    postrotate 
        nginx -s reopen 
    endscript 
}

Option reference:

Option Effect daily/weekly/monthly Rotation frequency rotate N Keep N rotated files compress Gzip old files delaycompress Skip compressing the most recent rotated file (allows apps to finish writing) missingok Don't error if log file missing notifempty Skip rotation if file is empty copytruncate Copy then truncate (for apps that don't support SIGHUP) postrotate Script to run after rotation (e.g., reload nginx) dateext Use date in rotated filename instead of numbers

Testing logrotate

# Dry run 
logrotate -d /etc/logrotate.d/nginx

# Force rotation now
logrotate -f /etc/logrotate.d/nginx

systemd-journald Rotation

journald has its own built-in rotation controlled by journald.conf. For systemd-only environments without rsyslog, ensure SystemMaxUse is set to prevent unbounded growth.

8. Centralized Logging Architecture

Single-host logging is fragile and unscalable. Centralized logging solves:

Single pane of glass for multi-host/multi-service debugging
Correlation across services and nodes
Long-term retention independent of host lifecycle
Compliance and audit requirements

Architecture Patterns

Pattern 1: Lightweight (small scale)

Linux hosts → rsyslog TCP → Central syslog server → Files

Pattern 2: Modern observability stack

Linux hosts → Fluent Bit → Loki → Grafana

Pattern 3: Enterprise ELK

Linux hosts → Fluent Bit → Kafka → Logstash → Elasticsearch → Kibana

Pattern 4: High-volume regulated environment

Linux hosts → Fluent Bit (per node) → Kafka (durable buffer) →  
Logstash (parse/enrich) → Elasticsearch (hot-warm-cold) → Kibana 
                                     ↓ 
                              Cold storage (S3/GCS)

Kafka as a buffer between shippers and the indexer is critical in regulated environments. If Elasticsearch goes down for maintenance, Kafka holds the messages. Without it, you lose log data during indexer downtime.

Fluent Bit

Fluent Bit is the lightweight log shipper of choice for Kubernetes and resource-constrained environments. Written in C, it consumes ~450KB RAM at idle. It runs as a DaemonSet in Kubernetes or as a systemd service on Linux hosts.

# /etc/fluent-bit/fluent-bit.conf 
[SERVICE] 
    Flush        5 
    Daemon       Off 
    Log_Level    info 
    Parsers_File parsers.conf

[INPUT]
Name systemd
Tag host.*
Systemd_Filter _SYSTEMD_UNIT=nginx.service
DB /var/log/flb_systemd.db
Read_From_Tail On[FILTER]
Name record_modifier
Match host.*
Record hostname ${HOSTNAME}
Record environment production[OUTPUT]
Name loki
Match host.*
Host loki.monitoring.svc.cluster.local
Port 3100
Labels job=fluentbit,host=${HOSTNAME}
line_format json

Logstash

Logstash is the heavy-duty ETL layer: it parses, enriches, and routes logs. Use it when you need complex transformations — grok parsing of unstructured logs, GeoIP enrichment, field normalization, or routing to multiple outputs.

input { 
  kafka { 
    bootstrap_servers => "kafka:9092" 
    topics => ["logs"] 
    codec => json 
  } 
}

filter {
if [kubernetes][namespace] == "payments" {
mutate {
add_field => { "compliance_scope" => "pci" }
}
}

if [message] =~ /ERROR/ {
mutate { add_tag => ["error"] }
}
}output {
elasticsearch {
hosts => ["elasticsearch:9200"]
index => "logs-%{+YYYY.MM.dd}"
}
}

9. Logging in Kubernetes

The Kubernetes Logging Model

Kubernetes has no built-in centralized logging. The official model: containers write to stdout/stderr, the container runtime captures these streams, and you’re responsible for the rest.

On each node, container logs land at:

/var/log/containers/<pod>_<namespace>_<container>-<id>.log

These are symlinks to:

/var/log/pods/<namespace>_<pod>_<uid>/<container>/<n>.log

kubectl Logging Commands

# Current logs 
kubectl logs <pod> -n <namespace>

# Follow
kubectl logs -f -n # Previous container instance (after crash)
kubectl logs --previous -n # Multi-container pod
kubectl logs -c -n # All pods matching a label
kubectl logs -l app=nginx -n production --prefix# With timestamps
kubectl logs --timestamps=true

Node-Level Log Architecture

Container → container runtime (containerd/CRI-O) → /var/log/pods/... 
                                                  ↓ 
                                          Fluent Bit DaemonSet 
                                                  ↓ 
                                          Central log system

journald integration depends on the container runtime. With systemd cgroups, containerd can forward to journald:

journalctl -u containerd CONTAINER_NAME=nginx

Kubernetes Logging Stack: Fluent Bit + Loki

The production-recommended lightweight stack for Kubernetes:

# fluent-bit-daemonset.yaml (simplified) 
apiVersion: apps/v1 
kind: DaemonSet 
metadata: 
  name: fluent-bit 
  namespace: logging 
spec: 
  selector: 
    matchLabels: 
      app: fluent-bit 
  template: 
    spec: 
      serviceAccountName: fluent-bit 
      containers: 
      - name: fluent-bit 
        image: cr.fluentbit.io/fluent/fluent-bit:3.0 
        volumeMounts: 
        - name: varlog 
          mountPath: /var/log 
        - name: config 
          mountPath: /fluent-bit/etc 
      volumes: 
      - name: varlog 
        hostPath: 
          path: /var/log 
      - name: config 
        configMap: 
          name: fluent-bit-config

Loki is a horizontally scalable log aggregation system designed to work like Prometheus but for logs. It indexes only labels (not full text), making it far cheaper than Elasticsearch for pure log storage.

# Loki labels strategy for Kubernetes 
labels: 
  namespace: "{{ .kubernetes.namespace }}" 
  pod: "{{ .kubernetes.pod_name }}" 
  container: "{{ .kubernetes.container_name }}" 
  node: "{{ .kubernetes.host }}"

Critical: Keep Loki label cardinality low. Don’t use pod IDs or request IDs as labels. Use them as log line content instead. High cardinality labels destroy Loki performance.

Kubernetes Logging Pitfalls

Log rotation on nodes: kubelet rotates container logs by default (10MB / 5 files). Fluent Bit’s DB option tracks position and survives rotation.
Pod lifecycle: When a pod is deleted, its logs are deleted from the node. Ship logs before pods die.
CrashLoopBackOff: Use --previous flag and ship logs before the process exits to avoid losing crash context.
Sidecar containers: Some teams inject a sidecar that reads app log files and writes to stdout. This works but adds resource overhead.

10. Cloud Logging

AWS CloudWatch Logs

CloudWatch Logs is the native AWS solution. The CloudWatch Agent collects from:

systemd journal
/var/log/* files
Custom application logs

// /opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.json 
{ 
  "logs": { 
    "logs_collected": { 
      "files": { 
        "collect_list": [ 
          { 
            "file_path": "/var/log/nginx/access.log", 
            "log_group_name": "/ec2/nginx/access", 
            "log_stream_name": "{instance_id}", 
            "timestamp_format": "%d/%b/%Y:%H:%M:%S %z" 
          } 
        ] 
      }, 
      "journald": { 
        "collect_list": [ 
          { 
            "log_group_name": "/ec2/system", 
            "log_stream_name": "{instance_id}/journald" 
          } 
        ] 
      } 
    } 
  } 
}

For Kubernetes on EKS, Fluent Bit ships directly to CloudWatch:

[OUTPUT] 
    Name              cloudwatch_logs 
    Match             kube.* 
    region            eu-west-1 
    log_group_name    /eks/cluster/application 
    log_stream_prefix ${HOST_NAME}- 
    auto_create_group true

GCP Cloud Logging

Google Cloud Logging (formerly Stackdriver) uses the Ops Agent on GCE, and the GKE logging integration is automatic. For custom configurations:

logging: 
  receivers: 
    nginx_access: 
      type: files 
      include_paths: 
        - /var/log/nginx/access.log 
  processors: 
    nginx_parser: 
      type: parse_nginx_combined 
  pipelines: 
    nginx_pipeline: 
      receivers: [nginx_access] 
      processors: [nginx_parser]

Azure Monitor

Azure Monitor Logs (Log Analytics) uses the Azure Monitor Agent (AMA), which replaces the legacy Log Analytics Agent (MMA). Configure via Data Collection Rules (DCR):

{ 
  "dataSources": { 
    "syslog": [ 
      { 
        "streams": ["Microsoft-Syslog"], 
        "facilityNames": ["auth", "authpriv", "daemon"], 
        "logLevels": ["Warning", "Error", "Critical"], 
        "name": "syslogSource" 
      } 
    ] 
  } 
}

11. Debugging and Troubleshooting with Logs

Diagnosing a Failed Service

# Step 1: What's the service status? 
systemctl status nginx

# Step 2: Get full journal context
journalctl -u nginx -xe --since "10 minutes ago"# Step 3: Check last 100 lines
journalctl -u nginx -n 100 --no-pager# Step 4: Check for dependency failures
journalctl -b -p err

SSH Authentication Failures

# Failed logins 
journalctl -u ssh --since today | grep "Failed password"

# Successful logins
journalctl -u ssh --since today | grep "Accepted"# On systems with auth.log:
grep "Failed password" /var/log/auth.log | awk '{print $11}' | sort | uniq -c | sort -rn | head -20

Kernel OOM Events

# OOM killer events 
journalctl -k | grep -i "oom\|killed process\|out of memory"

# Memory pressure before OOM
dmesg -T | grep -A5 "oom_kill"

Nginx 502/503 Debugging

# Combine nginx error log with upstream logs 
journalctl -u nginx --since "15 minutes ago" -o json | \ 
  python3 -c " 
import sys, json 
for line in sys.stdin: 
    e = json.loads(line) 
    msg = e.get('MESSAGE','') 
    if 'upstream' in msg or 'error' in msg.lower(): 
        print(e['__REALTIME_TIMESTAMP'], msg) 
"

Correlating Logs Across Services

In a microservices environment, use a correlation ID (trace ID) injected at the API gateway and propagated through all service calls. When debugging:

# Search for a specific request ID across all logs 
grep "req-abc123" /var/log/*/access.log 
# Or with Loki: 
{namespace="production"} |= "req-abc123"

12. Security and Compliance Logging

auditd: The Kernel Audit Framework

auditd captures security-relevant system calls at the kernel level — before any application-layer filtering. This is essential for PCI-DSS, SOX, and CESOP compliance.

# Install 
apt install auditd audispd-plugins   # Debian/Ubuntu 
dnf install audit                     # RHEL

# Enable
systemctl enable --now auditd

Audit rules in /etc/audit/rules.d/:

# Monitor privileged command execution 
-a always,exit -F arch=b64 -S execve -F euid=0 -k root_commands

# Monitor file access to sensitive files
-w /etc/passwd -p wa -k identity
-w /etc/shadow -p wa -k identity
-w /etc/sudoers -p wa -k sudoers# Monitor network configuration changes
-a always,exit -F arch=b64 -S sethostname -S setdomainname -k system-locale# Monitor successful/failed login attempts
-w /var/log/faillog -p wa -k logins
-w /var/log/lastlog -p wa -k logins# Monitor sudo usage
-w /usr/bin/sudo -p x -k sudo_usage

Query audit logs:

# All events for a specific user 
ausearch -ua 1001 --start today

# All file writes to /etc
ausearch -f /etc --success yes# Failed login attempts
ausearch -m USER_AUTH --success no --start today# Generate a summary report
aureport --summary
aureport --failed

Authentication Log Analysis

# Monitor for brute force attempts (>10 failures from same IP) 
awk '/Failed password/{print $11}' /var/log/auth.log | \ 
  sort | uniq -c | sort -rn | awk '$1>10{print $2, $1, "attempts"}'

# Account lockouts
grep "pam_unix.*authentication failure" /var/log/auth.log# sudo escalations
grep "sudo:" /var/log/auth.log | grep "COMMAND"

FinTech and Compliance Specifics

In regulated environments (PSD2, CESOP, PCI-DSS), logging requirements include:

Tamper-evident storage: Use journald sealing (--seal) or write-once storage (S3 Object Lock, WORM drives)
Retention: PCI-DSS requires 1 year minimum (3 months online, 9 months archival)
Access logging: Every access to cardholder data must be logged with user, timestamp, action, and source IP
Privileged access monitoring: All root/sudo activity must be captured and reviewed
Log integrity verification: Regular hash verification of archived logs
Separation of duties: Log data must be inaccessible to the users being monitored

Example compliance-focused rsyslog config for financial systems:

# Auth events to tamper-evident remote store (TCP with TLS) 
auth.*  action(type="omfwd" 
    target="siem.compliance.internal" 
    port="6514" 
    protocol="tcp" 
    StreamDriver="gtls" 
    StreamDriverMode="1" 
    queue.type="LinkedList" 
    queue.filename="complianceFwd" 
    queue.saveOnShutdown="on" 
    action.resumeRetryCount="-1")

# Local copy for immediate access
auth.* /var/log/auth.log

13. Performance and Optimization

Rate Limiting

A misbehaving service can generate millions of log lines per second, flooding journald and consuming disk. journald rate limiting:

# /etc/systemd/journald.conf 
RateLimitInterval=30s 
RateLimitBurst=10000

If a service exceeds 10,000 messages in 30 seconds, journald drops further messages until the interval resets. When this happens you’ll see:

Suppressed N messages from unit nginx.service

For rsyslog:

module(load="omprog") 
if $programname == 'noisy-app' then { 
    action(type="omfile" file="/dev/null") 
    stop 
}

Disk I/O Optimization

Async writes: rsyslog uses async I/O by default; don’t disable this
journald compression: Keep Compress=yes; journal entries compress well (70-80% reduction)
Separate log partition: Mount /var/log on a dedicated partition or LVM volume to prevent root filesystem exhaustion
tmpfs for volatile logs: If you only need logs for live debugging (no persistence requirement), use /run/log/journal/ (volatile mode)

journald Performance Tuning

[Journal] 
Storage=persistent 
Compress=yes 
SystemMaxUse=4G           # Absolute cap 
SystemKeepFree=1G         # Always keep this free on disk 
SystemMaxFileSize=200M    # Max size per journal file 
MaxFileSec=1month         # Rotate files older than this

Check current disk usage and auto-vacuum:

journalctl --disk-usage 
journalctl --verify 
journalctl --vacuum-size=2G 
journalctl --vacuum-time=30d

Fluent Bit Buffer Tuning

For high-volume environments, configure Fluent Bit’s memory and filesystem buffering:

[SERVICE] 
    Flush           1 
    storage.path    /var/log/flb-storage/ 
    storage.sync    normal 
    storage.checksum off 
    storage.max_chunks_up 128

[INPUT]
Name tail
storage.type filesystem # Persist to disk if output unavailable
Mem_Buf_Limit 50MB
Buffer_Max_Size 5MB

14. Best Practices

1. Centralize Everything

No exception. Single-host log analysis is acceptable for development but never for production. You will have an incident at 3 AM where you need logs from 12 hosts simultaneously.

2. Use Structured Logging (JSON)

Unstructured logs are human-readable but machine-painful. JSON logs enable field-level filtering, aggregation, and alerting.

Application output:

{"time":"2024-01-15T10:23:45Z","level":"ERROR","service":"payment-api","trace_id":"abc123","user_id":9821,"message":"Payment gateway timeout","duration_ms":5000,"gateway":"stripe"}

With structured logs, you can query: {service="payment-api"} | json | duration_ms > 3000

3. Never Log Secrets

PAN numbers, passwords, API keys, tokens, session IDs — none of these belong in logs. Implement log scrubbing:

At the application level (redact before logging)
At the shipper level (Fluent Bit lua filter or Logstash mutate)
Audit regularly (grep for patterns like card numbers in log samples)

-- Fluent Bit Lua filter to redact card numbers 
function redact_pci(tag, timestamp, record) 
    if record["message"] then 
        record["message"] = string.gsub(record["message"], "%d%d%d%d%s?%d%d%d%d%s?%d%d%d%d%s?%d%d%d%d", "****-****-****-****") 
    end 
    return 1, timestamp, record 
end

4. Add Consistent Metadata

Every log line should carry: hostname, service name, environment (production/staging), version, and a correlation/trace ID. This is non-negotiable in distributed systems.

5. Monitor Log Volume

Log volume is a signal. A spike in log volume often precedes or accompanies an incident. Set up alerts:

Loki: rate({job="myapp"}[5m]) > 1000 — alert if log rate exceeds 1000/min
Elasticsearch: watcher on index document rate

6. Test Log Retention and Recovery

Quarterly: confirm that you can actually retrieve 90-day-old logs. Compliance logs that can’t be retrieved are compliance failures.

7. Separate Application and Security Logs

Route auth events, audit events, and privileged command logs to a separate, higher-retention, tamper-evident destination. Don’t mix them with application debug logs.

8. Document Log Schema

For every service, maintain a log schema document: what fields are emitted, what values are valid, what severity means in your context. This pays dividends during incidents and onboarding.

15. Modern Logging Stack: journald + Fluent Bit + Loki + Grafana

This stack is the practical choice for teams running Kubernetes with existing Grafana infrastructure. It’s open source, lightweight, and deeply integrated.

┌─────────────┐    ┌──────────────┐    ┌──────────┐    ┌─────────┐ 
│  systemd    │    │  Fluent Bit  │    │   Loki   │    │ Grafana │ 
│  journald   │───▶│  DaemonSet   │───▶│ (ingest) │───▶│(Explore)│ 
│             │    │              │    │          │    │         │ 
│  containers │    │  per node    │    │ S3/GCS   │    │ Alerts  │ 
└─────────────┘    └──────────────┘    └──────────┘    └─────────┘

Advantages over ELK:

Cost: Loki indexes labels only; full-text search is done at query time. Storage costs are 10–20x cheaper than Elasticsearch for equivalent retention.
Operational overhead: No JVM tuning, no shard management, no complex cluster coordination.
Grafana native: Same tool for metrics (Prometheus) and logs (Loki). Correlate a spike in error rate directly with log entries in the same dashboard.
LogQL: A powerful query language that mirrors PromQL, familiar to any Prometheus user.

Loki query examples (LogQL):

# All errors from production payment service 
{namespace="production", app="payment-api"} |= "ERROR"

# JSON parsing and field filtering
{namespace="production"} | json | level="error" | duration_ms > 1000# Error rate over time
rate({namespace="production"} |= "ERROR" [5m])# Top 10 slowest requests
{app="api"} | json | duration_ms > 0 | line_format "{{.duration_ms}}" | unwrap duration_ms | topk(10, sum by (path) (avg_over_time([1h])))

16. Practical Production Setup: Single Linux Server

Full working example: one Linux server shipping systemd journal to Loki via Fluent Bit.

Step 1: Install Fluent Bit

curl https://raw.githubusercontent.com/fluent/fluent-bit/master/install.sh | sh 
systemctl enable fluent-bit

Step 2: Configure Fluent Bit

# /etc/fluent-bit/fluent-bit.conf 
[SERVICE] 
    Flush           5 
    Daemon          Off 
    Log_Level       info 
    Parsers_File    parsers.conf 
    storage.path    /var/log/flb-storage/

[INPUT]
Name systemd
Tag systemd.*
DB /var/log/flb_journal.db
Read_From_Tail On
Strip_Underscores On[INPUT]
Name tail
Path /var/log/nginx/access.log
Tag nginx.access
DB /var/log/flb_nginx.db
Parser nginx[FILTER]
Name record_modifier
Match *
Record hostname ${HOSTNAME}
Record env production[FILTER]
Name lua
Match *
script /etc/fluent-bit/redact.lua
call redact_pci[OUTPUT]
Name loki
Match *
Host loki.internal
Port 3100
Labels job=fluent-bit,host=${HOSTNAME}
line_format json
auto_kubernetes_labels on

Step 3: Parsers

# /etc/fluent-bit/parsers.conf 
[PARSER] 
    Name        nginx 
    Format      regex 
    Regex       ^(?[^ ]*) [^ ]* (?[^ ]*) \[(?<time>[^\]]*)\] "(?\S+)(?: +(?[^\"]*?)(?: +\S*)?)?" (?[^ ]*) (?[^ ]*)(?: "(?[^\"]*)" "(?[^\"]*)")?$ 
    Time_Key    time 
    Time_Format %d/%b/%Y:%H:%M:%S %z

`Step 4: Loki Configuration (minimal)`

# /etc/loki/loki.yaml 
auth_enabled: false

server:  http_listen_port: 3100

ingester:  wal:    dir: /var/loki/wal  lifecycler:    ring:      replication_factor: 1

schema_config:  configs:    - from: 2024-01-01      store: tsdb      object_store: filesystem      schema: v13      index:        prefix: index_        period: 24h

storage_config:  tsdb_shipper:    active_index_directory: /var/loki/index    cache_location: /var/loki/cache  filesystem:    directory: /var/loki/chunks

limits_config:  retention_period: 30d

Start and verify:

systemctl start fluent-bit loki 
# Check Fluent Bit is reading journal 
journalctl -u fluent-bit -f 
# Verify Loki is receiving 
curl http://localhost:3100/ready 
curl http://localhost:3100/loki/api/v1/labels

`17. Common Logging Problems and Solutions`

`Problem: Logs Missing for a Time Window`

Symptoms: Can’t find logs for a specific period.

Diagnosis:

journalctl --list-boots 
journalctl --verify 
journalctl --disk-usage

Causes and fixes:

Volatile storage: journald in volatile mode loses logs on reboot. Enable persistent: mkdir /var/log/journal && systemctl restart systemd-journald
Rate limiting: Check for suppression messages: journalctl | grep "Suppressed"
Fluent Bit position DB: If Fluent Bit crashed, the DB file may be corrupted. Delete it and restart (accept re-shipping some logs)
Log rotation removed files: Fluent Bit tail input with a DB survives rotation; without it, rotating files can cause missed lines

`Problem: Logs Not Rotating`

# Test config 
logrotate -d /etc/logrotate.d/myapp

# Force rotationlogrotate -f /etc/logrotate.conf

# Check cron/timersystemctl status logrotate.timer

Common cause: file is not owned by the expected user, or the postrotate script is failing (check exit code).

`Problem: /var/log Full`

# Find largest consumers 
du -sh /var/log/* | sort -rh | head -20

# Check journaldjournalctl --disk-usage

# Emergency cleanupjournalctl --vacuum-size=500M     # Trim journal to 500MBjournalctl --vacuum-time=3d       # Remove entries older than 3 days

# Find any rotated but uncompressed logsfind /var/log -name "*.log.*" ! -name "*.gz" -size +100Mgzip /var/log/bigapp/app.log.1

Prevention: Add a filesystem alert at 80% usage on the /var/log partition.

`Problem: journald Consuming Excessive Disk`

journalctl --disk-usage 
# Expected: < configured SystemMaxUse value

# If over limit, journald should auto-vacuum; if not:journalctl --vacuum-size=2G

# Permanently fixcat >> /etc/systemd/journald.conf << EOFSystemMaxUse=2GSystemKeepFree=500MEOFsystemctl restart systemd-journald

`Problem: Fluent Bit Not Forwarding to Loki`

# Check Fluent Bit logs 
journalctl -u fluent-bit -n 100

# Test Loki connectivitycurl -s http://loki.internal:3100/ready

# Manually push a test logcurl -H "Content-Type: application/json" \  -X POST http://loki.internal:3100/loki/api/v1/push \  --data '{"streams":[{"stream":{"job":"test"},"values":[["'$(date +%s%N)'","test message"]]}]}'

`Problem: High Log Cardinality Breaking Loki`

Symptoms: Loki query performance degrades, ingestion slows, stream limit errors.

# Check number of unique streams 
count(count by(__stream_shard__)(rate({job="myapp"}[5m])))

Fix: Reduce label diversity. Never use request IDs, user IDs, or timestamps as Loki labels. These belong in the log body.

`18. Conclusion`

Linux logging is not a checkbox — it’s operational infrastructure as important as networking or storage. The progression from single-host journald to a fully centralized, structured, correlated logging platform represents the difference between flying blind and having genuine observability.

The modern logging stack — journald capturing everything locally, Fluent Bit shipping efficiently at the node level, Loki storing cost-effectively with label-based indexing, and Grafana providing unified dashboards and alerting — gives you production-grade observability without the operational weight of a full ELK cluster.

For FinTech and regulated environments, logging is compliance. Missing logs are audit failures. The investment in proper centralization, tamper-evident storage, appropriate retention policies, and regular testing of log retrieval isn’t optional — it’s the operational cost of operating in regulated space.

The non-negotiables:

Centralize or accept blindness — distributed logs you can’t query are not observability
Structured logging — JSON everywhere, from day one; retrofitting is painful
No secrets in logs — ever; implement scrubbing at multiple layers
Test your retention — logs you can’t retrieve don’t count for compliance
Monitor log volume — it’s a signal; spikes precede incidents

Master these fundamentals, implement the practices in this guide, and logging becomes a superpower rather than a liability.

_Vladimiras Levinas is a Lead DevOps Engineer with 18+ years in fintech infrastructure. He runs a production K3s homelab and writes about AI infrastructure at doc.thedevops.dev_