Skip to main content
This guide covers everything you need to run Log in production on Kubernetes: a complete Helm chart, S3-backed storage with a local disk cache, health checks, monitoring, and security.

Deployment

Overview

Since we haven’t yet built partitioning into Log, a production Log deployment consists of a single replica only. The primary means of scaling would be scaling up, which can take you pretty far. Since all data is persisted on S3, data in a single node Log is highly durable. With that said, a production Log deployment consists of:
  • A single-replica Deployment running the opendata-log container
  • An S3 bucket for durable data storage
  • A PersistentVolumeClaim backed by a fast SSD for the SlateDB disk cache
  • A ConfigMap for SlateDB tuning parameters
  • A ServiceAccount with an IAM role for S3 access (IRSA on EKS)
Log uses SlateDB’s epoch-based fencing, which means only one writer can hold the epoch lock at a time. The Deployment uses the Recreate strategy so that the old pod is fully terminated before the new one starts — a RollingUpdate would cause the new pod to be fenced by the old one and never become ready.

Helm chart

Below is a complete Helm chart for deploying Log to production. Create these files under charts/opendata-log/.

values.yaml

values.yaml
image:
  repository: ghcr.io/opendata-oss/log
  tag: "0.2.2"

port: 8080

# S3 storage configuration
s3:
  bucket: my-log-bucket
  region: us-west-2

# SlateDB disk cache — use a fast SSD-backed StorageClass
cache:
  size: 100Gi
  storageClassName: gp3
  maxCacheSizeBytes: 107374182400  # 100 GB

# SlateDB tuning
slatedb:
  defaultTtl: 604800000          # 7 days (ms)
  maxUnflushedBytes: 134217728   # 128 MB
  l0SstSizeBytes: 16777216       # 16 MB
  maxSstSize: 67108864           # 64 MB

resources:
  requests:
    cpu: 100m
    memory: 512Mi
  limits:
    cpu: "1"
    memory: 8Gi

# IRSA role ARN for S3 access
serviceAccount:
  roleArn: ""

templates/configmap.yaml

templates/configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: {{ .Release.Name }}-slatedb-settings
data:
  SlateDb.yaml: |
    default_ttl: {{ int .Values.slatedb.defaultTtl }}
    max_unflushed_bytes: {{ int .Values.slatedb.maxUnflushedBytes }}
    l0_sst_size_bytes: {{ int .Values.slatedb.l0SstSizeBytes }}
    compactor_options:
      max_concurrent_compactions: 2
      max_sst_size: {{ int .Values.slatedb.maxSstSize }}
    garbage_collector_options:
      manifest_options:
        interval: '60s'
        min_age: '3600s'
      wal_options:
        interval: '60s'
        min_age: '60s'
      compacted_options:
        interval: '60s'
        min_age: '3600s'
      compactions_options:
        interval: '60s'
        min_age: '3600s'
    object_store_cache_options:
      root_folder: /cache
      max_cache_size_bytes: {{ int .Values.cache.maxCacheSizeBytes }}

templates/serviceaccount.yaml

templates/serviceaccount.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
  name: {{ .Release.Name }}
  annotations:
    eks.amazonaws.com/role-arn: {{ .Values.serviceAccount.roleArn }}

templates/pvc.yaml

templates/pvc.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: {{ .Release.Name }}-cache
spec:
  accessModes:
    - ReadWriteOnce
  storageClassName: {{ .Values.cache.storageClassName }}
  resources:
    requests:
      storage: {{ .Values.cache.size }}

templates/deployment.yaml

templates/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: {{ .Release.Name }}
spec:
  replicas: 1
  strategy:
    type: Recreate
  selector:
    matchLabels:
      app: {{ .Release.Name }}
  template:
    metadata:
      labels:
        app: {{ .Release.Name }}
    spec:
      serviceAccountName: {{ .Release.Name }}
      terminationGracePeriodSeconds: 60
      securityContext:
        runAsNonRoot: true
        runAsUser: 1000
        runAsGroup: 1000
        fsGroup: 1000
      containers:
        - name: log
          image: {{ .Values.image.repository }}:{{ .Values.image.tag }}
          args:
            - "--port"
            - "{{ .Values.port }}"
            - "--s3-bucket"
            - "{{ .Values.s3.bucket }}"
            - "--s3-region"
            - "{{ .Values.s3.region }}"
          ports:
            - containerPort: {{ .Values.port }}
              name: http
          env:
            - name: RUST_LOG
              value: info
          resources:
            {{- toYaml .Values.resources | nindent 12 }}
          livenessProbe:
            httpGet:
              path: /-/healthy
              port: http
            initialDelaySeconds: 10
            periodSeconds: 30
          readinessProbe:
            httpGet:
              path: /-/ready
              port: http
            initialDelaySeconds: 5
            periodSeconds: 10
          volumeMounts:
            - name: slatedb-settings
              mountPath: /app/SlateDb.yaml
              subPath: SlateDb.yaml
              readOnly: true
            - name: cache
              mountPath: /cache
      volumes:
        - name: slatedb-settings
          configMap:
            name: {{ .Release.Name }}-slatedb-settings
        - name: cache
          persistentVolumeClaim:
            claimName: {{ .Release.Name }}-cache

templates/service.yaml

templates/service.yaml
apiVersion: v1
kind: Service
metadata:
  name: {{ .Release.Name }}
spec:
  selector:
    app: {{ .Release.Name }}
  ports:
    - port: {{ .Values.port }}
      targetPort: http
      name: http

Install the chart

helm install opendata-log ./charts/opendata-log \
  --set s3.bucket=my-log-bucket \
  --set s3.region=us-west-2 \
  --set serviceAccount.roleArn=arn:aws:iam::123456789012:role/opendata-log

Disk cache

SlateDB caches frequently accessed data on local disk to avoid repeated reads from S3. For production workloads, use an SSD-backed StorageClass:
  • EKS: Use gp3 (General Purpose SSD) or io2 for higher IOPS. For maximum performance, use instance-store NVMe volumes with a local-static-provisioner.
  • Size the cache based on your active working set. The default of 100 Gi is a good starting point; increase if you see frequent cache evictions in the slatedb_* metrics.
Avoid using HDD-backed volumes (e.g. st1, sc1) for the cache. SlateDB issues many small random reads, and spinning disks will bottleneck performance.

Health checks

Log exposes two health-check endpoints:
EndpointTypeBehavior
/-/healthyLivenessReturns 200 if the process is running
/-/readyReadinessReturns 200 if the storage check passes, 503 otherwise
Both probes are included in the Helm chart’s Deployment template above.

Graceful shutdown

Log handles SIGTERM and SIGINT signals gracefully:
  1. Stops accepting new connections
  2. Drains in-flight requests
  3. Flushes pending data to durable storage
  4. Exits cleanly
The Helm chart sets terminationGracePeriodSeconds: 60 to give the server enough time to complete the flush before Kubernetes force-kills the pod.

Monitoring

All metrics are exposed at /metrics in Prometheus text format.

Key metrics

MetricTypeLabelsDescription
log_append_records_totalcounterTotal records appended
log_append_bytes_totalcounterTotal bytes appended
log_records_scanned_totalcounterTotal records scanned
log_bytes_scanned_totalcounterTotal bytes scanned
http_requests_totalcountermethod, endpoint, statusTotal HTTP requests handled
http_request_duration_secondshistogrammethod, endpointRequest latency distribution
http_requests_in_flightgaugeNumber of HTTP requests currently being served
Log also exposes slatedb_* metrics from the underlying SlateDB storage engine. These are useful for debugging storage-level performance and compaction behavior.

Example PromQL queries

# Request rate (requests per second over 5 minutes)
rate(http_requests_total[5m])

# Error rate (5xx responses)
rate(http_requests_total{status=~"5.."}[5m])

# p99 request latency
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))

# In-flight requests
http_requests_in_flight

# Append throughput (records per second)
rate(log_append_records_total[5m])

# Scan throughput (bytes per second)
rate(log_bytes_scanned_total[5m])

Security

TLS and authentication

Log does not include built-in TLS termination or authentication. Place a reverse proxy (nginx, Envoy, or a cloud load balancer) in front of Log to handle TLS and access control.

Object storage security

The Helm chart uses IRSA (IAM Roles for Service Accounts) so that the pod receives temporary AWS credentials automatically — no static access keys required. Create an IAM role with the following policy and attach it to the ServiceAccount via the serviceAccount.roleArn value:
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:PutObject",
        "s3:DeleteObject",
        "s3:ListBucket",
        "s3:GetBucketLocation"
      ],
      "Resource": [
        "arn:aws:s3:::my-log-bucket",
        "arn:aws:s3:::my-log-bucket/*"
      ]
    }
  ]
}
The IAM role’s trust policy should scope access to your EKS cluster’s OIDC provider and the specific service account:
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Federated": "arn:aws:iam::123456789012:oidc-provider/oidc.eks.us-west-2.amazonaws.com/id/EXAMPLE"
      },
      "Action": "sts:AssumeRoleWithWebIdentity",
      "Condition": {
        "StringEquals": {
          "oidc.eks.us-west-2.amazonaws.com/id/EXAMPLE:aud": "sts.amazonaws.com",
          "oidc.eks.us-west-2.amazonaws.com/id/EXAMPLE:sub": "system:serviceaccount:default:opendata-log"
        }
      }
    }
  ]
}
Additional recommendations:
  • Enable encryption at rest on the S3 bucket (SSE-S3 or SSE-KMS).
  • Use a VPC endpoint for S3 to keep traffic off the public internet.
  • Block all public access on the bucket.
  • Add a lifecycle rule to transition old data to Intelligent-Tiering after 30 days and abort incomplete multipart uploads after 7 days.