Deployment
Overview
Since we haven’t yet built partitioning into Timeseries, a production Timeseries deployment consists of a single replica only. The primary means of scaling would be scaling up, which can take you pretty far. Since all data is persisted on S3, data in a single node Timeseries is highly durable. So, a production deployment of Timeseries consists of:- A single-replica Deployment running the
opendata-timeseriescontainer - An S3 bucket for durable data storage
- A PersistentVolumeClaim backed by a fast SSD for the SlateDB disk cache
- A ConfigMap for the Prometheus-compatible scrape configuration, S3 storage settings, and SlateDB tuning
- A ServiceAccount with an IAM role for S3 access (IRSA on EKS)
Timeseries uses SlateDB’s epoch-based fencing, which means only one writer can
hold the epoch lock at a time. The Deployment uses the
Recreate strategy so
that the old pod is fully terminated before the new one starts — a
RollingUpdate creates the possibility for the new pod to be fenced by the old one and never
become ready.Helm chart
Below is a complete Helm chart for deploying Timeseries to production. Create these files undercharts/opendata-timeseries/.
values.yaml
values.yaml
templates/configmap.yaml
templates/configmap.yaml
templates/serviceaccount.yaml
templates/serviceaccount.yaml
templates/pvc.yaml
templates/pvc.yaml
templates/deployment.yaml
templates/deployment.yaml
templates/service.yaml
templates/service.yaml
Install the chart
Disk cache
SlateDB caches frequently accessed data on local disk to avoid repeated reads from S3. For production workloads, use an SSD-backed StorageClass:- EKS: Use
gp3(General Purpose SSD) orio2for higher IOPS. For maximum performance, use instance-store NVMe volumes with a local-static-provisioner. - Size the cache based on your active working set. The default of 100 Gi is a
good starting point; increase if you see frequent cache evictions in the
slatedb_*metrics.
Block cache
On top of SlateDB’s disk cache, Timeseries can keep decoded blocks in a hybrid memory-plus-disk block cache backed by foyer. Add it understorage in prometheus.yaml:
prometheus.yaml
- Set
memory_capacityto roughly the recent working set (the last few hours of bucket index and sample data). - Point
disk_pathat the same SSD-backed PVC used by the disk cache; the two workloads coexist. - Keep
write_policy: WriteOnInsertionso every cached block is also on disk. Restarts then hit the disk tier instead of re-reading from S3. - Raise
flushersif thefoyer_*write-queue metrics show backpressure.
<block_cache_config> for
the full field reference.
Cache warmer
On startup the server scans recent time bucket key ranges through the storage reader, which populates the block cache. Queries that hit in the first few seconds after a restart avoid a cold-cache penalty. This is on by default and covers the last 24 hours including sample data. To tune or disable it, see<cache_warmer_config>.
Durable OTLP ingest
For high-volume OTel metrics, run the stateless ingest path instead of (or alongside) direct OTLP/HTTP writes. Producers keep writing during TSDB restarts, writes stay inside the AZ, and a crashed consumer resumes from the last acked batch on its own.Health checks
Timeseries exposes two health-check endpoints:| Endpoint | Type | Behavior |
|---|---|---|
/-/healthy | Liveness | Returns 200 if the process is running |
/-/ready | Readiness | Returns 200 once the TSDB is initialized and ready to serve queries |
Graceful shutdown
Timeseries handlesSIGTERM and SIGINT signals gracefully:
- Stops accepting new connections
- Drains in-flight requests
- Flushes TSDB data from memory to durable storage
- Exits cleanly
terminationGracePeriodSeconds: 60 to give the server enough
time to complete the flush before Kubernetes force-kills the pod.
Monitoring
All metrics are exposed at/metrics in Prometheus text format. Since Timeseries
is itself a Prometheus-compatible data source, you can configure it to scrape its
own metrics endpoint (included in the default scrapeConfig above).
Key metrics
| Metric | Type | Labels | Description |
|---|---|---|---|
scrape_samples_scraped | counter | job, instance | Number of samples scraped per target |
scrape_samples_failed | counter | job, instance | Number of samples that failed validation |
remote_write_samples_ingested_total | counter | — | Total samples ingested via remote write |
remote_write_samples_failed_total | counter | — | Total samples that failed remote write ingestion |
http_requests_total | counter | method, endpoint, status | Total HTTP requests handled |
http_request_duration_seconds | histogram | method, endpoint | Request latency distribution |
http_requests_in_flight | gauge | — | Number of HTTP requests currently being served |
Timeseries also exposes
slatedb_* metrics from the underlying SlateDB storage
engine. These are useful for debugging storage-level performance and compaction
behavior.Example PromQL queries
Security
TLS and authentication
Object storage security
The Helm chart uses IRSA (IAM Roles for Service Accounts) so that the pod receives temporary AWS credentials automatically — no static access keys required. Create an IAM role with the following policy and attach it to the ServiceAccount via theserviceAccount.roleArn value:
- Enable encryption at rest on the S3 bucket (SSE-S3 or SSE-KMS).
- Use a VPC endpoint for S3 to keep traffic off the public internet.
- Block all public access on the bucket.
- Add a lifecycle rule to transition old data to Intelligent-Tiering after 30 days and abort incomplete multipart uploads after 7 days.