Engineering learnings from production

Full set of practical learnings from distributed systems, data platforms, and reliability work.

Back to home

Kafka

Ordering · Throughput · Hot keys

Your key choice defines ordering guarantees, consumer parallelism, backfills, and what correctness means under load.

Cassandra

Query-first · Compaction · Tombstones

Cassandra rewards predictable access paths and punishes flexible queries with hotspots, tombstones, and slow repairs.

Caching

Staleness · Stampedes · Invalidation

Choose bounded staleness and predictable failure modes over perfect invalidation that becomes operational debt.

Pipelines

Retries · Idempotency · Backpressure

Retries happen. Make handlers idempotent, encode dedupe, and expose backpressure before latency becomes an outage.

Reliability

SLOs · Error budgets · Triage

Define what matters to users, then pick signals that explain failures. Everything else becomes noise and burnout.

Data

Consistency · Reconciliation · Audits

Use invariants, audits, and repair tools. The safest distributed system assumes partial failure and drift.

Additional

Systems that are easy to reason about, observe, and recover usually outperform feature-rich designs under sustained production pressure.

Additional

Strong runbooks, ownership, and fast feedback loops reduce blast radius and build trust faster than any single tooling change.