An operator's /
briefing.
What we approved this week, and what it means for your team. Each paying customer gets a private VM; openberth runs their apps inside. This deck: what your customers experience, what we promise, and where we still need your call.
The customer journey.
The shapes of what you can sell, what the customer lives through, and what they see when something changes — billed or otherwise.
Four tiers, plus custom for sales.
Three self-serve sizes and an enterprise escape hatch. App counts below are marketing approximations, not contracts — we enforce the envelope (vCPU/RAM/disk); how many apps fit is the customer's choice.
Two things product should know. Trial isn't a flavor — it's a Hobby VM with upstream metadata. Trial-vs-paying distinction lives entirely in the frontpage; infra never sees billing state. Second: these specific numbers can change (Hobby might become 1 vCPU) without re-opening the design — the shape is settled, the values are tunable.
Four states. That's all.
One paying customer = one workspace = one VM = four possible life states. The third column below is what product needs to know: when a workspace is active, infra doesn't care if the customer is trialing, paying, or a month overdue — that's the frontpage's concern. We only see "run it" or "take it offline".
| state | vm | local disk | snapshot | url response |
|---|---|---|---|---|
| active | running | attached | periodic 24h | openberth serves |
| suspended | powered off | retained | last periodic | edge: suspended page |
| archived | destroyed | wiped | fresh, object storage | edge: archived page |
| deleted | gone | gone | purged | 404 / tombstone |
Four verbs. Safe to retry.
This is the entire transition vocabulary between frontpage and infra. Every call carries a client-supplied request_id; sending the same call twice is safe — we return the same result, never duplicate work. Conflicting in-flight operations get a clear 409 your client can retry on.
Archive is minutes, not instant.
These durations are the contract: "Suspending…" finishes in seconds, "Restoring from archive…" needs to read minutes. Two implications: product messaging must match these ranges, and the frontpage grace-period timers must account for them — don't schedule an archive for 30s before a billing retry fires.
Worst-case for archive and restore-from-archive. Don't show customers a spinner that implies seconds. The recommended UX is a confirmation screen with an ETA, then an email / notification when the workspace is back.
Two off-states. Very different costs.
The customer-visible difference is how quickly we can bring them back when they update their card. Product should map these to your grace-period tiers.
Three pages you own the copy for.
When a workspace isn't running, the customer's URL doesn't 500 — our edge answers directly with a state page. These are three surfaces product/marketing should own the wording and branding of. Each should tell the customer exactly what to do next (pay, restore, contact support).
tap down · disk retained
slot freed · snapshot in S3
row pseudonymized
Our obligations.
The boundary between your team and ours, the legal obligation we carry on deletion, and how much data a customer can lose if a host fails.
Your team owns when. Ours owns how.
This is the single most important line in the whole design. The frontpage decides when each transition should happen (grace periods, trial expiries, promo extensions — all yours). Infra decides how it happens (atomicity, ordering, verification, rollback). We don't run billing clocks; you don't run VMs.
Below: the relational model that backs these commitments. Seven tables, one Postgres. Operators call this the "boring-on-purpose" design — every stateful service we have to run at 3am is a liability.
regions catalog
- idtext · "us-west-1"
- display_nametext
- stateactive / retiring
- created_attimestamptz
hosts bare-metal
- iduuid
- region_idregions
- fqdntext · unique
- total_vcpu / ram / diskint · int · int
- statehealthy / draining / retired
- last_heartbeat_attimestamptz
workspaces main entity
- iduuid
- external_workspace_idtext
- external_user_idtext
- display_nametext
- region_id / host_idregions / hosts
- flavorhobby · pro · team · custom
- vcpu / ram_gb / disk_gbenvelope
- stateactive · suspended · archived · deleted
- current_operation_idoperations (deferred)
- snapshot_interval_secondsdefault 86400
- snapshot_retention_countdefault 30
workspace_domains edge routing
- iduuid
- workspace_idworkspaces
- domaintext · unique
- kinddefault_subdomain / custom
- cert_statepending · issued · failed · expired
- verified_at / cert_issued_attimestamptz
operations transitions
- iduuid
- workspace_idworkspaces
- verbsuspend · archive · restore · delete
- request_idUNIQUE(ws, req)
- statuspending · running · ok · failed · rolled_back
- step_statejsonb · resumable
- river_job_idbigint
- requested/started/completed_attimestamptz
snapshots object store
- iduuid
- workspace_idworkspaces
- object_uritext
- size_bytes / checksumbigint · text
- toolkopia · restic · qemu-img
- kindperiodic · pre_archive
- verified_atonce checksum verified
audit_log append-only · survives delete
- idbigserial
- workspace_idnullable
- former_workspace_iduuid · set after delete
- event_typee.g. transition.archive.succeeded
- event_datajsonb · no PII
- actorsystem · api:<caller> · admin:<user>
Legend · ◆ primary key · → foreign key · ·pii scrubbed atomically on delete.
Delete means gone. Verified.
When your team calls Delete, we do not mark a column and move on. We verify every step and only report success at the end. This is the GDPR artefact — if a regulator asks, this is what we show them.
Scope of this obligation. Account-deletion decision is frontpage's (whenever the customer clicks the button or legal retention expires). Execution — the actual wipe — is ours. GDPR completeness is our risk to manage.
RPO promise: 24 hours.
Snapshots run every 24 hours by default. If a bare-metal host fails during active use, the customer can lose up to 24 hours of work. Enterprise customers can buy tighter cadence as part of a custom contract — default should stay at 24h to control storage cost.
Default snapshot cadence. Incremental, backed by a periodic full, retained for the archive-retention window. Product should surface this in the Terms of Service and on higher-tier plan comparisons — tighter RPO is a paid upgrade axis.
Scheduling note: host capacity is derived from live workspace assignments, not stored as a counter — so there's no bookkeeping that can drift out of sync with reality. Below is the actual query we run to place a new workspace. Small fleet, cheap query, zero drift risk.
-- find a host in region $1 with room for $2 vcpu / $3 ram / $4 disk SELECT h.id FROM hosts h WHERE h.region_id = $1 AND h.state = 'healthy' AND h.total_vcpu - COALESCE((SELECT SUM(w.vcpu) FROM workspaces w WHERE w.host_id = h.id AND w.state IN ('active','suspended')), 0) >= $2 AND h.total_ram_gb - COALESCE((SELECT SUM(w.ram_gb) FROM workspaces w WHERE w.host_id = h.id AND w.state IN ('active','suspended')), 0) >= $3 AND h.total_disk_gb - COALESCE((SELECT SUM(w.disk_gb) FROM workspaces w WHERE w.host_id = h.id AND w.state IN ('active','suspended')), 0) >= $4 FOR UPDATE SKIP LOCKED LIMIT 1;
FOR UPDATE SKIP LOCKED lets concurrent schedulers pick different hosts. The workspace INSERT in the same transaction reserves the slot.
What this is, technically.
For the engineering folks on your team. The surface your backend talks to, the errors it sees, and the telemetry we publish so ops can integrate without asking us for a read-only DB.
gRPC to backends, not browsers.
The control-plane API is not a public API. Three consumers: your frontpage server, our admin tooling, and the host agents. No browser clients, no third-party integrations — those stay on top of whatever you build on the frontpage. This lets us keep one schema/tooling stack for every inbound call, which simplifies the security story considerably.
For your backend team: every mutating call returns an Operation handle immediately. You poll until it's terminal — no wait parameters, no dual sync/async modes. Integrations against this API are, by design, one code path.
Your client only branches on reasons.
Every error from us carries a machine-readable reason. The set is closed and versioned — new reasons are additive; existing reasons never change meaning. Your frontend code never needs a fix because we renamed something. This matters most when you're writing customer-facing error copy on top.
One category your product team should plan for: RESOURCE_EXHAUSTED means "we're at capacity in that region". You need UX for "we can't provision you a workspace right now, try a different region or contact us" — this is not impossible, just uncommon enough that it gets forgotten in mock-ups.
Ops integration out of the box.
Three separate layers SRE will care about — Prometheus metrics, OpenTelemetry spans, and an append-only audit log. Nothing custom to wire up.
Metrics · Prometheus :9090/metrics
Tracing · OpenTelemetry
Every API call gets a span; span context propagates into the job runner, so one trace covers the full transition — API call → job start → step execution → completion. Useful when your team asks "why did this customer's archive take 28 minutes?" — SRE can answer in minutes, not hours.
Audit log · survives account deletion
Every transition writes an append-only record. On customer deletion, PII is scrubbed, but the opaque record stays — so we can reconstruct what happened for incident post-mortems without GDPR risk. Legal/compliance should review the audit schema; it's the artefact they'll lean on in any dispute.
Where we run — and what's still open.
Where our VMs physically live, how we secure the link, how we survive disconnects — and the short list of decisions we still need from you or from prototyping.
Bare-metal, any cloud, through NAT.
Each bare-metal host in our fleet runs a small daemon that dials the control plane and keeps a long-lived session open. Hosts never accept inbound connections from us — they're free to sit in any data center, any cloud, behind any firewall. This is what makes "run our platform on any provider" actually feasible. Hetzner today, OVH tomorrow, on-prem next year — no networking change to the control plane.
Commands →
Control plane → agent dispatch. Ack confirms receipt and validation, not execution. Command command_id = operations.id.
Events ←
Agent → control plane telemetry. Monotonic seq per session, high-water-mark acks, durable replay buffer on reconnect.
Mutual TLS. From day one.
Agents are long-lived, autonomous, and run on hardware we don't always directly control. Bearer tokens are not enough — we use mutual TLS, with our own internal CA issuing per-host certificates bound to a host id and region. Tokens are fine for the frontpage-to-infra call; they're not fine for the infra-to-fleet link.
What this buys us: even if a stolen bearer token leaks, it can't impersonate a host. Even if a host physically disappears (stolen, re-imaged, etc.), retiring it via the admin API cuts it off the fleet immediately — no waiting for cert expiry.
Section 14 originally detailed the command/event stream design (nine command verbs like ProvisionVM, TakeSnapshot; five event types; independent back-pressure on a single mTLS connection). That engineering detail lives in the source spec — it's implementation work the infra team owns. Above is the posture summary.
What we need from you.
Everything above is approved design, pre-implementation. Three categories of open work — where we need decisions, where we accept risk, and what we'll ship first.
Recovery · how we survive disconnects
When a host's connection to the control plane drops and comes back, neither side trusts its state memory. The agent sends an inventory of everything it has on disk; we compare to our database and reconcile differences. Expected but missing → provision again. Orphans (things on the host we don't know about) → alert an operator, never auto-destroy. Stuck in-flight ops get marked failed so upstream can retry. This is what lets us promise that a network blip never leaves a customer's workspace in an inconsistent limbo.
Deferred decisions · needed before build
Explicitly out of scope · needed eventually
Your asks · what we need from the application team
Specs approved. Everything above is a designed commitment. Implementation begins next; first end-to-end demo of suspend / archive / restore is the near-term milestone.