OBI · ESTABLISHING LINK
briefing · application VP · 2026-04-19

An operator's /
briefing.

What we approved this week, and what it means for your team. Each paying customer gets a private VM; openberth runs their apps inside. This deck: what your customers experience, what we promise, and where we still need your call.

4 specs approved 15 slides 12 min read status · approved, pre-implementation arrow keys to navigate
scroll
Chapter I · what your customers experience
I

The customer journey.

covers 01 → 05tiers · states · timings · pagessource · 2026-04-18 lifecycle spec

The shapes of what you can sell, what the customer lives through, and what they see when something changes — billed or otherwise.

01Flavors
what product can sell

Four tiers, plus custom for sales.

Three self-serve sizes and an enterprise escape hatch. App counts below are marketing approximations, not contracts — we enforce the envelope (vCPU/RAM/disk); how many apps fit is the customer's choice.

Hobby
2 vCPU
RAM4 GB
Disk25 GB
~apps5
~static40
Pro
4 vCPU
RAM8 GB
Disk50 GB
~apps10
~static100
Team
8 vCPU
RAM20 GB
Disk100 GB
~apps25
~static300
Custom
SALES
RAMneg.
Diskneg.
~apps
~static

Two things product should know. Trial isn't a flavor — it's a Hobby VM with upstream metadata. Trial-vs-paying distinction lives entirely in the frontpage; infra never sees billing state. Second: these specific numbers can change (Hobby might become 1 vCPU) without re-opening the design — the shape is settled, the values are tunable.

02Customer states
the customer's workspace, across its life

Four states. That's all.

One paying customer = one workspace = one VM = four possible life states. The third column below is what product needs to know: when a workspace is active, infra doesn't care if the customer is trialing, paying, or a month overdue — that's the frontpage's concern. We only see "run it" or "take it offline".

persistent
active
persistent
suspended
persistent
archived
terminal
deleted
statevmlocal disksnapshoturl response
activerunningattachedperiodic 24hopenberth serves
suspendedpowered offretainedlast periodicedge: suspended page
archiveddestroyedwipedfresh, object storageedge: archived page
deletedgonegonepurged404 / tombstone
03The instruction set
what frontpage sends us

Four verbs. Safe to retry.

This is the entire transition vocabulary between frontpage and infra. Every call carries a client-supplied request_id; sending the same call twice is safe — we return the same result, never duplicate work. Conflicting in-flight operations get a clear 409 your client can retry on.

POST/workspaces/{id}:suspend
POST/workspaces/{id}:archive
POST/workspaces/{id}:restore
POST/workspaces/{id}:delete
GET/workspaces/{id}  → state · transient-op flag · ETA
i
archive is snapshot-then-wipe. Snapshot uploaded and checksum-verified before local disk is wiped. Mid-flight failure rolls back to suspended; partial artifacts cleaned up.
ii
delete verifies purge completeness. Not marked deleted until VM gone, disk wiped, snapshots purged, row pseudonymized, audit-log entry written.
iii
restore is scheduled. Picks a same-region host with capacity, pulls snapshot if from archived, boots, waits for healthcheck. Failure returns to prior state — no transient limbo.
iv
409 on conflict. Caller retries after the transient op completes. Infra never silently blocks.
04SLA timings
numbers for product copy

Archive is minutes, not instant.

These durations are the contract: "Suspending…" finishes in seconds, "Restoring from archive…" needs to read minutes. Two implications: product messaging must match these ranges, and the frontpage grace-period timers must account for them — don't schedule an archive for 30s before a billing retry fires.

suspend
10s · max 2m
archive
5–30m
restore ← suspended
20s · max 2m
restore ← archived
5–30m
delete
1m · max 10m
30m

Worst-case for archive and restore-from-archive. Don't show customers a spinner that implies seconds. The recommended UX is a confirmation screen with an ETA, then an email / notification when the workspace is back.

05When payment lapses
suspended vs archived

Two off-states. Very different costs.

The customer-visible difference is how quickly we can bring them back when they update their card. Product should map these to your grace-period tiers.

i
Suspended. VM powered off, disk kept on the host. Revive in seconds. More expensive per-idle-workspace because the host slot stays occupied. Right for "card expired, probably updating tomorrow".
ii
Archived. VM destroyed, snapshot in object storage. Revive in minutes. Cheap at rest, but restoration needs a full rehydrate. Right for "haven't heard from them in weeks, want to keep data just in case".
iii
Your call when each kicks in. Infra only does the transition when frontpage calls the API. We strongly suggest suspended → archived after 7–14 days of non-payment, then archived → deleted at whatever retention window legal wants.
06Edge state pages
what the customer's url shows

Three pages you own the copy for.

When a workspace isn't running, the customer's URL doesn't 500 — our edge answers directly with a state page. These are three surfaces product/marketing should own the wording and branding of. Each should tell the customer exactly what to do next (pay, restore, contact support).

suspended
edge → static page
tap down · disk retained
archived
edge → static page
slot freed · snapshot in S3
deleted
edge → 404 tombstone
row pseudonymized
Chapter II · what we commit to
II

Our obligations.

covers 06 → 08ownership · gdpr · durabilitysource · lifecycle & schema specs

The boundary between your team and ours, the legal obligation we carry on deletion, and how much data a customer can lose if a host fails.

07Who owns what
the boundary · read this twice

Your team owns when. Ours owns how.

This is the single most important line in the whole design. The frontpage decides when each transition should happen (grace periods, trial expiries, promo extensions — all yours). Infra decides how it happens (atomicity, ordering, verification, rollback). We don't run billing clocks; you don't run VMs.

You decide when. Grace periods, trial timers, retention windows, promotional extensions. The moment you decide "archive this", you call our API.
We decide how. Atomicity, ordering, checksum verification, audit trail, rollback on failure. You get a 200 only when it's durably complete.
Infra never learns why. We don't see "payment_failed" — we just see "archive workspace X". This protects us from scope creep and protects you from needing infra changes every time billing logic evolves.
!
Billing records aren't in our schema. We keep an operational audit log (for incident post-mortems); revenue recognition, invoices, Stripe artifacts all live in your systems. On customer deletion, our PII is scrubbed; audit entries survive pseudonymized.

Below: the relational model that backs these commitments. Seven tables, one Postgres. Operators call this the "boring-on-purpose" design — every stateful service we have to run at 3am is a liability.

regions catalog

  • idtext · "us-west-1"
  • display_nametext
  • stateactive / retiring
  • created_attimestamptz

hosts bare-metal

  • iduuid
  • region_idregions
  • fqdntext · unique
  • total_vcpu / ram / diskint · int · int
  • statehealthy / draining / retired
  • last_heartbeat_attimestamptz

workspaces main entity

  • iduuid
  • external_workspace_idtext
  • external_user_idtext
  • display_nametext
  • region_id / host_idregions / hosts
  • flavorhobby · pro · team · custom
  • vcpu / ram_gb / disk_gbenvelope
  • stateactive · suspended · archived · deleted
  • current_operation_idoperations (deferred)
  • snapshot_interval_secondsdefault 86400
  • snapshot_retention_countdefault 30

workspace_domains edge routing

  • iduuid
  • workspace_idworkspaces
  • domaintext · unique
  • kinddefault_subdomain / custom
  • cert_statepending · issued · failed · expired
  • verified_at / cert_issued_attimestamptz

operations transitions

  • iduuid
  • workspace_idworkspaces
  • verbsuspend · archive · restore · delete
  • request_idUNIQUE(ws, req)
  • statuspending · running · ok · failed · rolled_back
  • step_statejsonb · resumable
  • river_job_idbigint
  • requested/started/completed_attimestamptz

snapshots object store

  • iduuid
  • workspace_idworkspaces
  • object_uritext
  • size_bytes / checksumbigint · text
  • toolkopia · restic · qemu-img
  • kindperiodic · pre_archive
  • verified_atonce checksum verified

audit_log append-only · survives delete

  • idbigserial
  • workspace_idnullable
  • former_workspace_iduuid · set after delete
  • event_typee.g. transition.archive.succeeded
  • event_datajsonb · no PII
  • actorsystem · api:<caller> · admin:<user>

Legend · primary key · foreign key · ·pii scrubbed atomically on delete.

08GDPR deletion
our legal obligation

Delete means gone. Verified.

When your team calls Delete, we do not mark a column and move on. We verify every step and only report success at the end. This is the GDPR artefact — if a regulator asks, this is what we show them.

1
VM stopped and destroyed. The bare-metal host unmounts, secure-wipes the disk, and the entry in the scheduler is freed.
2
All snapshots purged. Every copy in object storage is deleted and the delete is confirmed by a list-then-verify pass.
3
PII scrubbed atomically. In a single database statement, we set state to deleted and NULL the three PII fields: external_workspace_id, external_user_id, display_name. A unit test ensures any new PII column can't ship without being added to the scrub.
4
Audit entry written. An append-only record that we deleted the workspace survives — with no PII linkage, just an opaque former-id for forensics.
Only then does the API return 200. If any step fails, we don't mark deleted. Your team sees the operation stuck "running" until we fix it; no false completion.

Scope of this obligation. Account-deletion decision is frontpage's (whenever the customer clicks the button or legal retention expires). Execution — the actual wipe — is ours. GDPR completeness is our risk to manage.

09Data durability
what a customer can lose

RPO promise: 24 hours.

Snapshots run every 24 hours by default. If a bare-metal host fails during active use, the customer can lose up to 24 hours of work. Enterprise customers can buy tighter cadence as part of a custom contract — default should stay at 24h to control storage cost.

24h

Default snapshot cadence. Incremental, backed by a periodic full, retained for the archive-retention window. Product should surface this in the Terms of Service and on higher-tier plan comparisons — tighter RPO is a paid upgrade axis.

Scheduling note: host capacity is derived from live workspace assignments, not stored as a counter — so there's no bookkeeping that can drift out of sync with reality. Below is the actual query we run to place a new workspace. Small fleet, cheap query, zero drift risk.

-- find a host in region $1 with room for $2 vcpu / $3 ram / $4 disk
SELECT h.id
FROM hosts h
WHERE h.region_id = $1
  AND h.state = 'healthy'
  AND h.total_vcpu    - COALESCE((SELECT SUM(w.vcpu)    FROM workspaces w WHERE w.host_id = h.id AND w.state IN ('active','suspended')), 0) >= $2
  AND h.total_ram_gb  - COALESCE((SELECT SUM(w.ram_gb)  FROM workspaces w WHERE w.host_id = h.id AND w.state IN ('active','suspended')), 0) >= $3
  AND h.total_disk_gb - COALESCE((SELECT SUM(w.disk_gb) FROM workspaces w WHERE w.host_id = h.id AND w.state IN ('active','suspended')), 0) >= $4
FOR UPDATE SKIP LOCKED
LIMIT 1;

FOR UPDATE SKIP LOCKED lets concurrent schedulers pick different hosts. The workspace INSERT in the same transaction reserves the slot.

Chapter III · technical posture
III

What this is, technically.

covers 09 → 11backend api · error surface · observabilitysource · control-plane api spec

For the engineering folks on your team. The surface your backend talks to, the errors it sees, and the telemetry we publish so ops can integrate without asking us for a read-only DB.

10Backend contract
backend-to-backend only

gRPC to backends, not browsers.

The control-plane API is not a public API. Three consumers: your frontpage server, our admin tooling, and the host agents. No browser clients, no third-party integrations — those stay on top of whatever you build on the frontpage. This lets us keep one schema/tooling stack for every inbound call, which simplifies the security story considerably.

auth bearer tokens over TLS · upgrade to mTLS later 2–3 tokens at MVP · frontpage · admin · CI perimeter allowed subnets first, tokens second rotation manual at MVP, automate if it gets painful

For your backend team: every mutating call returns an Operation handle immediately. You poll until it's terminal — no wait parameters, no dual sync/async modes. Integrations against this API are, by design, one code path.

11Stable errors
a closed, versioned contract

Your client only branches on reasons.

Every error from us carries a machine-readable reason. The set is closed and versioned — new reasons are additive; existing reasons never change meaning. Your frontend code never needs a fix because we renamed something. This matters most when you're writing customer-facing error copy on top.

NOT_FOUND
Referenced workspace / operation / host does not exist.
ALREADY_EXISTS
Idempotent resubmit with divergent payload, or external_workspace_id collision.
FAILED_PRECONDITION
Illegal state transition (e.g. Restore on active).
ABORTED
Conflicting in-flight op. Caller retries after terminal.
RESOURCE_EXHAUSTED
No host capacity in region for requested flavor.
INVALID_ARGUMENT
Malformed flavor / domain / region.
UNAUTHENTICATED
Missing or malformed bearer token.
PERMISSION_DENIED
Token scope insufficient for the RPC.
UNAVAILABLE
Temporary server issue; retry with backoff.
INTERNAL
Programming bug; not retryable.

One category your product team should plan for: RESOURCE_EXHAUSTED means "we're at capacity in that region". You need UX for "we can't provision you a workspace right now, try a different region or contact us" — this is not impossible, just uncommon enough that it gets forgotten in mock-ups.

12Observability
metrics · tracing · audit

Ops integration out of the box.

Three separate layers SRE will care about — Prometheus metrics, OpenTelemetry spans, and an append-only audit log. Nothing custom to wire up.

Metrics · Prometheus :9090/metrics

grpc_requests_total · counts by RPC and terminal code grpc_request_duration · latency histograms workspaces_total · gauge by state, region, flavor host_capacity_used_ratio · derived utilization operation_duration · by verb and terminal status river_jobs · queued · in_progress · retries

Tracing · OpenTelemetry

Every API call gets a span; span context propagates into the job runner, so one trace covers the full transition — API call → job start → step execution → completion. Useful when your team asks "why did this customer's archive take 28 minutes?" — SRE can answer in minutes, not hours.

Audit log · survives account deletion

Every transition writes an append-only record. On customer deletion, PII is scrubbed, but the opaque record stays — so we can reconstruct what happened for incident post-mortems without GDPR risk. Legal/compliance should review the audit schema; it's the artefact they'll lean on in any dispute.

Chapter IV · the fleet & what's open
IV

Where we run — and what's still open.

covers 12 → 15agents · security · recovery · askssource · host-agent spec + open questions

Where our VMs physically live, how we secure the link, how we survive disconnects — and the short list of decisions we still need from you or from prototyping.

13Agents anywhere
the fleet can live on any provider

Bare-metal, any cloud, through NAT.

Each bare-metal host in our fleet runs a small daemon that dials the control plane and keeps a long-lived session open. Hosts never accept inbound connections from us — they're free to sit in any data center, any cloud, behind any firewall. This is what makes "run our platform on any provider" actually feasible. Hetzner today, OVH tomorrow, on-prem next year — no networking change to the control plane.

control plane
WorkspaceService · mTLS
agent ca · 90-day certs · revocation list
agent a
host-01us-west-1
agent b
host-02eu-central-1
agent c
host-03us-east-1

Commands

Control plane → agent dispatch. Ack confirms receipt and validation, not execution. Command command_id = operations.id.

Events

Agent → control plane telemetry. Monotonic seq per session, high-water-mark acks, durable replay buffer on reconnect.

mTLS from day one bootstrap token + enrollment URL cert CN=host_id · OU=region_id rotate at 60 days · 7-day overlap heartbeat 10s · stale @ 30s reconnect · expo-backoff cap 30s
14Security posture
how we trust our own fleet

Mutual TLS. From day one.

Agents are long-lived, autonomous, and run on hardware we don't always directly control. Bearer tokens are not enough — we use mutual TLS, with our own internal CA issuing per-host certificates bound to a host id and region. Tokens are fine for the frontpage-to-infra call; they're not fine for the infra-to-fleet link.

enrollment bootstrap token + CSR on first run cert lifetime 90 days · rotates at 60 days revocation retire a host and its cert is invalid immediately cn host_id · ou region_id separate ca from anything public-facing

What this buys us: even if a stolen bearer token leaks, it can't impersonate a host. Even if a host physically disappears (stolen, re-imaged, etc.), retiring it via the admin API cuts it off the fleet immediately — no waiting for cert expiry.

Section 14 originally detailed the command/event stream design (nine command verbs like ProvisionVM, TakeSnapshot; five event types; independent back-pressure on a single mTLS connection). That engineering detail lives in the source spec — it's implementation work the infra team owns. Above is the posture summary.

15Recovery & gaps
final slide

What we need from you.

Everything above is approved design, pre-implementation. Three categories of open work — where we need decisions, where we accept risk, and what we'll ship first.

Recovery · how we survive disconnects

When a host's connection to the control plane drops and comes back, neither side trusts its state memory. The agent sends an inventory of everything it has on disk; we compare to our database and reconcile differences. Expected but missing → provision again. Orphans (things on the host we don't know about) → alert an operator, never auto-destroy. Stuck in-flight ops get marked failed so upstream can retry. This is what lets us promise that a network blip never leaves a customer's workspace in an inconsistent limbo.

Deferred decisions · needed before build

?
Hypervisor choice. KVM/QEMU vs. Cloud Hypervisor. Decided during prototyping — doesn't affect any API shape or the contract with your team.
?
Snapshot tool. kopia / restic / qemu-img. The tool column in snapshots lets us change this fleet-wide without downtime. No decision needed from your side.
?
Hypervisor + snapshot combination. Will be settled by prototype benchmarks. ETA: 2–3 weeks into implementation.

Explicitly out of scope · needed eventually

×
Live migration between hosts. Not in v1. Our "migration" is an archive + restore (minutes of customer-visible downtime). Acceptable for Hobby/Pro; Team+Custom customers may want this eventually.
×
Networking spec — bandwidth shaping, IP allocation, per-app quota — is a separate design still to be written. Networking constraints that product wants to sell should be raised now.
×
Capacity auto-rebalancing. If a host gets hot, we don't automatically migrate workspaces off it. Operator does it manually. Fine at early scale; revisit when fleet size demands it.
×
Workspace creation, resize, list APIs. Not in the lifecycle spec — covered by separate specs we'll circulate before build.

Your asks · what we need from the application team

Confirm the grace-period policy. How many days from card-decline → suspended? Suspended → archived? Archived → deleted? These drive product copy and retention language; we need the numbers.
Own the state-page copy. Three edge pages (suspended / archived / deleted) — wording, branding, what call-to-action each shows. We'll ship placeholders; we need the real thing before GA.
Define resize paths. Can a customer self-serve upgrade from Hobby → Pro? Is it always in-place, or are we OK with archive-restore if the region is tight?
Legal review of the audit-log schema. Compliance should confirm what we keep after deletion meets retention/minimisation requirements in each jurisdiction we sell in.
Tighter-RPO product story. Is 24h data-loss window acceptable to sell at any tier, or should Team and above offer 1h/6h? This is a paid upgrade axis we could open.
4

Specs approved. Everything above is a designed commitment. Implementation begins next; first end-to-end demo of suspend / archive / restore is the near-term milestone.