briefing · application VP · 2026-04-19

An operator's /
briefing.

What we approved this week, and what it means for your team. Each paying customer gets a private VM; openberth runs their apps inside. This deck: what your customers experience, what we promise, and where we still need your call.

4 specs approved 15 slides 12 min read status · approved, pre-implementation arrow keys to navigate

scroll

Chapter I · what your customers experience

The customer journey.

covers 01 → 05tiers · states · timings · pagessource · 2026-04-18 lifecycle spec

The shapes of what you can sell, what the customer lives through, and what they see when something changes — billed or otherwise.

01Flavors

what product can sell

Four tiers, plus custom for sales.

Three self-serve sizes and an enterprise escape hatch. App counts below are marketing approximations, not contracts — we enforce the envelope (vCPU/RAM/disk); how many apps fit is the customer's choice.

Hobby

2 vCPU

RAM4 GB

Disk25 GB

~apps5

~static40

Pro

4 vCPU

RAM8 GB

Disk50 GB

~apps10

~static100

Team

8 vCPU

RAM20 GB

Disk100 GB

~apps25

~static300

Custom

∞ SALES

RAMneg.

Diskneg.

~apps—

~static—

Two things product should know. Trial isn't a flavor — it's a Hobby VM with upstream metadata. Trial-vs-paying distinction lives entirely in the frontpage; infra never sees billing state. Second: these specific numbers can change (Hobby might become 1 vCPU) without re-opening the design — the shape is settled, the values are tunable.

02Customer states

the customer's workspace, across its life

Four states. That's all.

One paying customer = one workspace = one VM = four possible life states. The third column below is what product needs to know: when a workspace is active, infra doesn't care if the customer is trialing, paying, or a month overdue — that's the frontpage's concern. We only see "run it" or "take it offline".

persistent

active

persistent

suspended

persistent

archived

terminal

deleted

state	vm	local disk	snapshot	url response
active	running	attached	periodic 24h	openberth serves
suspended	powered off	retained	last periodic	edge: suspended page
archived	destroyed	wiped	fresh, object storage	edge: archived page
deleted	gone	gone	purged	404 / tombstone

03The instruction set

what frontpage sends us

Four verbs. Safe to retry.

This is the entire transition vocabulary between frontpage and infra. Every call carries a client-supplied request_id; sending the same call twice is safe — we return the same result, never duplicate work. Conflicting in-flight operations get a clear 409 your client can retry on.

POST/workspaces/{id}:suspend

POST/workspaces/{id}:archive

POST/workspaces/{id}:restore

POST/workspaces/{id}:delete

GET/workspaces/{id} → state · transient-op flag · ETA

archive is snapshot-then-wipe. Snapshot uploaded and checksum-verified before local disk is wiped. Mid-flight failure rolls back to suspended; partial artifacts cleaned up.

delete verifies purge completeness. Not marked deleted until VM gone, disk wiped, snapshots purged, row pseudonymized, audit-log entry written.

iii

restore is scheduled. Picks a same-region host with capacity, pulls snapshot if from archived, boots, waits for healthcheck. Failure returns to prior state — no transient limbo.

409 on conflict. Caller retries after the transient op completes. Infra never silently blocks.

04SLA timings

numbers for product copy

Archive is minutes, not instant.

These durations are the contract: "Suspending…" finishes in seconds, "Restoring from archive…" needs to read minutes. Two implications: product messaging must match these ranges, and the frontpage grace-period timers must account for them — don't schedule an archive for 30s before a billing retry fires.

suspend

10s · max 2m

Two off-states. Very different costs.

The customer-visible difference is how quickly we can bring them back when they update their card. Product should map these to your grace-period tiers.

Suspended. VM powered off, disk kept on the host. Revive in seconds. More expensive per-idle-workspace because the host slot stays occupied. Right for "card expired, probably updating tomorrow".

Archived. VM destroyed, snapshot in object storage. Revive in minutes. Cheap at rest, but restoration needs a full rehydrate. Right for "haven't heard from them in weeks, want to keep data just in case".

iii

Your call when each kicks in. Infra only does the transition when frontpage calls the API. We strongly suggest suspended → archived after 7–14 days of non-payment, then archived → deleted at whatever retention window legal wants.

06Edge state pages

what the customer's url shows

Three pages you own the copy for.

When a workspace isn't running, the customer's URL doesn't 500 — our edge answers directly with a state page. These are three surfaces product/marketing should own the wording and branding of. Each should tell the customer exactly what to do next (pay, restore, contact support).

suspended

edge → static page
tap down · disk retained

archived

edge → static page
slot freed · snapshot in S3

deleted

edge → 404 tombstone
row pseudonymized

Chapter II · what we commit to

Our obligations.

covers 06 → 08ownership · gdpr · durabilitysource · lifecycle & schema specs

The boundary between your team and ours, the legal obligation we carry on deletion, and how much data a customer can lose if a host fails.

07Who owns what

the boundary · read this twice

Your team owns when. Ours owns how.

This is the single most important line in the whole design. The frontpage decides when each transition should happen (grace periods, trial expiries, promo extensions — all yours). Infra decides how it happens (atomicity, ordering, verification, rollback). We don't run billing clocks; you don't run VMs.

→

You decide when. Grace periods, trial timers, retention windows, promotional extensions. The moment you decide "archive this", you call our API.

←

We decide how. Atomicity, ordering, checksum verification, audit trail, rollback on failure. You get a 200 only when it's durably complete.

∿

Infra never learns why. We don't see "payment_failed" — we just see "archive workspace X". This protects us from scope creep and protects you from needing infra changes every time billing logic evolves.

Billing records aren't in our schema. We keep an operational audit log (for incident post-mortems); revenue recognition, invoices, Stripe artifacts all live in your systems. On customer deletion, our PII is scrubbed; audit entries survive pseudonymized.

Below: the relational model that backs these commitments. Seven tables, one Postgres. Operators call this the "boring-on-purpose" design — every stateful service we have to run at 3am is a liability.

regions catalog

idtext · "us-west-1"
display_nametext
stateactive / retiring
created_attimestamptz

hosts bare-metal

iduuid
region_idregions
fqdntext · unique
total_vcpu / ram / diskint · int · int
statehealthy / draining / retired
last_heartbeat_attimestamptz

workspaces main entity

iduuid
external_workspace_idtext
external_user_idtext
display_nametext
region_id / host_idregions / hosts
flavorhobby · pro · team · custom
vcpu / ram_gb / disk_gbenvelope
stateactive · suspended · archived · deleted
current_operation_idoperations (deferred)
snapshot_interval_secondsdefault 86400
snapshot_retention_countdefault 30

workspace_domains edge routing

iduuid
workspace_idworkspaces
domaintext · unique
kinddefault_subdomain / custom
cert_statepending · issued · failed · expired
verified_at / cert_issued_attimestamptz

operations transitions

iduuid
workspace_idworkspaces
verbsuspend · archive · restore · delete
request_idUNIQUE(ws, req)
statuspending · running · ok · failed · rolled_back
step_statejsonb · resumable
river_job_idbigint
requested/started/completed_attimestamptz

snapshots object store

iduuid
workspace_idworkspaces
object_uritext
size_bytes / checksumbigint · text
toolkopia · restic · qemu-img
kindperiodic · pre_archive
verified_atonce checksum verified

audit_log append-only · survives delete

idbigserial
workspace_idnullable
former_workspace_iduuid · set after delete
event_typee.g. transition.archive.succeeded
event_datajsonb · no PII
actorsystem · api:<caller> · admin:<user>

Legend · ◆ primary key · → foreign key · ·pii scrubbed atomically on delete.

08GDPR deletion

our legal obligation

Delete means gone. Verified.

When your team calls Delete, we do not mark a column and move on. We verify every step and only report success at the end. This is the GDPR artefact — if a regulator asks, this is what we show them.

VM stopped and destroyed. The bare-metal host unmounts, secure-wipes the disk, and the entry in the scheduler is freed.

All snapshots purged. Every copy in object storage is deleted and the delete is confirmed by a list-then-verify pass.

PII scrubbed atomically. In a single database statement, we set state to deleted and NULL the three PII fields: external_workspace_id, external_user_id, display_name. A unit test ensures any new PII column can't ship without being added to the scrub.

Audit entry written. An append-only record that we deleted the workspace survives — with no PII linkage, just an opaque former-id for forensics.

✓

Only then does the API return 200. If any step fails, we don't mark deleted. Your team sees the operation stuck "running" until we fix it; no false completion.

Scope of this obligation. Account-deletion decision is frontpage's (whenever the customer clicks the button or legal retention expires). Execution — the actual wipe — is ours. GDPR completeness is our risk to manage.

09Data durability

what a customer can lose

RPO promise: 24 hours.

Snapshots run every 24 hours by default. If a bare-metal host fails during active use, the customer can lose up to 24 hours of work. Enterprise customers can buy tighter cadence as part of a custom contract — default should stay at 24h to control storage cost.

24h

Default snapshot cadence. Incremental, backed by a periodic full, retained for the archive-retention window. Product should surface this in the Terms of Service and on higher-tier plan comparisons — tighter RPO is a paid upgrade axis.

Scheduling note: host capacity is derived from live workspace assignments, not stored as a counter — so there's no bookkeeping that can drift out of sync with reality. Below is the actual query we run to place a new workspace. Small fleet, cheap query, zero drift risk.

-- find a host in region $1 with room for $2 vcpu / $3 ram / $4 disk
SELECT h.id
FROM hosts h
WHERE h.region_id = $1
  AND h.state = 'healthy'
  AND h.total_vcpu    - COALESCE((SELECT SUM(w.vcpu)    FROM workspaces w WHERE w.host_id = h.id AND w.state IN ('active','suspended')), 0) >= $2
  AND h.total_ram_gb  - COALESCE((SELECT SUM(w.ram_gb)  FROM workspaces w WHERE w.host_id = h.id AND w.state IN ('active','suspended')), 0) >= $3
  AND h.total_disk_gb - COALESCE((SELECT SUM(w.disk_gb) FROM workspaces w WHERE w.host_id = h.id AND w.state IN ('active','suspended')), 0) >= $4
FOR UPDATE SKIP LOCKED
LIMIT 1;

FOR UPDATE SKIP LOCKED lets concurrent schedulers pick different hosts. The workspace INSERT in the same transaction reserves the slot.

Chapter III · technical posture

III

What this is, technically.

covers 09 → 11backend api · error surface · observabilitysource · control-plane api spec

For the engineering folks on your team. The surface your backend talks to, the errors it sees, and the telemetry we publish so ops can integrate without asking us for a read-only DB.

10Backend contract

backend-to-backend only

gRPC to backends, not browsers.

The control-plane API is not a public API. Three consumers: your frontpage server, our admin tooling, and the host agents. No browser clients, no third-party integrations — those stay on top of whatever you build on the frontpage. This lets us keep one schema/tooling stack for every inbound call, which simplifies the security story considerably.

auth bearer tokens over TLS · upgrade to mTLS later 2–3 tokens at MVP · frontpage · admin · CI perimeter allowed subnets first, tokens second rotation manual at MVP, automate if it gets painful

For your backend team: every mutating call returns an Operation handle immediately. You poll until it's terminal — no wait parameters, no dual sync/async modes. Integrations against this API are, by design, one code path.

11Stable errors

a closed, versioned contract

Your client only branches on reasons.

Every error from us carries a machine-readable reason. The set is closed and versioned — new reasons are additive; existing reasons never change meaning. Your frontend code never needs a fix because we renamed something. This matters most when you're writing customer-facing error copy on top.

NOT_FOUND

Referenced workspace / operation / host does not exist.

ALREADY_EXISTS

Idempotent resubmit with divergent payload, or external_workspace_id collision.

FAILED_PRECONDITION

Illegal state transition (e.g. Restore on active).

ABORTED

Conflicting in-flight op. Caller retries after terminal.

RESOURCE_EXHAUSTED

No host capacity in region for requested flavor.

INVALID_ARGUMENT

Malformed flavor / domain / region.

UNAUTHENTICATED

Missing or malformed bearer token.

PERMISSION_DENIED

Token scope insufficient for the RPC.

UNAVAILABLE

Temporary server issue; retry with backoff.

INTERNAL

Programming bug; not retryable.

One category your product team should plan for: RESOURCE_EXHAUSTED means "we're at capacity in that region". You need UX for "we can't provision you a workspace right now, try a different region or contact us" — this is not impossible, just uncommon enough that it gets forgotten in mock-ups.

12Observability

metrics · tracing · audit

Ops integration out of the box.

Three separate layers SRE will care about — Prometheus metrics, OpenTelemetry spans, and an append-only audit log. Nothing custom to wire up.

Metrics · Prometheus :9090/metrics

grpc_requests_total · counts by RPC and terminal code grpc_request_duration · latency histograms workspaces_total · gauge by state, region, flavor host_capacity_used_ratio · derived utilization operation_duration · by verb and terminal status river_jobs · queued · in_progress · retries

Tracing · OpenTelemetry

Every API call gets a span; span context propagates into the job runner, so one trace covers the full transition — API call → job start → step execution → completion. Useful when your team asks "why did this customer's archive take 28 minutes?" — SRE can answer in minutes, not hours.

Audit log · survives account deletion

Every transition writes an append-only record. On customer deletion, PII is scrubbed, but the opaque record stays — so we can reconstruct what happened for incident post-mortems without GDPR risk. Legal/compliance should review the audit schema; it's the artefact they'll lean on in any dispute.

Chapter IV · the fleet & what's open

Where we run — and what's still open.

covers 12 → 15agents · security · recovery · askssource · host-agent spec + open questions

Where our VMs physically live, how we secure the link, how we survive disconnects — and the short list of decisions we still need from you or from prototyping.

13Agents anywhere

the fleet can live on any provider

Bare-metal, any cloud, through NAT.

Each bare-metal host in our fleet runs a small daemon that dials the control plane and keeps a long-lived session open. Hosts never accept inbound connections from us — they're free to sit in any data center, any cloud, behind any firewall. This is what makes "run our platform on any provider" actually feasible. Hetzner today, OVH tomorrow, on-prem next year — no networking change to the control plane.

control plane

WorkspaceService · mTLS

agent ca · 90-day certs · revocation list

agent a

host-01us-west-1

agent b

host-02eu-central-1

agent c

host-03us-east-1

Commands →

Control plane → agent dispatch. Ack confirms receipt and validation, not execution. Command command_id = operations.id.

Events ←

Agent → control plane telemetry. Monotonic seq per session, high-water-mark acks, durable replay buffer on reconnect.

mTLS from day one bootstrap token + enrollment URL cert CN=host_id · OU=region_id rotate at 60 days · 7-day overlap heartbeat 10s · stale @ 30s reconnect · expo-backoff cap 30s

14Security posture

how we trust our own fleet

Mutual TLS. From day one.

Agents are long-lived, autonomous, and run on hardware we don't always directly control. Bearer tokens are not enough — we use mutual TLS, with our own internal CA issuing per-host certificates bound to a host id and region. Tokens are fine for the frontpage-to-infra call; they're not fine for the infra-to-fleet link.

enrollment bootstrap token + CSR on first run cert lifetime 90 days · rotates at 60 days revocation retire a host and its cert is invalid immediately cn host_id · ou region_id separate ca from anything public-facing

What this buys us: even if a stolen bearer token leaks, it can't impersonate a host. Even if a host physically disappears (stolen, re-imaged, etc.), retiring it via the admin API cuts it off the fleet immediately — no waiting for cert expiry.

Section 14 originally detailed the command/event stream design (nine command verbs like ProvisionVM, TakeSnapshot; five event types; independent back-pressure on a single mTLS connection). That engineering detail lives in the source spec — it's implementation work the infra team owns. Above is the posture summary.

15Recovery & gaps

final slide

What we need from you.

Everything above is approved design, pre-implementation. Three categories of open work — where we need decisions, where we accept risk, and what we'll ship first.

Recovery · how we survive disconnects

When a host's connection to the control plane drops and comes back, neither side trusts its state memory. The agent sends an inventory of everything it has on disk; we compare to our database and reconcile differences. Expected but missing → provision again. Orphans (things on the host we don't know about) → alert an operator, never auto-destroy. Stuck in-flight ops get marked failed so upstream can retry. This is what lets us promise that a network blip never leaves a customer's workspace in an inconsistent limbo.

Deferred decisions · needed before build

Hypervisor choice. KVM/QEMU vs. Cloud Hypervisor. Decided during prototyping — doesn't affect any API shape or the contract with your team.

Snapshot tool. kopia / restic / qemu-img. The tool column in snapshots lets us change this fleet-wide without downtime. No decision needed from your side.

Hypervisor + snapshot combination. Will be settled by prototype benchmarks. ETA: 2–3 weeks into implementation.

Explicitly out of scope · needed eventually

Live migration between hosts. Not in v1. Our "migration" is an archive + restore (minutes of customer-visible downtime). Acceptable for Hobby/Pro; Team+Custom customers may want this eventually.

Networking spec — bandwidth shaping, IP allocation, per-app quota — is a separate design still to be written. Networking constraints that product wants to sell should be raised now.

Capacity auto-rebalancing. If a host gets hot, we don't automatically migrate workspaces off it. Operator does it manually. Fine at early scale; revisit when fleet size demands it.

Workspace creation, resize, list APIs. Not in the lifecycle spec — covered by separate specs we'll circulate before build.

Your asks · what we need from the application team

→

Confirm the grace-period policy. How many days from card-decline → suspended? Suspended → archived? Archived → deleted? These drive product copy and retention language; we need the numbers.

→

Own the state-page copy. Three edge pages (suspended / archived / deleted) — wording, branding, what call-to-action each shows. We'll ship placeholders; we need the real thing before GA.

→

Define resize paths. Can a customer self-serve upgrade from Hobby → Pro? Is it always in-place, or are we OK with archive-restore if the region is tight?

→

Legal review of the audit-log schema. Compliance should confirm what we keep after deletion meets retention/minimisation requirements in each jurisdiction we sell in.

→

Tighter-RPO product story. Is 24h data-loss window acceptable to sell at any tier, or should Team and above offer 1h/6h? This is a paid upgrade axis we could open.

Specs approved. Everything above is a designed commitment. Implementation begins next; first end-to-end demo of suspend / archive / restore is the near-term milestone.

An operator's / briefing.

The customer journey.

Four tiers, plus custom for sales.

Four states. That's all.

Four verbs. Safe to retry.

Archive is minutes, not instant.

Two off-states. Very different costs.

Three pages you own the copy for.

Our obligations.

Your team owns when. Ours owns how.

regions catalog

hosts bare-metal

workspaces main entity

workspace_domains edge routing

operations transitions

snapshots object store

audit_log append-only · survives delete

Delete means gone. Verified.

RPO promise: 24 hours.

What this is, technically.

gRPC to backends, not browsers.

Your client only branches on reasons.

Return an Operation. Callers poll.

Workspace lifecycle

Domain / routing

Operations

Fleet

Ops integration out of the box.

Metrics · Prometheus :9090/metrics

Tracing · OpenTelemetry

Audit log · survives account deletion

Where we run — and what's still open.

Bare-metal, any cloud, through NAT.

Commands →

Events ←

Mutual TLS. From day one.

Commands · control plane → agent

Events · agent → control plane

What we need from you.

Recovery · how we survive disconnects

Deferred decisions · needed before build

Explicitly out of scope · needed eventually

Your asks · what we need from the application team

An operator's /
briefing.