Published about 5 hours ago 21 min read

🚀 The SaaS Template Playbook - Part 1 📖

MayFest2026

A comprehensive, opinionated, actionable guide for building a professional, reusable SaaS template that you can fork and reskin for any vertical (CRM, project management, analytics, internal tooling, vertical SaaS, etc.).

If you read only one section first, read §3 The 12 Pillars and §5 Multi-Tenancy — those two ideas dictate every other decision in this document.

📋 Table of Contents

🧐 What "SaaS Template" Actually Means
⚡ The 30-Second Mental Model
🏛️ The 12 Pillars of a Production SaaS
🏗️ Reference Architecture
🏢 Multi-Tenancy — the Keystone Decision
🔐 Authentication & Authorization
👥 Accounts, Organizations, Workspaces, Teams
🚪 Onboarding & Activation
💳 Billing, Subscriptions & Metering
🗄️ Database Design Patterns
🌐 API Design
⚙️ Background Jobs, Queues & Schedulers
📡 Real-time & Eventing
📨 Email, Notifications & Inbox
📦 File Storage, Uploads & CDN
🔎 Search (Full-Text + Semantic)
🚩 Feature Flags & Experiments
📊 Audit Logs, Activity Feeds & Telemetry
🛡️ Security, Compliance & Privacy
⚡ Performance, Caching & Scaling
📈 Observability — Logs, Metrics, Traces, Errors
🎨 Frontend Architecture
🌍 Internationalization & Accessibility
🔧 Admin & Internal Tooling
📝 Marketing Site, Docs & SEO
🚢 CI/CD, Environments & Release Strategy
🧰 Developer Experience (DX)
🧪 Testing Strategy
💰 Pricing, Plans & Packaging Strategy
🎯 Product Analytics & Growth
🤝 Customer Support & Success
📦 Reusability — How to Make This a Template
🗺️ The 14-Phase Build Plan
⚠️ Common Pitfalls & Hard-Won Guardrails
📋 Cheat Sheet

1. 🧐 What "SaaS Template" Actually Means

A reusable SaaS template is the boring 80% you'd otherwise rebuild for every product:

Sign-up, login, password reset, SSO, MFA
Organizations / workspaces / teams / invites
Roles + permissions
Billing, subscriptions, plans, usage metering, invoices
Email + notifications + in-app inbox
Audit logs + activity feeds
Admin panel
Feature flags
Background jobs, scheduled jobs, webhooks
File uploads + CDN
API keys + rate limiting
Observability + error tracking
CI/CD + multi-environment deploys
Marketing landing page + docs site

It is NOT:

Your product's domain logic — that's the unique 20% you build on top.
A no-code platform — it's a code starter.
A magic SaaS-in-a-box — you still need product judgment.

The right mental model: infrastructure for the parts every SaaS has, with clean seams where your domain plugs in.

2. ⚡ The 30-Second Mental Model

                ┌─────────────────────────────────────┐
                │  Marketing Site  +  Docs  +  Status │
                └─────────────────────┬───────────────┘
                                      │
                ┌─────────────────────▼───────────────┐
                │            Web App (SPA)            │
                │       + (optional) Mobile/Desktop   │
                └────────┬─────────────────┬──────────┘
                         │ REST/GraphQL    │ WS/SSE
                ┌────────▼─────────────────▼──────────┐
                │  Edge / API Gateway                 │
                │   (auth, rate limit, CORS, WAF)     │
                └────────┬────────────────────────────┘
                         │
       ┌─────────────────┼─────────────────────────────┐
       ▼                 ▼                             ▼
  ┌────────┐       ┌──────────┐                 ┌──────────┐
  │ App API│ ◄───► │Worker(s) │                 │ Webhooks │
  │  (BFF) │       │+ Cron    │                 │ Out/In   │
  └───┬────┘       └────┬─────┘                 └────┬─────┘
      │                 │                            │
      ▼                 ▼                            ▼
  ┌─────────────────────────────────────────────────────┐
  │  Postgres (core)  •  Redis (cache+queue)            │
  │  Object Storage (S3)  •  Search (PG/Meili/Elastic)  │
  │  Time-series / Analytics (ClickHouse / DuckDB)      │
  └─────────────────────────────────────────────────────┘
                                  │
                  ┌───────────────┼─────────────────────┐
                  ▼               ▼                     ▼
              Stripe          Email (Resend)        Auth (Clerk/
              (billing)       SMS (Twilio)          WorkOS) [opt]
              Sentry          Segment/PostHog       OpenAI/etc.

Three deployable surfaces, one source of truth:

Surface	Built from	Where it runs
Marketing + docs	Next.js static / Astro	CDN (Vercel / Cloudflare Pages)
Web app	React SPA (Vite) or Next.js	CDN + edge
API + workers	Go / Python / Node	Container platform (Fly / Railway / ECS / k8s)

3. 🏛️ The 12 Pillars of a Production SaaS

Every SaaS template needs all twelve. Skip one, and you eat scope creep later.

#	Pillar	What "done" looks like
1	Identity	Email/password, OAuth (Google/GitHub), magic link, MFA, SSO (SAML/OIDC), session + token model.
2	Tenancy	Org/workspace boundary, every query filtered by `workspace_id`, RBAC + (optional) ABAC.
3	Billing	Stripe wired, plans configurable, trials, dunning, usage metering, invoice portal.
4	Lifecycle	Onboarding flow, email verification, invites, offboarding, account deletion (GDPR-clean).
5	Eventing	In-process bus → outbox → workers → webhooks. Idempotent.
6	Observability	Structured logs + traces + metrics + error tracker, all correlated by `request_id` + `tenant_id`.
7	Audit	Append-only audit log of every privileged action, queryable by tenant.
8	Notifications	Transactional email + in-app inbox + (opt) SMS/push, all with per-user preferences.
9	Files	Direct-to-S3 uploads via signed URLs; never proxy bytes through your API.
10	Admin	Internal dashboard for support: impersonate, refund, suspend, inspect tenant.
11	Flags	Feature flags per environment + per tenant + per user. Kill-switch culture.
12	DX	One command to dev (`make dev`), seed data, fast tests, docs that don't lie.

4. 🏗️ Reference Architecture

4.1 The Spine

          [Browser / Mobile / Desktop]
                       │
                       ▼
              [CDN / Edge Cache]
                       │
                       ▼
            [Reverse Proxy / WAF]   ← TLS terminates here
            (Caddy: automatic HTTPS via Let's Encrypt,
             or Traefik: dynamic routing from Docker/K8s labels)
                       │
            ┌──────────┼───────────┐
            ▼          ▼           ▼
     [API Gateway] [WebSocket]  [Static Assets]
            │          │
            ▼          ▼
       [App API (stateless, horizontally scalable)]
            │
   ┌────────┼─────────────┬─────────────┐
   ▼        ▼             ▼             ▼
 [DB]   [Cache]      [Queue]       [Object Store]
Postgres  Redis      Redis/SQS         S3
   │        │             │             │
   ▼        ▼             ▼             ▼
[Read    [Pub/Sub   [Workers +     [CDN signed
 replica] for WS]    cron]          URLs]

4.2 What lives where

Concern	Where
Source of truth	Postgres
Hot reads, sessions, idempotency keys, rate-limit counters	Redis
Heavy/slow work, retries, scheduled work	Workers consuming a queue
Real-time fanout to clients	WS hub backed by Redis pub/sub (multi-node)
Bulk analytics & reporting	ClickHouse / BigQuery / DuckDB (mirrored from Postgres)
Static UI	CDN
User-uploaded files	S3 + CDN with signed URLs
Secrets	Env (dev) / SSM / Vault / Doppler (prod)

4.3 Suggested tech stack (opinionated, swappable)

Layer	Default	Why
API (Go)	chi + sqlc + pgx (lean) or Gin + GORM (batteries-included)	Fast, predictable, low-overhead. Gin/GORM is the path-of-least-resistance combo most Go SaaS teams ship on.
API (Node)	Hono / Fastify + Prisma	Edge-friendly, ergonomic
ML / heavy compute	Python (FastAPI + uv + pydantic v2 + structlog)	Ecosystem advantage; structlog gives you JSON logs out of the box
Web	React 19 + TypeScript + Vite + TanStack Query + Zustand + Tailwind	Boring, excellent, zero magic
DB	Postgres 16+ (with `pgvector`, `pg_trgm`)	One DB to do 90% of jobs
Cache	Redis 7	Battle-tested
Queue / Eventing	Redis (simple) → NATS JetStream (durable streams, replay, KV, multi-tenant subjects)	NATS is the right answer when you need at-least-once delivery, replay, or fan-out across services without standing up Kafka.
Search	Postgres FTS (start) → Meilisearch / Typesense (scale)	Cheap → fast
Object store	S3 / Cloudflare R2 (no egress) / Supabase Storage (if you're already on Supabase)	Standard
Email	Resend or Postmark	Reliable transactional, simple SDKs
Auth (managed SaaS)	Clerk (fast UX), WorkOS (enterprise SSO/SCIM), Supabase Auth (if you want auth + DB + storage in one)	Saves weeks; pick by where the rest of your stack lives.
Auth (self-hosted OSS)	Ory Kratos (identity) + Ory Hydra (OIDC) + Ory Keto (permissions) — pure API, no UI bundled. Casdoor — full-stack IAM with built-in admin UI, OIDC/SAML, RBAC, MFA.	Own your identity layer without writing it. Kratos = composable primitives; Casdoor = drop-in IAM.
Auth (DIY)	Lucia / Auth.js / your own JWT + refresh	Maximum ownership, maximum maintenance
Billing	Stripe (default) / Paddle or LemonSqueezy (Merchant-of-Record, global tax) / PayPal (add as a secondary payment method when you have non-card markets — LATAM, parts of EU, gamer/creator audiences)	Stripe owns card-first markets; PayPal is the second checkout option customers ask for.
Logging (Go)	zerolog (zero-allocation JSON) or `slog` (stdlib, 1.21+)	zerolog is the production default for Go SaaS — fast, structured, contextual.
Logging (Python)	structlog + `orjson` renderer	Structured, contextvars-aware, async-safe
Background jobs	Asynq (Go, Redis) / River (Go, Postgres) / BullMQ (Node) / Celery / Arq (Python) / NATS JetStream consumers (cross-language)	Match language, or use NATS if you already have it for eventing.
Reverse proxy / TLS	Caddy (automatic HTTPS, simplest config) or Traefik (dynamic config, great with Docker/K8s/labels) — nginx if you have a reason.	Caddy = "it just works" for VMs. Traefik = service-discovery-driven for containerized stacks.
Observability	OpenTelemetry → Grafana / Honeycomb / Datadog	Vendor-neutral export
Errors	Sentry	Best-in-class
Analytics	PostHog (self-host or cloud)	Product + flags + session replay in one
CI/CD	GitHub Actions	Where your code already is
Infra (PaaS, fastest start)	Fly.io / Railway / Render	Push-to-deploy, no ops
Infra (cheap VMs, more control)	Hetzner (best €/CPU in the market — €4–€40/mo dedicated cores) or Digital Ocean (polished UX, managed PG/Redis, App Platform)	Most bootstrapped SaaS run profitably on a Hetzner box + DO managed Postgres. Pair with Caddy/Traefik.
Infra (hyperscaler, when you have to)	AWS / GCP / Azure	Compliance, region breadth, enterprise procurement

Two reference stacks to pick from on day one:

"Bootstrapped solo / small team": Go (Gin + GORM + zerolog) + Postgres + NATS JetStream + Caddy on a single Hetzner box, Casdoor or Ory Kratos for auth, Stripe + PayPal for payments. ~€30/mo, scales to thousands of paying customers.

"Funded / enterprise-ready": Go (chi + sqlc) + managed Postgres + Redis + NATS cluster behind Traefik on Digital Ocean App Platform / Kubernetes, WorkOS or Supabase Auth, Stripe Billing, OTel → Grafana Cloud.

4.4 Cross-cutting building blocks (the glossary)

These are the load-bearing concepts every later section assumes. Define them once here; deeper coverage is in the linked sections.

🧱 The middleware chain

A request flows through a fixed stack of middleware before any handler runs. Order is load-bearing — wire it once in main.go and don't rearrange.

Request
  │
  ▼
[1] Recovery        — catch panics, return 500 + Sentry capture
[2] RequestID       — generate or accept X-Request-ID header
[3] Logger          — bind request_id to ctx logger (zerolog/structlog)
[4] Tracing         — OTel span for the request
[5] CORS            — allowlist origins
[6] RateLimit       — Redis token bucket per IP / API key (§11.7)
[7] Auth            — verify session/JWT/API key → set Actor in ctx (§6)
[8] Tenant          — resolve workspace_id → set in ctx + SET LOCAL app.workspace_id (§5)
[9] CSRF            — cookie endpoints only
[10] Idempotency    — POSTs with Idempotency-Key header (§11.6)
  │
  ▼
Handler → Service → Repository
  │
  ▼
Response
  │
  ▼
[Logger middleware closes the span, emits access log line]

Auth comes before Tenant (you need an actor before resolving their workspace). Recovery is outermost so a panic anywhere still produces a clean 500. RateLimit goes before Auth so unauthenticated abuse hits the limiter first.

📦 What `ctx` carries

context.Context is the request-scoped envelope. Everything below is bound by middleware and read by handlers/services/repos.

Key	Set by	Read by
`request_id`	RequestID middleware	logs, error responses, traces
`logger`	Logger middleware	every layer (`log.Ctx(ctx)`)
`actor`	Auth middleware	permission checks, audit log
`workspace_id`	Tenant middleware	every repo query, RLS GUC
`trace_id` / `span`	OTel middleware	downstream HTTP/DB instrumentation
`db` (per-request handle with GUCs set)	Tenant middleware	repos

Rule: if a function needs any of these, it takes ctx context.Context as the first argument. No globals. No req.Context() 3 layers deep — pass ctx explicitly.

🎭 The `Actor` type (polymorphic identity)

Every action in the system is performed by something — a human, an API key, or the system itself. Don't model "user" everywhere; model Actor.

type Actor struct {
    Type ActorType // user | api_key | system
    ID   uuid.UUID
    // for users: cached membership in current workspace
    Role        Role     // owner | admin | member | viewer
    Permissions []string // resolved at auth time
}

func (a *Actor) Can(action string, resource Resource) bool { /* §6.3 */ }

This pairs with the polymorphic-actor DB pattern (created_by_type, created_by_id — see §35) so audit logs, activity feeds, and created_by fields handle integrations and humans uniformly.

🏛️ Layered architecture (handler → service → repo)

Each layer has a strict allowed-imports list. Violations are caught by golangci-lint depguard rules (or equivalent in other languages).

Layer	Knows about	Forbidden
Handler	HTTP, Service interfaces, request/response DTOs	DB, SQL, third-party SDKs
Service	Domain logic, other Services, Repository interfaces, the `Bus`	HTTP types (`http.Request`, `gin.Context`)
Repository	DB driver, SQL, models	HTTP, business rules, other repos

A handler never touches the DB. A repo never decides whether an action is allowed. This is what makes services testable without a server and repos swappable.

🔌 The kernel interfaces (the seams)

Every cross-cutting capability is a Go interface (or TS type) defined in kernel/. The product imports the interface; wiring picks the implementation at startup. These are the seams that keep the template reusable.

type Auth interface {                         // §6
    Authenticate(ctx, token) (*Actor, error)
    Issue(ctx, user *User) (Token, error)
}

type Bus interface {                          // §13
    Publish(ctx, subject string, payload []byte) error
    Subscribe(ctx, subject string, h Handler) (Subscription, error)
}

type Storage interface {                      // §15
    PresignPut(ctx, key string, opts PutOpts) (string, error)
    PresignGet(ctx, key string, ttl time.Duration) (string, error)
}

type Mailer interface {                       // §14
    Send(ctx, msg Message) error
}

type Meter interface {                        // §9.6
    Increment(ctx, workspaceID uuid.UUID, metric string, n int64) error
}

type Flags interface {                        // §17
    IsEnabled(ctx, key string, scope FlagScope) bool
}

type Cache interface {                        // §20
    Get(ctx, key string) ([]byte, bool, error)
    Set(ctx, key string, val []byte, ttl time.Duration) error
    Bump(ctx, tag string) error // tag-based invalidation
}

Implementations: casdoor.Auth, workos.Auth, kratos.Auth / nats.Bus, redis.Bus, inproc.Bus / s3.Storage, r2.Storage, supabase.Storage / resend.Mailer, postmark.Mailer / etc. Swapping providers = changing one line in main.go.

🔒 Transactions: the `WithTx` pattern

Don't manually Begin/Commit/Rollback — it leaks on panics and confuses nested calls. Use a closure helper that the repo layer owns:

func (r *Repo) WithTx(ctx context.Context, fn func(tx *Repo) error) error {
    return r.db.Transaction(func(db *gorm.DB) error {
        return fn(&Repo{db: db})
    })
}

// Service:
err := repo.WithTx(ctx, func(tx *Repo) error {
    if err := tx.Orders().Create(ctx, order); err != nil { return err }
    return tx.Outbox().Append(ctx, "order.created", order) // §12.4
})

Two rules:

Never hold a transaction across a network call (HTTP, Stripe, S3). Read first, do external work, then write fast inside the tx.
DB writes + event emission live in the same tx via the outbox pattern (§12.4). Anything else is eventually-inconsistent in failure modes.

🔁 Idempotency (everywhere, not just §11.6)

Three places idempotency shows up; same idea, different keys:

Surface	Key	Storage
Public API `POST`	`Idempotency-Key` header (§11.6)	Redis, 24h TTL, scoped by `(workspace_id, key)`
Stripe/PayPal webhooks	`event.id` (§9.3)	Redis, 7-day TTL
Background jobs	`(job_type, dedup_key)` (§12.3)	Postgres unique index, or Redis SETNX

The shape is always: check if you've seen this key → if yes, return cached result / no-op → else do work, then record the key.

🆔 ID conventions

UUID v7 for all primary keys — sortable by time, single column for PK + chronology, no created_at index needed for ordering.
Prefixed display IDs in API responses for human-readable references: proj_01HMZ..., inv_01HMZ.... The DB stores the raw UUID; the API serializer adds the prefix. Saves debugging time when a customer pastes an ID into a ticket.

🌍 The standard handler shape

Every handler in the codebase looks the same. Deviation = reviewer flag.

func (h *ProjectHandler) Create(c *gin.Context) {
    ctx := c.Request.Context()
    actor := auth.ActorFrom(ctx)            // set by Auth middleware
    workspaceID := tenant.IDFrom(ctx)       // set by Tenant middleware

    var req CreateProjectRequest
    if err := c.ShouldBindJSON(&req); err != nil {
        respondError(c, errs.Validation(err)); return
    }

    project, err := h.svc.Create(ctx, actor, workspaceID, req)
    if err != nil {
        respondError(c, err); return         // single error envelope (§11.5)
    }

    c.JSON(201, project)
}

Five lines of mechanical work, then one line of actual business logic delegated to the service. If a handler grows past 20 lines, push the logic down a layer.

The single most consequential architectural choice. Decide at day one and enforce in code.

5.1 The three models

Model	Description	When to use
Pool (shared)	One DB, every row tagged `workspace_id` (or `org_id`).	Default for B2B SaaS. Best ops/cost.
Bridge (silo schema)	One DB, one schema per tenant.	Mid-enterprise; per-tenant migrations possible.
Silo (isolated DB)	One DB per tenant.	Regulated tenants (banks, healthcare), VIP customers.

Recommendation: Start with Pool. Add Silo later as an enterprise tier. Don't try to do all three on day one.

5.2 Hard rules for the Pool model

Every tenant-owned table has workspace_id (or org_id) NOT NULL.
Every query filters by workspace_id — no exceptions. Enforce via:
- Repository methods that require workspaceID as a typed argument.
- Postgres Row-Level Security (RLS) as a belt-and-suspenders defense.
The active tenant is resolved once per request from the auth token and stored in context.Context / request-local state.
Cross-tenant queries (admin, analytics) go through a separate, audited code path. Never inside the user request handler.

5.3 Postgres RLS as defense-in-depth

ALTER TABLE issue ENABLE ROW LEVEL SECURITY;

CREATE POLICY issue_tenant_isolation ON issue
    USING (workspace_id = current_setting('app.workspace_id')::uuid);

In your handler middleware:

tx.Exec(`SET LOCAL app.workspace_id = $1`, workspaceID)

Even if a developer forgets a WHERE workspace_id = ?, RLS blocks the leak.

5.4 The "two-actor" rule for queries

Every query has two implicit parameters:

actor_user_id (who's asking)
tenant_id (which tenant they're acting in)

Don't accept "logged-in user" alone. The same user can belong to multiple workspaces.

5.5 Tenant resolution

Either:

Subdomain: acme.app.yourtool.com → acme → workspace lookup.
Path: app.yourtool.com/w/acme/...
Header: X-Workspace-ID: <uuid> (good for APIs, but UI needs a workspace switcher).

Most SaaS pick subdomain or path — pick one and stick with it.

6. 🔐 Authentication & Authorization

6.1 Auth methods you must support

Email + password (always — even if SSO available).
Magic link (best UX for low-stakes products).
OAuth: Google + GitHub minimum. Apple if iOS app.
MFA: TOTP (Authenticator apps) — easy to add, big trust signal.
Passkeys (WebAuthn) — increasingly expected.
SSO (SAML 2.0 + OIDC) — gate behind enterprise plan; outsource to WorkOS or Clerk unless you want to own the support burden.
API keys — per-workspace, scoped, revocable, hashed at rest (sha256).
Personal access tokens (PATs) — for CLIs, with rotation.

6.2 Sessions vs JWTs — pick a hybrid

Use case	Mechanism
Browser session	HttpOnly secure cookie with opaque session ID → server-side session in Redis. Easy revocation.
Mobile / desktop / CLI	Short-lived JWT (15 min) + refresh token stored securely.
Public API	API key (long-lived, scoped, revocable).
Service-to-service	mTLS or signed JWT with short TTL.

Rule: JWT or server-side session — pick per surface. Don't mix-and-match within one surface.

6.3 Authorization — RBAC, then ABAC if needed

Start with role-based access control (RBAC):

Workspace roles: owner | admin | member | viewer
Resource permissions derived from role

Only add attribute-based access control (ABAC) (e.g., "user X can edit only resources where assignee_id = user.id") when RBAC alone produces unmaintainable conditionals.

// Permission helper signature
func Can(actor *Actor, action string, resource Resource) bool

Centralize all permission logic in one package. Never inline if user.Role == "admin" checks in handlers.

6.4 Open-source policy engines

Casbin — Go, lightweight, RBAC + ABAC.
OPA (Open Policy Agent) — sidecar, enterprise-grade.
Oso — embedded, declarative.
Ory Keto — Google Zanzibar–style relationship-based access control as a service.

For a template, hand-rolled Can() is fine until you hit ~20 permission rules.

6.5 Don't-build-it-yourself: managed & self-hostable identity

Auth is a tarpit. Ship a real identity service before you ship your second feature. Pick by where you want the trust boundary:

Option	Type	Sweet spot	Watch out for
Clerk	Managed SaaS	B2C/PLG products that want pre-built React components and great DX.	Per-MAU pricing scales painfully past ~50k actives.
WorkOS	Managed SaaS	B2B selling into mid-market/enterprise — SSO (SAML/OIDC), SCIM, directory sync, audit log API.	Light on consumer-style password/magic-link flows; pair with Clerk or your own for those.
Supabase Auth (GoTrue)	Managed or self-hosted	You're already using Supabase Postgres + Storage; auth comes "free" with RLS hooks wired in.	You're now Supabase-shaped; migrating off later isn't trivial.
Casdoor	Self-hosted OSS	Single binary IAM with a built-in admin UI. OIDC/OAuth2/SAML/CAS providers, RBAC/ABAC, MFA, social logins, webhooks.	UI is functional, not premium — usually fine since admins use it, not end users.
Ory Kratos + Hydra + Keto	Self-hosted OSS	API-first, headless, composable. Kratos = identity + flows, Hydra = OIDC/OAuth2 server, Keto = permissions. You bring your own UI.	More moving parts; budget a week to wire flows + UI.
Authentik / Zitadel / Keycloak	Self-hosted OSS	Alternatives in the same shape as Casdoor — pick on UX preference and language affinity.	Keycloak is JVM-heavy; Authentik/Zitadel are lighter.

Template recommendation by audience:

Solo / bootstrapped: start with Casdoor (one container, admin UI, OIDC works in 30 minutes) or Supabase Auth if you want DB + auth co-located.
Funded B2B: WorkOS for SSO/SCIM + your own password/magic-link, or Ory Kratos if you must self-host for compliance.
Consumer-facing PLG: Clerk for the fastest path to a polished sign-in experience.

Your app should talk to identity through a thin auth package interface (Authenticate(token) → Actor, Issue(ctx, user) → token). Swapping Casdoor for WorkOS later is then a ~1-day adapter change, not a rewrite.

6.6 Auth security checklist

[ ] Passwords hashed with argon2id (or bcrypt cost 12+).
[ ] Email enumeration defended (same response for "email not found" and "wrong password").
[ ] Rate limiting on /login (5/min/IP + 10/hr/email).
[ ] Lockout after N failed attempts, with email notification.
[ ] CSRF protection on cookie-auth endpoints.
[ ] Session fixation defense: rotate session ID on login.
[ ] Logout invalidates server-side session.
[ ] Refresh tokens rotated on use; revoke entire family on reuse-detection.
[ ] Password reset tokens are single-use, expire in 1h, are sent to verified email only.
[ ] MFA backup codes generated, shown once, hashed at rest.

7. 👥 Accounts, Organizations, Workspaces, Teams

7.1 The canonical hierarchy

User  ─┬─►  Membership  ─►  Workspace (tenant)
       │                       │
       │                       ├── Teams (subgroups)
       │                       ├── Resources (projects, issues, …)
       │                       ├── Subscription (Stripe)
       │                       └── Settings (branding, SSO, etc.)
       │
       └─►  Personal account (optional — for solo plans)

A User is a global identity. A Membership ties a user to a workspace with a role.

7.2 Required tables (minimum)

user (id, email, password_hash, email_verified_at, mfa_enabled, created_at, ...)
workspace (id, slug, name, plan, owner_user_id, created_at, ...)
membership (id, user_id, workspace_id, role, status, invited_by, joined_at)
invite (id, workspace_id, email, role, token_hash, expires_at, accepted_at)
team (id, workspace_id, name, parent_team_id NULL)
team_membership (id, team_id, user_id, role)
api_key (id, workspace_id, name, prefix, hash, scopes JSONB, created_by, last_used_at, revoked_at)

7.3 Invites

Email a single-use signed token (expires in 7 days).
Accepting creates the membership row.
Critical: if invitee already has an account, just attach a membership — don't force a separate signup flow.

7.4 Workspace switcher UI

A persistent UI element (sidebar dropdown or top nav) that:

Shows current workspace.
Lets user switch (changes URL: /w/<slug>/...).
Lets user create a new workspace.
Cache the active workspace ID per-user in a cookie/localStorage so it survives reloads.

7.5 Offboarding & deletion

Delete account: GDPR right-to-be-forgotten. Anonymize PII, retain audit log entries with user_id = NULL + display_name = "Deleted user".
Leave workspace: just removes the membership row.
Delete workspace: 30-day soft-delete with restore option. Hard-delete after grace period via cron.

8. 🚪 Onboarding & Activation

The 5-minute window between sign-up and first value is the highest-leverage UX you'll ever build.

8.1 The signup flow

1. /signup → email + password (or OAuth)
2. Send verification email immediately (but don't block app entry on it)
3. Land in "create your workspace" step
4. Land in product with one-time guided tour
5. Trigger first-aha-moment within ≤ 3 clicks

8.2 Activation events

Define the activation event — the action that predicts retention. Examples:

Slack: send 2,000 team messages
Dropbox: upload 1 file
Linear: create 3 issues
Figma: invite 1 collaborator

Track this as activated_at on the workspace, fire it from your event bus, and trigger lifecycle emails off it.

8.3 Email verification — required vs optional

Required for sensitive actions (billing, inviting users, API keys).
Optional for read-only browsing.
Show a banner ("Verify your email — we sent a link to alice@…") and a one-click resend button.

8.4 Sample data / templates

For B2B SaaS, ship with a demo workspace that's pre-populated. Lets new users explore before they set up their own data.

8.5 Empty states are product surface

Every list view (/issues, /projects, …) needs an empty state with:

One sentence of context ("No issues yet — issues are how you track work").
A primary CTA button.
An optional "import from CSV / Linear / Jira" hook.

9. 💳 Billing, Subscriptions & Metering

9.1 Use Stripe. (Or Paddle / LemonSqueezy if you want them to handle global tax.)

Don't build billing yourself. Stripe has solved every edge case you'd hit in year three.

On PayPal: Stripe is the default subscription engine. PayPal is a checkout option, not a billing system. A meaningful slice of customers — LATAM, parts of Asia/EU, freelancer/creator markets, B2C audiences who don't want to hand over a card — will bounce if PayPal isn't there. The right shape is:

Subscriptions ledger lives in your DB. Plan, status, period, seats — your tables, your truth.
Stripe for cards / Apple Pay / Google Pay / SEPA / ACH (subscription billing via Stripe Billing).
PayPal Subscriptions API wired as a parallel payment provider — same subscription row, different payment_provider column.
One webhook handler per provider writing into the same idempotent state machine. Don't try to unify webhooks; unify the resulting state.

subscription (
    id UUID PK,
    workspace_id UUID,
    plan_id UUID,
    status TEXT,                    -- trialing | active | past_due | canceled
    payment_provider TEXT,          -- 'stripe' | 'paypal' | 'manual'
    provider_subscription_id TEXT,  -- stripe sub_… / paypal I-…
    provider_customer_id TEXT,
    current_period_end TIMESTAMPTZ,
    cancel_at TIMESTAMPTZ NULL,
    ...
)

Skip PayPal until a real customer asks for it twice. Then add it behind a feature flag and offer it only on the plan-selection page.

9.2 Required Stripe surfaces

Surface	Stripe product
Plan selection at signup	Stripe Checkout (hosted)
In-app upgrade/downgrade	Stripe Billing Portal (hosted) — or build your own using the API
Usage-based billing	Metered prices
Trials	Set `trial_period_days` on subscription
Discounts / coupons	Stripe coupons + promotion codes
Invoices, payment methods, receipts	Customer Portal handles all this for free

9.3 The webhook contract

Subscribe to (at minimum):

customer.subscription.created
customer.subscription.updated
customer.subscription.deleted
invoice.paid
invoice.payment_failed
customer.updated
checkout.session.completed

Idempotency rule: every webhook handler must be idempotent. Stripe will retry. Use the event.id as a dedup key.

9.4 Plan model

plan (id, name, stripe_price_id, monthly_price_cents, yearly_price_cents, features JSONB, limits JSONB)
subscription (id, workspace_id, stripe_subscription_id, stripe_customer_id, plan_id, status, current_period_end, cancel_at, ...)
usage_record (id, workspace_id, metric, quantity, recorded_at, billed_at)

features and limits should be JSONB so you can add new feature gates without migrations:

{
  "features": { "sso": false, "audit_log_export": false, "custom_domains": false },
  "limits":   { "members": 10, "projects": 5, "ai_credits_per_month": 1000 }
}

9.5 Feature gating

// Single helper, used everywhere
if (!can(workspace, "feature.sso")) {
  return upgradePrompt("SSO is available on the Team plan and above");
}

Every paywall is a can() check + a UI prompt. Never silently 403.

9.6 Metering

For usage-based pricing (AI credits, API calls, storage GB, …):

// In the request path, fast and non-blocking:
meter.Increment(ctx, workspaceID, "ai.tokens", n)

meter.Increment writes to Redis (incr counter) + buffers writes to Postgres / Stripe in the worker. Never call Stripe synchronously in the request path.

9.7 Dunning (failed payments)

1st failure: email "We couldn't charge your card."
3rd failure (~7 days): downgrade to free + email.
30 days unpaid: suspend workspace (read-only) + email.
60 days: hard-delete or hand to collections.

Stripe handles the retry schedule (Smart Retries) — you handle the in-app messaging.

9.8 Trials done right

Length: 14 days is the cultural norm. Don't overthink it.
Card upfront vs not: card-up-front filters tire-kickers (lower volume, higher conversion); no-card maximizes top-of-funnel. For B2B SaaS template, default to no-card with trial countdown banners.
Trial extension: offer once, free, no questions. ("Need more time? Extend 7 days.")
Trial expiration UX: read-only mode + upgrade banner. Don't delete data.

9.9 When you'd outgrow Stripe-direct: Merchant-of-Record platforms

Stripe leaves you responsible for global tax (VAT, GST, US state sales tax). Below ~$1M ARR or with US-only customers, that's fine. Beyond that, or if you sell into the EU/UK as a non-resident, the compliance overhead becomes a real cost — at which point a Merchant-of-Record (MoR) sells the product to the customer and from you, taking the tax problem off your plate.

Option	Type	Sweet spot	Watch out for
Paddle	Managed MoR	Established (15+ years), broad payment-method coverage, good for B2B SaaS selling globally.	Higher fees than raw Stripe (~5% all-in vs ~2.9% + 30¢); less granular control over the checkout.
LemonSqueezy	Managed MoR (Stripe-owned since 2024)	Indie/SMB-friendly, simple pricing, good license-key + digital-product support.	Acquired by Stripe — long-term roadmap may converge with Stripe Tax.
Polar	OSS + managed MoR	Open-source, developer-focused, optimized for indie hackers and dev-tool SaaS. Native usage-based billing, GitHub integration, customer benefits/perks built in. The right pick when you want MoR + a tool that feels native to a dev-first product.	Younger than Paddle/LMSqueezy; smaller ecosystem of integrations. Verify supported regions/payment methods match your market.
Stripe Tax (add-on, not MoR)	Managed	You stay the merchant of record but Stripe calculates and (in some jurisdictions) files tax for you. The middle ground.	Doesn't solve "non-resident seller of digital services in the EU" — you're still the entity registered for VAT.

Decision rule: stay on raw Stripe until tax compliance starts costing you 1+ engineer-week per quarter. Then go MoR. Polar is the right default for indie / dev-tool / open-core SaaS; Paddle/LemonSqueezy for broader B2B.

The same pattern as PayPal (§9.1): your subscription table is provider-agnostic — payment_provider TEXT distinguishes stripe / paypal / polar / paddle. Switching MoRs later is a webhook-handler swap, not a rewrite.

10. 🗄️ Database Design Patterns

10.1 Conventions

Singular table names (user, issue) — matches Go struct naming.
Every table has: id (UUID v7 — sortable), created_at, updated_at, and workspace_id (if tenant-scoped).
UUID v7 is sortable by time → primary key + chronological order in one column.
Soft delete: deleted_at TIMESTAMPTZ NULL with a partial unique index where deleted_at IS NULL.
Append-only history tables for things that need provenance (audit log, billing events, webhooks).

10.2 Migrations

Always forward. Never edit an applied migration. Create a new one to fix mistakes.
Use goose or golang-migrate (Go — both fine; golang-migrate ships a CLI + library + Docker image and supports many DB drivers, goose has nicer Go-based migrations) / alembic (Python) / prisma migrate / drizzle-kit / Atlas (declarative, language-agnostic).
Number them sequentially: 001_init.up.sql, 002_add_invites.up.sql, ….
Run automatically on deploy (with a deploy gate / dry-run for prod).
Online migrations: never block writes on a hot table. Add column nullable → backfill in batches → add NOT NULL in a later migration.

10.3 Indexes that pay rent

Every foreign key.
Every WHERE clause column you actually filter on (run EXPLAIN ANALYZE).
(workspace_id, status, created_at DESC) for typical "list X for tenant" queries.
Partial indexes for soft delete: WHERE deleted_at IS NULL.

10.4 Transactions

Wrap every multi-write operation in a transaction.
Use the outbox pattern for cross-service events (see §13.3).
Don't hold transactions open across HTTP/RPC calls. Read first, do external work, write fast.

10.5 Ergonomics

Use sqlc (Go) / Prisma (TS) / SQLAlchemy 2.0 + Alembic (Python). Skip ORMs that hide SQL.
Co-locate migrations and queries in the repo; check them in.
Seed scripts for local dev that create realistic data (make seed).

11. 🌐 API Design

11.1 REST is the default; GraphQL is the exception

REST + JSON for 90% of endpoints. Predictable, cacheable, debuggable.
GraphQL if you have a complex, deeply-nested data graph and many client surfaces. Otherwise it's overhead.
gRPC for service-to-service inside your infra.

11.2 Resource conventions

GET    /api/v1/projects                 list
POST   /api/v1/projects                 create
GET    /api/v1/projects/:id             read
PATCH  /api/v1/projects/:id             partial update (preferred over PUT)
DELETE /api/v1/projects/:id             delete
GET    /api/v1/projects/:id/issues      sub-collection
POST   /api/v1/projects/:id/issues      create in sub-collection

11.3 Pagination

Cursor-based (?cursor=<opaque>&limit=50) — not offset. Offsets break under concurrent inserts.
Return { items: [], next_cursor, has_more }.
Cap limit at 100.

11.4 Filtering & sorting

?status=open&priority=high&sort=-created_at&limit=50

Document supported filters per endpoint. Reject unknown query params (don't silently ignore — typos won't surface).

11.5 Error envelope (one shape, everywhere)

{
  "error": {
    "code": "validation_error",
    "message": "Title is required",
    "fields": { "title": "must not be empty" },
    "request_id": "req_01HMZ..."
  }
}

Include request_id in every response (header + body) so support can grep your logs.

11.6 Idempotency

For POST endpoints that create resources or trigger side effects, accept an Idempotency-Key header.
Cache (workspace_id, idempotency_key) → response in Redis for 24h.
Return the cached response on retry. Stripe's the canonical example.

11.7 Rate limiting

Per API key + per IP + per workspace.
Token bucket in Redis (INCR + EXPIRE).
Return 429 with Retry-After header.
Document limits in your API docs and surface them in the response headers (X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset).

11.8 Versioning

URL versioning (/api/v1/, /api/v2/) — boring, works.
Or header-based (Accept: application/vnd.yourtool.v2+json) — fancy, more work.
Never break v1 once published. Add v2 alongside.

11.9 OpenAPI

Maintain a hand-written or generated OpenAPI 3.1 spec.
Generate client SDKs from it (openapi-generator, oapi-codegen).
Render docs with Stoplight / Redoc / Mintlify.

11.10 Webhooks (outgoing)

Per-workspace endpoints registered in settings.
Sign every payload: X-Signature: sha256=<hmac(body, secret)>.
Include X-Event-Id (idempotency) and X-Timestamp (replay defense).
Retry with exponential backoff (1m, 5m, 30m, 2h, 12h) — fail and notify after final retry.

12. ⚙️ Background Jobs, Queues & Schedulers

12.1 Three job categories

Category	Examples	Constraint
Async (fire-and-forget)	Send email, post to webhook, sync to CRM	Must be retried on failure
Scheduled	Daily reports, dunning emails, data exports	Must run within window, not on hot path
Long-running	Imports, AI batch jobs, video transcode	Need progress tracking + cancellation

12.2 Job system

Pick one library per language and stick to it.
Go: River (Postgres-backed, transactional) or Asynq (Redis-backed).
Python: Arq (asyncio + Redis) or Celery (mature, heavy).
Node: BullMQ.

12.3 Idempotency

Every handler must tolerate being called twice. Use a (job_type, dedup_key) unique key, or check-then-act inside a transaction.

12.4 Outbox pattern

When you need "DB write + event emission" to be transactional:

INSERT INTO order ...;
INSERT INTO outbox (event_type, payload) VALUES ('order.created', '...');
COMMIT;

A separate worker polls outbox, fires the event (queue / webhook / Stripe sync), marks it done.

12.5 Cron / scheduled jobs

Use a single, deduplicated scheduler — not cron per box (you'll get duplicate runs on multi-instance deploys).
Postgres-backed pg_cron or library-level (robfig/cron + leader election) work fine.
Every scheduled job logs its run + duration to a cron_run table for visibility.

12.6 Long-running progress

For jobs the user can see ("Importing 50,000 contacts…"):

Persist a job row with status, progress_pct, total, current, result, error.
Worker updates progress every N items / N seconds.
UI polls GET /jobs/:id or subscribes via WS.

12.7 The tier above queues: durable execution engines

A queue (Asynq, BullMQ) gives you "run this function later, retry on failure." That's enough for 80% of SaaS work. But once your jobs become multi-step workflows that can pause for hours, fan-out and join, survive worker crashes mid-step, and need exactly-once guarantees end-to-end (think: subscription onboarding flow, multi-day customer pipeline, agent runs that pause for human approval), a queue starts to bend. You end up rebuilding state machines, sagas, and resumability on top of it. That's the signal to step up to a durable execution engine.

Tool	Type	Sweet spot	Watch out for
Temporal	OSS, self-host or Temporal Cloud (managed)	The category leader. Workflows-as-code in Go/TS/Python/Java/.NET, deterministic replay, built-in retries/timeouts/heartbeats/sagas/signals/queries. The right pick for serious multi-step orchestration (billing flows, KYC, ETL pipelines, long-running agents §18 of the AI playbook).	Operationally non-trivial — Temporal cluster needs Cassandra/PostgreSQL + history service + matching service. Use Temporal Cloud (~$200/mo starter) until you have a reason not to. Workflow code must be deterministic — surprising at first.
Hatchet	OSS, Postgres-backed	Temporal-shaped (durable workflows, retries, fan-out, human-in-the-loop) but runs on just Postgres — no separate cluster. Excellent fit for teams that already have Postgres and don't want to operate Temporal. Python and TS SDKs, Go in progress.	Younger project, smaller ecosystem. Postgres becomes a hot bottleneck at very high workflow volume — fine for thousands/sec, not millions.
Inngest	Managed (OSS dev tools)	Step-functions-style workflows in TS/Python, focused on developer ergonomics and event-driven triggers. Best for serverless/Vercel-shaped stacks.	Less control if you self-host; managed pricing scales with executions.
Restate	OSS, single binary	Newer durable execution runtime focused on simplicity (single binary, deterministic) with TS/Java/Kotlin/Python/Go/Rust SDKs. Worth watching.	Smaller community than Temporal/Hatchet today.

When to pick a durable execution engine over a queue:

A workflow has ≥3 steps, any of which can be retried independently.
A workflow needs to pause and wait — for an external webhook, a human approval, a timer measured in hours/days.
"If the worker crashes mid-step, the work must continue from exactly where it left off" is a real requirement, not a nice-to-have.
You're writing your fourth state-machine table this quarter.

Recommendation by stage:

Day one of the template: stick with the queue from §12.2. Don't import Temporal complexity before you need it.
Year one, indie/bootstrapped: if you cross the threshold above, Hatchet is the path of least resistance — it slots into your existing Postgres.
Year two, funded / enterprise: Temporal Cloud is the safe pick — battle-tested, audited, used by Uber/Snap/Netflix, deep tooling. The managed offering removes the operational pain.

The same Bus / Worker interface pattern from §4.4 applies: workflows are invoked through a thin adapter so swapping queues for Temporal later is a worker rewrite, not an API rewrite. AI agents in particular (long pause, human-in-the-loop, hours-long runs) are the canonical fit — see the AI playbook §18.

13. 📡 Real-time & Eventing

13.1 In-process event bus (the spine)

A simple synchronous publisher with topic-based listeners:

bus.Publish(ctx, "issue.created", IssueCreated{ID: ..., WorkspaceID: ...})

Listeners write derived state, enqueue jobs, and broadcast over WS.

Important: subscribers register before publishers. Document the order in main.go. Order is load-bearing.

13.2 WebSocket vs SSE

Need	Use
Bidirectional (chat, collaborative editing)	WebSocket
Server → client only (live dashboards, notifications)	SSE (simpler, plays nice with HTTP/2)

For most SaaS, SSE is enough. WebSocket only if you have meaningful client→server messaging beyond auth handshake.

13.3 Multi-node fanout

Single API node: in-memory hub. Multi-node: backend hub publishes to a pub/sub bus, every node subscribes and forwards to its connected clients.

Bus	When to pick it
Redis pub/sub	You already have Redis. Fire-and-forget. No durability — a disconnected node misses messages.
Redis Streams	Same Redis, but with replay + consumer groups. Good middle ground.
NATS JetStream	The right answer for any SaaS that's growing into multiple services. Persistent streams, replay, exactly-once-on-ack consumers, KV + object store, per-tenant subjects (`ws.<workspace_id>.>`), works as eventing backbone and WS fan-out and job queue. Cheap to self-host (single binary), clusters trivially.
Kafka / Redpanda	You have a data team and analytics pipelines. Overkill as a starting point.

[Browser] ─WS─► [API node A] ─pub─► [NATS JetStream] ─sub─► [API node B] ─WS─► [Browser]
                                          │
                                          └─► [Worker pool] (durable consumers, replay on crash)

Why NATS JetStream is the recommended template default once you outgrow single-node:

One binary replaces Redis pub/sub + a job queue + an event log.
Per-tenant subject hierarchy (tenant.<workspace_id>.events.>) maps cleanly to multi-tenancy.
Durable consumers give you the outbox-pattern guarantees (§12.4) without an outbox table for cross-service events.
KV bucket for ephemeral state (presence, rate-limit counters) — you can drop Redis in some deployments.

Don't make any of this required for the dev/single-node experience. Single-node self-host should run on Postgres alone, with the bus interface no-op'd to an in-memory channel.

// Bus abstraction — same interface, different backends.
type Bus interface {
    Publish(ctx context.Context, subject string, payload []byte) error
    Subscribe(ctx context.Context, subject string, h Handler) (Subscription, error)
}
// inproc.NewBus() | redis.NewBus(rdb) | nats.NewJetStreamBus(js)

13.4 Realtime ↔ Cache invalidation rule

WS events invalidate Query cache. They never write directly to client stores.

Why: WS messages can arrive out of order, can be dropped, can be replayed. Cache invalidation is idempotent; direct writes are not.

ws.on("issue.updated", ({ id }) => {
  queryClient.invalidateQueries(["issue", id])
})

14. 📨 Email, Notifications & Inbox

14.1 Three notification surfaces

Surface	Provider	Use for
Transactional email	Resend / Postmark / SES	Verify, reset, invite, receipts, dunning
In-app inbox	Your own DB	Mentions, comments, status changes, system messages
Push / SMS	Twilio / OneSignal / APNS	Mobile-only critical alerts

14.2 Templates

Use MJML or React Email for transactional templates. Renders to bulletproof HTML across clients.
Keep one template per email type. Centralize a "layout" component.
Plain-text fallback always.

14.3 Per-user preferences

notification_preference (
    user_id, workspace_id, channel TEXT, event_type TEXT, enabled BOOL
)

Every email and in-app alert checks preferences before sending. Default new events to "on" — but always allow opt-out with one click.

14.4 Unsubscribe link

Every transactional email except security/billing has a List-Unsubscribe header + footer link.
One-click unsubscribe (mailto: + URL).
Persist the opt-out, don't re-send on bounce-back-then-recreate.

14.5 In-app inbox

Same data shape as email events. Render a bell icon with unread count + a list view. Keys:

notification rows: user_id, workspace_id, kind, payload JSONB, read_at.
WS push for live updates.
Mark-all-read endpoint.

14.6 Digesting / batching

For high-volume events (chat mentions, comment replies):

Real-time push if user is online.
Otherwise, batch into a digest email (hourly/daily), configurable per user.

15. 📦 File Storage, Uploads & CDN

15.1 The cardinal rule

Never proxy file bytes through your API server. Client uploads directly to S3 via signed URL.

[Client] ──GET /upload-url──► [API] ──signed PUT URL──► [Client]
[Client] ──PUT───────────────────────────────────────► [S3]
[Client] ──POST /confirm──► [API] (records metadata)

15.2 Server-issued signed URLs

url := s3.PresignPutObject(ctx, bucket, key, ttl=15min, contentType=..., maxSize=...)

Always set:

TTL (15 min usually).
Content-Type constraint.
Content-Length max (defense against unbounded uploads).
Tenant-scoped key prefix: s3://your-bucket/<workspace_id>/<file_id>.

15.3 File metadata

file (
    id UUID PK,
    workspace_id UUID,
    uploader_user_id UUID,
    filename TEXT,
    mime_type TEXT,
    size_bytes BIGINT,
    s3_key TEXT,
    sha256 TEXT,
    status TEXT,  -- pending | uploaded | scanned | quarantined
    created_at TIMESTAMPTZ
)

15.4 Virus / content scanning

For user-uploaded files, scan on upload (S3 event → Lambda / worker → ClamAV / proprietary).
Until scanned, mark status = pending and refuse to serve.

15.5 Serving private files

Generate signed GET URLs (5–60 min TTL), or
Stream from server with auth check (only for small / sensitive files).

15.6 CDN

Cloudflare or CloudFront in front of S3.
Use signed CloudFront URLs for private content.
Public assets (avatars, public docs) get a permanent path with cache-busting via content hash.

16. 🔎 Search (Full-Text + Semantic)

16.1 Start with Postgres

CREATE INDEX idx_issue_search ON issue
    USING GIN (to_tsvector('english', title || ' ' || coalesce(content, '')));

pg_trgm adds typo tolerance:

CREATE INDEX idx_issue_title_trgm ON issue USING GIN (title gin_trgm_ops);

This carries you to ~10M rows easily.

16.2 Move to a search engine when you need

Fuzzy search across many fields with relevance tuning → Meilisearch or Typesense (both excellent DX).
Massive scale + analytics → Elasticsearch / OpenSearch.
Replicate from Postgres via CDC (Debezium) or write-on-write triggers.

16.3 Vector / semantic search

CREATE EXTENSION vector;
ALTER TABLE document ADD COLUMN embedding vector(1536);
CREATE INDEX ON document USING hnsw (embedding vector_cosine_ops);

Generate embeddings via OpenAI / local model in a worker after content changes. Don't generate them in the request path.

16.4 Hybrid search

Combine BM25 (keyword) and vector (semantic) with reciprocal rank fusion:

score(doc) = 1/(k + rank_bm25) + 1/(k + rank_vector)

This dramatically beats either alone for product search.

17. 🚩 Feature Flags & Experiments

17.1 Three flag scopes

flag → environment (dev/staging/prod)
     → workspace (tenant-level rollout)
     → user (individual override)

Every flag check resolves: env default → workspace override → user override.

17.2 Use a service

Self-host: PostHog, Unleash, GrowthBook.
Hosted: LaunchDarkly, Statsig.
DIY: simple flag table + Redis cache → fine for ≤ 50 flags.

17.3 The kill-switch culture

Every risky new feature ships behind a flag. Rule: "if it's not behind a flag, it can't ship."

if flags.IsEnabled(ctx, "new_billing_engine", workspaceID) {
    return newPath()
}
return oldPath()

After 2 weeks of stable rollout: clean up the flag and the dead branch.

17.4 Experiments / A-B tests

Ship via the same flag system with a randomized assignment. Log assignment + outcome to your analytics warehouse. Decide significance with a stats library or PostHog's experiment view — don't eyeball.

18. 📊 Audit Logs, Activity Feeds & Telemetry

18.1 Three different things, often confused

Concept	Audience	Retention	Mutability
Audit log	Compliance / security teams	Years	Immutable, append-only
Activity feed	End users ("Alice changed the title")	Months	Mutable summaries OK
Telemetry / analytics	Your team (product/eng)	Months–years	Aggregated, anonymized

Don't try to use one table for all three.

18.2 Audit log table

audit_log (
    id UUID PK,
    workspace_id UUID,
    actor_user_id UUID NULL,
    actor_type TEXT,          -- user | api_key | system
    action TEXT,              -- "issue.delete", "billing.plan.change", "auth.login"
    target_type TEXT,
    target_id UUID,
    metadata JSONB,
    ip_address INET,
    user_agent TEXT,
    created_at TIMESTAMPTZ NOT NULL DEFAULT now()
);

-- never UPDATE or DELETE this table; partition by month

Log every privileged action: settings change, role change, billing change, member invite/remove, file deletion, login, password change, MFA enable/disable.

18.3 Activity feed

For end-user "what happened to my project":

activity (
    id, workspace_id, actor_user_id, verb, object_type, object_id, metadata, created_at
)

Render with templates: "{actor} {verb} {object}".

18.4 Export

Enterprise plan users want audit log export (CSV / JSON / Splunk-compatible). Build the endpoint behind a feature flag.

(... to be continued...) Read Part 2 here https://viblo.asia/p/the-saas-template-playbook-part-2-2vJPdW2MJeK

If you found this helpful, let me know by leaving a 👍 or a comment!, or if you think this post could help someone, feel free to share it! Thank you very much! 😃

Android iOS JavaScript ReactJS