How does the classic realm-per-tenant model compare to the newer organizations feature (KC 26+) that keeps everything in one realm?
  • JavaScript 63%
  • Shell 22.5%
  • Python 14.5%
Find a file
2026-05-26 08:40:53 +01:00
.claude/skills/keycloak-benchmark initial commit 2026-05-14 14:19:06 +01:00
bench initial commit 2026-05-14 14:19:06 +01:00
plots initial commit 2026-05-14 14:19:06 +01:00
results initial commit 2026-05-14 14:19:06 +01:00
scripts initial commit 2026-05-14 14:19:06 +01:00
seeder initial commit 2026-05-14 14:19:06 +01:00
.env.example initial commit 2026-05-14 14:19:06 +01:00
.gitignore initial commit 2026-05-14 14:19:06 +01:00
docker-compose.yml initial commit 2026-05-14 14:19:06 +01:00
README.md feat: added recommended specs section to readme 2026-05-26 08:40:53 +01:00

Keycloak Benchmark

A small benchmark setup for Keycloak multi-tenancy: how does Keycloak's performance behave as you add more tenants? Specifically, how does the classic realm-per-tenant model compare to the newer organizations feature (KC 26+) that keeps everything in one realm?

I'd heard repeatedly that "Keycloak gets weird with many realms" but never had concrete numbers. This repo is what came out of trying to make that concrete — a docker-compose stack, a REST-based seeder, three k6 workloads, a sweep driver, and a Python plotter. Everything's reproducible.

TL;DR: for realm-per-tenant multi-tenancy on a single node, Keycloak is not viable past ~500 tenants. Admin API throughput collapses by 4 orders of magnitude, p95 latency goes from 2 ms to 17 s, and somewhere around 1200 realms the server runs itself out of memory chasing scheduled cache reloads. Organizations mode stays essentially flat through at least 1500 tenants under the same workload.

Admin API throughput

Hardware

All numbers below come from a single laptop:

  • CPU: AMD Ryzen 5 7640U (6 cores / 12 threads, ~4.3 GHz boost)
  • RAM: 60 GB
  • Storage: NVMe SSD, encrypted (LUKS)
  • OS: Guix System, Linux 6.19
  • Container runtime: Podman 5.8 (rootless), podman-compose
  • Keycloak: 26.0, production start mode, Postgres backend
  • Postgres: 16-alpine, default config
  • Network: k6 → KC over --network host (effectively localhost, no TLS)

No artificial resource limits on the containers — KC gets the whole box. Numbers will vary on other hardware, but the shape of the curves should not. Take the absolute throughput figures with a grain of salt; the realms-vs-orgs delta is what matters.

What's measured

Workload What it does Why it matters
token-issuance OIDC password grant against a uniform-random tenant, 50 VUs The hot path most apps actually hit
jwks GET /realms/{r}/protocol/openid-connect/certs, 50 VUs Cached path; sanity check that not everything degrades
admin-api List realms + get realm + list users, 10 VUs, admin token Where the multi-tenancy pain shows up first
seed REST loop creating realms/orgs + clients + roles + groups + users The provisioning angle: "how slow is automated tenant onboarding"
boot Restart KC, time to /health/ready, snapshot container RSS Cold-start latency and memory footprint vs realm count

Each k6 workload is run cold (immediately after a KC restart with N tenants already seeded) and warm (after a 60 s warmup pass that's discarded). Cold/warm matters because production failure modes — the admin console going unresponsive after a redeploy — are cold-only.

Each tenant has the same content in both modes:

  • 2 OIDC clients (1 public web, 1 confidential backend)
  • 5 realm roles + 2 client roles per client
  • 2 groups
  • 10 users, each assigned to a group and a role

How to run it

There's a Claude Code skill (.claude/skills/keycloak-benchmark/) that walks through this end-to-end. Drop into Claude Code in this repo and ask "run the keycloak benchmark" — it'll handle the prerequisites, suggest a scope, run the sweep, and aggregate the results.

If you'd rather do it by hand:

cp .env.example .env
cd seeder && npm install && cd ..

# Bring up the stack
podman compose up -d
./scripts/wait-ready.sh 180

# Run the sweep (~34 hours on Ryzen 7640U for NS=1..1500 both modes)
NS="1 100 500 1000 1500" MODES="realms orgs" ./scripts/run-sweep.sh

# Aggregate + plot
node scripts/aggregate.mjs
guix shell python python-matplotlib python-pandas -- python3 plots/generate.py

The skill is the better path — it knows about the N > 1500 realms-mode trap (see below) and won't let you walk into it unattended.

Findings

Admin API: realm-per-tenant collapses fast

This is the headline. Listing realms, fetching a realm's metadata, and listing users in a realm — the things the admin console does on every page load — degrade catastrophically as realm count grows.

N realms throughput (req/s) orgs throughput (req/s) realms p95 orgs p95
1 7,744 4,710 2 ms 5 ms
100 367 4,250 76 ms 6 ms
500 1.96 4,545 16.6 s 5 ms
1000 0.48 3,032 60.0 s (timeout) 8 ms
1500 2,777 9 ms

Orgs mode actually starts slower at N=1 (fewer optimisations in the per-org code path) but stays roughly flat through N=1500. Realms drops 4 orders of magnitude.

Admin API p95 latency

Token issuance: degrades gracefully in both modes

The user-facing OIDC login flow holds up much better. Both modes lose roughly 40 % throughput by N=1000, but neither falls off a cliff.

N realms (req/s, warm) orgs (req/s, warm)
1 185 147
100 151 143
500 146 144
1000 106 97
1500 92

Token throughput

Worth noting: the cold-start tail is much uglier than the warm steady state. At N=100 realms, max latency for the first 30 seconds after a restart hits 6.4 seconds — that's a user staring at a hung login form for 6 seconds. By warm phase it's back under 600 ms.

Token max latency, cold

JWKS: cached, stays fast in both modes

The OIDC discovery endpoint is heavily cached. Both modes hold above 11k req/s at all tenant counts. Not a differentiator, included as a sanity check that "the host machine isn't just falling over."

JWKS throughput

Memory: where realms-mode dies

N realms RSS (MB) orgs RSS (MB)
1 882 895
100 891 892
500 1,055 901
1000 1,062 877
1500 (killed at 43 GB during seed) 882

RSS

In realms mode, RSS at-ready jumps 20 % between N=100 and N=500. But the real failure was during seeding at N≈1200: KC's ClearExpiredUserSessions scheduled task started iterating every realm on every tick, and each call rehydrated the realm cache. Within minutes, process RSS went from ~1 GB to 43 GB (72 % of total system memory) and CPU pegged at 600 %. Seeding rate collapsed from ~1 s/tenant to ~25 s/tenant and we killed it.

Orgs mode shows none of this — RSS is indistinguishable from N=1 at N=1500.

Provisioning speed: REST loop is brutal on realms

We provisioned each tenant via the standard Keycloak admin REST API (@keycloak/keycloak-admin-client) with 50 parallel requests. This is what you'd actually run in real tenant-onboarding automation.

N realms seed (s) orgs seed (s)
1 2 2
100 29 11
500 387 37
1000 3,524 84
1500 (killed) 159

Seed time

At N=1000 realms, seeding took 59 minutes vs orgs' 84 seconds — a 42× gap, and the realms curve is super-linear. Per-tenant time grows from 0.3 s at N=100 to 3.5 s at N=1000 to ~25 s by N=1500. Same admin REST API, same content per tenant, same concurrency.

Boot time

N realms boot (s) orgs boot (s)
1 6.83 7.13
100 7.73 7.41
500 8.33 7.83
1000 12.74 9.55
1500 9.71

Boot time

Realms mode adds ~6 seconds of cold-start by N=1000 — meaningful in a restart-after-deploy context. Orgs adds <3 seconds across the whole range.

What this means

If you need true tenant isolation — separate password policies, identity providers, themes, and login flows per tenant — and you expect to grow past ~500 tenants, single-node realm-per-tenant Keycloak is not the answer. The admin API ceiling alone makes the admin console unusable; the memory and provisioning curves make routine operations painful before that.

For the same scale, the organizations feature (KC 26+) holds up cleanly, at the cost of sharing realm-level config across tenants. For most SaaS-style multi-tenancy that trade is fine: tenants don't need their own themes or their own OAuth flows, they need a logical grouping with isolated membership and policies, which is exactly what orgs give you.

The Keycloak team has been clear about this — see their Sizing guide and the Keycloak Benchmark project they maintain — but it's still useful to have your own numbers from your own hardware.

Sizing suggestions inferred from the numbers above, plus the usual "give the JVM some headroom" rule of thumb. Worth re-stating up front: this repo only measured idle, post-boot RSS — heap behaviour under sustained load is not directly observed. Treat these as starting points, not guarantees.

Scenario Mode Minimum Recommended
Dev / staging, ≤10 tenants, <100 users either 1 vCPU / 1.5 GB 2 vCPU / 2 GB
Small SaaS, ≤100 tenants either (orgs preferred) 2 vCPU / 2 GB 4 vCPU / 4 GB
Mid SaaS, 100500 tenants orgs 2 vCPU / 3 GB 4 vCPU / 4 GB
Large SaaS, 5001500 tenants orgs 4 vCPU / 4 GB 8 vCPU / 8 GB
1500+ tenants orgs, HA cluster benchmark per node first scale horizontally
Real realm isolation past 500 realms, HA cluster don't attempt single-node see the 43 GB seed failure at N≈1200

Add a separate small Postgres instance — 1 vCPU / 1 GB is plenty for these loads. The in-process KC cache is the bottleneck, not the database.

Translating to a cloud SKU:

  • Token issuance is CPU-bound (BCrypt). Bursty logins? Size CPU first, memory second.
  • Set -Xmx explicitly. The benchmark let the JVM auto-size against host RAM. In a container with a limit, JVM ergonomics will pick a much smaller heap than you'd expect — aim for ~75 % of the container limit.
  • Don't size for cold-start. The 6-second token tail at N=100 (and worse beyond) is a one-time hit after deploys, not steady state. Use a readiness probe with slack rather than oversizing the box.
  • Active sessions matter more than total users. Heap scales with the session/refresh-token cache, roughly active_users × refresh_lifetime — not the size of your user table.

Caveats

Take these with a grain of salt:

  • Single-node only. Multi-node Keycloak with distributed Infinispan behaves differently — and possibly worse in some ways (cache invalidation traffic scales with realm count). Not measured here.
  • HTTP, not HTTPS. Pinned to plain HTTP for benchmark consistency. TLS adds per-request cost that varies with cipher and session-resumption config.
  • Uniform random tenant selection. Real traffic is heavily skewed — 10 % of tenants get 90 % of the load. Uniform distribution stresses the cache the most (worst-case for hit rate), so these numbers are conservative for typical SaaS shapes.
  • Default Postgres tuning. No shared_buffers/max_connections/etc. tweaks. Production deployments with tuned Postgres may push the ceiling out somewhat — but not by orders of magnitude, since the bottleneck is KC's in-process cache, not the database.
  • One snapshot of one KC release. 26.0 — newer releases may improve this. Re-running the sweep on a newer version is a good first contribution.