franz/keycloak-benchmark

Fork 0

How does the classic realm-per-tenant model compare to the newer organizations feature (KC 26+) that keeps everything in one realm?

JavaScript 63%
Shell 22.5%
Python 14.5%

Find a file

Franz Geffke 9482b05df9 feat: added recommended specs section to readme		2026-05-26 08:40:53 +01:00
.claude/skills/keycloak-benchmark	initial commit	2026-05-14 14:19:06 +01:00
bench	initial commit	2026-05-14 14:19:06 +01:00
plots	initial commit	2026-05-14 14:19:06 +01:00
results	initial commit	2026-05-14 14:19:06 +01:00
scripts	initial commit	2026-05-14 14:19:06 +01:00
seeder	initial commit	2026-05-14 14:19:06 +01:00
.env.example	initial commit	2026-05-14 14:19:06 +01:00
.gitignore	initial commit	2026-05-14 14:19:06 +01:00
docker-compose.yml	initial commit	2026-05-14 14:19:06 +01:00
README.md	feat: added recommended specs section to readme	2026-05-26 08:40:53 +01:00

README.md

Keycloak Benchmark

A small benchmark setup for Keycloak multi-tenancy: how does Keycloak's performance behave as you add more tenants? Specifically, how does the classic realm-per-tenant model compare to the newer organizations feature (KC 26+) that keeps everything in one realm?

I'd heard repeatedly that "Keycloak gets weird with many realms" but never had concrete numbers. This repo is what came out of trying to make that concrete — a docker-compose stack, a REST-based seeder, three k6 workloads, a sweep driver, and a Python plotter. Everything's reproducible.

TL;DR: for realm-per-tenant multi-tenancy on a single node, Keycloak is not viable past ~500 tenants. Admin API throughput collapses by 4 orders of magnitude, p95 latency goes from 2 ms to 17 s, and somewhere around 1200 realms the server runs itself out of memory chasing scheduled cache reloads. Organizations mode stays essentially flat through at least 1500 tenants under the same workload.

Hardware

All numbers below come from a single laptop:

CPU: AMD Ryzen 5 7640U (6 cores / 12 threads, ~4.3 GHz boost)
RAM: 60 GB
Storage: NVMe SSD, encrypted (LUKS)
OS: Guix System, Linux 6.19
Container runtime: Podman 5.8 (rootless), podman-compose
Keycloak: 26.0, production start mode, Postgres backend
Postgres: 16-alpine, default config
Network: k6 → KC over --network host (effectively localhost, no TLS)

No artificial resource limits on the containers — KC gets the whole box. Numbers will vary on other hardware, but the shape of the curves should not. Take the absolute throughput figures with a grain of salt; the realms-vs-orgs delta is what matters.

What's measured

Workload	What it does	Why it matters
`token-issuance`	OIDC password grant against a uniform-random tenant, 50 VUs	The hot path most apps actually hit
`jwks`	GET `/realms/{r}/protocol/openid-connect/certs`, 50 VUs	Cached path; sanity check that not everything degrades
`admin-api`	List realms + get realm + list users, 10 VUs, admin token	Where the multi-tenancy pain shows up first
`seed`	REST loop creating realms/orgs + clients + roles + groups + users	The provisioning angle: "how slow is automated tenant onboarding"
`boot`	Restart KC, time to `/health/ready`, snapshot container RSS	Cold-start latency and memory footprint vs realm count

Each k6 workload is run cold (immediately after a KC restart with N tenants already seeded) and warm (after a 60 s warmup pass that's discarded). Cold/warm matters because production failure modes — the admin console going unresponsive after a redeploy — are cold-only.

Each tenant has the same content in both modes:

2 OIDC clients (1 public web, 1 confidential backend)
5 realm roles + 2 client roles per client
2 groups
10 users, each assigned to a group and a role

How to run it

There's a Claude Code skill (.claude/skills/keycloak-benchmark/) that walks through this end-to-end. Drop into Claude Code in this repo and ask "run the keycloak benchmark" — it'll handle the prerequisites, suggest a scope, run the sweep, and aggregate the results.

If you'd rather do it by hand:

cp .env.example .env
cd seeder && npm install && cd ..

# Bring up the stack
podman compose up -d
./scripts/wait-ready.sh 180

# Run the sweep (~3–4 hours on Ryzen 7640U for NS=1..1500 both modes)
NS="1 100 500 1000 1500" MODES="realms orgs" ./scripts/run-sweep.sh

# Aggregate + plot
node scripts/aggregate.mjs
guix shell python python-matplotlib python-pandas -- python3 plots/generate.py

The skill is the better path — it knows about the N > 1500 realms-mode trap (see below) and won't let you walk into it unattended.

Findings

Admin API: realm-per-tenant collapses fast

This is the headline. Listing realms, fetching a realm's metadata, and listing users in a realm — the things the admin console does on every page load — degrade catastrophically as realm count grows.

N	realms throughput (req/s)	orgs throughput (req/s)	realms p95	orgs p95
1	7,744	4,710	2 ms	5 ms
100	367	4,250	76 ms	6 ms
500	1.96	4,545	16.6 s	5 ms
1000	0.48	3,032	60.0 s (timeout)	8 ms
1500	—	2,777	—	9 ms

Orgs mode actually starts slower at N=1 (fewer optimisations in the per-org code path) but stays roughly flat through N=1500. Realms drops 4 orders of magnitude.

Token issuance: degrades gracefully in both modes

The user-facing OIDC login flow holds up much better. Both modes lose roughly 40 % throughput by N=1000, but neither falls off a cliff.

N	realms (req/s, warm)	orgs (req/s, warm)
1	185	147
100	151	143
500	146	144
1000	106	97
1500	—	92

Worth noting: the cold-start tail is much uglier than the warm steady state. At N=100 realms, max latency for the first 30 seconds after a restart hits 6.4 seconds — that's a user staring at a hung login form for 6 seconds. By warm phase it's back under 600 ms.

JWKS: cached, stays fast in both modes

The OIDC discovery endpoint is heavily cached. Both modes hold above 11k req/s at all tenant counts. Not a differentiator, included as a sanity check that "the host machine isn't just falling over."

Memory: where realms-mode dies

N	realms RSS (MB)	orgs RSS (MB)
1	882	895
100	891	892
500	1,055	901
1000	1,062	877
1500	(killed at 43 GB during seed)	882

In realms mode, RSS at-ready jumps 20 % between N=100 and N=500. But the real failure was during seeding at N≈1200: KC's ClearExpiredUserSessions scheduled task started iterating every realm on every tick, and each call rehydrated the realm cache. Within minutes, process RSS went from ~1 GB to 43 GB (72 % of total system memory) and CPU pegged at 600 %. Seeding rate collapsed from ~1 s/tenant to ~25 s/tenant and we killed it.

Orgs mode shows none of this — RSS is indistinguishable from N=1 at N=1500.

Provisioning speed: REST loop is brutal on realms

We provisioned each tenant via the standard Keycloak admin REST API (@keycloak/keycloak-admin-client) with 50 parallel requests. This is what you'd actually run in real tenant-onboarding automation.

N	realms seed (s)	orgs seed (s)
1	2	2
100	29	11
500	387	37
1000	3,524	84
1500	(killed)	159

At N=1000 realms, seeding took 59 minutes vs orgs' 84 seconds — a 42× gap, and the realms curve is super-linear. Per-tenant time grows from 0.3 s at N=100 to 3.5 s at N=1000 to ~25 s by N=1500. Same admin REST API, same content per tenant, same concurrency.

Boot time

N	realms boot (s)	orgs boot (s)
1	6.83	7.13
100	7.73	7.41
500	8.33	7.83
1000	12.74	9.55
1500	—	9.71

Realms mode adds ~6 seconds of cold-start by N=1000 — meaningful in a restart-after-deploy context. Orgs adds <3 seconds across the whole range.

What this means

If you need true tenant isolation — separate password policies, identity providers, themes, and login flows per tenant — and you expect to grow past ~500 tenants, single-node realm-per-tenant Keycloak is not the answer. The admin API ceiling alone makes the admin console unusable; the memory and provisioning curves make routine operations painful before that.

For the same scale, the organizations feature (KC 26+) holds up cleanly, at the cost of sharing realm-level config across tenants. For most SaaS-style multi-tenancy that trade is fine: tenants don't need their own themes or their own OAuth flows, they need a logical grouping with isolated membership and policies, which is exactly what orgs give you.

The Keycloak team has been clear about this — see their Sizing guide and the Keycloak Benchmark project they maintain — but it's still useful to have your own numbers from your own hardware.

Recommended specs

Sizing suggestions inferred from the numbers above, plus the usual "give the JVM some headroom" rule of thumb. Worth re-stating up front: this repo only measured idle, post-boot RSS — heap behaviour under sustained load is not directly observed. Treat these as starting points, not guarantees.

Scenario	Mode	Minimum	Recommended
Dev / staging, ≤10 tenants, <100 users	either	1 vCPU / 1.5 GB	2 vCPU / 2 GB
Small SaaS, ≤100 tenants	either (orgs preferred)	2 vCPU / 2 GB	4 vCPU / 4 GB
Mid SaaS, 100–500 tenants	orgs	2 vCPU / 3 GB	4 vCPU / 4 GB
Large SaaS, 500–1500 tenants	orgs	4 vCPU / 4 GB	8 vCPU / 8 GB
1500+ tenants	orgs, HA cluster	benchmark per node first	scale horizontally
Real realm isolation past 500	realms, HA cluster	don't attempt single-node	see the 43 GB seed failure at N≈1200

Add a separate small Postgres instance — 1 vCPU / 1 GB is plenty for these loads. The in-process KC cache is the bottleneck, not the database.

Translating to a cloud SKU:

Token issuance is CPU-bound (BCrypt). Bursty logins? Size CPU first, memory second.
Set -Xmx explicitly. The benchmark let the JVM auto-size against host RAM. In a container with a limit, JVM ergonomics will pick a much smaller heap than you'd expect — aim for ~75 % of the container limit.
Don't size for cold-start. The 6-second token tail at N=100 (and worse beyond) is a one-time hit after deploys, not steady state. Use a readiness probe with slack rather than oversizing the box.
Active sessions matter more than total users. Heap scales with the session/refresh-token cache, roughly active_users × refresh_lifetime — not the size of your user table.

Caveats

Take these with a grain of salt:

Single-node only. Multi-node Keycloak with distributed Infinispan behaves differently — and possibly worse in some ways (cache invalidation traffic scales with realm count). Not measured here.
HTTP, not HTTPS. Pinned to plain HTTP for benchmark consistency. TLS adds per-request cost that varies with cipher and session-resumption config.
Uniform random tenant selection. Real traffic is heavily skewed — 10 % of tenants get 90 % of the load. Uniform distribution stresses the cache the most (worst-case for hit rate), so these numbers are conservative for typical SaaS shapes.
Default Postgres tuning. No shared_buffers/max_connections/etc. tweaks. Production deployments with tuned Postgres may push the ceiling out somewhat — but not by orders of magnitude, since the bottleneck is KC's in-process cache, not the database.
One snapshot of one KC release. 26.0 — newer releases may improve this. Re-running the sweep on a newer version is a good first contribution.

README.md Unescape Escape