Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Performance

This document presents benchmark results and performance analysis for the current implementation.

Methodology

Benchmark harness

The benchmark is driven by the Criterion.rs library (crates/ahdapa-bench/) and orchestrated by contrib/bench/bench.sh.

What each run measures:

ScenarioWhat it tests
client_credentialsOne token endpoint round-trip per supported auth method
auth_codeFull authorization code + PKCE flow: login → /authorize/token (3 round-trips)
introspectToken introspection of a pre-minted token
gossip convergenceTime for a key-value write on node 0 to appear on all other nodes

Criterion collects 100 samples per benchmark function (warm-up: 3 s, measurement window: 30 s). The 30 s window gives adequate headroom for slower post-quantum algorithms (ML-DSA-87 JWT signatures) without the Unable to complete 100 samples warning that the default 5 s window triggers. The reported value is the mean of those 100 samples. Confidence intervals span the 5th–95th percentile of Criterion’s bootstrap estimation. Memory overhead (live heap Δ, peak heap, and total allocation pressure) is printed alongside latency but not analysed here.

Grid dimensions:

  • 7 algorithms: ES256, ES384, ES512, EdDSA, ML-DSA-44, ML-DSA-65, ML-DSA-87
  • 6 node counts: 1, 2, 3, 5, 7, 10
  • 42 runs total; each run takes approximately 7 minutes (30 s Criterion measurement window per benchmark group)

Measurement environment:

  • Host: Fedora 42 (x86-64, Linux 6.15)
  • Build: cargo bench --release (bench profile with --release flag)
  • Ahdapa server: release profile; both the Criterion harness and the Ahdapa nodes are compiled with full optimisations
  • TLS: loopback HTTPS with a per-run self-signed P-256 CA; no OCSP or CRL
  • Rate limiting: disabled (auth_rate_limit = 0)
  • Criterion request routing: round-robin across all cluster nodes via a shared Arc<AtomicUsize> counter
  • Git commit: e207adaf (audit-journald branch)

Topology:

NodesTopologyDescription
1–5full-meshEvery node peers with every other node
6–10hub-spokeNode 0 (hub) peers all; others peer only to hub

Gossip interval is 2 seconds in both topologies. The gossip loop wakes immediately on CRDT writes via tokio::sync::Notify; the interval is a fallback for passive re-sync only.

Auth code node affinity

All three steps of the authorization code flow (login, /authorize, /token) are directed to the same node per iteration. Authorization codes are stored in the node’s local database and are not replicated over CRDT, so mixing nodes within a single flow would cause code-not-found failures.

JWKS caching

private_key_jwt and JWT-bearer flows perform a remote JWKS fetch to verify client assertions. A 5-minute in-memory cache (AppState::jwks_cache) is shared across requests. Without this cache, the loopback JWKS fetch adds ~30–50 ms per request and dominates the latency; with caching the method is within 2–3× of client_secret_basic.


Implementation

The token endpoint critical path avoids all inter-request coordination:

  • JTI replay cache: DashMap<String, i64> provides lock-free concurrent access; check_and_insert_jti is synchronous and adds no async overhead
  • Signing key cache: AppState caches the active JWT signing key after first load; subsequent requests skip the database fetch entirely
  • Audit writes: sent via the JournalWriter Unix datagram socket (or JSONL file append); the write is synchronous but sub-microsecond, so token issuance and revocation latency is not affected
  • Gossip wakeup: CRDT-writing admin operations (create_client, revoke_*, scope and HBAC changes) wake the outbound gossip loop immediately via tokio::sync::Notify; the 2-second gossip interval serves only as a passive fallback
  • Parallel peer push: the gossip loop prepares all per-peer payloads under a single CRDT read lock (Phase 1), then sends HTTP POSTs to all peers concurrently via tokio::task::JoinSet (Phase 2), and processes responses sequentially (Phase 3). Wall-clock gossip time is O(max single-peer round-trip) regardless of cluster size

Results

client_credentials — token latency (µs, mean)

All auth methods are measured via the client_credentials grant, which exercises only the token endpoint (no redirect flow).

ClientSecretBasic

Symmetric HMAC verification against a shared secret. Lowest overhead method.

Algorithmn=1n=2n=3n=5n=7n=10
ES256151183210283305361
ES384452577639772878899
ES512425560650717772778
EdDSA160173238291300329
ML-DSA-4488810531124119812581274
ML-DSA-65138214241477156315901639
ML-DSA-87167717061773191818581969

ClientSecretPost

Secret in request body instead of Authorization header; otherwise identical to Basic.

Algorithmn=1n=2n=3n=5n=7n=10
ES256156163206288314350
ES384433650661767840877
ES512400568580709713776
EdDSA161171228274271364
ML-DSA-4492010681119119213531297
ML-DSA-65133615361503176316951683
ML-DSA-87169018251712184417881831

ClientSecretJwt

Server verifies a client-generated HMAC-based JWT assertion; no JWKS fetch.

Algorithmn=1n=2n=3n=5n=7n=10
ES256164191251319347381
ES384486621711794789889
ES512448536635684765815
EdDSA176215259320330370
ML-DSA-4493210611167120912311349
ML-DSA-65140714561505159916101657
ML-DSA-87171317391734192918931987

PrivateKeyJwt

Server verifies an asymmetric JWT assertion by fetching the client’s JWKS. JWKS is cached for 5 minutes; only the first request per cache miss incurs a network round-trip. The JWKS server uses TLS (same CA as the cluster).

Algorithmn=1n=2n=3n=5n=7n=10
ES256383546574676669678
ES384727908923100110841184
ES5126557808349099791032
EdDSA407524647658680750
ML-DSA-44116212951401137015051603
ML-DSA-65165017061644191718412113
ML-DSA-87190418772027207821072177

TlsClientAuth

mTLS: client presents a CA-signed certificate at the TLS layer; no JWT overhead.

Algorithmn=1n=2n=3n=5n=7n=10
ES256157168215295320366
ES384428547723720819955
ES512441580614726736876
EdDSA163172222284314328
ML-DSA-4491810421139128712631427
ML-DSA-65135114311471158216411735
ML-DSA-87167917101828179719942029

SelfSignedTlsClientAuth

Client presents a self-signed certificate; server verifies the certificate thumbprint against the registered client record.

Algorithmn=1n=2n=3n=5n=7n=10
ES256149194234269245374
ES384445557625778712779
ES512453555633702749804
EdDSA167176217297284370
ML-DSA-4493910291076123512331348
ML-DSA-65129714191528157016701877
ML-DSA-87163917431831182919231921

Authorization Code + PKCE — flow latency (µs, mean)

Three sequential loopback round-trips per measurement (login → /authorize/token). The JWT signing algorithm determines how the session token and authorization code are signed, not how the PKCE proof is verified.

Algorithmn=1n=2n=3n=5n=7n=10
ES25650969582697010411162
ES38493612361552167819352121
ES51291812031506158517471873
EdDSA52867186095910971145
ML-DSA-44177221342406249427642921
ML-DSA-65264232003259349336383716
ML-DSA-87320734063760364536923984

Token Introspection — latency (µs, mean)

Introspection validates a pre-minted access token; it is largely a local signature-check with no cluster I/O. The client_secret_basic auth method is used for the introspection endpoint itself; other methods vary by ±100 µs.

Algorithmn=1n=2n=3n=5n=7n=10
ES25663697999112129
ES38461697995121133
ES51265758294107134
EdDSA636986101106130
ML-DSA-44727987112124130
ML-DSA-65768496115134134
ML-DSA-8795100115133146175

Gossip Convergence — mean (ms)

Time for a write on node 0 to reach all other nodes. Not applicable at n=1. The gossip loop wakes immediately via Notify when a CRDT write occurs and pushes to all peers concurrently via JoinSet; the 2-second polling interval only fires as a fallback.

Algorithmn=2n=3n=5n=7n=10
ES2566.57.06.89.810.5
ES3846.46.87.08.011.3
ES5127.07.17.67.410.7
EdDSA6.46.78.48.911.0
ML-DSA-446.36.78.77.810.2
ML-DSA-656.46.58.07.99.1
ML-DSA-876.66.47.19.89.1

Analysis

Token endpoint latency scales with node count, not algorithm cost

Latency for client_credentials (ES256) increases from 151 µs at n=1 to 361 µs at n=10 — a ~139% increase in relative terms, but only 210 µs in absolute terms. The per-node overhead is approximately 23 µs per additional node. The slope is gentle because token endpoint handling is fully local: session lookup is in the local database, token signing uses a pre-loaded key, and CRDT synchronisation happens asynchronously on a separate gossip path. Network coordination is not on the critical path.

On this x86-64 platform, the dominant cost for classical algorithms is the TLS loopback round-trip; for PQC algorithms, the signing operation dominates:

  • ES256 (P-256): 151 µs at n=1, 361 µs at n=10
  • EdDSA (Ed25519): 160 µs at n=1, 329 µs at n=10 (fastest at n=10)
  • ES512 (P-521): 425 µs at n=1, 778 µs at n=10 (~2.4× over EdDSA)
  • ES384 (P-384): 452 µs at n=1, 899 µs at n=10 (~2.7× over EdDSA)
  • ML-DSA-44: 888 µs at n=1, 1274 µs at n=10 (~3.9× over EdDSA)
  • ML-DSA-65: 1382 µs at n=1, 1639 µs at n=10 (~5.0× over EdDSA)
  • ML-DSA-87: 1677 µs at n=1, 1969 µs at n=10 (~6.0× over EdDSA)

PrivateKeyJwt: JWKS caching makes asymmetric auth viable

private_key_jwt uses an in-process TLS JWKS server with 5-minute caching. After the initial JWKS fetch, latency is within 1.5–2.5× of client_secret_basic:

  • ES256: 383 µs at n=1 (2.5× over Basic), 678 µs at n=10
  • EdDSA: 407 µs at n=1 (2.5× over Basic), 750 µs at n=10
  • ML-DSA-44: 1162 µs at n=1, 1603 µs at n=10 (1.3× over Basic)

The PQC algorithms show a smaller relative overhead for PrivateKeyJwt vs Basic (1.3× for ML-DSA-44 vs 2.5× for ES256) because the JWT signing cost dominates the JWKS fetch overhead at higher algorithm weights.

Authorization code flow: crypto cost dominates at PQC levels

The auth code flow (ES256) ranges from 509 µs at n=1 to 1162 µs at n=10. For classical algorithms the latency is dominated by application-level operations: three sequential HTTP requests (login + /authorize + /token), database lookups for session and authorization code storage, and PKCE verification.

For PQC algorithms the signing overhead becomes the dominant cost: ML-DSA-87 reaches 3207 µs at n=1 and 3984 µs at n=10 — the crypto adds ~2700 µs over the application baseline. ML-DSA-44 sits at a practical sweet spot: 1772 µs at n=1, 2921 µs at n=10.

Token introspection is sub-175 µs

Introspection (ES256, ClientSecretBasic) at 63 µs (n=1) to 129 µs (n=10) is dominated by the TLS round-trip and local JWT verification. The cryptographic overhead is negligible: all algorithms fall within the 61–175 µs range across all cluster sizes. Introspection does not sign tokens, so algorithm selection has minimal impact.

Gossip convergence is sub-12 ms across all topologies

Full-mesh (n ≤ 5):
  n=2:  ~6.3–7.0 ms  (notify → parallel push → one round)
  n=3:  ~6.4–7.1 ms
  n=5:  ~6.8–8.7 ms  ← minimal growth; peers processed in parallel

Hub-spoke (n ≥ 7):
  n=7:  ~7.4–9.8 ms  ← hub pushes to all spokes concurrently
  n=10: ~9.1–11.3 ms ← 9 concurrent pushes from hub

In full-mesh, the writing node wakes immediately, prepares payloads under a single CRDT read lock, and sends to all peers concurrently. Growth from n=2 to n=5 is minimal (~1–2 ms) because additional peers are served in parallel.

In hub-spoke, convergence at n=7 is 7–10 ms and at n=10 is 9–11 ms. The hub sends to all spokes concurrently; the limiting factor is the slowest single HTTP round-trip plus CMS verify/merge time on the receiver.

Gossip convergence is algorithm-independent: the algorithm affects only JWT signing/verification on the token endpoint; gossip uses ECDSA P-256 for CMS SignedData authentication and ML-KEM-768 for envelope encryption, regardless of the configured JWT algorithm.

Throughput estimates

The benchmark measures single-request sequential latency. Real deployments issue many concurrent requests. The following estimates assume each Ahdapa node runs one Tokio thread pool with enough concurrency to saturate the local TLS stack (typically 32–64 concurrent requests before TLS becomes the bottleneck).

Using client_credentials / ClientSecretBasic at mean latency with an assumed 32× concurrency factor per node:

AlgorithmLatency n=1Single-node est.10-node cluster est.
ES256151 µs~212,000 req/s~2,120,000 req/s
EdDSA160 µs~200,000 req/s~2,000,000 req/s
ES512425 µs~75,000 req/s~750,000 req/s
ES384452 µs~71,000 req/s~710,000 req/s
ML-DSA-44888 µs~36,000 req/s~360,000 req/s
ML-DSA-651382 µs~23,000 req/s~230,000 req/s
ML-DSA-871677 µs~19,000 req/s~190,000 req/s

These are order-of-magnitude estimates. Actual throughput depends on hardware, connection pool sizing, and TLS session resumption. The concurrency factor should be validated with a dedicated load test (e.g., vegeta or oha).

For the authorization code flow at ~509 µs (n=1, ES256) the limit is the three sequential TLS round-trips, not Ahdapa logic. With HTTP/2 keepalive and 32× concurrency a single node can sustain ~63,000 flow/s.


Scalability summary

ScenarioScales with nodes?Primary bottleneck
client_credentials~2.4× from n=1 to n=10 (ES256)TLS round-trip + CRDT lock contention
auth_code~2.3× from n=1 to n=10 (ES256)3× loopback TLS handshakes
introspect~2.0× from n=1 to n=10 (ES256)1× loopback TLS + JWT verify
Gossip convergence~6–7 ms (full-mesh); ~9–11 ms (hub-spoke n=10)Parallel push; hub fan-out at n=10

The system is effectively horizontally scalable for throughput: adding nodes multiplies aggregate capacity while per-request latency grows only marginally. The gossip overhead on the token endpoint critical path is zero; CRDT synchronisation is entirely asynchronous.

Platform note: These benchmarks were conducted on Fedora 42 (x86-64, Linux 6.15) with the full 7-algorithm × 6-node grid.


Algorithm selection

Use ES256 or EdDSA (Ed25519) as the default for new deployments.

Both EC algorithms deliver sub-200 µs latency at n=1:

  • ES256 and EdDSA are in the same performance tier (151 µs and 160 µs); either is a sound default
  • EdDSA produces compact 64-byte signatures and has constant-time key operations; preferred when JWT payload size matters
  • ES256 is universally supported including by legacy clients that do not implement Ed25519; preferred for JWKS compatibility with existing P-256 PKI

Use ES384 / ES512 only when regulatory or security policy mandates a specific NIST curve.

  • ES384 ~2.8× over EdDSA (452 µs at n=1); ES512 ~2.7× (425 µs at n=1)
  • No practical security benefit over ES256 for OAuth2 token signing at normal token TTLs

Use ML-DSA-44 when post-quantum security is required and performance matters.

  • ~5.6× crypto overhead over EdDSA (888 µs at n=1, 1274 µs at n=10)
  • Remains sub-3 ms even for the full auth_code flow at n=10
  • Provides NIST-standardised (FIPS 204) post-quantum security
  • Suitable for green-field PQC deployments or mixed classical/PQC rollouts

Use ML-DSA-65 or ML-DSA-87 only when security level ≥ Category 3 is a hard requirement.

  • ML-DSA-65 ≈ 8.6× over EdDSA (1382 µs at n=1); NIST Category 3
  • ML-DSA-87 ≈ 10.5× over EdDSA (1677 µs at n=1); NIST Category 5
  • Both remain under 4 ms for the full auth_code flow at n=10

Token sizes

PQC algorithms produce significantly larger JWT tokens because the signature dominates the encoded output. The payload (claims) is identical across algorithms; only the header and signature change.

AlgorithmSignature (raw)Access tokenID tokenvs ES256
ES25664 B0.5 KiB0.7 KiB1.0×
EdDSA64 B0.5 KiB0.7 KiB1.0×
ES38496 B0.6 KiB0.8 KiB1.1×
ES512132 B0.6 KiB0.9 KiB1.2×
ML-DSA-442,420 B3.6 KiB3.8 KiB7.1×
ML-DSA-653,309 B4.7 KiB4.9 KiB9.4×
ML-DSA-874,627 B6.5 KiB6.7 KiB12.8×

Token sizes assume a typical claims set: iss, sub, aud, exp, nbf, iat, jti, scope, client_id (~250 bytes JSON for access tokens; ~405 bytes for ID tokens with auth_time, acr, amr, at_hash). Actual sizes vary with claim content.

Signature as a fraction of the token:

For EC algorithms the signature is ~17% of the token — the payload dominates. For ML-DSA the signature is 88–93% of the token — the payload is negligible. This means adding claims to a PQC token has almost no impact on total size.

Practical limits:

ConstraintLimitES256ML-DSA-44ML-DSA-65ML-DSA-87
HTTP Authorization header8–16 KiBOKOKOKOK
Cookie (Set-Cookie)4 KiBOKborderlinenono
URL query parameter (redirect)~2 KiBOKnonono

ML-DSA-44 access tokens (3.6 KiB) fit in HTTP headers and barely fit in a cookie. ML-DSA-65/87 tokens exceed typical cookie limits but work fine as bearer tokens in Authorization headers and token endpoint responses.

JWKS endpoint overhead:

AlgorithmPublic key (JWK)JWKS set (single key)
ES2560.2 KiB0.3 KiB
EdDSA0.1 KiB0.2 KiB
ML-DSA-441.8 KiB1.9 KiB
ML-DSA-652.6 KiB2.7 KiB
ML-DSA-873.5 KiB3.6 KiB

JWKS responses are cached (5-minute TTL in the AppState::jwks_cache), so the larger PQC keys are fetched infrequently. The bandwidth impact is negligible in practice.

Cluster sizing

ScenarioRecommended sizeTopologyRationale
Development / single-tenant1 nodeNo HA needed; gossip overhead zero
HA minimum3 nodesfull-meshSurvives one node loss; ~7 ms convergence
Production HA5 nodesfull-meshTwo node failures tolerated; ~7 ms convergence
High throughput10 nodeshub-spoke~10× single-node capacity; ~10 ms convergence
Very high throughput> 10 nodeshub-spokeLinear scaling; gossip overhead minimal

For most enterprise deployments a 3-node full-mesh with ES256 or EdDSA is the right starting point: simple to operate, tolerates one node failure, gossip converges in ~7 ms, and token endpoint latency is under 240 µs.

Scale to 10 nodes when aggregate token throughput exceeds ~100,000 req/s or when geographic distribution requires a hub at each site. Note that gossip convergence at n=10 increases to ~10 ms due to hub fan-out overhead.

Flow selection guidance

Client typeRecommended auth methodReason
Server-to-server (machine client)client_secret_basicLowest latency; HTTPS already provides transport security
FreeIPA-enrolled machine (SSSD)kerberos_client_authNo secret to manage; uses existing host keytab; adds one SPNEGO round-trip (~KDC latency)
M2M with key rotationprivate_key_jwtJWKS cache amortises fetch cost; client controls key lifecycle
M2M requiring mutual TLStls_client_authEquivalent latency to Basic; TLS layer provides client identity
M2M with self-signed certself_signed_tls_client_authNo CA required; thumbprint validated against registered client
Browser / native appAuthorization Code + PKCEOnly flow suitable for public clients; latency is network-bound
Microservice / API gatewayToken introspectionSub-175 µs; ideal for high-frequency access checks
PQC-hardened M2Mprivate_key_jwt with ML-DSA-44 keyJWKS cache hides PQC fetch cost; assertion signing on client

Graphs

Per-algorithm 6-panel graphs (client_credentials methods, auth_code, introspect, convergence, memory) are shown below. A cross-algorithm comparison panel is also included.

Cross-algorithm comparison

ES256 ES384 ES512 EdDSA ML-DSA-44 ML-DSA-65 ML-DSA-87

The benchmark grid can be reproduced with:

for ALG in ES256 ES384 ES512 EdDSA ML-DSA-44 ML-DSA-65 ML-DSA-87; do
    for N in 1 2 3 5 7 10; do
        contrib/bench/bench.sh --algorithm "$ALG" --nodes "$N" --release run
    done
done