Tier-1 FI · Architecture Review

Platform Assessment — GCP Core 4.1.26a

Outages · Gen 1

0

6+ years FI production

Microservices · Gen 2

10

NATS transport

Arch Pillars · Gen 3

3

Event · Crypto · Determinism

Risk Items

6

Pre-submission review

Three-Generation Platform Heritage

From FI-Proven Node.js to
Rust Cloud-Native Doctrine

Production-proven architecture across three generations. EveLedger ran 6+ years in live FI production with zero outages — the most credible data point in any engagement with a Tier-1 bank.

Gen 1 — Live FI Production Gen 2 — Deployment Pending Gen 3 — Proposed GCP

Generation 1 · EveLedger · 2018–2024+

Hardened Production Node.js

runtime
Node.js (legacy)
api
GraphQL
database
MongoDB
routing
Traefik
multi-cloud
Azure + AWS (site-to-site)
uptime
0 Outages · 6+ Yrs
env
FI Production (live)

Generation 2 · SNR Core · 2026

TypeScript Microservices

runtime
Node.js 24 + TypeScript 5.9
framework
Moleculer 0.14 + NATS
api
Apollo Server 5 / GraphQL 16
database
MongoDB + Mongoose 9
routing
Traefik Alpine
status
Deployment Pending
services
10 microservices

Generation 3 · GCP Core 4.1.26a

Rust / Cloud-Native Doctrine

runtime
Rust + tokio async
transport
gRPC / tonic + protobuf
events
Kafka (Confluent) — event sourcing
database
Cloud Spanner (global)
cache
Redis active-active CRDT
status
Proposed · GCP
decimal
rust_decimal (no IEEE 754)
📜

Event Sourcing as Legal Foundation

Kafka append-only log as source of truth. Signed events. Deterministic replay. Snapshot optimization. Sequence enforcement. State is a projection of events — never mutated directly.

🔐

Cryptographic Tamper Evidence

Every mutation is signed. GDPR erasure via envelope encryption — key destruction, not data deletion. Audit trail is cryptographically sealed and independently verifiable.

⚖️

Regulatory-Grade Determinism

rust_decimal bans IEEE 754 floats at type level. Reg CC hold logic is code, not config. ACH return thresholds (3% / 0.5%) hardcoded. AVAILABLE vs CURRENT correctly separated.

Core premise: "The primary risk is not scale — it is state corruption." Event sourcing addresses auditability and regulatory replay requirements that MongoDB mutation-based systems cannot provide natively.

Full Stack Comparison — Three Generations

Dimension Gen 1 · EveLedger Gen 2 · SNR Core (TS) Gen 3 · GCP Core (Rust)
LanguageJavaScript / Node.jsTypeScript 5.9 strictRust (memory-safe, no GC)
AsyncEvent loop (single-thread)Event loop + NATS pub/subtokio async runtime (multi-thread)
Service MeshCustom / TraefikMoleculer + NATSgRPC (tonic) + protobuf (prost)
Message BusNone / customNATS 2.29 (pub/sub)Kafka (Confluent) — durable event log
Primary DBMongoDBMongoDB + Mongoose 9Cloud Spanner (external consistency)
CacheRedis (implied)Redis Alpine + ioredisRedis active-active CRDT
API LayerGraphQLApollo Server 5 / GraphQL 16gRPC internal + REST/GraphQL external
Decimal MathBigNumber.js scalarBigNumber (string-stored)rust_decimal — type-level f64 ban
State ModelMutable documentsMutable + MongoDB sessionsEvent sourcing — append-only
AuthProductionNot implementedCryptographic signatures
ComplianceProven 6yr FISchema-readyReg CC · GDPR
Outage Record0 / 6+ YearsNot deployedProposed

SNR Core — What's Already Right

Production-Grade Patterns Present

  • Mod10 / Luhn account generation (IBM/Fidelity IFS standard)
  • Double-entry bookkeeping — DEBIT/CREDIT with balance tracking
  • MongoDB atomic sessions — ACID across 4 document types
  • Aggregator → Platform → Customer entity hierarchy
  • CURRENT / AVAILABLE / DAILY balance kinds (correct model)
  • Federal Reserve BAI codes on all ledger entries
  • gluId UUID for cross-system reconciliation
  • Traefik routing — battle-tested from Gen 1
  • NATS transport — zero SPOF in cluster mode
  • Circuit breaker (50%) + bulkhead (10 concurrent / 100 queue)

Gaps vs Tier-1 Requirements

Must Be Added for Full FI Deployment

  • Authentication not implemented — JWT planned but absent
  • No event sourcing — all mutations are destructive
  • No cryptographic tamper evidence on any records
  • Sanctions screening is mock — L1/L3 are stubs
  • No audit log — state corruption undetectable
  • BigNumber as string scalar — not compile-time enforced
  • External rails (FedWire, SWIFT, TCH) are placeholders
  • Reg CC hold logic not implemented
  • No CTR/SAR reporting
  • No load test results documented
Critical distinction: NATS in SNR Core is a message transport — service-to-service RPC routing. Kafka in GCP Core 4.1.26a is a durable event log — the legal source of truth for event sourcing, replay, and audit. These are fundamentally different patterns, not competing technologies.

NATS · SNR Core / Moleculer

  • Sub-millisecond latency, minimal resource footprint
  • Simple ops — single binary, no ZooKeeper/KRaft
  • Zero SPOF in cluster mode
  • Moleculer circuit breaker works natively
  • At-least-once delivery with NATS JetStream
  • State lives in MongoDB — NATS is the routing bus only
  • No offset-based replay — not retained indefinitely
  • Cannot replay entire financial history from log
  • No consumer group lag monitoring
  • Schema governance is application-level only
  • Not designed for event sourcing pattern

Kafka · GCP Core 4.1.26a

  • Durable, ordered, replayable log — the legal record
  • Consumer group lag monitoring — observable processing
  • Schema registry enforces message contracts at publish
  • Exactly-once semantics (EOS) with transactions
  • Full state reconstruction via event replay
  • Audit trail is the log itself — regulatory gold standard
  • Confluent Cloud vendor dependency
  • Consumer lag = projection lag under high load
  • Significantly higher ops complexity
  • Requires schema governance discipline
  • Partition split failure modes need explicit handling
  • Sequence gap handling critical for financial integrity

Kafka Partition Split

Consumers may receive events out of sequence. Must detect gap and halt — not auto-recover. Auto-recovery without human sign-off is unacceptable.

Dead-letter queue + manual reconciliation gate before replay resumes.

Spanner Stall

Projection lag increases. Read-only replicas may serve stale data silently. TrueTime consistency degrades during regional events.

Timeout budgets per operation type. Circuit breaker on Spanner client.

Redis CRDT Mis-merge

Active-active CRDT can produce incorrect available balance values via concurrent writes — silently.

Do NOT use CRDT for balance state. Spanner is the sole authoritative source.
Gen 1 advantage: No Kafka meant no projection lag, no consumer group ops, no schema governance surface. Simplicity was the resilience. The tradeoff: NATS/MongoDB cannot prove regulatory state reconstruction from first principles — which Tier-1 legal and audit teams will require.
Rust GCP Core 4.1.26a
async
tokio — multi-threaded, zero-cost
gRPC
tonic (built on tokio)
proto
prost
decimal
rust_decimal — f64 is a compile error
db client
sqlx — async, Spanner via PG wire
serde
serde (JSON external) + prost (internal)
memory
No GC — deterministic latency
throughput
~500k–1M req/sec
  • Highest throughput for auth / settlement hot path
  • rust_decimal enforces monetary safety at compile time
  • No GC pauses — critical for P99 card network SLAs (~100ms)
  • Ownership model prevents data races at compile time
  • 6–18 month onboarding for engineers new to Rust
  • FI internal teams cannot maintain without retraining
  • Slow compile times reduce iteration velocity
  • async Rust + tokio lifetime complexity is non-trivial
Go Alternative Option
async
goroutines — M:N threading, built-in scheduler
gRPC
google/grpc-go — Google's reference impl
proto
google/protobuf
decimal
shopspring/decimal (convention-based)
db client
pgx or Cloud Spanner Go SDK
serde
encoding/json
memory
GC (tunable via GOGC, arenas in 1.20+)
throughput
~200k–400k req/sec
  • Largest cloud-native / Kubernetes ecosystem
  • Google's own gRPC reference implementation
  • Massive talent pool — FI can hire and own independently
  • Fast compile, fast iteration velocity
  • GC pauses tunable via GOGC — manageable but present
  • shopspring/decimal requires convention enforcement
  • No compile-time monetary type safety like rust_decimal
  • Race conditions possible (detector is test-time only)
Recommendation — Hybrid: Go for platform services, Rust scoped to the card authorization hot path where GC pauses breach card network timeout budgets (~100ms). Preserves the FI's ability to own and maintain the platform independently.

Moleculer → Rust / Go Capability Mapping

Moleculer Feature Moleculer (NATS) Rust Equivalent Go Equivalent Fit
Service registryBuilt-in auto-discoverConsul / etcd / k8sConsul / etcd / k8sAll viable
Load balancingRound-robin local-prefEnvoy sidecarEnvoy sidecarEnvoy preferred
Circuit breakerBuilt-in (50% / 60s)tower::retry + governorgo-resilienceMore config
BulkheadBuilt-in (10 / 100)tower::concurrency_limitx/time/rate + semaBoth solid
Message transportNATS pub/subrdkafkaconfluent-kafka-goKafka for event source
Decimal mathBigNumber (strings)rust_decimalshopspring/decimalRust wins
Request tracinggluId UUIDopentelemetry-rustgo.opentelemetry.io/otelBoth excellent
Inter-service callsctx.call() via brokertonic gRPCgrpc-gogRPC preferred
Ops complexityLow (single broker)High (Kafka + Spanner)Medium-HighNATS simpler

High Severity

R1

Document Is Extremely Opinionated

Mandates Rust, Spanner, specific AI vendors, Confluent Cloud. Target FI has existing vendor contracts and internal architecture standards. "We selected this" framing triggers ARB conflicts.

HighReframe: "We align to your preferences"
R2

AI Layer Triggers Model Risk Management

Tool-use AI for financial mutations will alarm the FI's MRM team. SR 11-7 compliance analysis required. Any AI initiating financial mutations requires Model Risk Validation before sign-off.

HighExtract to Appendix A — human approval gate
R3

Document Length — 100+ Page Engineering Doctrine

No executive reads this in full. Without a proper executive summary it reads as an academic exercise, not a production proposal.

Medium10-pg exec · 30-pg technical · 3-pg compliance

Medium Severity

R4

Rust Talent Gap at Target FI

Large FI technologist pools are predominantly Java/Python/C++. Full Rust platform creates maintenance dependency on your team for incident response and evolution.

MediumGo as primary, Rust for hot path only
R5

Kafka Ops Discipline Requirement

Requires consumer lag management, schema registry, key rotation, partition split recovery, cross-region replication. If FI ops is not Kafka-fluent, GCP Core is a liability.

MediumInclude runbook + SRE staffing estimates
R6

SNR Core (Gen 2) — Critical Gaps vs Tier-1

Auth not implemented. Sanctions screening is mock. External rails are stubs. No audit log. No event sourcing. Strong ledger core — not yet a complete Tier-1 platform.

MediumHonest: "Production ledger engine, ops pending"

Det. Risk 1

Snapshot Rebuild Safety

Snapshot version must be cryptographically tied to the Kafka offset at creation. Snapshots must be validated against projected state before serving. Never serve an unvalidated snapshot.

Det. Risk 2

Sequence Gap Handling

Missing event sequence numbers are a legal problem. System must detect gaps, halt projection, alert ops, and refuse to serve stale state. Auto-recovery without human sign-off is unacceptable.

Det. Risk 3

Projection Lag

For card-present auth, lag must stay below card network timeout (~100ms). Define lag SLOs per operation type. Circuit break on breach for real-time operations.

A
EveLedger — Gen 1
Production Credibility
6+ years, zero outages, live FI deployment. Your strongest credential. Lead every conversation with this — demonstrated reliability outweighs any architectural argument.
Use as Primary Proof Point
B+
SNR Core — Gen 2
Bridge Platform
Excellent ledger core. Correct financial patterns, strong type safety, proven infra stack. Complete auth, event sourcing overlay, and live ops record before presenting to FI.
Complete Critical Gaps First
A−
GCP Core 4.1.26a
Architectural Vision
Intellectually correct. Event sourcing, rust_decimal, Reg CC as code, GDPR envelope encryption are genuinely senior-level. Risk is presentation, not substance.
Right-Size for Target FI

How to Win the Room

Conversation Strategy

  • Lead with Gen 1's 6-year zero-outage FI production record
  • Position SNR Core as the typed evolution of that architecture
  • Present GCP Core as phased target state — not day-one mandate
  • Offer Go as FI-maintainable alternative to Rust
  • Scope Rust to card authorization hot path only
  • AI section → Appendix A with human-approval gate language
  • Make infra flexibility explicit: GCP, AWS, Azure all supportable

Document Package

What to Deliver

Doc 1
10-page Executive Summary
Doc 2
30-page Technical Architecture (edited GCP Core)
Doc 3
3-page Risk & Compliance Summary
Appendix A
AI/ML Layer — SR 11-7, human approval gate
Appendix B
Infrastructure flexibility (GCP / AWS / Azure)
Appendix C
Rust vs Go technical comparison
Bottom line: GCP Core 4.1.26a is architecturally correct. The risk is presentation, not substance. Present as a recommendation, not a mandate. Lead with the zero-outage heritage and let the architecture speak for itself.