Design a solution for adding real-time collaborative editing to a web-based document app. Evaluate operational transformation versus CRDTs and recommend an architecture with the tradeoffs, key risks, and a rough implementation path.
After evaluating Operational Transformation (OT) and Conflict-Free Replicated Data Types (CRDTs) across research, requirements analysis, strategic evaluation, cross-functional impact assessment, and assumption verification, the recommendation is to adopt an OT-first architecture with a conditional CRDT evaluation path at Month 6+. OT is the right Phase 1 choice: it is proven at Google Docs scale, architecturally aligned with centralized access control, and avoids the CRDT tombstone and implementation-complexity risks that remain poorly benchmarked for your specific workload. However, three blocking issues in the current design must be resolved before Phase 1 coding begins, and the Phase 3 CRDT decision must be gated behind empirical benchmarks on your own data model — not inherited from upstream research claims that remain unverified.
The task is to add real-time collaborative editing to an existing web-based document application. The solution must support multiple concurrent editors with low latency, scale reliably, integrate with the existing codebase and API layers, and provide a credible path toward offline editing. Two competing approaches exist: Operational Transformation (OT) and Conflict-Free Replicated Data Types (CRDTs). The deliverable is an architecture recommendation with tradeoffs, risks, and a phased implementation path.
**Architectural distinction:** **Coordination model** — OT: centralized, transformation functions run on server, every operation passes through a single coordination point. CRDT: distributed, conflict resolution embedded in the data structure itself, no central server required. **Data model efficiency** — OT: compact documents with no per-character metadata; transformation functions are complex but tested once centrally. CRDT: tombstone problem — deleted elements persist as metadata; files with 10M tombstones can balloon to gigabytes. Mitigated via compaction, TTL policies, hybrid architectures. **Offline capability** — OT: poor, requires server for transformation; existing OT algorithms are slow to merge files that have diverged substantially. CRDT: strong, designed for peer-to-peer sync and offline-first use cases. **Access control** — OT: natural advantage, centralized server means every operation passes through auth layer. CRDT: requires additional coordination layer for access control in distributed topology. **Implementation risk** — OT: "OT bugs can be subtle and devastating"; exhaustive test suites critical. CRDT: "CRDTs are easy to implement badly; many published algorithms have anomalies" — Martin Kleppmann [R16]. **Maturity** — OT: dates to 1989 (GROVE system); Google Docs processes 2B+ documents using OT [R4]. CRDT: modern libraries (Yjs, Automerge 2.0) are production-capable [R14]; newer hybrid approaches like Eg-walker merge long-running branches orders of magnitude faster than traditional OT [R7]. **Decision matrix (MUST criteria):** Real-time low-latency editing — both satisfy. Scalability to concurrent editors — both satisfy. Security and access control — OT has architectural advantage; CRDT requires additional layer. Web technology compatibility — both satisfy. Offline editing addressed — OT conditional (Phase 3); CRDT native. Selection rationale: OT satisfies all five MUST requirements. The only conditional criterion — offline editing — is deferred to Phase 3 with a concrete evaluation gate. **If offline editing is a current hard requirement (not future), this recommendation inverts to CRDT-first.** Product ownership must confirm offline timing before design freeze.
**Phase 1: OT-First Implementation (Months 1–6)** Components: - **OT Server** — WebSocket server accepting, transforming, and broadcasting operations (`ot-server.js`) - **OT Engine** — core transformation logic implementing selected OT variant (`ot-engine.js`) - **Client Sync Layer** — client-side operation buffering, optimistic application, server reconciliation (`collaboration-client.js`) - **Operation Log** — append-only Redis-backed operation history per document (`plith:collab:{docId}:ops`) - **Document Persistence** — periodic snapshot to database from operation log (existing DB schema + migration) Architecture (logical): ``` Client A ──WebSocket──┐ Client B ──WebSocket──┤──► OT Server ──► OT Engine (transform) Client C ──WebSocket──┘ │ │ ▼ ▼ Access Control Operation Log (Redis) │ │ ▼ ▼ Auth Layer Document Snapshots (DB) ``` **OT variant selection (MUST resolve before Phase 1 coding begins):** An ADR must be produced selecting a specific OT variant (Jupiter, MAYBE, or SOC) with rationale. Jupiter is the most well-documented for centralized server architectures; SOC may be preferable if multi-server topology is anticipated in Phase 2. No variant is currently specified in the design — this is a blocking gap. Phase 1 targets: 50 concurrent editors per document; sub-500ms p99 latency (alert at 200ms, re-evaluate trigger at 300ms); zero silent data loss. --- **Phase 2: Optimization & Hardening (Months 4–8, overlapping)** Operation history pruning/compaction with snapshot-before-prune safety checks; latency telemetry instrumentation (p50/p95/p99); scale testing beyond 50 concurrent editors; Phase 2 decision gate: re-evaluate CRDT path before scaling to >100 concurrent editors. --- **Phase 3: Conditional CRDT Evaluation (Month 6+, gated)** Go/no-go decision requires: (1) Automerge 2.0 empirical benchmark on your document model — upstream benchmark claims are unverified; (2) Yjs empirical benchmark on your operation mix and document size distribution; (3) CRDT integration tests with adversarial conditions (network partitions, clock skew, out-of-order operations); (4) product confirmation that offline editing is a shipping requirement, not aspirational. If Phase 3 proceeds: migration from OT to CRDT requires documented data transformation logic, rollback validation tests, and a concrete timeline. The current spec's "migrate document model to CRDT" is a placeholder, not a plan.
**Proven path vs. modern approach** — OT gains: Google Docs-scale production precedent; predictable implementation risk. OT costs: locked into centralized model for Months 1–6; offline editing deferred. **Simplicity vs. flexibility** — OT gains: no tombstone overhead; compact data model; no CRDT metadata bloat. OT costs: OT transformation functions have hundreds of edge cases; bugs are subtle. **Access control vs. distribution** — OT gains: centralized server = natural auth enforcement point. OT costs: cannot do peer-to-peer sync; every operation must traverse server. **Phase 3 optionality vs. commitment** — OT gains: CRDT path preserved without upfront investment. OT costs: Phase 3 may never execute if roadmap shifts; OT→CRDT migration is re-architecture, not upgrade. **Testing burden** — OT gains: OT testing is well-understood; property-based testing frameworks exist. OT costs: exhaustive test suites are required, not optional; corner cases are pathological.
**5.1 Breaking Risks (Must Resolve Before Phase 1 Coding)** **BA-0: No Phase 1 OT Emergency Rollback Defined** If OT produces silent data corruption in production, there is no documented emergency response. Add a Phase 1 rollback procedure: (a) disable WebSocket upgrade on the collaboration endpoint via feature flag, (b) serve the last verified snapshot from the operation log for all affected documents, (c) halt new collaborative sessions until root cause is identified. Add this to the Pre-Phase 1 checklist as a feature-flag gate requirement. **BA-1: Infrastructure Capacity Unknown** The design targets 50 concurrent WebSocket editors, described as a "low bar well within standard WebSocket server capacity." This claim is untested against actual infrastructure. If real capacity is 20–30 connections, Phase 1 cannot ship. - **Action:** Load test production infrastructure with 50 WebSocket connections + 100 ops/sec throughput. Include a high-frequency scenario: 2–5 concurrent editors at 20+ ops/sec each for a sustained 10-minute period, to catch OT transformation bottlenecks that are frequency-driven rather than editor-count-driven. - **Gate:** Pass/Fail: p99 latency < 300ms with 50 concurrent editors. **BA-2: OT Variant Unspecified** The implementation spec says "Implement OT algorithms" without specifying which variant (Jupiter, MAYBE, SOC). Each variant has different properties for concurrency, offline tolerance, and server topology. Wrong choice cascades into Month 2–3 rework. - **Action:** Produce ADR selecting OT variant with rationale before Phase 1 sprint planning. The ADR must evaluate the following decision criteria: (a) Is multi-server topology required in Phase 2? If yes, prefer SOC over Jupiter. (b) Is convergence correctness proof required? If yes, eliminate MAYBE (its correctness has open questions in the literature). (c) Is the team using an existing library (ShareDB uses Jupiter internally; recommend auditing ShareDB compatibility before committing to a custom implementation)? --- **5.2 Degraded Risks (Must Mitigate Before Phase 1 Ship)** - **OT transformation bugs evade test suite** — property-based testing (QuickCheck-style); bounded client/server reconciliation tests; >10K random operation sequences. Phase 1 acceptance criteria. - **Latency budget based on unverified <5ms claim** — instrument p50/p95/p99 telemetry; alert at 200ms; re-evaluate architecture if p99 > 300ms. Phase 1 launch. - **WebSocket auth gap** — mirror REST auth layer to WebSocket connections; add RLS on operation-history table. Phase 1 design. - **`MCP_GATEWAY_URL` env var undefined** — define in `.env.example` and infrastructure runbooks. Phase 1 deployment. - **"No API contract changes" claim is false** — document three net-new contract surfaces: WebSocket upgrade endpoint, operation envelope {op, docId, clientId, revision}, broadcast format. Before Phase 1 ships. - **Phase 3 decision window timing unconfirmed** — lock Phase 3 go/no-go date to product roadmap now, not Month 5. Design freeze. - **Phase 2 compaction may lose data** — compaction spec must include snapshot-before-prune, replay verification, and dry-run mode. Phase 2 acceptance criteria. --- **5.3 CRDT-Specific Risks (Phase 3 Only)** - CRDT performance benchmarks cited in research (Automerge 2.0: 600ms/260K keystrokes, Yjs: 26K–156K ops/sec) are **unverified** — no source in the research corpus provides these figures. Do not use them as planning inputs. - CRDTs are "easy to implement badly; many published algorithms have anomalies" [R16]. Using Yjs or Automerge mitigates but does not eliminate this risk. - CRDT tombstone problem requires aggressive compaction; files with 10M tombstones can reach gigabytes [R10][R12][R13].
**API Contracts — BREAKING.** Three net-new surfaces (WebSocket, operation envelope, broadcast). Content-type enforcement on the edit resource endpoint must be confirmed before Phase 1 ships. **Database — REQUIRES UPDATE.** Add operation-history schema, Redis key namespaces (`plith:collab:{docId}:ops`), compaction job (Phase 2). Plan tombstone storage ceiling if Phase 3 proceeds. **Agent Runtime — INFORMATIONAL.** Confirm that agent workflows reading document content are tested against documents under active collaborative editing sessions. **MCP Gateway — REQUIRES UPDATE.** If the MCP gateway tool schema is externally published, version the tool to reflect OT-mediated write semantics. **Billing/Credits — INFORMATIONAL.** WebSocket long-polling + operation broadcasting increase compute/egress per active document. Monitor per-session credit consumption in Phase 1. **Monitoring — REQUIRES UPDATE.** Add latency telemetry (p50/p95/p99), operation throughput metrics, WebSocket connection count, Redis operation log size. **Security — REQUIRES UPDATE.** WebSocket auth must mirror REST layer. RLS on operation-history table. Session UUID + sequence number for operation idempotency.
**Pre-Phase 1 (Blocking — before coding begins)** - [ ] Feature-flag gate: define emergency rollback procedure (BA-0) and implement WebSocket upgrade feature flag before any Phase 1 coding begins - [ ] Infrastructure load test: 50 concurrent WebSocket editors, <300ms p99; plus high-frequency scenario (2–5 editors at 20+ ops/sec, sustained 10 minutes) - [ ] ADR: select OT variant (Jupiter/MAYBE/SOC) with rationale against defined decision criteria (multi-server topology, correctness proof requirement, ShareDB compatibility audit) - [ ] **Product sign-off: is offline editing Phase 1 or Phase 3? This decision is architecturally sequenced BEFORE any Phase 1 coding begins. If there is any possibility offline editing is a Phase 1 requirement, the OT-first architecture must not be locked.** - [ ] Define Redis key namespaces for collaboration - [ ] Design and migrate operation-history schema **Phase 1 Ship Criteria** - [ ] Document three net-new API contract surfaces - [ ] QuickCheck-style OT property-based test suite (>10K random operation sequences) - [ ] Client/server bounded reconciliation tests - [ ] p50/p95/p99 latency telemetry with alert at 200ms - [ ] WebSocket auth mirroring REST layer; RLS on operation-history table - [ ] Define `MCP_GATEWAY_URL` in `.env.example` and runbooks - [ ] Operation idempotency via session UUID + sequence number - [ ] Update MCP tool schema if externally published **Phase 3 Go/No-Go Gate** - [ ] Automerge 2.0 benchmark on your document model - [ ] Yjs benchmark on your operation mix and document sizes - [ ] CRDT adversarial integration tests (partitions, clock skew, out-of-order) - [ ] Product confirmation: offline editing is a shipping requirement - [ ] OT→CRDT migration plan with data transformation logic, rollback validation tests, and timeline estimate
1. **OT variant not selected.** The entire design says "implement OT" without specifying Jupiter, MAYBE, or SOC. Each has materially different properties. This is the single most consequential design decision not yet made. 2. **OT latency overhead unquantified.** The <5ms transformation overhead claim has no source. The 500ms latency budget may be adequate even at 50–100ms overhead, but the baseline is unknown. Phase 1 telemetry will resolve this empirically. 3. **Phase 1→Phase 3 "rollback" is re-architecture.** The spec describes migrating from OT to CRDT as a rollback plan. It is not. It is a full re-architecture of the document model, operation log, client sync layer, and persistence schema. No migration steps, data transformation logic, or timeline exist. 4. **Product sign-off on offline editing timing is an architecture gate, not a parallel task.** This dependency is stronger than its current presentation as a parallel task implies. If there is any possibility offline editing is a Phase 1 requirement, the OT-first architecture must not be locked. 5. **Verification plan lacks acceptance criteria.** Review found "Manual testing of real-time editing features" with no test scenarios, success metrics, or connection to the 50-editor / sub-500ms targets. The verification plan must be rewritten with concrete pass/fail gates. 6. **CRDT benchmark claims are unverified.** Automerge 2.0 (600ms/260K keystrokes) and Yjs (26K–156K ops/sec) figures appear in the research but have no corresponding source in the reference corpus. Phase 3 decisions must not rely on these figures without independent empirical validation on your workload. 7. **Race condition on editor join/leave.** No handling exists for WebSocket connection closure during in-flight operations. Rapid join/leave cycles could cause partial operations applied to wrong revision, leading to client/server state divergence. 8. **Redis persistence policy unconfirmed.** The operation log is Redis-backed, but whether AOF is enabled, RDB snapshot interval, and backup validation procedures are unspecified. If Redis restarts with AOF disabled, operation history is lost.
Overall: **Medium-High** **OT-first recommendation — High.** Grounded in Google Docs production precedent; verified by multiple sources; all five MUST criteria satisfied; strategic evaluation unanimous. **Phase 1 architecture — Medium.** Sound in principle but missing OT variant selection, verification acceptance criteria, and infrastructure validation. Three blocking issues identified. **Phase 3 CRDT evaluation path — Medium-Low.** Conceptually correct but benchmark claims are unverified; migration plan is a placeholder; Phase 3 timeline is not locked to roadmap. **Codebase integration — N/A.** No confirmed codebase-specific bugs identified; OT variant selection and infrastructure validation are the outstanding pre-Phase 1 gaps. **Cross-functional impact — High.** Systematically assessed across 7 domains; all impacts have concrete actions. What was well-covered: OT vs. CRDT architectural tradeoffs, production precedent analysis, cross-functional impact mapping, assumption verification with breaking/degraded classification. Where uncertainty remains: actual infrastructure capacity (untested), OT transformation latency (unmeasured), CRDT library performance on your specific workload (unverified benchmarks), Phase 3 timeline commitment (unconfirmed by product ownership).
1. **Infrastructure load test** — 50 WebSocket editors, 100 ops/sec, measure p99; plus high-frequency scenario (2–5 editors at 20+ ops/sec, sustained 10 minutes). Owner: DevOps/Eng. Deadline: before Phase 1 sprint. Blocks: all Phase 1 coding. 2. **ADR: select OT variant (Jupiter/MAYBE/SOC)** against defined decision criteria. Owner: Eng Lead. Deadline: before Phase 1 sprint. Blocks: OT engine implementation. 3. **Product sign-off: offline editing Phase 1 or Phase 3** (architecturally gates Phase 1 coding). Owner: Product Owner. Deadline: before Phase 1 coding begins. Blocks: architecture direction. 4. **Lock Phase 3 go/no-go date** to product roadmap. Owner: Product/Eng. Deadline: design freeze. Blocks: Phase 3 planning. 5. **Document three net-new API surfaces.** Owner: Eng. Deadline: before Phase 1 ships. Blocks: downstream integrations. 6. **Write QuickCheck OT test suite + reconciliation tests.** Owner: QA/Eng. Deadline: Phase 1 acceptance. Blocks: Phase 1 ship. 7. **Automerge + Yjs empirical benchmarks on your data model.** Owner: Eng. Deadline: before Phase 3 go/no-go. Blocks: Phase 3 commitment. Actions #1–#3 are Phase 1 blockers. If no named assignee exists at design-freeze time, treat the item as unowned and escalate to Eng Lead before sprint planning commences.
[R1] Building Collaborative Editing: The Battle Between Operational Transform and CRDTs. Medium. https://medium.com/@sohail_saifi/building-collaborative-editing-the-battle-between-operational-transform-and-crdts-fdceb63c54ac. Source tier: T3. [R2] CRDTs vs Operational Transformation: A Practical Guide to Real-Time Collaboration. HackerNoon. https://hackernoon.com/crdts-vs-operational-transformation-a-practical-guide-to-real-time-collaboration. Source tier: T3. [R4] CRDTs vs. Operational Transformation: How Google Docs Handles Collaborative Editing. SystemDR. https://systemdr.systemdrd.com/p/crdts-vs-operational-transformation. Source tier: T4. [R5] Building Collaborative Interfaces: Operational Transforms vs. CRDTs. DEV Community. https://dev.to/puritanic/building-collaborative-interfaces-operational-transforms-vs-crdts-2obo. Source tier: T3. [R7] Collaborative Text Editing with Eg-walker: Better, Faster, Smaller. arXiv (EuroSys '25). https://arxiv.org/pdf/2409.14252. Source tier: T4. [R10] CRDTs and Real-Time Collaboration: Building Conflict-Free Distributed Systems. Zylos Research. https://zylos.ai/research/2026-01-29-crdt-real-time-collaboration/. Source tier: T4. [R12] CRDTs in Production. InfoQ. https://www.infoq.com/presentations/crdt-production/. Source tier: T2. [R13] How to Build CRDT Implementation. OneUptime. https://oneuptime.com/blog/post/2026-01-30-crdt-implementation/view. Source tier: T4. [R14] Best CRDT Libraries 2025 — Real-Time Data Sync Guide. Velt. https://velt.dev/blog/best-crdt-libraries-real-time-data-sync. Source tier: T4. [R16] CRDTs: The Hard Parts. Martin Kleppmann. https://martin.kleppmann.com/2020/07/06/crdt-hard-parts-hydra.html. Source tier: T4. [R19] Operational Transformation. ot.js.org. https://ot.js.org/docs/operational-transformation/. Source tier: T4. [R20] Operational Transformation in Real-Time Group Editors. ACM DL. https://dl.acm.org/doi/pdf/10.1145/289444.289469. Source tier: T4.
Consulting external sources for current information and best practices
Verifying factual claims and dependencies against authoritative sources
Analyzing requirements to ensure complete coverage and identify gaps
Evaluating strategic fit, risks, and alignment with objectives
Generating the core deliverable with structured methodology
Reviewing for completeness, consistency, and accuracy from multiple angles
Aligning proposed changes with the existing codebase structure
Evaluating cross-functional effects and downstream implications
Identifying and validating implicit assumptions
Combining all findings into a unified deliverable
Reviewing for completeness, consistency, and accuracy from multiple angles