The eCRF system you build in the first sprint will serve a 3-site, 80-subject pilot. The one you need two years later will run an 80-site, 12,000-subject Phase III trial across 15 countries. These are not the same system — unless you make the right architectural decisions at the beginning. This guide shows you what those decisions are.
Average schema rework cost when monolithic eCRF design meets protocol amendments mid-trial
Median time for a self-built eCRF to reach production readiness for a Phase III trial
NCCI code pairs updated quarterly — one reason compliance maintenance never stops
HIPAA minimum retention period for eCRF audit logs after marketing approval
01. Metadata-Driven Schema Design: The Decision That Defines Everything
The single most consequential eCRF architecture decision is choosing between a monolithic protocol-specific schema and a metadata-driven schema. In a monolithic design, CRF structure is encoded directly in database tables — creating tables named VISIT_2_VITALS, ECG_WEEK12, SCREENING_LABS. Every protocol amendment (which happens in virtually every Phase II/III trial) requires a database migration, which in a regulated environment means a validation protocol, change control record, and partial or full revalidation of the system.
CRF structure encoded in DDL. Each new visit or field = new table migration. Protocol Amendment #3 adds a safety assessment → migration script → validation protocol → revalidation → 4–6 week delay. After 8 amendments: 8 migration cycles, 8 revalidations. Data management budget consumed by maintenance, not science.
Core tables: STUDY, SUBJECT, VISIT, FORM, ITEM, ITEM_DATA. CRF structure lives in metadata configuration, not DDL. Protocol Amendment #3 updates metadata — zero DB migrations, zero schema revalidation. Same validated system handles unlimited amendments. Data architecture stable for the life of the trial.
The core entity model of a metadata-driven eCRF separates the data model (static, validated once, never changes) from the study definition (CRF structure, visit schedule, edit checks — changeable without schema impact). This is the architecture used by every purpose-built EDC platform: Medidata Rave, Oracle Clinical One, OpenClinica, and Veeva Vault CDMS all implement this separation as their foundational design choice.
Read More: Why Many eCRF Systems Fail at Scale
02. 21 CFR Part 11 Audit Trail Architecture
21 CFR Part 11 mandates that any electronic system used to create, modify, or maintain records used in FDA-regulated submissions must maintain a computer-generated, time-stamped audit trail that captures the date, time, and identity of every operator entry and action that creates, modifies, or deletes electronic records. This is non-negotiable — an eCRF without a 21 CFR Part 11 compliant audit trail cannot support a regulatory submission.
21 CFR Part 11 — §11.10(e) — Audit Trail Requirement Use of computer-generated, time-stamped audit trails to independently record the date and time of operator entries and actions that create, modify, or delete electronic records. Record changes shall not obscure previously recorded information and audit trail documentation shall be retained for a period at least as long as that required for the subject electronic records. Source: 21 CFR §11.10(e) · US FDA Electronic Records Regulation
🚨 OCR Audit Response Time
FDA inspectors can request audit evidence within 10 business days. If your audit log export requires a custom engineering query or manual data compilation, you will miss this window. Audit log export must be a self-service function accessible to your QA team — not an engineering ticket. Design the export API before your first subject is enrolled, not during a 483 response.
03. Form Versioning & Protocol Amendment Architecture
Protocol amendments produce CRF amendments. A subject enrolled before Amendment 3 may have Week 4 data captured on CRF version 1.2. After Amendment 3, Week 4 adds a new safety assessment form — captured on CRF version 1.5. At database lock, your system must clearly identify which version of each form was used to collect each data point, for every subject. An eCRF that overwrites its active CRF definition and re-renders historical data against the new version destroys regulatory traceability.
-
Form versions are immutable objects, not editable records. Each CRF version is a new, independently addressable object with its own UUID. Published versions are never modified — only new versions are created. Historical data always references the version active at collection time.
-
Subject-version assignment is tracked per form instance. Each ItemData record carries a crf_version_id foreign key — not just a form_oid. This enables the system to render historical data exactly as it was collected, and the investigator to see what questions a subject answered at each visit.
-
Amendment deployment is tenant-aware. When Amendment 3 is activated, the system determines which subjects are affected based on their enrollment date and applicable protocol version, and activates the correct CRF version for future visits without altering past data.
-
Version comparison audit report is mandatory. When deploying a new CRF version, the system must produce a diff report showing exactly which items were added, removed, or modified — stored permanently as part of the study's regulatory documentation package.
04. Multi-Site Role Architecture That Scales to 100+ Sites
A naive RBAC implementation assigns permissions per-user, per-study, per-site. With 80 sites averaging 6 users each, across 3 active studies, you have 1,440 individual role assignments to maintain — before accounting for staff turnover running at 20–30% annually in clinical research. Maintaining this matrix is a full-time job for a data manager, and errors create either access gaps (sites cannot enter data) or access violations (users see data they shouldn't).
| Role | Data Entry | Query Respond | Read All Sites | Lock Records | Export Data |
|---|---|---|---|---|---|
| Site Investigator (PI) | Own site only | Own site only | No | Own site only | No |
| Site Coordinator (CRC) | Own site only | Own site only | No | No | No |
| Clinical Data Manager | No | All sites | All sites (read) | With approval | All sites |
| Sponsor Monitor (CRA) | No | Assigned sites | Assigned sites | No | Assigned sites |
| Biostatistician | No | No | All (read-only) | No | Anonymized only |
| System Administrator | No clinical data | No | Config access only | No | Audit logs only |
💡 Site-Scoped Role Templates
The scalable architecture uses site-scoped role templates — a role (PI, CRC, Monitor) is assigned at the site level, and all users at that site inherit that role's permissions relative to that site's data. Adding a new PI to site 047 is one assignment. The template handles the rest. Role logic is computed at query time from (site_id, user_id, role_template) — never stored as a flat permission matrix per user.
Peerbits Service: EHR Integration Services
05. Edit Check Rules Engine: Parameterized & Maintainable
Edit checks are the primary mechanism for catching data quality issues at the point of entry. A well-governed edit check library catches protocol deviations, out-of-range values, and missing mandatory data before they become queries. A poorly managed library with 200+ hardcoded checks — added one at a time by data managers over years — fires incorrectly on 15–20% of encounters, trains site staff to ignore warnings, and requires weeks of remediation for each protocol amendment that invalidates check logic.
The correct architecture implements edit checks as a parameterized rule engine: each check is an instance of a typed rule template with parameters stored in the study metadata layer. This enables firing rate monitoring per rule, automated regression testing on amendment, and site-level exception configuration without forking rule logic.
06. Query Management at Scale: Event-Sourced Workflow Service
A Phase III trial with 80 sites and 4,000 subjects may have 20,000 open queries at any point in the study. eCRF systems that store query state in the primary clinical database — requiring joins across clinical data, form metadata, user assignments, and query history for every dashboard load — become progressively slower as the trial scales. Query dashboards that take 8 seconds to load are abandoned by data managers who switch back to spreadsheets.
Queries are a workflow object, not a database column. Architect them as a dedicated service with an event-sourced state model: every state transition (Open → Answered → Closed / Reopened) is an immutable event, not an in-place update. Query state is computed from the event log — you get a complete query lifecycle history for free, and query load is isolated from clinical data entry performance.
💡 Query Performance Benchmark
A well-architected query management service must handle 50,000 concurrent open queries across 100 sites with sub-200ms dashboard load. If your query dashboard takes more than 2 seconds to load at 10,000 open queries, your query architecture will fail before the end of a large Phase III trial. Test this explicitly during architecture review — not during database lock.
07. Offline Data Capture for Low-Connectivity Sites
Clinical trial sites range from academic medical centers with enterprise IT to community clinics in rural Kenya and hospital networks in Southeast Asia where internet connectivity is intermittent. An eCRF with no offline capability forces these sites back to paper, which defeats the entire purpose of electronic data capture. But offline mode in a regulated context is not simply "cache the form and sync later" — it carries specific regulatory requirements around timestamp integrity, conflict resolution, and audit trail continuity.
-
Device clock timestamp vs server receipt timestamp must both be captured. Offline entries are timestamped by device clock at collection time. The server timestamp is captured at sync time. Both must be stored in the audit trail. If device and server timestamps differ by more than 2 minutes, the system must flag the entry and require investigator attestation for the discrepancy.
-
Conflicts require human resolution, never last-write-wins. When an offline session syncs and finds the same field was edited both offline and by a remote user (e.g., a data manager correction), a silent last-write-wins resolution is a data integrity violation under 21 CFR Part 11. Conflicts must surface as queries requiring human resolution with a documented rationale.
-
Form submission must be atomic. All items in a form submission sync as a single transaction — all committed or none. Partial sync leaving some items pending produces incomplete records that fail edit checks unpredictably and create phantom query states.
-
Offline edit checks must match online edit checks. If the offline client runs a different version of edit checks than the server, the same data entry will produce different query states online vs offline. Edit check logic must be delivered to offline clients as versioned configuration, not as compiled code embedded in the client application.
08. CTMS & Central Lab Integration: Preventing Silent Data Misalignment
A functioning clinical trial generates data from at least four concurrent systems: the eCRF (site-entered clinical data), the CTMS (site contacts, monitoring visits, IRB submissions), the central lab system (hematology, chemistry, PK samples via HL7 v2 ORU messages), and the IxRS/RTSM (randomization and trial supply management). Each integration that works at study launch has a 30–40% probability of breaking within 12 months as source systems are upgraded — and when they break, they typically break silently.
A lab integration that stops receiving HL7 v2 ORU messages doesn't produce errors — it produces missing lab values that appear as incomplete CRFs rather than integration failures. The eCRF team discovers this 3 months later during a site monitoring visit. The architectural fix is proactive integration health monitoring: track the expected message frequency from each lab system (a LIMS typically sends result batches within 24 hours of analysis), and alert when the frequency drops below threshold before site staff or monitors notice.
Read More: Common Challenges in FHIR Integrations
09. CDISC SDTM Export: Built-In, Not Bolted On
CDISC SDTM (Study Data Tabulation Model) is the FDA and PMDA required format for clinical trial data submissions. If your eCRF architecture doesn't map to SDTM domains at design time, database lock becomes a 6–12 month data transformation project. This is by far the most expensive mistake in clinical data management — a mistake that is entirely preventable with upfront CDASH-to-SDTM alignment.
The correct approach maps every eCRF item to its SDTM variable at the time the item is defined in the metadata layer. SDTM domain (DM, AE, LB, VS, CM, EX, etc.), variable name, controlled terminology, and origin flag are all metadata attributes of each item definition. SDTM export is then a query against the metadata-driven model — not a custom ETL pipeline developed post-hoc at database lock.
"An eCRF that maps to SDTM at collection generates its submission dataset as a view. An eCRF that defers SDTM mapping generates a 6-month crisis."
— Peerbits Clinical Data Engineering Practice, observed across 40+ EDC implementations
10. Database Lock & Closeout: Orchestrated, Not Manual
Database lock — the formal process of freezing a clinical trial dataset for statistical analysis — should be an orchestrated, automated workflow. In practice, it is frequently a multi-week manual process of spreadsheet-based checklists, individual query reviews, and manual confirmations. An eCRF system built for scale makes database lock a machine-checkable precondition, not a human-reviewed milestone.
01
Lock Readiness Audit (Automated)
System automatically checks: zero outstanding mandatory queries, all required items complete for all subjects, all subject dispositions resolved, all protocol deviation forms completed. Produces a lock readiness report by site, by form, by subject — not a manual curation task.
Output: Lock_Readiness_Report_[date].pdf — directly submittable to QA
02
Double-Entry Verification (if applicable)
For paper-to-electronic transcription trials, automated comparison of primary and secondary entry discrepancies with discrepancy rate by site and form. Discrepancies above threshold trigger pre-lock site follow-up before proceeding.
Reconciliation algorithm: Levenshtein distance for text; absolute tolerance for numerics
03
SDTM Export & Validation
Automated SDTM dataset generation from the metadata-driven export view. CDISC Pinnacle21 Community validation runs automatically, and all P21 errors must be resolved before lock approval is requested. Produces submission-ready XPT files.
Format: SAS XPT v5 transport files · define.xml (CDISC Define-XML 2.1)
04
Formal Lock with Electronic Signatures
Authorized signatories apply electronic signatures via 21 CFR Part 11 compliant e-signature workflow. The lock event is recorded in an immutable audit log. All subsequent data changes require a formal unlock procedure with documented justification.
E-signature: FDA 21 CFR Part 11 §11.50 compliant · username + password + 2FA
05
Archive Package Generation
Automated generation of archive package including complete audit trails, SDTM and ADaM datasets, CRF PDFs per subject, edit check logs, query logs, user access logs, and validation documentation. The archive is encrypted, checksummed, and stored according to compliant retention requirements.
Storage: WORM-compliant object storage with SHA-256 integrity hash per file
Read More: HIPAA by Design: The Engineering Blueprint for Compliant Healthcare Systems
The Complete eCRF Data Flow Architecture
The 10 pillars above work together as an integrated system. Here is how data flows through a production eCRF architecture from site entry to regulatory submission:
Table------------------
💡 FHIR Integration for Patient Pre-Population
An increasingly common eCRF pattern pre-populates subject demographic and baseline data from the site's EHR via FHIR R4 Patient and Condition resources, reducing manual transcription errors and site burden. This requires SMART on FHIR EHR launch from within the eCRF, patient matching via MPI, and FHIR DocumentReference for visit notes. See our FHIR Integration Architecture Guide for the complete EHR pre-population pattern, and our FHIR Data Mapping guide for the terminology and null-data handling challenges unique to clinical trial contexts.
Build Your eCRF to Survive Phase III — Not Just Phase I
The architectural decisions in this guide are the difference between an eCRF that handles your first pilot study and one that handles a 12,000-subject global Phase III trial without a migration, a revalidation cycle, or a database lock crisis. Every decision — metadata-driven schema, immutable audit trail, parameterized edit checks, site-scoped RBAC, SDTM mapping at definition time — is made once and pays dividends across every study the platform runs.
Peerbits has built production eCRF and EDC platforms for Phase I through IV clinical trials, covering FDA, EMA, and PMDA submission environments. Our implementations include metadata-driven schema design validated once, 21 CFR Part 11 compliant audit architecture, CDISC CDASH-aligned data collection with automated SDTM export, and multi-site RBAC that scales to 200+ sites without administrative overhead. We deliver from architecture review through production deployment and ongoing regulatory compliance support.
Book Free eCRF Architecture Review







