Physician burnout is not a new problem — but AI is offering a new answer. The average clinician spends nearly two hours on documentation for every hour of direct patient care. AI medical scribe platforms are closing that gap by listening to clinical conversations in real time and generating structured, accurate clinical notes that flow directly into the EHR. But building one is an architecture challenge of the first order.
In this post, we break down the full technical architecture behind production-grade AI medical scribe systems — the kind Peerbits architects and builds for healthcare clients. Whether you are a CTO evaluating a platform purchase, a startup founder scoping an MVP, or an engineer tasked with building one, this is the blueprint.
What an AI Medical Scribe Actually Does
At its core, an AI medical scribe captures the ambient audio of a clinical encounter, transcribes it, understands clinical intent and context, extracts structured clinical data, and writes it into a draft note — formatted to the physician's specialty, documentation style, and EHR template — in near real time.
The output is not a raw transcript. It is a clinically structured SOAP note, HPI, assessment and plan, or procedure note — with ICD-10 and CPT code suggestions, medication reconciliation, and follow-up action items. That distinction matters enormously for architecture decisions.
💡 Key Insight
AI scribes are not transcription tools with a formatting layer. They are multi-stage clinical intelligence pipelines where every stage — audio capture, ASR, diarisation, NLU, code extraction, EHR write-back — must be engineered for medical-grade accuracy, latency, and security.
The Seven-Layer Architecture
A production AI medical scribe platform can be decomposed into seven distinct architectural layers, each with its own technology choices, failure modes, and compliance obligations.
// AI Medical Scribe — Platform Architecture Layers
Audio Capture & Streaming Layer
Automatic Speech Recognition (ASR) & Speaker Diarisation
Clinical NLP & LLM Reasoning Engine
Medical Coding & Entity Extraction
EHR Integration & FHIR Layer
Security, Privacy & Audit Layer
Physician Review & Feedback UI
Layer 1: Audio Capture & Streaming
The pipeline begins in the exam room. Audio capture must be low-latency, noise-resilient, and encrypted at the point of capture. Most production platforms use a lightweight mobile or desktop client — sometimes a purpose-built smart badge or ambient microphone device — that encodes audio using the Opus codec and streams it in 100–500ms chunks over a persistent WebSocket connection.
Voice Activity Detection (VAD) runs on-device to suppress silence and non-speech noise, reducing upstream data volume by 40–60% and filtering out background clinic noise before the audio ever reaches the server. This is critical both for latency and for minimising the amount of raw PHI transmitted over the network.
Key Engineering Decisions
All audio must be encrypted in transit using TLS 1.3 or better. For multi-doctor clinic environments, beamforming microphones or directional audio capture improve speaker separation before diarisation. Offline-capable clients with local buffering are important for clinics with unreliable connectivity.
Layer 2: ASR & Speaker Diarisation
General-purpose speech-to-text models fail in clinical environments. Drug names, anatomical terms, acronyms like "HbA1c" or "CABG," and physician dictation cadences are poorly handled by consumer ASR. Production medical scribes use either fine-tuned versions of Whisper or specialised medical ASR providers, augmented with custom vocabulary injection for specialty-specific terminology.
Speaker diarisation — the process of segmenting the transcript by speaker ("Doctor" vs. "Patient") — is equally critical. The LLM reasoning layer needs to know who said what: the patient's self-reported symptoms differ from the physician's clinical observations. Tools like Pyannote.audio provide speaker-segmented timestamps that are merged with the ASR output before NLU processing begins.
Audio Segmentation
Raw audio chunks are VAD-filtered and segmented into speech windows for streaming ASR inference.
Parallel Transcription
Medical ASR model transcribes each chunk with clinical vocabulary; rolling context window maintains accuracy.
Diarisation Merge
Speaker timestamps from diarisation model are merged with transcript; output is a labelled turn-by-turn dialogue.
Layer 3: The LLM Clinical Reasoning Engine
This is the intellectual core of the platform. The speaker-labelled transcript is passed to a large language model — either a proprietary frontier model like Claude or GPT-4, or a fine-tuned open-source model such as Meditron or BioMedLM — along with a structured system prompt that defines the output schema, specialty context, and documentation style.
Prompt engineering at this layer is sophisticated. The system prompt encodes the note type (SOAP, HPI, DAP), the specialty (cardiology, primary care, psychiatry), the physician's preferred phrasing patterns, and explicit instructions to distinguish between patient-reported symptoms and physician-observed findings. Retrieval-Augmented Generation (RAG) connects the LLM to a clinical knowledge base containing drug interaction data, clinical guidelines, and the patient's longitudinal record for context.
“The difference between a mediocre AI scribe and a great one is almost entirely in the LLM layer — specifically in how well the prompt architecture mirrors the cognitive workflow of a clinician writing a note.”
— Peerbits Healthcare AI Engineering Team
Hallucination control is a non-negotiable concern. Unlike a consumer chatbot, a hallucinated clinical note can directly harm a patient. Production platforms implement structured output constraints (JSON schema enforcement), confidence scoring, and mandatory physician review gates before any note is committed to the EHR.
Layer 4: Medical Coding & Entity Extraction
In parallel with note generation, a dedicated NLP pipeline extracts structured clinical entities from the LLM output: diagnoses, medications, dosages, lab values, procedures, allergies, and problem list updates. Each entity is mapped to standard terminologies — ICD-10-CM for diagnoses, CPT for procedures, RxNorm for medications, SNOMED CT for clinical findings.
This structured output serves two purposes. First, it auto-populates the EHR's structured fields — problem list, medication list, order entry — reducing the physician's click burden. Second, it provides coding suggestions for the billing team, with HCC (Hierarchical Condition Category) risk scores that are increasingly relevant for value-based care contracts.
Peerbits Services - AI Medical Coding Software Development
Layer 5: EHR Integration & FHIR
Integration with the EHR is where most AI scribe projects hit their hardest engineering challenges. The EHR vendor ecosystem is fragmented: Epic, Cerner, Athenahealth, eClinicalWorks, Meditech, and dozens of smaller systems each have their own API standards, authentication models, and data schemas.
The modern approach is FHIR R4 — HL7's RESTful interoperability standard — accessed via SMART on FHIR OAuth2 for secure, delegated access. A well-architected scribe platform maintains a FHIR abstraction layer that normalises EHR-specific data models into canonical FHIR resources (DocumentReference, Composition, DiagnosticReport, MedicationRequest), and translates write operations back into the EHR's native format.
Peerbits Differentiator
Peerbits' HIDEM middleware handles the HL7 v2 / FHIR R4 / MLLP translation layer as a reusable multi-tenant SaaS component, dramatically reducing EHR integration timeline for scribe platform builds from 4–6 months to 6–8 weeks.
Layer 6: HIPAA Compliance & Security Architecture
Every layer of the stack handles PHI, which means every layer must be HIPAA-compliant. This is not a checklist exercise — it is a pervasive architectural constraint that touches infrastructure, data flows, access controls, and vendor agreements.
| Requirement | Implementation | Standard |
|---|---|---|
| PHI Encryption at Rest | AES-256 with per-tenant key management (AWS KMS / Azure Key Vault) | HIPAA §164.312 |
| PHI Encryption in Transit | TLS 1.3 enforced; Opus codec encryption at capture point | HIPAA §164.312 |
| Access Control | RBAC with role-specific PHI scopes; MFA enforced for all clinical users | HIPAA §164.308 |
| Audit Logging | Immutable audit trail (CloudTrail / Azure Monitor) with 6-year retention | HIPAA §164.312 |
| Data Residency | US-only data residency; cross-region replication disabled for PHI | BAA Required |
| LLM PHI Controls | No PHI in LLM fine-tuning; PHI de-identification before any third-party API call | HIPAA §164.514 |
| Breach Response | Automated breach detection + 60-day notification SLA; DLP monitoring | HIPAA §164.400 |
One particularly sensitive area is LLM API usage. If your clinical NLP layer calls a third-party LLM API — including frontier models — the API provider must sign a Business Associate Agreement (BAA) and the PHI must be de-identified to HIPAA Safe Harbor standards before transmission. Peerbits addresses this by supporting both hosted frontier models (with BAA) and on-premise or VPC-deployed open-source models for clients who require zero PHI egress.
Read more: Build GDPR HIPAA Compliant AI Healthcare Software
Layer 7: Physician Review UX & Feedback Loop
Even the most accurate AI-generated note requires a physician review step. Regulatory requirements, liability, and clinical judgment all demand a human in the loop before any note is finalised. The UX at this layer is disproportionately important to adoption: if the review interface is clunky, physicians will abandon the tool.
Best-in-class review UIs present the draft note with inline editing, a side-by-side transcript for reference, and a single-click commit to the EHR. Section-level confidence indicators highlight areas where the model was uncertain. Physicians' edits are captured as feedback signals that feed a continuous fine-tuning loop, improving model accuracy over time for that specific practice.
Why Peerbits for AI Medical Scribe Development
Building an AI medical scribe platform requires deep competency across a genuinely unusual combination of disciplines: real-time audio engineering, medical NLP, LLM prompt architecture, FHIR interoperability, and HIPAA-compliant cloud infrastructure. Most engineering teams are strong in one or two of these areas. Peerbits has assembled dedicated pods for all of them.
Our healthcare AI practice has delivered FHIR integration middleware, HIPAA-compliant LLM pipelines, and AI-augmented clinical documentation tools for healthcare clients in the US and Europe. We build on your infrastructure — or provision compliant infrastructure from scratch — with full BAA coverage and end-to-end ownership of the delivery.
Peerbits Healthcare AI Team
Engineering Practice — Ahmedabad, India
Peerbits' healthcare engineering practice builds AI-powered clinical tools, FHIR middleware, and HIPAA-compliant SaaS platforms for health systems, digital health startups, and medical device companies across the US and Europe.








