Architecture Decisions
Key technical choices and the reasoning behind them.
ADR-1: Kotlin + Jetpack Compose
Chosen over: React Native, Flutter, Kotlin + XML
Why Compose
- 80% of the app is native Android services (AccessibilityService, PTY, foreground services, biometrics). Cross-platform frameworks would need Kotlin native modules for all of that, plus a bridge layer.
- Compose is declarative like React — same mental model (state to UI), different syntax.
- Material 3 / Material You theming is first-class.
- OkHttp WebSocket supports
wss://natively. No bridge layer to debug. - Single language (Kotlin) for the entire app.
Why not React Native
Native modules needed for AccessibilityService, foreground service, biometric auth, EncryptedSharedPreferences, and MediaProjection. That's most of the app in Kotlin anyway, plus JS bridge overhead. Only makes sense if the UI were 80%+ of the codebase — here it's roughly 20%.
Why not Flutter
Same native bridge problem, but with Dart instead of familiar React/JS patterns. Smaller Android-specific ecosystem for security and biometric libraries.
ADR-2: Single WSS Connection with Channel Multiplexing
Decision: One WebSocket connection carries all relay channels (terminal, bridge) via typed message envelopes.
Rationale
Simpler connection management, single auth flow, single reconnect handler. Mobile networks are flaky — one connection is easier to keep alive than three.
Trade-off
If one channel floods (e.g., terminal output), it could delay others. Mitigated by terminal output batching (16ms frames) and priority queuing.
ADR-3: Unified Relay Server as Separate Service
Decision: One Python relay service on port 8767 hosts chat (terminal), bridge, voice, and notifications channels. It runs alongside — not inside — the Hermes gateway.
Rationale
- Separate service means independent deployment and restart cycles from the gateway.
- Single WSS port keeps the phone's connection model simple: one persistent socket, channel-multiplexed envelopes.
- Future option to merge into the gateway as a platform adapter if the footprint stabilizes.
History
v0.2 ran the bridge as a standalone service on port 8766 (plugin/tools/android_relay.py). v0.3 consolidated that onto the unified relay port 8767 as part of the Phase 3 bridge rollout. The wire protocol was kept byte-for-byte identical so the android_* plugin tools only needed a BRIDGE_URL change to cut over.
ADR-4: Chat via Direct API, Not Relay Proxy
Decision: Chat connects directly from the Android app to the Hermes API Server via HTTP/SSE. The relay server is only used for bridge and terminal channels.
Original Approach
Chat was originally proxied through the relay server, which converted SSE responses to WebSocket envelopes.
Why Direct API Won
- The relay was an unnecessary middleman — it just converted SSE to WebSocket.
- Every other Hermes frontend (Open WebUI, ClawPort, LobeChat) connects directly.
- The Sessions API (
/api/sessions/{id}/chat/stream) provides SSE streaming with rich event types. - Simpler, lower latency, removes relay as single point of failure for chat.
Result
Phone (HTTP/SSE) → Hermes API Server (:8642) [chat — direct]
Phone (WSS) → Relay Server (:8767) [terminal, bridge, voice, notifications]Auth uses optional Bearer token (API_SERVER_KEY). Most local setups run without one.
ADR-5: xterm.js in WebView for Terminal
Decision: Use xterm.js in a local WebView, not a native Compose canvas renderer.
Rationale
- xterm.js is battle-tested — handles all ANSI escape sequences, Unicode, colors, scrollback.
- A native Compose terminal renderer would take weeks for inferior rendering.
- The WebView is a single composable in an otherwise fully native app.
ADR-6: tmux for Terminal Sessions
Decision: Terminal channel attaches to tmux sessions, not raw PTY.
Rationale
- Persistence — disconnect and reconnect without losing state.
- Named sessions for multiple contexts.
- Shared sessions — agent and user can see the same terminal.
ADR-7: Pairing Code Auth for Relay (QR-driven, updated 2026-04-11)
Decision: Initial pairing via 6-char code generated by the pair command (/hermes-relay-pair skill or hermes-pair shell shim) on the Hermes host, pre-registered with the relay via a loopback-only /pairing/register endpoint, and embedded in the same QR payload that carries the API server credentials. One scan configures both chat and the relay. Session tokens handle all subsequent reconnects.
Rationale
- Pairing codes are user-friendly — no pre-shared secrets.
- Driving the code flow from the host (via the pair command) means the operator always has the source of truth; previously the phone generated its own code and the relay had no way to validate it.
POST /pairing/registeris gated to loopback callers only (127.0.0.1/::1) — trust anchor is the operator with host shell access. A LAN attacker cannot inject codes.- Session tokens avoid re-pairing on every restart.
- Tokens stored in EncryptedSharedPreferences (AES-256-GCM, Android Keystore-backed).
- Codes use the full
A-Z / 0-9alphabet (36 chars). The earlier "no ambiguous 0/O/1/I" restriction only mattered when a human had to retype a code from a display; with QR + HTTP the restriction silently rejected valid codes. - Old API-only QRs (no
relayblock) still parse cleanly — therelayfield is nullable and the Android parser runs withignoreUnknownKeys = true. - A future symmetric phone-generates, host-approves flow for the bridge channel will reuse
/pairing/registerfrom the opposite direction; phone-sideAuthManager.generatePairingCode()is retained for that reason.
ADR-8: Biometric Gate for Terminal Only
Decision: Biometric/PIN required before terminal access. Chat and bridge don't require it.
Rationale
- Terminal = shell access to your server — highest privilege.
- Chat is conversational — no more dangerous than a chat app.
- Bridge enforces its own five-stage safety system (see ADR-9); adding a biometric on top doesn't add security because the bridge is initiated by the agent, not the user.
ADR-9: Bridge Five-Stage Safety Gate
Decision: Every bridge command (v0.3+) must pass five independent gates before a gesture dispatches: session grant → in-app master toggle → HermesAccessibilityService permission → MediaProjection consent → Tier 5 safety rails (blocklist → destructive-verb confirmation → auto-disable reschedule).
Rationale
- Agent-controlled device access is structurally different from user-controlled device access. A single "allow" toggle is not enough — a compromised or confused agent can issue commands just as easily as a trusted one.
- Each gate is a different trust decision: the session grant says "this pairing may use the bridge channel at all"; the master toggle says "right now, the bridge may act"; the a11y/MediaProjection grants are OS-level and survive reboots; the Tier 5 rails are per-command and content-aware.
- Failing any gate fails the command at the phone side, not the relay — so a network-side attacker with the session token still cannot bypass safety rails.
- Blocklisted packages (banking / password managers / 2FA / email / work apps) get a hard 403, matched against the target of
/open_appas well as the currently foregrounded app. - Destructive-verb words (
send/pay/delete/transfer/ etc.) trigger a full-screenWindowManageroverlay confirmation — rendered outside the Hermes activity so it's visible even when the agent sends commands while another app is in the foreground. - An idle auto-disable timer (5–120 min, resets on every command) keeps a stale grant from surviving a crash or a forgotten session.
Trade-off
The confirmation overlay adds latency and a manual tap to every destructive command. This is the entire point — the user must deliberately authorize state-changing actions even when the agent is otherwise trusted.
ADR-10: Bridge Wake-Scope for Reliable Gesture Dispatch
Decision: Wrap gesture dispatch in a short-lived PowerManager.PARTIAL_WAKE_LOCK via WakeLockManager so commands issued while the screen is dim or idle still land.
Rationale
Android aggressively throttles GestureDescription dispatch on a dim or doze-mode screen. Agent commands that fire during a long session were unreliable — a /tap might succeed at 5-second intervals and silently drop at 30-second intervals. Holding a partial wake lock around the gesture dispatch (released immediately after the callback fires) keeps the CPU awake just long enough for the dispatcher to run, without affecting screen state or battery life meaningfully.
Trade-off
Wake-lock abuse is a real Android antipattern — stale locks drain batteries. The implementation uses scoped try/finally semantics so the lock is always released, even on gesture failure or crash. No long-held locks.
ADR-11: Accessibility Event Stream Instead of Polling
Decision: The bridge exposes /events (poll) and /events/stream (toggle) over an in-memory EventStore that buffers recent AccessibilityEvent objects, rather than making the agent poll /screen repeatedly to detect change.
Rationale
/screenis expensive — it walks the full accessibility tree and serializes every node.- Waiting for "has the screen changed?" is a very common agent primitive (
wait until this loads,notice when the dialog opens,monitor for a toast). AccessibilityEventis exactly the right level: the OS already dispatches it, the phone just needs to buffer recent events in a bounded store and hand them out on request.- Combining
/eventswith/screen_hash+/diff_screengives the agent a cheap "did anything happen, and if so what?" loop without ever re-downloading the full tree.
ADR-12: Android 14+ MediaProjection Foreground Service Type
Decision: BridgeForegroundService declares foregroundServiceType="specialUse|mediaProjection" and the app ORs both type constants on startForeground().
Rationale
Android 14 introduced a requirement that any FGS using MediaProjection must declare mediaProjection in its foreground service type slot. A specialUse-only declaration silently revokes the MediaProjection grant within frames of it being issued — the consent dialog appears, the user allows, the dialog closes, and the grant evaporates before any screen can be captured.
Trade-off
specialUse remains on the type slot because the bridge FGS is used for more than just screen capture (gesture dispatch continues without the projection). Declaring both types and ORing the constants is what lets the same FGS host both surfaces.
ADR-13: Google Play vs Sideload Build Flavors
Decision: Hermes-Relay ships two distinct APKs from the same source tree — googlePlay (conservative, Play-policy-compliant) and sideload (full-feature, GitHub Releases only). They install with different application IDs so both can coexist on the same device.
Rationale
- Google Play's Accessibility Service policy review is strict and slow. Some Phase 3 features — vision-driven navigation, voice-to-bridge intents, direct SMS / contact / call / location tools — are not compatible with the conservative use-case that Play will approve.
- Rather than water down the whole app, we compile out the sensitive tiers in the
googlePlayflavor and keep them insideload. Users who want "safe + autopatch" pick Play; users who want "full agent control" pick sideload. BuildFlavor.current+ compile-time constants let R8 fold the disabled tiers out of the Play build entirely — not a runtime flag.- The
sideloadflavor carries.sideloadas an applicationId suffix andHermes Devas the launcher label so side-by-side installs are disambiguated visually.
Deferrals
| Feature | Reason | When |
|---|---|---|
| iOS support | Android-first, platform-specific APIs | v2+ |
| Multi-device | Single-device simplifies auth and state | Future |
| File transfer | Terminal tools work as a workaround | Future |
| Gateway adapter | WebAPI proxy works well, adapter is overengineering for now | If WebAPI becomes limiting |