From 0ce972a361dafeb3a20a3809cc09c3015a6a33c3 Mon Sep 17 00:00:00 2001
From: Beda Schmid <beda@tukutoi.com>
Date: Fri, 15 May 2026 13:59:30 -0300
Subject: [PATCH] Add developer documentation

---
 dmarc-watcher/doc/README.md      | 16 ++++++
 dmarc-watcher/doc/agent-notes.md | 26 +++++++++
 dmarc-watcher/doc/code-map.md    | 84 ++++++++++++++++++++++++++++
 dmarc-watcher/doc/repository.md  | 61 ++++++++++++++++++++
 dmarc-watcher/doc/runtime.md     | 96 ++++++++++++++++++++++++++++++++
 5 files changed, 283 insertions(+)
 create mode 100644 dmarc-watcher/doc/README.md
 create mode 100644 dmarc-watcher/doc/agent-notes.md
 create mode 100644 dmarc-watcher/doc/code-map.md
 create mode 100644 dmarc-watcher/doc/repository.md
 create mode 100644 dmarc-watcher/doc/runtime.md

diff --git a/dmarc-watcher/doc/README.md b/dmarc-watcher/doc/README.md
new file mode 100644
index 0000000..cef4cbb
--- /dev/null
+++ b/dmarc-watcher/doc/README.md
@@ -0,0 +1,16 @@
+# DMARC Watch Developer Docs
+
+This folder documents the implemented repository for developers and LLM agents. It is a code-navigation aid, not end-user product documentation.
+
+## Index
+
+- [Repository structure](repository.md): top-level files, application package layout, tests, design references, and generated runtime paths.
+- [Runtime and development workflow](runtime.md): configuration loading, local test commands, Docker entry points, scheduler jobs, and CLI/API processing paths.
+- [Code map](code-map.md): major modules, data flow, persistence models, alerting, LLM usage, and dashboard/API surfaces.
+- [Agent notes](agent-notes.md): scoped guidance for future code changes and documentation work in this repository.
+
+## Scope Notes
+
+- The repository is named DMARC Watch in the orchestration context, while the implemented FastAPI application currently identifies itself as `DMARC Sentinel`.
+- Settings are loaded from host-controlled files and environment variables. The implemented settings page is read-only.
+- These docs describe current files only. They do not define migration, compatibility, or product roadmap decisions.
diff --git a/dmarc-watcher/doc/agent-notes.md b/dmarc-watcher/doc/agent-notes.md
new file mode 100644
index 0000000..da92dc5
--- /dev/null
+++ b/dmarc-watcher/doc/agent-notes.md
@@ -0,0 +1,26 @@
+# Agent Notes
+
+Use this file as a quick orientation guide before editing the repository.
+
+## Ground Rules
+
+- Keep documentation and code claims aligned with implemented files.
+- Do not describe settings as editable through the web application. Runtime settings come from `config/config.yml` or `config/config.example.yml`, plus environment variables.
+- Treat `data/` and `logs/` as runtime output locations. The repository ignores SQLite databases and log files.
+- Do not add compatibility, migration, or legacy-support behavior unless a task explicitly asks for an implemented change.
+- UI work should follow the mockups in `design/`. If documenting UI icons, respect the project policy that icons must be material icons.
+
+## Fast Navigation
+
+- Start at `app/main.py` for HTTP routes, template rendering, API auth dependencies, and admin processing endpoints.
+- Read `app/config.py` before changing runtime behavior because settings are Pydantic models loaded at import time by `get_settings()`.
+- Follow ingestion through `app/message_processor.py`, then `app/attachment_extractor.py`, `app/dmarc_parser.py`, `app/known_senders.py`, and `app/analyzer.py`.
+- Use `app/models.py` for database table fields and relationships. Tables are created by `app.db.init_db()` through SQLAlchemy metadata.
+- Use `tests/` for examples of expected parser, extractor, analyzer, storage, homepage, and LLM fallback behavior.
+
+## Documentation Change Checklist
+
+1. Inspect the code path being documented.
+2. Keep filenames in `doc/` lowercase, except `README.md`.
+3. Prefer links to existing files and commands that are present in this repository.
+4. Validate with lightweight checks appropriate to the change. For docs-only changes, at minimum verify Markdown filenames and run the test suite when the environment has dependencies installed.
diff --git a/dmarc-watcher/doc/code-map.md b/dmarc-watcher/doc/code-map.md
new file mode 100644
index 0000000..bac13cc
--- /dev/null
+++ b/dmarc-watcher/doc/code-map.md
@@ -0,0 +1,84 @@
+# Code Map
+
+This map describes the implemented Python modules and their main responsibilities.
+
+## Application Entry Points
+
+- `app/main.py` creates the FastAPI app, loads settings, configures logging, initializes the database, mounts static assets, starts the scheduler on startup, and defines HTML/API routes.
+- `app/cli.py` exposes `python -m app.cli backlog` for backlog processing using the same `process_inbox()` pipeline as the admin API.
+- `Dockerfile` runs `uvicorn app.main:app --host 0.0.0.0 --port 8000`.
+
+## Configuration and Auth
+
+- `app/config.py` defines Pydantic models for `app`, `security`, `llm`, `inboxes`, `known_senders`, and `alerts`.
+- `load_settings()` reads `DMARC_SENTINEL_CONFIG` when set, otherwise `config/config.yml` if it exists, otherwise `config/config.example.yml`.
+- `validate_llm_environment()` requires the configured OpenAI API key when `llm.provider` is `openai`, except when `DMARC_SENTINEL_ALLOW_NO_LLM_FOR_TESTS=true`.
+- `app/auth.py` protects dashboard/admin routes with Basic Auth when enabled and protects homepage API routes with bearer token auth when required.
+- `app/templates/settings.html` renders runtime settings and environment-variable presence as read-only information.
+
+## Persistence
+
+- `app/db.py` creates a SQLAlchemy engine from `settings.app.database_url`, provides session helpers, and creates tables with `Base.metadata.create_all()`.
+- `app/models.py` defines tables for inbox status, mail messages, parsed reports, records, auth results, alerts, daily stats, and stored LLM reports.
+- Duplicate report detection is based on the unique `Report.raw_xml_sha256` field.
+
+## Ingestion Pipeline
+
+The main ingestion path is implemented in `app/message_processor.py`:
+
+1. `process_inbox()` opens an IMAP connection using `app/imap_client.py`.
+2. It searches UIDs in either new-message mode or backlog mode.
+3. It stores or reuses a `MailMessage` row for each fetched message.
+4. `is_candidate_message()` checks recipients, subject text, and attachment hints.
+5. `app/attachment_extractor.py` extracts `.xml`, `.gz`, and `.zip` report payloads, rejects unsafe ZIP paths, skips nested archives inside ZIPs, and enforces the decompressed size limit.
+6. `app/dmarc_parser.py` parses DMARC aggregate XML with `defusedxml`.
+7. `app/known_senders.py` classifies each record using configured CIDR allowlists, DKIM domains, SPF domains, or aligned DKIM evidence.
+8. `app/analyzer.py` creates or updates deterministic alerts.
+9. Email notifications are sent through `app/alerts.py` when configured and when a new alert is created or severity increases.
+
+## Deterministic Analysis
+
+`app/analyzer.py` produces alerts from stored report records. Implemented alert paths include:
+
+- unknown source failed both SPF and DKIM alignment;
+- known sender DMARC failure;
+- quarantine or reject disposition;
+- first observed failing source;
+- first observed passing but unclassified source;
+- high SPF or DKIM alignment failure rate;
+- sudden unknown failure spike;
+- new reporter;
+- first observed policy for a domain;
+- missing expected reporter based on recent report history.
+
+Open alerts are deduplicated by a fingerprint composed from domain, alert type, and key. Existing open alerts are updated rather than duplicated.
+
+## LLM Usage
+
+- `app/llm.py` wraps the OpenAI client for JSON-only alert explanations and daily/weekly summaries.
+- Alert existence and severity are deterministic in `app/analyzer.py`; LLM output is stored as explanation fields on alerts or in `LLMReport` rows.
+- Fallback outputs are returned when LLM calls fail or validation fails.
+- Config flags for raw XML and raw email are present in settings. The implemented LLM payloads use derived alert or summary facts.
+
+## Scheduler
+
+`app/scheduler.py` starts an APScheduler `BackgroundScheduler` with these jobs:
+
+- `poll`: interval job using `settings.app.poll_interval_minutes`;
+- `daily`: cron job at 07:00 in `settings.app.timezone`;
+- `weekly`: cron job on Monday at 07:30 in `settings.app.timezone`.
+
+Daily and weekly summary jobs aggregate stored data, call `LLMClient`, store `LLMReport` rows, and attempt digest email delivery through `app/alerts.py`.
+
+## HTTP and Template Surfaces
+
+HTML routes in `app/main.py` render templates from `app/templates/`:
+
+- `/`: overview dashboard;
+- `/domains/{domain}`: domain detail;
+- `/reports/{report_id}`: report detail;
+- `/alerts`: alert list and actions;
+- `/inboxes`: inbox status and manual processing controls;
+- `/settings`: read-only runtime configuration.
+
+JSON routes include `/health`, homepage widget endpoints, domain/report/alert APIs, and admin processing endpoints. Authentication behavior is defined by route dependencies in `app/main.py`.
diff --git a/dmarc-watcher/doc/repository.md b/dmarc-watcher/doc/repository.md
new file mode 100644
index 0000000..2cf7617
--- /dev/null
+++ b/dmarc-watcher/doc/repository.md
@@ -0,0 +1,61 @@
+# Repository Structure
+
+The repository contains a FastAPI application for monitoring DMARC aggregate reports. The orchestration context names the repository DMARC Watch; the implemented application title and many runtime names currently use `DMARC Sentinel`.
+
+## Top-Level Files
+
+- `README.md`: current user-facing setup and operational notes.
+- `requirements.txt`: Python runtime and test dependencies.
+- `pytest.ini`: adds the repository root to `pythonpath`.
+- `Dockerfile`: Python 3.12 image that installs requirements, copies `app/` and `config/config.example.yml`, creates runtime directories, and starts Uvicorn.
+- `docker-compose.yml`: builds the app service, mounts `config/` read-only, mounts `data/` and `logs/`, publishes port `8000`, and attaches to the external `npm_proxy` Docker network.
+- `config/config.example.yml`: host-controlled runtime configuration template.
+- `.env.example`: environment variable template for IMAP, dashboard auth, homepage token, OpenAI, and SMTP settings.
+- `dmarc-sentinel-codex-implementation-spec.md`: implementation specification document present in the repository.
+
+## Application Package
+
+`app/` is the main Python package.
+
+- `main.py`: FastAPI app, startup, HTML routes, JSON routes, and admin endpoints.
+- `config.py`: settings models and configuration loading.
+- `db.py`: SQLAlchemy engine, session helpers, table initialization, and health check.
+- `models.py`: SQLAlchemy ORM models.
+- `message_processor.py`: IMAP message scanning, report import, analysis, message status updates, and processing summaries.
+- `imap_client.py`: IMAP connection, UID search/fetch, mark-seen, and move operations.
+- `attachment_extractor.py`: XML, gzip, and ZIP attachment extraction.
+- `dmarc_parser.py`: DMARC aggregate XML parser.
+- `known_senders.py`: deterministic sender classification.
+- `analyzer.py`: deterministic alert creation and update logic.
+- `alerts.py`: SMTP alert and digest delivery.
+- `llm.py`: OpenAI JSON calls and fallback summary/explanation outputs.
+- `scheduler.py`: polling, daily aggregation, and weekly summary jobs.
+- `homepage.py`: homepage widget and domain summary payload builders.
+- `schemas.py`: Pydantic request models for admin processing endpoints.
+- `auth.py`: dashboard Basic Auth and homepage bearer token dependencies.
+- `smoke_data.py`: helper for seeding smoke data.
+- `templates/`: Jinja2 templates for the server-rendered dashboard.
+- `static/app.css`: dashboard styles.
+
+## Tests and Fixtures
+
+`tests/` contains pytest coverage for current behavior:
+
+- parser behavior: `tests/test_parser.py`;
+- attachment extraction and message attachment detection: `tests/test_attachment_extractor.py`;
+- duplicate storage behavior: `tests/test_storage.py`;
+- alert analysis behavior: `tests/test_analyzer.py`;
+- homepage summary behavior: `tests/test_homepage.py`;
+- LLM fallback validation behavior: `tests/test_llm.py`;
+- test environment defaults: `tests/conftest.py`;
+- sample config and XML fixtures: `tests/fixtures/`.
+
+## Design References
+
+`design/` contains mockups and design reference files. These are reference artifacts for UI work, not runtime code.
+
+## Runtime Output Paths
+
+- `data/` is used for SQLite database files.
+- `logs/` is used for `logs/dmarc-sentinel.log`.
+- `.gitignore` excludes SQLite databases, log files, bytecode, virtual environments, and pytest cache output.
diff --git a/dmarc-watcher/doc/runtime.md b/dmarc-watcher/doc/runtime.md
new file mode 100644
index 0000000..d0d353a
--- /dev/null
+++ b/dmarc-watcher/doc/runtime.md
@@ -0,0 +1,96 @@
+# Runtime and Development Workflow
+
+This file summarizes how the implemented repository runs and how developers can validate changes.
+
+## Configuration Loading
+
+`app/config.py` loads settings in this order:
+
+1. `DMARC_SENTINEL_CONFIG`, when set;
+2. `config/config.yml`, when present;
+3. `config/config.example.yml`.
+
+Secrets are read from environment variables named by the loaded settings. The settings page renders loaded values and environment-variable presence as read-only status; the application does not implement settings writes through the dashboard.
+
+When `llm.provider` is `openai`, `validate_llm_environment()` requires the configured API key environment variable unless `DMARC_SENTINEL_ALLOW_NO_LLM_FOR_TESTS=true`.
+
+## Local Test Workflow
+
+Install dependencies in a Python environment, then run:
+
+```bash
+DMARC_SENTINEL_ALLOW_NO_LLM_FOR_TESTS=true python3 -m pytest
+```
+
+`tests/conftest.py` also sets test defaults for the LLM bypass, auth credentials, and homepage token.
+
+There is no repository-local Ruff configuration or requirement in `requirements.txt`; if linting is added, document the exact command with the added tooling.
+
+## Docker Runtime
+
+The Docker image starts the FastAPI app with:
+
+```bash
+uvicorn app.main:app --host 0.0.0.0 --port 8000
+```
+
+The compose service:
+
+- loads environment variables from `.env`;
+- mounts `./config` at `/app/config` as read-only;
+- mounts `./data` at `/app/data`;
+- mounts `./logs` at `/app/logs`;
+- publishes `8000:8000`;
+- attaches to the external `npm_proxy` network with static address `192.168.99.18`.
+
+The app initializes database tables on import through `app.main` calling `init_db()`.
+
+## CLI Backlog Processing
+
+Backlog processing is implemented by `app/cli.py`:
+
+```bash
+python -m app.cli backlog --inbox <inbox-id>
+```
+
+Available arguments are:
+
+- `--folder`;
+- `--since YYYY-MM-DD`;
+- `--before YYYY-MM-DD`;
+- `--limit`;
+- `--dry-run`;
+- `--reprocess`;
+- `--mark-seen`.
+
+The CLI calls `process_inbox()` in backlog mode and prints the resulting `ProcessingSummary` counters.
+
+## HTTP Processing Paths
+
+`app/main.py` exposes admin endpoints that call the same processing pipeline:
+
+- `POST /api/admin/process-now`: processes the configured inbox using request fields from `ProcessNowRequest`.
+- `POST /api/admin/backlog`: runs backlog processing using request fields from `BacklogRequest`.
+
+Both endpoints use dashboard Basic Auth dependencies.
+
+## Scheduled Work
+
+On FastAPI startup, `start_scheduler(settings)` registers:
+
+- polling for all enabled inboxes every `settings.app.poll_interval_minutes`;
+- daily summaries at 07:00 in `settings.app.timezone`;
+- weekly summaries on Monday at 07:30 in `settings.app.timezone`.
+
+Polling uses `process_inbox()` with `mode="new"`. Summary jobs aggregate stored records, call the LLM wrapper, store `LLMReport` rows, and send digest email when SMTP settings allow it.
+
+## Lightweight Validation Commands
+
+For documentation-only changes, useful checks are:
+
+```bash
+find doc -type f -maxdepth 1 -print
+python3 -m pytest
+```
+
+The filename rule for this docs folder is: every Markdown file under `doc/` must be lowercase, except `README.md`.