From 0ce972a361dafeb3a20a3809cc09c3015a6a33c3 Mon Sep 17 00:00:00 2001 From: Beda Schmid Date: Fri, 15 May 2026 13:59:30 -0300 Subject: [PATCH] Add developer documentation --- dmarc-watcher/doc/README.md | 16 ++++++ dmarc-watcher/doc/agent-notes.md | 26 +++++++++ dmarc-watcher/doc/code-map.md | 84 ++++++++++++++++++++++++++++ dmarc-watcher/doc/repository.md | 61 ++++++++++++++++++++ dmarc-watcher/doc/runtime.md | 96 ++++++++++++++++++++++++++++++++ 5 files changed, 283 insertions(+) create mode 100644 dmarc-watcher/doc/README.md create mode 100644 dmarc-watcher/doc/agent-notes.md create mode 100644 dmarc-watcher/doc/code-map.md create mode 100644 dmarc-watcher/doc/repository.md create mode 100644 dmarc-watcher/doc/runtime.md diff --git a/dmarc-watcher/doc/README.md b/dmarc-watcher/doc/README.md new file mode 100644 index 0000000..cef4cbb --- /dev/null +++ b/dmarc-watcher/doc/README.md @@ -0,0 +1,16 @@ +# DMARC Watch Developer Docs + +This folder documents the implemented repository for developers and LLM agents. It is a code-navigation aid, not end-user product documentation. + +## Index + +- [Repository structure](repository.md): top-level files, application package layout, tests, design references, and generated runtime paths. +- [Runtime and development workflow](runtime.md): configuration loading, local test commands, Docker entry points, scheduler jobs, and CLI/API processing paths. +- [Code map](code-map.md): major modules, data flow, persistence models, alerting, LLM usage, and dashboard/API surfaces. +- [Agent notes](agent-notes.md): scoped guidance for future code changes and documentation work in this repository. + +## Scope Notes + +- The repository is named DMARC Watch in the orchestration context, while the implemented FastAPI application currently identifies itself as `DMARC Sentinel`. +- Settings are loaded from host-controlled files and environment variables. The implemented settings page is read-only. +- These docs describe current files only. They do not define migration, compatibility, or product roadmap decisions. diff --git a/dmarc-watcher/doc/agent-notes.md b/dmarc-watcher/doc/agent-notes.md new file mode 100644 index 0000000..da92dc5 --- /dev/null +++ b/dmarc-watcher/doc/agent-notes.md @@ -0,0 +1,26 @@ +# Agent Notes + +Use this file as a quick orientation guide before editing the repository. + +## Ground Rules + +- Keep documentation and code claims aligned with implemented files. +- Do not describe settings as editable through the web application. Runtime settings come from `config/config.yml` or `config/config.example.yml`, plus environment variables. +- Treat `data/` and `logs/` as runtime output locations. The repository ignores SQLite databases and log files. +- Do not add compatibility, migration, or legacy-support behavior unless a task explicitly asks for an implemented change. +- UI work should follow the mockups in `design/`. If documenting UI icons, respect the project policy that icons must be material icons. + +## Fast Navigation + +- Start at `app/main.py` for HTTP routes, template rendering, API auth dependencies, and admin processing endpoints. +- Read `app/config.py` before changing runtime behavior because settings are Pydantic models loaded at import time by `get_settings()`. +- Follow ingestion through `app/message_processor.py`, then `app/attachment_extractor.py`, `app/dmarc_parser.py`, `app/known_senders.py`, and `app/analyzer.py`. +- Use `app/models.py` for database table fields and relationships. Tables are created by `app.db.init_db()` through SQLAlchemy metadata. +- Use `tests/` for examples of expected parser, extractor, analyzer, storage, homepage, and LLM fallback behavior. + +## Documentation Change Checklist + +1. Inspect the code path being documented. +2. Keep filenames in `doc/` lowercase, except `README.md`. +3. Prefer links to existing files and commands that are present in this repository. +4. Validate with lightweight checks appropriate to the change. For docs-only changes, at minimum verify Markdown filenames and run the test suite when the environment has dependencies installed. diff --git a/dmarc-watcher/doc/code-map.md b/dmarc-watcher/doc/code-map.md new file mode 100644 index 0000000..bac13cc --- /dev/null +++ b/dmarc-watcher/doc/code-map.md @@ -0,0 +1,84 @@ +# Code Map + +This map describes the implemented Python modules and their main responsibilities. + +## Application Entry Points + +- `app/main.py` creates the FastAPI app, loads settings, configures logging, initializes the database, mounts static assets, starts the scheduler on startup, and defines HTML/API routes. +- `app/cli.py` exposes `python -m app.cli backlog` for backlog processing using the same `process_inbox()` pipeline as the admin API. +- `Dockerfile` runs `uvicorn app.main:app --host 0.0.0.0 --port 8000`. + +## Configuration and Auth + +- `app/config.py` defines Pydantic models for `app`, `security`, `llm`, `inboxes`, `known_senders`, and `alerts`. +- `load_settings()` reads `DMARC_SENTINEL_CONFIG` when set, otherwise `config/config.yml` if it exists, otherwise `config/config.example.yml`. +- `validate_llm_environment()` requires the configured OpenAI API key when `llm.provider` is `openai`, except when `DMARC_SENTINEL_ALLOW_NO_LLM_FOR_TESTS=true`. +- `app/auth.py` protects dashboard/admin routes with Basic Auth when enabled and protects homepage API routes with bearer token auth when required. +- `app/templates/settings.html` renders runtime settings and environment-variable presence as read-only information. + +## Persistence + +- `app/db.py` creates a SQLAlchemy engine from `settings.app.database_url`, provides session helpers, and creates tables with `Base.metadata.create_all()`. +- `app/models.py` defines tables for inbox status, mail messages, parsed reports, records, auth results, alerts, daily stats, and stored LLM reports. +- Duplicate report detection is based on the unique `Report.raw_xml_sha256` field. + +## Ingestion Pipeline + +The main ingestion path is implemented in `app/message_processor.py`: + +1. `process_inbox()` opens an IMAP connection using `app/imap_client.py`. +2. It searches UIDs in either new-message mode or backlog mode. +3. It stores or reuses a `MailMessage` row for each fetched message. +4. `is_candidate_message()` checks recipients, subject text, and attachment hints. +5. `app/attachment_extractor.py` extracts `.xml`, `.gz`, and `.zip` report payloads, rejects unsafe ZIP paths, skips nested archives inside ZIPs, and enforces the decompressed size limit. +6. `app/dmarc_parser.py` parses DMARC aggregate XML with `defusedxml`. +7. `app/known_senders.py` classifies each record using configured CIDR allowlists, DKIM domains, SPF domains, or aligned DKIM evidence. +8. `app/analyzer.py` creates or updates deterministic alerts. +9. Email notifications are sent through `app/alerts.py` when configured and when a new alert is created or severity increases. + +## Deterministic Analysis + +`app/analyzer.py` produces alerts from stored report records. Implemented alert paths include: + +- unknown source failed both SPF and DKIM alignment; +- known sender DMARC failure; +- quarantine or reject disposition; +- first observed failing source; +- first observed passing but unclassified source; +- high SPF or DKIM alignment failure rate; +- sudden unknown failure spike; +- new reporter; +- first observed policy for a domain; +- missing expected reporter based on recent report history. + +Open alerts are deduplicated by a fingerprint composed from domain, alert type, and key. Existing open alerts are updated rather than duplicated. + +## LLM Usage + +- `app/llm.py` wraps the OpenAI client for JSON-only alert explanations and daily/weekly summaries. +- Alert existence and severity are deterministic in `app/analyzer.py`; LLM output is stored as explanation fields on alerts or in `LLMReport` rows. +- Fallback outputs are returned when LLM calls fail or validation fails. +- Config flags for raw XML and raw email are present in settings. The implemented LLM payloads use derived alert or summary facts. + +## Scheduler + +`app/scheduler.py` starts an APScheduler `BackgroundScheduler` with these jobs: + +- `poll`: interval job using `settings.app.poll_interval_minutes`; +- `daily`: cron job at 07:00 in `settings.app.timezone`; +- `weekly`: cron job on Monday at 07:30 in `settings.app.timezone`. + +Daily and weekly summary jobs aggregate stored data, call `LLMClient`, store `LLMReport` rows, and attempt digest email delivery through `app/alerts.py`. + +## HTTP and Template Surfaces + +HTML routes in `app/main.py` render templates from `app/templates/`: + +- `/`: overview dashboard; +- `/domains/{domain}`: domain detail; +- `/reports/{report_id}`: report detail; +- `/alerts`: alert list and actions; +- `/inboxes`: inbox status and manual processing controls; +- `/settings`: read-only runtime configuration. + +JSON routes include `/health`, homepage widget endpoints, domain/report/alert APIs, and admin processing endpoints. Authentication behavior is defined by route dependencies in `app/main.py`. diff --git a/dmarc-watcher/doc/repository.md b/dmarc-watcher/doc/repository.md new file mode 100644 index 0000000..2cf7617 --- /dev/null +++ b/dmarc-watcher/doc/repository.md @@ -0,0 +1,61 @@ +# Repository Structure + +The repository contains a FastAPI application for monitoring DMARC aggregate reports. The orchestration context names the repository DMARC Watch; the implemented application title and many runtime names currently use `DMARC Sentinel`. + +## Top-Level Files + +- `README.md`: current user-facing setup and operational notes. +- `requirements.txt`: Python runtime and test dependencies. +- `pytest.ini`: adds the repository root to `pythonpath`. +- `Dockerfile`: Python 3.12 image that installs requirements, copies `app/` and `config/config.example.yml`, creates runtime directories, and starts Uvicorn. +- `docker-compose.yml`: builds the app service, mounts `config/` read-only, mounts `data/` and `logs/`, publishes port `8000`, and attaches to the external `npm_proxy` Docker network. +- `config/config.example.yml`: host-controlled runtime configuration template. +- `.env.example`: environment variable template for IMAP, dashboard auth, homepage token, OpenAI, and SMTP settings. +- `dmarc-sentinel-codex-implementation-spec.md`: implementation specification document present in the repository. + +## Application Package + +`app/` is the main Python package. + +- `main.py`: FastAPI app, startup, HTML routes, JSON routes, and admin endpoints. +- `config.py`: settings models and configuration loading. +- `db.py`: SQLAlchemy engine, session helpers, table initialization, and health check. +- `models.py`: SQLAlchemy ORM models. +- `message_processor.py`: IMAP message scanning, report import, analysis, message status updates, and processing summaries. +- `imap_client.py`: IMAP connection, UID search/fetch, mark-seen, and move operations. +- `attachment_extractor.py`: XML, gzip, and ZIP attachment extraction. +- `dmarc_parser.py`: DMARC aggregate XML parser. +- `known_senders.py`: deterministic sender classification. +- `analyzer.py`: deterministic alert creation and update logic. +- `alerts.py`: SMTP alert and digest delivery. +- `llm.py`: OpenAI JSON calls and fallback summary/explanation outputs. +- `scheduler.py`: polling, daily aggregation, and weekly summary jobs. +- `homepage.py`: homepage widget and domain summary payload builders. +- `schemas.py`: Pydantic request models for admin processing endpoints. +- `auth.py`: dashboard Basic Auth and homepage bearer token dependencies. +- `smoke_data.py`: helper for seeding smoke data. +- `templates/`: Jinja2 templates for the server-rendered dashboard. +- `static/app.css`: dashboard styles. + +## Tests and Fixtures + +`tests/` contains pytest coverage for current behavior: + +- parser behavior: `tests/test_parser.py`; +- attachment extraction and message attachment detection: `tests/test_attachment_extractor.py`; +- duplicate storage behavior: `tests/test_storage.py`; +- alert analysis behavior: `tests/test_analyzer.py`; +- homepage summary behavior: `tests/test_homepage.py`; +- LLM fallback validation behavior: `tests/test_llm.py`; +- test environment defaults: `tests/conftest.py`; +- sample config and XML fixtures: `tests/fixtures/`. + +## Design References + +`design/` contains mockups and design reference files. These are reference artifacts for UI work, not runtime code. + +## Runtime Output Paths + +- `data/` is used for SQLite database files. +- `logs/` is used for `logs/dmarc-sentinel.log`. +- `.gitignore` excludes SQLite databases, log files, bytecode, virtual environments, and pytest cache output. diff --git a/dmarc-watcher/doc/runtime.md b/dmarc-watcher/doc/runtime.md new file mode 100644 index 0000000..d0d353a --- /dev/null +++ b/dmarc-watcher/doc/runtime.md @@ -0,0 +1,96 @@ +# Runtime and Development Workflow + +This file summarizes how the implemented repository runs and how developers can validate changes. + +## Configuration Loading + +`app/config.py` loads settings in this order: + +1. `DMARC_SENTINEL_CONFIG`, when set; +2. `config/config.yml`, when present; +3. `config/config.example.yml`. + +Secrets are read from environment variables named by the loaded settings. The settings page renders loaded values and environment-variable presence as read-only status; the application does not implement settings writes through the dashboard. + +When `llm.provider` is `openai`, `validate_llm_environment()` requires the configured API key environment variable unless `DMARC_SENTINEL_ALLOW_NO_LLM_FOR_TESTS=true`. + +## Local Test Workflow + +Install dependencies in a Python environment, then run: + +```bash +DMARC_SENTINEL_ALLOW_NO_LLM_FOR_TESTS=true python3 -m pytest +``` + +`tests/conftest.py` also sets test defaults for the LLM bypass, auth credentials, and homepage token. + +There is no repository-local Ruff configuration or requirement in `requirements.txt`; if linting is added, document the exact command with the added tooling. + +## Docker Runtime + +The Docker image starts the FastAPI app with: + +```bash +uvicorn app.main:app --host 0.0.0.0 --port 8000 +``` + +The compose service: + +- loads environment variables from `.env`; +- mounts `./config` at `/app/config` as read-only; +- mounts `./data` at `/app/data`; +- mounts `./logs` at `/app/logs`; +- publishes `8000:8000`; +- attaches to the external `npm_proxy` network with static address `192.168.99.18`. + +The app initializes database tables on import through `app.main` calling `init_db()`. + +## CLI Backlog Processing + +Backlog processing is implemented by `app/cli.py`: + +```bash +python -m app.cli backlog --inbox +``` + +Available arguments are: + +- `--folder`; +- `--since YYYY-MM-DD`; +- `--before YYYY-MM-DD`; +- `--limit`; +- `--dry-run`; +- `--reprocess`; +- `--mark-seen`. + +The CLI calls `process_inbox()` in backlog mode and prints the resulting `ProcessingSummary` counters. + +## HTTP Processing Paths + +`app/main.py` exposes admin endpoints that call the same processing pipeline: + +- `POST /api/admin/process-now`: processes the configured inbox using request fields from `ProcessNowRequest`. +- `POST /api/admin/backlog`: runs backlog processing using request fields from `BacklogRequest`. + +Both endpoints use dashboard Basic Auth dependencies. + +## Scheduled Work + +On FastAPI startup, `start_scheduler(settings)` registers: + +- polling for all enabled inboxes every `settings.app.poll_interval_minutes`; +- daily summaries at 07:00 in `settings.app.timezone`; +- weekly summaries on Monday at 07:30 in `settings.app.timezone`. + +Polling uses `process_inbox()` with `mode="new"`. Summary jobs aggregate stored records, call the LLM wrapper, store `LLMReport` rows, and send digest email when SMTP settings allow it. + +## Lightweight Validation Commands + +For documentation-only changes, useful checks are: + +```bash +find doc -type f -maxdepth 1 -print +python3 -m pytest +``` + +The filename rule for this docs folder is: every Markdown file under `doc/` must be lowercase, except `README.md`.