docs: complete repository technical documentation refresh

2026-02-21 11:27:44 -03:00
parent 17dccbbe20
commit 6501841426
6 changed files with 454 additions and 88 deletions
@@ -1,15 +1,18 @@
 # Documentation

-This is the documentation entrypoint for DMS.
+This directory contains technical documentation for DMS.

-## Available Documents
+## Core References

- Project setup and operations: `../README.md`
- Frontend visual system and compact UI rules: `frontend-design-foundation.md`
- Handwriting style implementation plan: `../PLAN.md`
+- `../README.md` - project overview, setup, and quick operations
+- `architecture-overview.md` - backend, frontend, and infrastructure architecture
+- `api-contract.md` - API endpoint contract grouped by route module
+- `data-model-reference.md` - database entity definitions and lifecycle states
+- `operations-and-configuration.md` - runtime operations, ports, volumes, and configuration values
+- `frontend-design-foundation.md` - frontend visual system, tokens, and UI implementation rules

-## Planned Additions
+## Documentation Rules

- Architecture overview
- Data model reference
- API contract details
+- Keep this file as the documentation index and add new technical documents here.
+- Update referenced documents whenever behavior, routes, models, or runtime configuration change.
+- Prefer concise, implementation-backed descriptions with explicit paths to source modules.
@@ -0,0 +1,130 @@
+# API Contract
+
+Base URL prefix: `/api/v1`
+
+Primary implementation modules:
+- `backend/app/api/router.py`
+- `backend/app/api/routes_health.py`
+- `backend/app/api/routes_documents.py`
+- `backend/app/api/routes_search.py`
+- `backend/app/api/routes_processing_logs.py`
+- `backend/app/api/routes_settings.py`
+
+## Health
+
+- `GET /health`
+- Purpose: liveness check
+- Response: `{ "status": "ok" }`
+
+## Documents
+
+### Collection and metadata helpers
+
+- `GET /documents`
+  - Query: `offset`, `limit`, `include_trashed`, `only_trashed`, `path_prefix`, `path_filter`, `tag_filter`, `type_filter`, `processed_from`, `processed_to`
+  - Response model: `DocumentsListResponse`
+- `GET /documents/tags`
+  - Query: `include_trashed`
+  - Response: `{ "tags": string[] }`
+- `GET /documents/paths`
+  - Query: `include_trashed`
+  - Response: `{ "paths": string[] }`
+- `GET /documents/types`
+  - Query: `include_trashed`
+  - Response: `{ "types": string[] }`
+- `POST /documents/content-md/export`
+  - Body model: `ContentExportRequest`
+  - Response: ZIP stream containing one markdown file per matched document
+
+### Per-document operations
+
+- `GET /documents/{document_id}`
+  - Response model: `DocumentDetailResponse`
+- `GET /documents/{document_id}/download`
+  - Response: original file bytes
+- `GET /documents/{document_id}/preview`
+  - Response: inline preview stream where browser-supported
+- `GET /documents/{document_id}/thumbnail`
+  - Response: generated thumbnail image when available
+- `GET /documents/{document_id}/content-md`
+  - Response: extracted markdown content for one document
+- `PATCH /documents/{document_id}`
+  - Body model: `DocumentUpdateRequest`
+  - Response model: `DocumentResponse`
+- `POST /documents/{document_id}/trash`
+  - Response model: `DocumentResponse`
+- `POST /documents/{document_id}/restore`
+  - Response model: `DocumentResponse`
+- `DELETE /documents/{document_id}`
+  - Behavior: permanent delete, requires document to be trashed first
+  - Response: deletion counters
+- `POST /documents/{document_id}/reprocess`
+  - Response model: `DocumentResponse`
+  - Behavior: requeues asynchronous processing task
+
+### Upload
+
+- `POST /documents/upload`
+- Multipart form fields:
+  - `files[]` (required)
+  - `relative_paths[]` (optional)
+  - `logical_path` (optional, defaults to `Inbox`)
+  - `tags` (optional CSV)
+  - `conflict_mode` (`ask`, `replace`, `duplicate`)
+- Response model: `UploadResponse`
+- Behavior:
+  - `ask`: returns `conflicts` if duplicate checksum is detected
+  - `replace`: creates new document linked to replaced document id
+  - `duplicate`: creates additional document record
+
+## Search
+
+- `GET /search`
+- Query: `query` (min length 2), `offset`, `limit`, `include_trashed`, `only_trashed`, `path_filter`, `tag_filter`, `type_filter`, `processed_from`, `processed_to`
+- Response model: `SearchResponse`
+- Behavior: PostgreSQL full-text and metadata ranking
+
+## Processing Logs
+
+- `GET /processing/logs`
+  - Query: `offset`, `limit`, `document_id`
+  - Response model: `ProcessingLogListResponse`
+- `POST /processing/logs/trim`
+  - Query: `keep_document_sessions`, `keep_unbound_entries`
+  - Response: trim counters
+- `POST /processing/logs/clear`
+  - Response: clear counters
+
+## Settings
+
+- `GET /settings`
+  - Response model: `AppSettingsResponse`
+- `PATCH /settings`
+  - Body model: `AppSettingsUpdateRequest`
+  - Response model: `AppSettingsResponse`
+- `POST /settings/reset`
+  - Response model: `AppSettingsResponse`
+- `PATCH /settings/handwriting`
+  - Body model: `HandwritingSettingsUpdateRequest`
+  - Response model: `AppSettingsResponse`
+- `GET /settings/handwriting`
+  - Response model: `HandwritingSettingsResponse`
+
+## Schema Families
+
+Document schemas in `backend/app/schemas/documents.py`:
+- `DocumentResponse`
+- `DocumentDetailResponse`
+- `DocumentsListResponse`
+- `UploadConflict`
+- `UploadResponse`
+- `DocumentUpdateRequest`
+- `SearchResponse`
+- `ContentExportRequest`
+
+Processing log schemas in `backend/app/schemas/processing_logs.py`:
+- `ProcessingLogEntryResponse`
+- `ProcessingLogListResponse`
+
+Settings schemas in `backend/app/schemas/settings.py`:
+- Provider, task, upload-default, display, predefined paths or tags, handwriting-style, and legacy handwriting models grouped under `AppSettingsResponse` and `AppSettingsUpdateRequest`.
@@ -0,0 +1,66 @@
+# Architecture Overview
+
+## System Topology
+
+DMS runs as a multi-service application defined in `docker-compose.yml`:
+- `frontend` serves the React UI on port `5173`
+- `api` serves FastAPI on port `8000`
+- `worker` executes asynchronous extraction and indexing jobs
+- `db` provides PostgreSQL persistence on port `5432`
+- `redis` backs queueing on port `6379`
+- `typesense` stores search index and vector-adjacent metadata on port `8108`
+
+## Backend Architecture
+
+Backend source root: `backend/app/`
+
+Main boundaries:
+- `api/` route handlers and HTTP contract
+- `services/` domain logic (storage, extraction, routing, settings, processing logs, Typesense)
+- `db/` SQLAlchemy base, engine, and session lifecycle
+- `models/` persistence entities (`Document`, `ProcessingLogEntry`)
+- `schemas/` Pydantic response and request schemas
+- `worker/` RQ queue integration and background processing tasks
+
+Application bootstrap in `backend/app/main.py`:
+- mounts routers under `/api/v1`
+- configures CORS from settings
+- initializes storage, settings, database schema, and Typesense collection on startup
+
+## Processing Lifecycle
+
+1. Upload starts at `POST /api/v1/documents/upload`.
+2. API stores file bytes and inserts document rows with status `queued`.
+3. API enqueues `app.worker.tasks.process_document_task` into Redis.
+4. Worker extracts content and metadata, handles ZIP expansion, runs OCR and routing suggestions, and writes processing logs.
+5. Worker updates database fields, document status, and search index entries.
+6. UI polls for documents and processing logs to reflect progress.
+
+## Frontend Architecture
+
+Frontend source root: `frontend/src/`
+
+Core structure:
+- `App.tsx` orchestrates screen switching, state, polling, and action flows
+- `components/` contains upload, filter, grid, viewer, modal, settings, and log panel modules
+- `lib/api.ts` centralizes API client calls
+- `types.ts` defines typed API contracts used by components
+- `design-foundation.css` and `styles.css` define design tokens and global/component styling
+
+Main user flows:
+- Upload and conflict resolution
+- Search and filtered document browsing
+- Metadata editing and lifecycle actions (trash, restore, delete, reprocess)
+- Settings management for providers, tasks, and UI defaults
+- Processing log review
+
+## Persistence and State
+
+Persistent data:
+- PostgreSQL stores document metadata and processing logs
+- Docker volume-backed storage keeps original files, previews, and settings JSON
+- Typesense stores indexed search representations
+
+Transient runtime state:
+- Redis queues processing tasks and worker execution state
+- frontend local component state drives active filters, selection, and modal flows
@@ -0,0 +1,53 @@
+# Data Model Reference
+
+Primary SQLAlchemy models are defined in `backend/app/models/`.
+
+## documents
+
+Model: `Document` in `backend/app/models/document.py`
+
+Purpose:
+- Stores source file identity, storage location, extracted content, lifecycle status, and classification metadata.
+
+Core fields:
+- Identity and source: `id`, `original_filename`, `source_relative_path`, `stored_relative_path`
+- File attributes: `mime_type`, `extension`, `sha256`, `size_bytes`
+- Organization: `logical_path`, `suggested_path`, `tags`, `suggested_tags`
+- Processing outputs: `extracted_text`, `image_text_type`, `handwriting_style_id`, `preview_available`
+- Lifecycle and relations: `status`, `is_archive_member`, `archived_member_path`, `parent_document_id`, `replaces_document_id`
+- Metadata and timestamps: `metadata_json`, `created_at`, `processed_at`, `updated_at`
+
+Enum `DocumentStatus`:
+- `queued`
+- `processed`
+- `unsupported`
+- `error`
+- `trashed`
+
+Relationships:
+- Self-referential `parent_document` relationship for archive extraction trees.
+
+## processing_logs
+
+Model: `ProcessingLogEntry` in `backend/app/models/processing_log.py`
+
+Purpose:
+- Stores timestamped pipeline events for upload, extraction, OCR, routing, indexing, and errors.
+
+Core fields:
+- Event identity and timing: `id`, `created_at`
+- Event classification: `level`, `stage`, `event`
+- Document linkage: `document_id`, `document_filename`
+- Model context: `provider_id`, `model_name`
+- Prompt or response traces: `prompt_text`, `response_text`
+- Structured event payload: `payload_json`
+
+Foreign keys:
+- `document_id` references `documents.id` with `ON DELETE SET NULL`.
+
+## Model Lifecycle Notes
+
+- Upload inserts a `Document` row in `queued` state and enqueues background processing.
+- Worker updates extraction results and final status (`processed`, `unsupported`, or `error`).
+- Trash and restore operations toggle `status` while preserving source files until permanent delete.
+- Permanent delete removes the document tree (including archive descendants) and associated stored files.
@@ -0,0 +1,111 @@
+# Operations And Configuration
+
+## Runtime Services
+
+`docker-compose.yml` defines the runtime stack:
+- `db` (Postgres 16, port `5432`)
+- `redis` (Redis 7, port `6379`)
+- `typesense` (Typesense 29, port `8108`)
+- `api` (FastAPI backend, port `8000`)
+- `worker` (RQ background worker)
+- `frontend` (Vite UI, port `5173`)
+
+## Named Volumes
+
+Persistent volumes:
+- `db-data`
+- `redis-data`
+- `dcm-storage`
+- `typesense-data`
+
+Reset all persisted runtime data:
+
+```bash
+docker compose down -v
+```
+
+## Operational Commands
+
+Start or rebuild stack:
+
+```bash
+docker compose up --build -d
+```
+
+Stop stack:
+
+```bash
+docker compose down
+```
+
+Tail logs:
+
+```bash
+docker compose logs -f
+```
+
+## Backend Configuration
+
+Settings source:
+- Runtime settings class: `backend/app/core/config.py`
+- API settings persistence: `backend/app/services/app_settings.py`
+
+Key environment variables used by `api` and `worker` in compose:
+- `APP_ENV`
+- `DATABASE_URL`
+- `REDIS_URL`
+- `STORAGE_ROOT`
+- `PUBLIC_BASE_URL`
+- `CORS_ORIGINS` (API service)
+- `TYPESENSE_PROTOCOL`
+- `TYPESENSE_HOST`
+- `TYPESENSE_PORT`
+- `TYPESENSE_API_KEY`
+- `TYPESENSE_COLLECTION_NAME`
+
+Selected defaults from `Settings` (`backend/app/core/config.py`):
+- `upload_chunk_size = 4194304`
+- `max_zip_members = 250`
+- `max_zip_depth = 2`
+- `max_text_length = 500000`
+- `default_openai_model = "gpt-4.1-mini"`
+- `default_openai_timeout_seconds = 45`
+- `default_summary_model = "gpt-4.1-mini"`
+- `default_routing_model = "gpt-4.1-mini"`
+- `typesense_timeout_seconds = 120`
+- `typesense_num_retries = 0`
+
+## Frontend Configuration
+
+Frontend runtime API target:
+- `VITE_API_BASE` in `docker-compose.yml` frontend service
+
+Frontend local commands:
+
+```bash
+cd frontend && npm run dev
+cd frontend && npm run build
+cd frontend && npm run preview
+```
+
+## Settings Persistence
+
+Application-level settings managed from the UI are persisted by backend settings service:
+- file path: `<STORAGE_ROOT>/settings.json`
+- endpoints: `/api/v1/settings`, `/api/v1/settings/reset`, `/api/v1/settings/handwriting`
+
+Settings include:
+- upload defaults
+- display options
+- provider configuration
+- OCR, summary, and routing task settings
+- predefined paths and tags
+- handwriting-style clustering settings
+
+## Validation Checklist
+
+After operational or configuration changes, verify:
+- `GET /api/v1/health` is healthy
+- frontend can list, upload, and search documents
+- processing worker logs show successful task execution
+- settings save or reset works and persists after restart