Compare commits

...

7 Commits

30 changed files with 2075 additions and 134 deletions

138
REPORT.md Normal file
View File

@@ -0,0 +1,138 @@
# Security Audit Report
Date: 2026-02-21
Repository: /Users/bedas/Developer/GitHub/dcm
Audit type: Static, read-only code and configuration review
## Scope
- Backend API, worker, extraction and routing pipeline, settings handling, and storage interactions.
- Frontend dependency posture.
- Docker runtime and service exposure.
## Method
- File-level inspection with targeted code tracing for authn/authz, input validation, upload and archive processing, outbound network behavior, secret handling, logging, and deployment hardening.
- No runtime penetration testing was performed.
## Findings
### 1) Critical - Missing authentication and authorization on privileged API routes
- Impact: Any reachable client can access document, settings, and log-management functionality.
- Evidence:
- `backend/app/main.py:29`
- `backend/app/api/router.py:14`
- `backend/app/api/routes_documents.py:464`
- `backend/app/api/routes_documents.py:666`
- `backend/app/api/routes_settings.py:148`
- `backend/app/api/routes_processing_logs.py:22`
- Recommendation:
- Enforce authentication globally for non-health routes.
- Add per-endpoint authorization checks for read/update/delete/admin actions.
### 2) Critical - SSRF and data exfiltration risk via configurable model provider base URL
- Impact: An attacker can redirect model calls to attacker-controlled or internal hosts and exfiltrate document-derived content.
- Evidence:
- `backend/app/api/routes_settings.py:148`
- `backend/app/schemas/settings.py:24`
- `backend/app/services/app_settings.py:249`
- `backend/app/services/model_runtime.py:144`
- `backend/app/services/model_runtime.py:170`
- `backend/app/worker/tasks.py:505`
- `backend/app/services/routing_pipeline.py:803`
- Recommendation:
- Restrict provider endpoints to an allowlist.
- Validate URL scheme and block private/link-local destinations.
- Protect settings updates behind strict admin authorization.
- Enforce outbound egress controls at runtime.
### 3) High - Unbounded upload and archive extraction can cause memory/disk denial of service
- Impact: Oversized files or compressed archive bombs can exhaust API/worker resources.
- Evidence:
- `backend/app/api/routes_documents.py:486`
- `backend/app/services/extractor.py:309`
- `backend/app/services/extractor.py:312`
- `backend/app/worker/tasks.py:122`
- `backend/app/core/config.py:20`
- Recommendation:
- Enforce request and file size limits.
- Stream uploads and extraction where possible.
- Cap total uncompressed archive size and per-entry size.
### 4) High - Sensitive data logging exposed through unsecured log endpoints
- Impact: Extracted text, prompts, and model outputs may be retrievable by unauthorized callers.
- Evidence:
- `backend/app/models/processing_log.py:31`
- `backend/app/models/processing_log.py:32`
- `backend/app/services/routing_pipeline.py:803`
- `backend/app/services/routing_pipeline.py:814`
- `backend/app/worker/tasks.py:479`
- `backend/app/schemas/processing_logs.py:21`
- `backend/app/api/routes_processing_logs.py:22`
- Recommendation:
- Require admin authorization for log endpoints.
- Remove or redact sensitive payloads from logs.
- Reduce retention for operational logs that may include sensitive context.
### 5) High - Internal services exposed with weak default posture in docker compose
- Impact: Exposed Redis/Postgres/Typesense can enable data compromise and queue abuse.
- Evidence:
- `docker-compose.yml:5`
- `docker-compose.yml:6`
- `docker-compose.yml:9`
- `docker-compose.yml:21`
- `docker-compose.yml:29`
- `docker-compose.yml:32`
- `docker-compose.yml:68`
- `backend/app/worker/queue.py:15`
- `backend/app/core/config.py:34`
- Recommendation:
- Remove unnecessary host port exposure for internal services.
- Use strong credentials and network ACL segmentation.
- Enable authentication and transport protections for stateful services.
### 6) Medium - Plaintext secrets and weak defaults in configuration paths
- Impact: Credentials and API keys can be exposed from source or storage.
- Evidence:
- `backend/app/services/app_settings.py:129`
- `backend/app/services/app_settings.py:257`
- `backend/app/services/app_settings.py:667`
- `backend/app/core/config.py:17`
- `backend/app/core/config.py:34`
- `backend/.env.example:15`
- Recommendation:
- Use managed secrets storage and encryption at rest.
- Remove default credentials.
- Rotate exposed and default keys/credentials.
### 7) Low - Minimal HTTP hardening headers and broad CORS shape
- Impact: Increased browser-side attack surface, especially once authentication is introduced.
- Evidence:
- `backend/app/main.py:23`
- `backend/app/main.py:25`
- `backend/app/main.py:26`
- `backend/app/main.py:27`
- Recommendation:
- Add standard security headers middleware.
- Constrain allowed methods and headers to actual application needs.
### 8) Low - Containers appear to run as root by default
- Impact: In-container compromise has higher blast radius.
- Evidence:
- `backend/Dockerfile:1`
- `backend/Dockerfile:17`
- `frontend/Dockerfile:1`
- `frontend/Dockerfile:16`
- Recommendation:
- Run containers as non-root users.
- Drop unnecessary Linux capabilities.
## Residual Risk and Assumptions
- This audit assumes services may be reachable beyond a strictly isolated localhost-only environment.
- If an external auth proxy is enforced upstream, risk severity of unauthenticated routes is reduced but not eliminated unless backend also enforces trust boundaries.
- Dependency CVE posture was not exhaustively enumerated in this static pass.
## Priority Remediation Order
1. Enforce authentication and authorization across API routes.
2. Lock down settings mutation paths, especially model provider endpoint configuration.
3. Add strict upload/extraction resource limits.
4. Remove sensitive logging and protect log APIs.
5. Harden Docker/network exposure and secrets management.

View File

@@ -2,6 +2,17 @@ APP_ENV=development
DATABASE_URL=postgresql+psycopg://dcm:dcm@db:5432/dcm
REDIS_URL=redis://redis:6379/0
STORAGE_ROOT=/data/storage
ADMIN_API_TOKEN=replace-with-random-admin-token
USER_API_TOKEN=replace-with-random-user-token
MAX_UPLOAD_FILES_PER_REQUEST=50
MAX_UPLOAD_FILE_SIZE_BYTES=26214400
MAX_UPLOAD_REQUEST_SIZE_BYTES=104857600
MAX_ZIP_MEMBER_UNCOMPRESSED_BYTES=26214400
MAX_ZIP_TOTAL_UNCOMPRESSED_BYTES=157286400
MAX_ZIP_COMPRESSION_RATIO=120
PROVIDER_BASE_URL_ALLOWLIST=["api.openai.com"]
PROVIDER_BASE_URL_ALLOW_HTTP=false
PROVIDER_BASE_URL_ALLOW_PRIVATE_NETWORK=false
DEFAULT_OPENAI_BASE_URL=https://api.openai.com/v1
DEFAULT_OPENAI_MODEL=gpt-4.1-mini
DEFAULT_OPENAI_TIMEOUT_SECONDS=45

View File

@@ -12,6 +12,11 @@ RUN apt-get update && apt-get install -y --no-install-recommends \
COPY requirements.txt /app/requirements.txt
RUN pip install --no-cache-dir -r /app/requirements.txt
COPY app /app/app
RUN addgroup --system appgroup && adduser --system --ingroup appgroup --uid 10001 appuser
RUN mkdir -p /data/storage && chown -R appuser:appgroup /app /data
COPY --chown=appuser:appgroup app /app/app
USER appuser
CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8000"]

87
backend/app/api/auth.py Normal file
View File

@@ -0,0 +1,87 @@
"""Token-based authentication and authorization dependencies for privileged API routes."""
import hmac
from typing import Annotated
from fastapi import Depends, HTTPException, status
from fastapi.security import HTTPAuthorizationCredentials, HTTPBearer
from app.core.config import Settings, get_settings
bearer_auth = HTTPBearer(auto_error=False)
class AuthRole:
"""Declares supported authorization roles for privileged API operations."""
ADMIN = "admin"
USER = "user"
def _raise_unauthorized() -> None:
"""Raises an HTTP 401 response with bearer authentication challenge headers."""
raise HTTPException(
status_code=status.HTTP_401_UNAUTHORIZED,
detail="Invalid or missing API token",
headers={"WWW-Authenticate": "Bearer"},
)
def _configured_admin_token(settings: Settings) -> str:
"""Returns required admin token or raises configuration error when unset."""
token = settings.admin_api_token.strip()
if token:
return token
raise HTTPException(
status_code=status.HTTP_503_SERVICE_UNAVAILABLE,
detail="Admin API token is not configured",
)
def _resolve_token_role(token: str, settings: Settings) -> str:
"""Resolves role from a bearer token using constant-time comparisons."""
admin_token = _configured_admin_token(settings)
if hmac.compare_digest(token, admin_token):
return AuthRole.ADMIN
user_token = settings.user_api_token.strip()
if user_token and hmac.compare_digest(token, user_token):
return AuthRole.USER
_raise_unauthorized()
def get_request_role(
credentials: Annotated[HTTPAuthorizationCredentials | None, Depends(bearer_auth)],
settings: Annotated[Settings, Depends(get_settings)],
) -> str:
"""Authenticates request token and returns its authorization role."""
if credentials is None:
_raise_unauthorized()
token = credentials.credentials.strip()
if not token:
_raise_unauthorized()
return _resolve_token_role(token=token, settings=settings)
def require_user_or_admin(role: Annotated[str, Depends(get_request_role)]) -> str:
"""Requires a valid user or admin token and returns resolved role."""
return role
def require_admin(role: Annotated[str, Depends(get_request_role)]) -> str:
"""Requires admin role and rejects requests authenticated as regular users."""
if role != AuthRole.ADMIN:
raise HTTPException(
status_code=status.HTTP_403_FORBIDDEN,
detail="Admin token required",
)
return role

View File

@@ -1,7 +1,8 @@
"""API router registration for all HTTP route modules."""
from fastapi import APIRouter
from fastapi import APIRouter, Depends
from app.api.auth import require_admin, require_user_or_admin
from app.api.routes_documents import router as documents_router
from app.api.routes_health import router as health_router
from app.api.routes_processing_logs import router as processing_logs_router
@@ -11,7 +12,27 @@ from app.api.routes_settings import router as settings_router
api_router = APIRouter()
api_router.include_router(health_router)
api_router.include_router(documents_router, prefix="/documents", tags=["documents"])
api_router.include_router(processing_logs_router, prefix="/processing/logs", tags=["processing-logs"])
api_router.include_router(search_router, prefix="/search", tags=["search"])
api_router.include_router(settings_router, prefix="/settings", tags=["settings"])
api_router.include_router(
documents_router,
prefix="/documents",
tags=["documents"],
dependencies=[Depends(require_user_or_admin)],
)
api_router.include_router(
processing_logs_router,
prefix="/processing/logs",
tags=["processing-logs"],
dependencies=[Depends(require_admin)],
)
api_router.include_router(
search_router,
prefix="/search",
tags=["search"],
dependencies=[Depends(require_user_or_admin)],
)
api_router.include_router(
settings_router,
prefix="/settings",
tags=["settings"],
dependencies=[Depends(require_admin)],
)

View File

@@ -1,4 +1,4 @@
"""Document CRUD, lifecycle, metadata, file access, and content export endpoints."""
"""Authenticated document CRUD, lifecycle, metadata, file access, and content export endpoints."""
import io
import re
@@ -14,7 +14,7 @@ from fastapi.responses import FileResponse, Response, StreamingResponse
from sqlalchemy import or_, func, select
from sqlalchemy.orm import Session
from app.services.app_settings import read_predefined_paths_settings, read_predefined_tags_settings
from app.core.config import get_settings
from app.db.base import get_session
from app.models.document import Document, DocumentStatus
from app.schemas.documents import (
@@ -26,6 +26,7 @@ from app.schemas.documents import (
UploadConflict,
UploadResponse,
)
from app.services.app_settings import read_predefined_paths_settings, read_predefined_tags_settings
from app.services.extractor import sniff_mime
from app.services.handwriting_style import delete_many_handwriting_style_documents
from app.services.processing_logs import log_processing_event, set_processing_log_autocommit
@@ -35,6 +36,7 @@ from app.worker.queue import get_processing_queue
router = APIRouter()
settings = get_settings()
def _parse_csv(value: str | None) -> list[str]:
@@ -227,6 +229,33 @@ def _build_document_list_statement(
return statement
def _enforce_upload_shape(files: list[UploadFile]) -> None:
"""Validates upload request shape against configured file-count bounds."""
if not files:
raise HTTPException(status_code=400, detail="Upload request must include at least one file")
if len(files) > settings.max_upload_files_per_request:
raise HTTPException(
status_code=413,
detail=(
"Upload request exceeds file count limit "
f"({len(files)} > {settings.max_upload_files_per_request})"
),
)
async def _read_upload_bytes(file: UploadFile, max_bytes: int) -> bytes:
"""Reads one upload file while enforcing per-file byte limits."""
data = await file.read(max_bytes + 1)
if len(data) > max_bytes:
raise HTTPException(
status_code=413,
detail=f"File '{file.filename or 'upload'}' exceeds per-file limit of {max_bytes} bytes",
)
return data
def _collect_document_tree(session: Session, root_document_id: UUID) -> list[tuple[int, Document]]:
"""Collects a document and all descendants for recursive permanent deletion."""
@@ -472,18 +501,29 @@ async def upload_documents(
) -> UploadResponse:
"""Uploads files, records metadata, and enqueues asynchronous extraction tasks."""
_enforce_upload_shape(files)
set_processing_log_autocommit(session, True)
normalized_tags = _normalize_tags(tags)
queue = get_processing_queue()
uploaded: list[DocumentResponse] = []
conflicts: list[UploadConflict] = []
total_request_bytes = 0
indexed_relative_paths = relative_paths or []
prepared_uploads: list[dict[str, object]] = []
for idx, file in enumerate(files):
filename = file.filename or f"uploaded_{idx}"
data = await file.read()
data = await _read_upload_bytes(file, settings.max_upload_file_size_bytes)
total_request_bytes += len(data)
if total_request_bytes > settings.max_upload_request_size_bytes:
raise HTTPException(
status_code=413,
detail=(
"Upload request exceeds total size limit "
f"({total_request_bytes} > {settings.max_upload_request_size_bytes} bytes)"
),
)
sha256 = compute_sha256(data)
source_relative_path = indexed_relative_paths[idx] if idx < len(indexed_relative_paths) else filename
extension = Path(filename).suffix.lower()

View File

@@ -1,10 +1,11 @@
"""Read-only API endpoints for processing pipeline event logs."""
"""Admin-only API endpoints for processing pipeline event logs."""
from uuid import UUID
from fastapi import APIRouter, Depends, Query
from sqlalchemy.orm import Session
from app.core.config import get_settings
from app.db.base import get_session
from app.schemas.processing_logs import ProcessingLogEntryResponse, ProcessingLogListResponse
from app.services.app_settings import read_processing_log_retention_settings
@@ -17,12 +18,13 @@ from app.services.processing_logs import (
router = APIRouter()
settings = get_settings()
@router.get("", response_model=ProcessingLogListResponse)
def get_processing_logs(
offset: int = Query(default=0, ge=0),
limit: int = Query(default=120, ge=1, le=400),
limit: int = Query(default=120, ge=1, le=settings.processing_log_max_unbound_entries),
document_id: UUID | None = Query(default=None),
session: Session = Depends(get_session),
) -> ProcessingLogListResponse:
@@ -43,8 +45,8 @@ def get_processing_logs(
@router.post("/trim")
def trim_processing_logs(
keep_document_sessions: int | None = Query(default=None, ge=0, le=20),
keep_unbound_entries: int | None = Query(default=None, ge=0, le=400),
keep_document_sessions: int | None = Query(default=None, ge=0, le=settings.processing_log_max_document_sessions),
keep_unbound_entries: int | None = Query(default=None, ge=0, le=settings.processing_log_max_unbound_entries),
session: Session = Depends(get_session),
) -> dict[str, int]:
"""Deletes old processing logs using query values or persisted retention defaults."""
@@ -61,10 +63,19 @@ def trim_processing_logs(
else int(retention_defaults.get("keep_unbound_entries", 80))
)
capped_keep_document_sessions = min(
settings.processing_log_max_document_sessions,
max(0, int(resolved_keep_document_sessions)),
)
capped_keep_unbound_entries = min(
settings.processing_log_max_unbound_entries,
max(0, int(resolved_keep_unbound_entries)),
)
result = cleanup_processing_logs(
session=session,
keep_document_sessions=resolved_keep_document_sessions,
keep_unbound_entries=resolved_keep_unbound_entries,
keep_document_sessions=capped_keep_document_sessions,
keep_unbound_entries=capped_keep_unbound_entries,
)
session.commit()
return result

View File

@@ -1,6 +1,6 @@
"""API routes for managing persistent single-user application settings."""
"""Admin-only API routes for managing persistent single-user application settings."""
from fastapi import APIRouter
from fastapi import APIRouter, HTTPException
from app.schemas.settings import (
AppSettingsUpdateRequest,
@@ -18,6 +18,7 @@ from app.schemas.settings import (
UploadDefaultsResponse,
)
from app.services.app_settings import (
AppSettingsValidationError,
TASK_OCR_HANDWRITING,
TASK_ROUTING_CLASSIFICATION,
TASK_SUMMARY_GENERATION,
@@ -179,16 +180,19 @@ def set_app_settings(payload: AppSettingsUpdateRequest) -> AppSettingsResponse:
if payload.predefined_tags is not None:
predefined_tags_payload = [item.model_dump(exclude_none=True) for item in payload.predefined_tags]
updated = update_app_settings(
providers=providers_payload,
tasks=tasks_payload,
upload_defaults=upload_defaults_payload,
display=display_payload,
processing_log_retention=processing_log_retention_payload,
handwriting_style=handwriting_style_payload,
predefined_paths=predefined_paths_payload,
predefined_tags=predefined_tags_payload,
)
try:
updated = update_app_settings(
providers=providers_payload,
tasks=tasks_payload,
upload_defaults=upload_defaults_payload,
display=display_payload,
processing_log_retention=processing_log_retention_payload,
handwriting_style=handwriting_style_payload,
predefined_paths=predefined_paths_payload,
predefined_tags=predefined_tags_payload,
)
except AppSettingsValidationError as error:
raise HTTPException(status_code=400, detail=str(error)) from error
return _build_response(updated)
@@ -203,14 +207,17 @@ def reset_settings_to_defaults() -> AppSettingsResponse:
def set_handwriting_settings(payload: HandwritingSettingsUpdateRequest) -> AppSettingsResponse:
"""Updates handwriting transcription settings and returns the resulting configuration."""
updated = update_handwriting_settings(
enabled=payload.enabled,
openai_base_url=payload.openai_base_url,
openai_model=payload.openai_model,
openai_timeout_seconds=payload.openai_timeout_seconds,
openai_api_key=payload.openai_api_key,
clear_openai_api_key=payload.clear_openai_api_key,
)
try:
updated = update_handwriting_settings(
enabled=payload.enabled,
openai_base_url=payload.openai_base_url,
openai_model=payload.openai_model,
openai_timeout_seconds=payload.openai_timeout_seconds,
openai_api_key=payload.openai_api_key,
clear_openai_api_key=payload.clear_openai_api_key,
)
except AppSettingsValidationError as error:
raise HTTPException(status_code=400, detail=str(error)) from error
return _build_response(updated)

View File

@@ -1,7 +1,10 @@
"""Application settings and environment configuration."""
from functools import lru_cache
import ipaddress
from pathlib import Path
import socket
from urllib.parse import urlparse, urlunparse
from pydantic import Field
from pydantic_settings import BaseSettings, SettingsConfigDict
@@ -18,9 +21,24 @@ class Settings(BaseSettings):
redis_url: str = "redis://redis:6379/0"
storage_root: Path = Path("/data/storage")
upload_chunk_size: int = 4 * 1024 * 1024
max_upload_files_per_request: int = 50
max_upload_file_size_bytes: int = 25 * 1024 * 1024
max_upload_request_size_bytes: int = 100 * 1024 * 1024
max_zip_members: int = 250
max_zip_depth: int = 2
max_zip_member_uncompressed_bytes: int = 25 * 1024 * 1024
max_zip_total_uncompressed_bytes: int = 150 * 1024 * 1024
max_zip_compression_ratio: float = 120.0
max_text_length: int = 500_000
admin_api_token: str = ""
user_api_token: str = ""
provider_base_url_allowlist: list[str] = Field(default_factory=lambda: ["api.openai.com"])
provider_base_url_allow_http: bool = False
provider_base_url_allow_private_network: bool = False
processing_log_max_document_sessions: int = 20
processing_log_max_unbound_entries: int = 400
processing_log_max_payload_chars: int = 4096
processing_log_max_text_chars: int = 12000
default_openai_base_url: str = "https://api.openai.com/v1"
default_openai_model: str = "gpt-4.1-mini"
default_openai_timeout_seconds: int = 45
@@ -39,6 +57,187 @@ class Settings(BaseSettings):
cors_origins: list[str] = Field(default_factory=lambda: ["http://localhost:5173", "http://localhost:3000"])
LOCAL_HOSTNAME_SUFFIXES = (".local", ".internal", ".home.arpa")
def _normalize_allowlist(allowlist: object) -> tuple[str, ...]:
"""Normalizes host allowlist entries to lowercase DNS labels."""
if not isinstance(allowlist, (list, tuple, set)):
return ()
normalized = {
candidate.strip().lower().rstrip(".")
for candidate in allowlist
if isinstance(candidate, str) and candidate.strip()
}
return tuple(sorted(normalized))
def _host_matches_allowlist(hostname: str, allowlist: tuple[str, ...]) -> bool:
"""Returns whether a hostname is included by an exact or subdomain allowlist rule."""
if not allowlist:
return False
candidate = hostname.lower().rstrip(".")
for allowed_host in allowlist:
if candidate == allowed_host or candidate.endswith(f".{allowed_host}"):
return True
return False
def _is_private_or_special_ip(value: ipaddress.IPv4Address | ipaddress.IPv6Address) -> bool:
"""Returns whether an IP belongs to private, loopback, link-local, or reserved ranges."""
return (
value.is_private
or value.is_loopback
or value.is_link_local
or value.is_multicast
or value.is_reserved
or value.is_unspecified
)
def _validate_resolved_host_ips(hostname: str, port: int, allow_private_network: bool) -> None:
"""Resolves hostnames and rejects private or special addresses when private network access is disabled."""
try:
addresses = socket.getaddrinfo(hostname, port, type=socket.SOCK_STREAM)
except socket.gaierror as error:
raise ValueError(f"Provider base URL host cannot be resolved: {hostname}") from error
resolved_ips: set[ipaddress.IPv4Address | ipaddress.IPv6Address] = set()
for entry in addresses:
sockaddr = entry[4]
if not sockaddr:
continue
ip_text = sockaddr[0]
try:
resolved_ips.add(ipaddress.ip_address(ip_text))
except ValueError:
continue
if not resolved_ips:
raise ValueError(f"Provider base URL host resolved without usable IP addresses: {hostname}")
if allow_private_network:
return
blocked = [ip for ip in resolved_ips if _is_private_or_special_ip(ip)]
if blocked:
blocked_text = ", ".join(str(ip) for ip in blocked)
raise ValueError(f"Provider base URL resolves to private or special IP addresses: {blocked_text}")
def _normalize_and_validate_provider_base_url(
raw_value: str,
allowlist: tuple[str, ...],
allow_http: bool,
allow_private_network: bool,
resolve_dns: bool,
) -> str:
"""Normalizes and validates provider base URLs with SSRF-safe scheme and host checks."""
trimmed = raw_value.strip().rstrip("/")
if not trimmed:
raise ValueError("Provider base URL must not be empty")
parsed = urlparse(trimmed)
scheme = parsed.scheme.lower()
if scheme not in {"http", "https"}:
raise ValueError("Provider base URL must use http or https")
if scheme == "http" and not allow_http:
raise ValueError("Provider base URL must use https")
if parsed.query or parsed.fragment:
raise ValueError("Provider base URL must not include query strings or fragments")
if parsed.username or parsed.password:
raise ValueError("Provider base URL must not include embedded credentials")
hostname = (parsed.hostname or "").lower().rstrip(".")
if not hostname:
raise ValueError("Provider base URL must include a hostname")
if allowlist and not _host_matches_allowlist(hostname, allowlist):
allowed_hosts = ", ".join(allowlist)
raise ValueError(f"Provider base URL host is not in allowlist: {hostname}. Allowed hosts: {allowed_hosts}")
if hostname == "localhost" or hostname.endswith(LOCAL_HOSTNAME_SUFFIXES):
if not allow_private_network:
raise ValueError("Provider base URL must not target local or internal hostnames")
try:
ip_host = ipaddress.ip_address(hostname)
except ValueError:
ip_host = None
if ip_host is not None:
if not allow_private_network and _is_private_or_special_ip(ip_host):
raise ValueError("Provider base URL must not target private or special IP addresses")
elif resolve_dns:
resolved_port = parsed.port
if resolved_port is None:
resolved_port = 443 if scheme == "https" else 80
_validate_resolved_host_ips(
hostname=hostname,
port=resolved_port,
allow_private_network=allow_private_network,
)
path = (parsed.path or "").rstrip("/")
if not path.endswith("/v1"):
path = f"{path}/v1" if path else "/v1"
normalized_hostname = hostname
if ":" in normalized_hostname and not normalized_hostname.startswith("["):
normalized_hostname = f"[{normalized_hostname}]"
netloc = f"{normalized_hostname}:{parsed.port}" if parsed.port is not None else normalized_hostname
return urlunparse((scheme, netloc, path, "", "", ""))
@lru_cache(maxsize=256)
def _normalize_and_validate_provider_base_url_cached(
raw_value: str,
allowlist: tuple[str, ...],
allow_http: bool,
allow_private_network: bool,
) -> str:
"""Caches provider URL validation results for non-DNS-resolved checks."""
return _normalize_and_validate_provider_base_url(
raw_value=raw_value,
allowlist=allowlist,
allow_http=allow_http,
allow_private_network=allow_private_network,
resolve_dns=False,
)
def normalize_and_validate_provider_base_url(raw_value: str, *, resolve_dns: bool = False) -> str:
"""Validates and normalizes provider base URL values using configured SSRF protections."""
settings = get_settings()
allowlist = _normalize_allowlist(settings.provider_base_url_allowlist)
allow_http = settings.provider_base_url_allow_http if isinstance(settings.provider_base_url_allow_http, bool) else False
allow_private_network = (
settings.provider_base_url_allow_private_network
if isinstance(settings.provider_base_url_allow_private_network, bool)
else False
)
if resolve_dns:
return _normalize_and_validate_provider_base_url(
raw_value=raw_value,
allowlist=allowlist,
allow_http=allow_http,
allow_private_network=allow_private_network,
resolve_dns=True,
)
return _normalize_and_validate_provider_base_url_cached(
raw_value=raw_value,
allowlist=allowlist,
allow_http=allow_http,
allow_private_network=allow_private_network,
)
@lru_cache(maxsize=1)
def get_settings() -> Settings:
"""Returns a cached settings object for dependency injection and service access."""

View File

@@ -1,7 +1,10 @@
"""FastAPI entrypoint for the DMS backend service."""
from fastapi import FastAPI
from typing import Awaitable, Callable
from fastapi import FastAPI, Request, Response
from fastapi.middleware.cors import CORSMiddleware
from fastapi.responses import JSONResponse
from app.api.router import api_router
from app.core.config import get_settings
@@ -13,6 +16,18 @@ from app.services.typesense_index import ensure_typesense_collection
settings = get_settings()
UPLOAD_ENDPOINT_PATH = "/api/v1/documents/upload"
UPLOAD_ENDPOINT_METHOD = "POST"
def _is_upload_size_guard_target(request: Request) -> bool:
"""Returns whether upload request-size enforcement applies to this request.
Upload-size validation is intentionally scoped to the upload POST endpoint so CORS
preflight OPTIONS requests can pass through CORSMiddleware.
"""
return request.method.upper() == UPLOAD_ENDPOINT_METHOD and request.url.path == UPLOAD_ENDPOINT_PATH
def create_app() -> FastAPI:
@@ -28,6 +43,38 @@ def create_app() -> FastAPI:
)
app.include_router(api_router, prefix="/api/v1")
@app.middleware("http")
async def enforce_upload_request_size(
request: Request,
call_next: Callable[[Request], Awaitable[Response]],
) -> Response:
"""Rejects only POST upload bodies without deterministic length or with oversized request totals."""
if _is_upload_size_guard_target(request):
content_length = request.headers.get("content-length", "").strip()
if not content_length:
return JSONResponse(
status_code=411,
content={"detail": "Content-Length header is required for document uploads"},
)
try:
content_length_value = int(content_length)
except ValueError:
return JSONResponse(status_code=400, content={"detail": "Invalid Content-Length header"})
if content_length_value <= 0:
return JSONResponse(status_code=400, content={"detail": "Content-Length must be a positive integer"})
if content_length_value > settings.max_upload_request_size_bytes:
return JSONResponse(
status_code=413,
content={
"detail": (
"Upload request exceeds total size limit "
f"({content_length_value} > {settings.max_upload_request_size_bytes} bytes)"
)
},
)
return await call_next(request)
@app.on_event("startup")
def startup_event() -> None:
"""Initializes storage directories and database schema on service startup."""

View File

@@ -2,14 +2,121 @@
import uuid
from datetime import UTC, datetime
import re
from typing import Any
from sqlalchemy import BigInteger, DateTime, ForeignKey, String, Text
from sqlalchemy.dialects.postgresql import JSONB, UUID
from sqlalchemy.orm import Mapped, mapped_column
from sqlalchemy.orm import Mapped, mapped_column, validates
from app.core.config import get_settings
from app.db.base import Base
settings = get_settings()
SENSITIVE_KEY_MARKERS = (
"api_key",
"apikey",
"authorization",
"bearer",
"token",
"secret",
"password",
"credential",
"cookie",
)
SENSITIVE_TEXT_PATTERNS = (
re.compile(r"(?i)[\"']authorization[\"']\s*:\s*[\"']bearer\s+[^\"']+[\"']"),
re.compile(r"(?i)[\"']bearer[\"']\s*:\s*[\"'][^\"']+[\"']"),
re.compile(r"(?i)[\"'](?:api[_-]?key|token|secret|password)[\"']\s*:\s*[\"'][^\"']+[\"']"),
re.compile(r"(?i)\bauthorization\b\s*[:=]\s*bearer\s+[a-z0-9._~+/\-]+=*"),
re.compile(r"(?i)\bbearer\s+[a-z0-9._~+/\-]+=*"),
re.compile(r"\b[a-z0-9_-]{8,}\.[a-z0-9_-]{8,}\.[a-z0-9_-]{8,}\b", flags=re.IGNORECASE),
re.compile(r"(?i)\bsk-[a-z0-9]{16,}\b"),
re.compile(r"(?i)\b(api[_-]?key|token|secret|password)\b\s*[:=]\s*['\"]?[^\s,'\";]+['\"]?"),
)
REDACTED_TEXT = "[REDACTED]"
MAX_PAYLOAD_KEYS = 80
MAX_PAYLOAD_LIST_ITEMS = 80
def _truncate(value: str, limit: int) -> str:
"""Truncates long log fields to configured bounds with stable suffix marker."""
normalized = value.strip()
if len(normalized) <= limit:
return normalized
return normalized[: max(0, limit - 3)] + "..."
def _is_sensitive_key(key: str) -> bool:
"""Returns whether a payload key likely contains sensitive credential data."""
normalized = key.strip().lower()
return any(marker in normalized for marker in SENSITIVE_KEY_MARKERS)
def _redact_sensitive_text(value: str) -> str:
"""Redacts token-like segments from log text while retaining non-sensitive context."""
redacted = value
for pattern in SENSITIVE_TEXT_PATTERNS:
redacted = pattern.sub(lambda _: REDACTED_TEXT, redacted)
return redacted
def sanitize_processing_log_payload_value(value: Any, *, parent_key: str | None = None) -> Any:
"""Sanitizes payload structures by redacting sensitive fields and bounding size."""
if parent_key and _is_sensitive_key(parent_key):
return REDACTED_TEXT
if isinstance(value, dict):
sanitized: dict[str, Any] = {}
for index, (raw_key, raw_value) in enumerate(value.items()):
if index >= MAX_PAYLOAD_KEYS:
break
key = str(raw_key)
sanitized[key] = sanitize_processing_log_payload_value(raw_value, parent_key=key)
return sanitized
if isinstance(value, list):
return [
sanitize_processing_log_payload_value(item, parent_key=parent_key)
for item in value[:MAX_PAYLOAD_LIST_ITEMS]
]
if isinstance(value, tuple):
return [
sanitize_processing_log_payload_value(item, parent_key=parent_key)
for item in list(value)[:MAX_PAYLOAD_LIST_ITEMS]
]
if isinstance(value, str):
redacted = _redact_sensitive_text(value)
return _truncate(redacted, settings.processing_log_max_payload_chars)
if isinstance(value, (int, float, bool)) or value is None:
return value
as_text = _truncate(str(value), settings.processing_log_max_payload_chars)
return _redact_sensitive_text(as_text)
def sanitize_processing_log_text(value: str | None) -> str | None:
"""Sanitizes prompt and response fields by redacting credentials and clamping length."""
if value is None:
return None
normalized = value.strip()
if not normalized:
return None
redacted = _redact_sensitive_text(normalized)
return _truncate(redacted, settings.processing_log_max_text_chars)
class ProcessingLogEntry(Base):
"""Stores a timestamped processing event with optional model prompt and response text."""
@@ -31,3 +138,17 @@ class ProcessingLogEntry(Base):
prompt_text: Mapped[str | None] = mapped_column(Text, nullable=True)
response_text: Mapped[str | None] = mapped_column(Text, nullable=True)
payload_json: Mapped[dict] = mapped_column(JSONB, nullable=False, default=dict)
@validates("prompt_text", "response_text")
def _validate_text_fields(self, key: str, value: str | None) -> str | None:
"""Redacts and bounds free-text log fields before persistence."""
return sanitize_processing_log_text(value)
@validates("payload_json")
def _validate_payload_json(self, key: str, value: dict[str, Any] | None) -> dict[str, Any]:
"""Redacts and bounds structured payload fields before persistence."""
if not isinstance(value, dict):
return {}
return sanitize_processing_log_payload_value(value)

View File

@@ -1,13 +1,16 @@
"""Pydantic schemas for processing pipeline log API payloads."""
from datetime import datetime
from typing import Any
from uuid import UUID
from pydantic import BaseModel, Field
from pydantic import BaseModel, Field, field_validator
from app.models.processing_log import sanitize_processing_log_payload_value, sanitize_processing_log_text
class ProcessingLogEntryResponse(BaseModel):
"""Represents one persisted processing log event returned by API endpoints."""
"""Represents one persisted processing log event with already-redacted sensitive fields."""
id: int
created_at: datetime
@@ -20,7 +23,26 @@ class ProcessingLogEntryResponse(BaseModel):
model_name: str | None
prompt_text: str | None
response_text: str | None
payload_json: dict
payload_json: dict[str, Any]
@field_validator("prompt_text", "response_text", mode="before")
@classmethod
def _sanitize_text_fields(cls, value: Any) -> str | None:
"""Ensures log text fields are redacted in API responses."""
if value is None:
return None
return sanitize_processing_log_text(str(value))
@field_validator("payload_json", mode="before")
@classmethod
def _sanitize_payload_field(cls, value: Any) -> dict[str, Any]:
"""Ensures payload fields are redacted in API responses."""
if not isinstance(value, dict):
return {}
sanitized = sanitize_processing_log_payload_value(value)
return sanitized if isinstance(sanitized, dict) else {}
class Config:
"""Enables ORM object parsing for SQLAlchemy model instances."""

View File

@@ -5,12 +5,16 @@ import re
from pathlib import Path
from typing import Any
from app.core.config import get_settings
from app.core.config import get_settings, normalize_and_validate_provider_base_url
settings = get_settings()
class AppSettingsValidationError(ValueError):
"""Raised when user-provided settings values fail security or contract validation."""
TASK_OCR_HANDWRITING = "ocr_handwriting"
TASK_SUMMARY_GENERATION = "summary_generation"
TASK_ROUTING_CLASSIFICATION = "routing_classification"
@@ -156,13 +160,13 @@ def _clamp_cards_per_page(value: int) -> int:
def _clamp_processing_log_document_sessions(value: int) -> int:
"""Clamps the number of recent document log sessions kept during cleanup."""
return max(0, min(20, value))
return max(0, min(settings.processing_log_max_document_sessions, value))
def _clamp_processing_log_unbound_entries(value: int) -> int:
"""Clamps retained unbound processing log events kept during cleanup."""
return max(0, min(400, value))
return max(0, min(settings.processing_log_max_unbound_entries, value))
def _clamp_predefined_entries_limit(value: int) -> int:
@@ -242,12 +246,19 @@ def _normalize_provider(
api_key_value = payload.get("api_key", fallback_values.get("api_key", defaults["api_key"]))
api_key = str(api_key_value).strip() if api_key_value is not None else ""
raw_base_url = str(payload.get("base_url", fallback_values.get("base_url", defaults["base_url"]))).strip()
if not raw_base_url:
raw_base_url = str(defaults["base_url"]).strip()
try:
normalized_base_url = normalize_and_validate_provider_base_url(raw_base_url)
except ValueError as error:
raise AppSettingsValidationError(str(error)) from error
return {
"id": provider_id,
"label": str(payload.get("label", fallback_values.get("label", provider_id))).strip() or provider_id,
"provider_type": provider_type,
"base_url": str(payload.get("base_url", fallback_values.get("base_url", defaults["base_url"]))).strip()
or defaults["base_url"],
"base_url": normalized_base_url,
"timeout_seconds": _clamp_timeout(
_safe_int(
payload.get("timeout_seconds", fallback_values.get("timeout_seconds", defaults["timeout_seconds"])),
@@ -576,7 +587,7 @@ def _normalize_handwriting_style_settings(payload: dict[str, Any], defaults: dic
def _sanitize_settings(payload: dict[str, Any]) -> dict[str, Any]:
"""Sanitizes all persisted settings into a stable normalized structure."""
"""Sanitizes persisted settings into a stable structure while tolerating corrupt provider rows."""
if not isinstance(payload, dict):
payload = {}
@@ -592,7 +603,14 @@ def _sanitize_settings(payload: dict[str, Any]) -> dict[str, Any]:
if not isinstance(provider_payload, dict):
continue
fallback = defaults["providers"][0]
candidate = _normalize_provider(provider_payload, fallback_id=f"provider-{index + 1}", fallback_values=fallback)
try:
candidate = _normalize_provider(
provider_payload,
fallback_id=f"provider-{index + 1}",
fallback_values=fallback,
)
except AppSettingsValidationError:
continue
if candidate["id"] in seen_provider_ids:
continue
seen_provider_ids.add(candidate["id"])

View File

@@ -300,16 +300,39 @@ def extract_text_content(filename: str, data: bytes, mime_type: str) -> Extracti
def extract_archive_members(data: bytes, depth: int = 0) -> list[ArchiveMember]:
"""Extracts processable members from zip archives with configurable depth limits."""
"""Extracts processable ZIP members within configured decompression safety budgets."""
members: list[ArchiveMember] = []
if depth > settings.max_zip_depth:
return members
with zipfile.ZipFile(io.BytesIO(data)) as archive:
infos = [info for info in archive.infolist() if not info.is_dir()][: settings.max_zip_members]
for info in infos:
member_data = archive.read(info.filename)
members.append(ArchiveMember(name=info.filename, data=member_data))
total_uncompressed_bytes = 0
try:
with zipfile.ZipFile(io.BytesIO(data)) as archive:
infos = [info for info in archive.infolist() if not info.is_dir()][: settings.max_zip_members]
for info in infos:
if info.file_size <= 0:
continue
if info.file_size > settings.max_zip_member_uncompressed_bytes:
continue
if total_uncompressed_bytes + info.file_size > settings.max_zip_total_uncompressed_bytes:
continue
compressed_size = max(1, int(info.compress_size))
compression_ratio = float(info.file_size) / float(compressed_size)
if compression_ratio > settings.max_zip_compression_ratio:
continue
with archive.open(info, mode="r") as archive_member:
member_data = archive_member.read(settings.max_zip_member_uncompressed_bytes + 1)
if len(member_data) > settings.max_zip_member_uncompressed_bytes:
continue
if total_uncompressed_bytes + len(member_data) > settings.max_zip_total_uncompressed_bytes:
continue
total_uncompressed_bytes += len(member_data)
members.append(ArchiveMember(name=info.filename, data=member_data))
except zipfile.BadZipFile:
return []
return members

View File

@@ -2,10 +2,10 @@
from dataclasses import dataclass
from typing import Any
from urllib.parse import urlparse, urlunparse
from openai import APIConnectionError, APIError, APITimeoutError, OpenAI
from app.core.config import normalize_and_validate_provider_base_url
from app.services.app_settings import read_task_runtime_settings
@@ -36,18 +36,9 @@ class ModelTaskRuntime:
def _normalize_base_url(raw_value: str) -> str:
"""Normalizes provider base URL and appends /v1 for OpenAI-compatible servers."""
"""Normalizes provider base URL and enforces SSRF protections before outbound calls."""
trimmed = raw_value.strip().rstrip("/")
if not trimmed:
return "https://api.openai.com/v1"
parsed = urlparse(trimmed)
path = parsed.path or ""
if not path.endswith("/v1"):
path = f"{path}/v1" if path else "/v1"
return urlunparse(parsed._replace(path=path))
return normalize_and_validate_provider_base_url(raw_value, resolve_dns=True)
def _should_fallback_to_chat(error: Exception) -> bool:
@@ -137,11 +128,16 @@ def resolve_task_runtime(task_name: str) -> ModelTaskRuntime:
if provider_type != "openai_compatible":
raise ModelTaskError(f"unsupported_provider_type:{provider_type}")
try:
normalized_base_url = _normalize_base_url(str(provider_payload.get("base_url", "https://api.openai.com/v1")))
except ValueError as error:
raise ModelTaskError(f"invalid_provider_base_url:{error}") from error
return ModelTaskRuntime(
task_name=task_name,
provider_id=str(provider_payload.get("id", "")),
provider_type=provider_type,
base_url=_normalize_base_url(str(provider_payload.get("base_url", "https://api.openai.com/v1"))),
base_url=normalized_base_url,
timeout_seconds=int(provider_payload.get("timeout_seconds", 45)),
api_key=str(provider_payload.get("api_key", "")).strip() or "no-key-required",
model=str(task_payload.get("model", "")).strip(),

View File

@@ -0,0 +1,149 @@
"""Unit coverage for resilient provider sanitization in persisted app settings."""
from __future__ import annotations
import sys
import unittest
from pathlib import Path
from types import ModuleType
from typing import Any
from unittest.mock import patch
BACKEND_ROOT = Path(__file__).resolve().parents[1]
if str(BACKEND_ROOT) not in sys.path:
sys.path.insert(0, str(BACKEND_ROOT))
if "pydantic_settings" not in sys.modules:
pydantic_settings_stub = ModuleType("pydantic_settings")
class _BaseSettings:
"""Minimal BaseSettings replacement for dependency-light unit test execution."""
def __init__(self, **kwargs: object) -> None:
for key, value in kwargs.items():
setattr(self, key, value)
def _settings_config_dict(**kwargs: object) -> dict[str, object]:
"""Returns configuration values using dict semantics expected by settings module."""
return kwargs
pydantic_settings_stub.BaseSettings = _BaseSettings
pydantic_settings_stub.SettingsConfigDict = _settings_config_dict
sys.modules["pydantic_settings"] = pydantic_settings_stub
from app.services import app_settings
def _sample_current_payload() -> dict[str, Any]:
"""Builds a sanitized payload used as in-memory persistence fixture for update tests."""
return app_settings._sanitize_settings(app_settings._default_settings())
class AppSettingsProviderResilienceTests(unittest.TestCase):
"""Verifies read-path resilience for corrupt persisted providers without weakening writes."""
def test_sanitize_settings_skips_invalid_persisted_provider_entries(self) -> None:
"""Invalid persisted providers are skipped and tasks rebind to remaining valid providers."""
payload = {
"providers": [
{
"id": "insecure-provider",
"label": "Insecure Provider",
"provider_type": "openai_compatible",
"base_url": "http://api.openai.com/v1",
"timeout_seconds": 45,
"api_key": "",
},
{
"id": "secure-provider",
"label": "Secure Provider",
"provider_type": "openai_compatible",
"base_url": "https://api.openai.com/v1",
"timeout_seconds": 45,
"api_key": "",
},
],
"tasks": {
app_settings.TASK_OCR_HANDWRITING: {"provider_id": "insecure-provider"},
app_settings.TASK_SUMMARY_GENERATION: {"provider_id": "insecure-provider"},
app_settings.TASK_ROUTING_CLASSIFICATION: {"provider_id": "insecure-provider"},
},
}
sanitized = app_settings._sanitize_settings(payload)
self.assertEqual([provider["id"] for provider in sanitized["providers"]], ["secure-provider"])
self.assertEqual(
sanitized["tasks"][app_settings.TASK_OCR_HANDWRITING]["provider_id"],
"secure-provider",
)
self.assertEqual(
sanitized["tasks"][app_settings.TASK_SUMMARY_GENERATION]["provider_id"],
"secure-provider",
)
self.assertEqual(
sanitized["tasks"][app_settings.TASK_ROUTING_CLASSIFICATION]["provider_id"],
"secure-provider",
)
def test_sanitize_settings_uses_default_provider_when_all_persisted_entries_are_invalid(self) -> None:
"""Default provider is restored when all persisted provider rows are invalid."""
payload = {
"providers": [
{
"id": "insecure-provider",
"label": "Insecure Provider",
"provider_type": "openai_compatible",
"base_url": "http://api.openai.com/v1",
"timeout_seconds": 45,
"api_key": "",
}
]
}
sanitized = app_settings._sanitize_settings(payload)
defaults = app_settings._default_settings()
default_provider_id = defaults["providers"][0]["id"]
self.assertEqual(sanitized["providers"][0]["id"], default_provider_id)
self.assertEqual(sanitized["providers"][0]["base_url"], defaults["providers"][0]["base_url"])
self.assertEqual(
sanitized["tasks"][app_settings.TASK_OCR_HANDWRITING]["provider_id"],
default_provider_id,
)
self.assertEqual(
sanitized["tasks"][app_settings.TASK_SUMMARY_GENERATION]["provider_id"],
default_provider_id,
)
self.assertEqual(
sanitized["tasks"][app_settings.TASK_ROUTING_CLASSIFICATION]["provider_id"],
default_provider_id,
)
def test_update_app_settings_keeps_provider_base_url_validation_strict(self) -> None:
"""Provider write updates still reject invalid base URLs instead of silently sanitizing."""
current_payload = _sample_current_payload()
current_provider = current_payload["providers"][0]
provider_update = {
"id": current_provider["id"],
"label": current_provider["label"],
"provider_type": current_provider["provider_type"],
"base_url": "http://api.openai.com/v1",
"timeout_seconds": current_provider["timeout_seconds"],
}
with (
patch.object(app_settings, "_read_raw_settings", return_value=current_payload),
patch.object(app_settings, "_write_settings") as write_settings_mock,
):
with self.assertRaises(app_settings.AppSettingsValidationError):
app_settings.update_app_settings(providers=[provider_update])
write_settings_mock.assert_not_called()
if __name__ == "__main__":
unittest.main()

View File

@@ -0,0 +1,341 @@
"""Unit coverage for API auth, SSRF validation, and processing-log redaction controls."""
from __future__ import annotations
from datetime import UTC, datetime
import socket
import sys
from pathlib import Path
from types import ModuleType, SimpleNamespace
import unittest
from unittest.mock import patch
BACKEND_ROOT = Path(__file__).resolve().parents[1]
if str(BACKEND_ROOT) not in sys.path:
sys.path.insert(0, str(BACKEND_ROOT))
if "pydantic_settings" not in sys.modules:
pydantic_settings_stub = ModuleType("pydantic_settings")
class _BaseSettings:
"""Minimal BaseSettings replacement for dependency-light unit test execution."""
def __init__(self, **kwargs: object) -> None:
for key, value in kwargs.items():
setattr(self, key, value)
def _settings_config_dict(**kwargs: object) -> dict[str, object]:
"""Returns configuration values using dict semantics expected by settings module."""
return kwargs
pydantic_settings_stub.BaseSettings = _BaseSettings
pydantic_settings_stub.SettingsConfigDict = _settings_config_dict
sys.modules["pydantic_settings"] = pydantic_settings_stub
if "fastapi" not in sys.modules:
fastapi_stub = ModuleType("fastapi")
class _HTTPException(Exception):
"""Minimal HTTPException compatible with route dependency tests."""
def __init__(self, status_code: int, detail: str, headers: dict[str, str] | None = None) -> None:
super().__init__(detail)
self.status_code = status_code
self.detail = detail
self.headers = headers or {}
class _Status:
"""Minimal status namespace for auth unit tests."""
HTTP_401_UNAUTHORIZED = 401
HTTP_403_FORBIDDEN = 403
HTTP_503_SERVICE_UNAVAILABLE = 503
def _depends(dependency): # type: ignore[no-untyped-def]
"""Returns provided dependency unchanged for unit testing."""
return dependency
fastapi_stub.Depends = _depends
fastapi_stub.HTTPException = _HTTPException
fastapi_stub.status = _Status()
sys.modules["fastapi"] = fastapi_stub
if "fastapi.security" not in sys.modules:
fastapi_security_stub = ModuleType("fastapi.security")
class _HTTPAuthorizationCredentials:
"""Minimal bearer credential object used by auth dependency tests."""
def __init__(self, *, scheme: str, credentials: str) -> None:
self.scheme = scheme
self.credentials = credentials
class _HTTPBearer:
"""Minimal HTTPBearer stand-in for dependency construction."""
def __init__(self, auto_error: bool = True) -> None:
self.auto_error = auto_error
fastapi_security_stub.HTTPAuthorizationCredentials = _HTTPAuthorizationCredentials
fastapi_security_stub.HTTPBearer = _HTTPBearer
sys.modules["fastapi.security"] = fastapi_security_stub
from fastapi import HTTPException
from fastapi.security import HTTPAuthorizationCredentials
from app.api.auth import AuthRole, get_request_role, require_admin
from app.core import config as config_module
from app.models.processing_log import sanitize_processing_log_payload_value, sanitize_processing_log_text
from app.schemas.processing_logs import ProcessingLogEntryResponse
def _security_settings(
*,
allowlist: list[str] | None = None,
allow_http: bool = False,
allow_private_network: bool = False,
) -> SimpleNamespace:
"""Builds lightweight settings object for provider URL validation tests."""
return SimpleNamespace(
provider_base_url_allowlist=allowlist if allowlist is not None else ["api.openai.com"],
provider_base_url_allow_http=allow_http,
provider_base_url_allow_private_network=allow_private_network,
)
class AuthDependencyTests(unittest.TestCase):
"""Verifies token authentication and admin authorization behavior."""
def test_get_request_role_accepts_admin_token(self) -> None:
"""Admin token resolves admin role."""
settings = SimpleNamespace(admin_api_token="admin-token", user_api_token="user-token")
credentials = HTTPAuthorizationCredentials(scheme="Bearer", credentials="admin-token")
role = get_request_role(credentials=credentials, settings=settings)
self.assertEqual(role, AuthRole.ADMIN)
def test_get_request_role_rejects_missing_credentials(self) -> None:
"""Missing bearer credentials return 401."""
settings = SimpleNamespace(admin_api_token="admin-token", user_api_token="user-token")
with self.assertRaises(HTTPException) as context:
get_request_role(credentials=None, settings=settings)
self.assertEqual(context.exception.status_code, 401)
def test_require_admin_rejects_user_role(self) -> None:
"""User role cannot access admin-only endpoints."""
with self.assertRaises(HTTPException) as context:
require_admin(role=AuthRole.USER)
self.assertEqual(context.exception.status_code, 403)
class ProviderBaseUrlValidationTests(unittest.TestCase):
"""Verifies allowlist, scheme, and private-network SSRF protections."""
def setUp(self) -> None:
"""Clears URL validation cache to keep tests independent."""
config_module._normalize_and_validate_provider_base_url_cached.cache_clear()
def test_validation_accepts_allowlisted_https_url(self) -> None:
"""Allowlisted HTTPS URLs are normalized with /v1 suffix."""
with patch.object(config_module, "get_settings", return_value=_security_settings(allowlist=["api.openai.com"])):
normalized = config_module.normalize_and_validate_provider_base_url("https://api.openai.com")
self.assertEqual(normalized, "https://api.openai.com/v1")
def test_validation_rejects_non_allowlisted_host(self) -> None:
"""Hosts outside configured allowlist are rejected."""
with patch.object(config_module, "get_settings", return_value=_security_settings(allowlist=["api.openai.com"])):
with self.assertRaises(ValueError):
config_module.normalize_and_validate_provider_base_url("https://example.org/v1")
def test_validation_rejects_private_ip_literal(self) -> None:
"""Private and loopback IP literals are blocked."""
with patch.object(config_module, "get_settings", return_value=_security_settings(allowlist=[])):
with self.assertRaises(ValueError):
config_module.normalize_and_validate_provider_base_url("https://127.0.0.1/v1")
def test_validation_rejects_private_ip_after_dns_resolution(self) -> None:
"""DNS rebind protection blocks public hostnames resolving to private addresses."""
mocked_dns_response = [
(socket.AF_INET, socket.SOCK_STREAM, socket.IPPROTO_TCP, "", ("127.0.0.1", 443)),
]
with (
patch.object(config_module, "get_settings", return_value=_security_settings(allowlist=["api.openai.com"])),
patch.object(config_module.socket, "getaddrinfo", return_value=mocked_dns_response),
):
with self.assertRaises(ValueError):
config_module.normalize_and_validate_provider_base_url(
"https://api.openai.com/v1",
resolve_dns=True,
)
def test_resolve_dns_validation_revalidates_each_call(self) -> None:
"""DNS-resolved validation is not cached and re-checks host resolution each call."""
mocked_dns_response = [
(socket.AF_INET, socket.SOCK_STREAM, socket.IPPROTO_TCP, "", ("8.8.8.8", 443)),
]
with (
patch.object(config_module, "get_settings", return_value=_security_settings(allowlist=["api.openai.com"])),
patch.object(config_module.socket, "getaddrinfo", return_value=mocked_dns_response) as getaddrinfo_mock,
):
first = config_module.normalize_and_validate_provider_base_url(
"https://api.openai.com/v1",
resolve_dns=True,
)
second = config_module.normalize_and_validate_provider_base_url(
"https://api.openai.com/v1",
resolve_dns=True,
)
self.assertEqual(first, "https://api.openai.com/v1")
self.assertEqual(second, "https://api.openai.com/v1")
self.assertEqual(getaddrinfo_mock.call_count, 2)
class ProcessingLogRedactionTests(unittest.TestCase):
"""Verifies sensitive processing-log values are redacted for persistence and responses."""
def test_payload_redacts_sensitive_keys(self) -> None:
"""Sensitive payload keys are replaced with redaction marker."""
sanitized = sanitize_processing_log_payload_value(
{
"api_key": "secret-value",
"nested": {
"authorization": "Bearer sample-token",
},
}
)
self.assertEqual(sanitized["api_key"], "[REDACTED]")
self.assertEqual(sanitized["nested"]["authorization"], "[REDACTED]")
def test_text_redaction_removes_bearer_and_jwt_values(self) -> None:
"""Bearer and JWT token substrings are fully removed from log text."""
bearer_token = "super-secret-token-123"
jwt_token = (
"eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9."
"eyJzdWIiOiIxMjM0NTY3ODkwIiwibmFtZSI6IkpvaG4ifQ."
"signaturevalue123456789"
)
sanitized = sanitize_processing_log_text(
f"Authorization: Bearer {bearer_token}\nraw_jwt={jwt_token}"
)
self.assertIsNotNone(sanitized)
sanitized_text = sanitized or ""
self.assertIn("[REDACTED]", sanitized_text)
self.assertNotIn(bearer_token, sanitized_text)
self.assertNotIn(jwt_token, sanitized_text)
def test_text_redaction_removes_json_formatted_secret_values(self) -> None:
"""JSON-formatted quoted secrets are fully removed from redacted log text."""
api_key_secret = "json-api-key-secret"
token_secret = "json-token-secret"
authorization_secret = "json-auth-secret"
bearer_secret = "json-bearer-secret"
json_text = (
"{"
f"\"api_key\":\"{api_key_secret}\","
f"\"token\":\"{token_secret}\","
f"\"authorization\":\"Bearer {authorization_secret}\","
f"\"bearer\":\"{bearer_secret}\""
"}"
)
sanitized = sanitize_processing_log_text(json_text)
self.assertIsNotNone(sanitized)
sanitized_text = sanitized or ""
self.assertIn("[REDACTED]", sanitized_text)
self.assertNotIn(api_key_secret, sanitized_text)
self.assertNotIn(token_secret, sanitized_text)
self.assertNotIn(authorization_secret, sanitized_text)
self.assertNotIn(bearer_secret, sanitized_text)
def test_response_schema_applies_redaction_to_existing_entries(self) -> None:
"""API schema validators redact sensitive fields from legacy stored rows."""
bearer_token = "abc123token"
jwt_token = (
"eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9."
"eyJzdWIiOiIxMjM0NTY3ODkwIiwibmFtZSI6IkpvaG4ifQ."
"signaturevalue123456789"
)
response = ProcessingLogEntryResponse.model_validate(
{
"id": 1,
"created_at": datetime.now(UTC),
"level": "info",
"stage": "summary",
"event": "response",
"document_id": None,
"document_filename": "sample.txt",
"provider_id": "provider",
"model_name": "model",
"prompt_text": f"Authorization: Bearer {bearer_token}",
"response_text": f"token={jwt_token}",
"payload_json": {"password": "secret", "trace_id": "trace-1"},
}
)
self.assertEqual(response.payload_json["password"], "[REDACTED]")
self.assertIn("[REDACTED]", response.prompt_text or "")
self.assertIn("[REDACTED]", response.response_text or "")
self.assertNotIn(bearer_token, response.prompt_text or "")
self.assertNotIn(jwt_token, response.response_text or "")
def test_response_schema_redacts_json_formatted_secret_values(self) -> None:
"""Response schema redacts quoted JSON secret forms from legacy text fields."""
api_key_secret = "legacy-json-api-key"
token_secret = "legacy-json-token"
authorization_secret = "legacy-json-auth"
bearer_secret = "legacy-json-bearer"
prompt_text = (
"{"
f"\"api_key\":\"{api_key_secret}\","
f"\"token\":\"{token_secret}\""
"}"
)
response_text = (
"{"
f"\"authorization\":\"Bearer {authorization_secret}\","
f"\"bearer\":\"{bearer_secret}\""
"}"
)
response = ProcessingLogEntryResponse.model_validate(
{
"id": 2,
"created_at": datetime.now(UTC),
"level": "info",
"stage": "summary",
"event": "response",
"document_id": None,
"document_filename": "sample-json.txt",
"provider_id": "provider",
"model_name": "model",
"prompt_text": prompt_text,
"response_text": response_text,
"payload_json": {"trace_id": "trace-2"},
}
)
self.assertIn("[REDACTED]", response.prompt_text or "")
self.assertIn("[REDACTED]", response.response_text or "")
self.assertNotIn(api_key_secret, response.prompt_text or "")
self.assertNotIn(token_secret, response.prompt_text or "")
self.assertNotIn(authorization_secret, response.response_text or "")
self.assertNotIn(bearer_secret, response.response_text or "")
if __name__ == "__main__":
unittest.main()

View File

@@ -0,0 +1,270 @@
"""Regression tests for upload request-size middleware scope and preflight handling."""
from __future__ import annotations
import importlib
import sys
import unittest
from pathlib import Path
from types import ModuleType, SimpleNamespace
from typing import Any, Awaitable, Callable
BACKEND_ROOT = Path(__file__).resolve().parents[1]
if str(BACKEND_ROOT) not in sys.path:
sys.path.insert(0, str(BACKEND_ROOT))
def _install_main_import_stubs() -> dict[str, ModuleType | None]:
"""Installs lightweight module stubs required for importing app.main in isolation."""
previous_modules: dict[str, ModuleType | None] = {
name: sys.modules.get(name)
for name in [
"fastapi",
"fastapi.middleware",
"fastapi.middleware.cors",
"fastapi.responses",
"app.api.router",
"app.core.config",
"app.db.base",
"app.services.app_settings",
"app.services.handwriting_style",
"app.services.storage",
"app.services.typesense_index",
]
}
fastapi_stub = ModuleType("fastapi")
class _Response:
"""Minimal response base class for middleware typing compatibility."""
class _FastAPI:
"""Captures middleware registration behavior used by app.main tests."""
def __init__(self, *_args: object, **_kwargs: object) -> None:
self.http_middlewares: list[Any] = []
def add_middleware(self, *_args: object, **_kwargs: object) -> None:
"""Accepts middleware registrations without side effects."""
def include_router(self, *_args: object, **_kwargs: object) -> None:
"""Accepts router registration without side effects."""
def middleware(
self,
middleware_type: str,
) -> Callable[[Callable[..., Any]], Callable[..., Any]]:
"""Registers request middleware functions for later invocation in tests."""
def decorator(func: Callable[..., Any]) -> Callable[..., Any]:
if middleware_type == "http":
self.http_middlewares.append(func)
return func
return decorator
def on_event(
self,
*_args: object,
**_kwargs: object,
) -> Callable[[Callable[..., Any]], Callable[..., Any]]:
"""Returns no-op startup and shutdown decorators."""
def decorator(func: Callable[..., Any]) -> Callable[..., Any]:
return func
return decorator
fastapi_stub.FastAPI = _FastAPI
fastapi_stub.Request = object
fastapi_stub.Response = _Response
sys.modules["fastapi"] = fastapi_stub
fastapi_middleware_stub = ModuleType("fastapi.middleware")
sys.modules["fastapi.middleware"] = fastapi_middleware_stub
fastapi_middleware_cors_stub = ModuleType("fastapi.middleware.cors")
class _CORSMiddleware:
"""Placeholder CORS middleware class accepted by FastAPI.add_middleware."""
fastapi_middleware_cors_stub.CORSMiddleware = _CORSMiddleware
sys.modules["fastapi.middleware.cors"] = fastapi_middleware_cors_stub
fastapi_responses_stub = ModuleType("fastapi.responses")
class _JSONResponse:
"""Simple JSONResponse stand-in exposing status code and payload fields."""
def __init__(self, *, status_code: int, content: dict[str, Any]) -> None:
self.status_code = status_code
self.content = content
fastapi_responses_stub.JSONResponse = _JSONResponse
sys.modules["fastapi.responses"] = fastapi_responses_stub
api_router_stub = ModuleType("app.api.router")
api_router_stub.api_router = object()
sys.modules["app.api.router"] = api_router_stub
config_stub = ModuleType("app.core.config")
def get_settings() -> SimpleNamespace:
"""Returns minimal settings consumed by app.main during test import."""
return SimpleNamespace(
cors_origins=["http://localhost:5173"],
max_upload_request_size_bytes=1024,
)
config_stub.get_settings = get_settings
sys.modules["app.core.config"] = config_stub
db_base_stub = ModuleType("app.db.base")
def init_db() -> None:
"""No-op database initializer for middleware scope tests."""
db_base_stub.init_db = init_db
sys.modules["app.db.base"] = db_base_stub
app_settings_stub = ModuleType("app.services.app_settings")
def ensure_app_settings() -> None:
"""No-op settings initializer for middleware scope tests."""
app_settings_stub.ensure_app_settings = ensure_app_settings
sys.modules["app.services.app_settings"] = app_settings_stub
handwriting_style_stub = ModuleType("app.services.handwriting_style")
def ensure_handwriting_style_collection() -> None:
"""No-op handwriting collection initializer for middleware scope tests."""
handwriting_style_stub.ensure_handwriting_style_collection = ensure_handwriting_style_collection
sys.modules["app.services.handwriting_style"] = handwriting_style_stub
storage_stub = ModuleType("app.services.storage")
def ensure_storage() -> None:
"""No-op storage initializer for middleware scope tests."""
storage_stub.ensure_storage = ensure_storage
sys.modules["app.services.storage"] = storage_stub
typesense_stub = ModuleType("app.services.typesense_index")
def ensure_typesense_collection() -> None:
"""No-op Typesense collection initializer for middleware scope tests."""
typesense_stub.ensure_typesense_collection = ensure_typesense_collection
sys.modules["app.services.typesense_index"] = typesense_stub
return previous_modules
def _restore_main_import_stubs(previous_modules: dict[str, ModuleType | None]) -> None:
"""Restores module table entries captured before installing app.main test stubs."""
for module_name, previous in previous_modules.items():
if previous is None:
sys.modules.pop(module_name, None)
else:
sys.modules[module_name] = previous
class UploadRequestSizeMiddlewareTests(unittest.IsolatedAsyncioTestCase):
"""Verifies upload request-size middleware ignores preflight and guards only upload POST."""
@classmethod
def setUpClass(cls) -> None:
"""Installs import stubs and imports app.main once for middleware extraction."""
cls._previous_modules = _install_main_import_stubs()
cls.main_module = importlib.import_module("app.main")
@classmethod
def tearDownClass(cls) -> None:
"""Removes imported module and restores pre-existing module table entries."""
sys.modules.pop("app.main", None)
_restore_main_import_stubs(cls._previous_modules)
def _http_middleware(
self,
) -> Callable[[object, Callable[[object], Awaitable[object]]], Awaitable[object]]:
"""Returns the registered HTTP middleware callable from the stubbed FastAPI app."""
return self.main_module.app.http_middlewares[0]
async def test_options_preflight_skips_upload_content_length_guard(self) -> None:
"""OPTIONS preflight requests for upload endpoint continue without Content-Length enforcement."""
request = SimpleNamespace(
method="OPTIONS",
url=SimpleNamespace(path="/api/v1/documents/upload"),
headers={},
)
expected_response = object()
call_next_count = 0
async def call_next(_request: object) -> object:
nonlocal call_next_count
call_next_count += 1
return expected_response
response = await self._http_middleware()(request, call_next)
self.assertIs(response, expected_response)
self.assertEqual(call_next_count, 1)
async def test_post_upload_without_content_length_is_rejected(self) -> None:
"""Upload POST requests remain blocked when Content-Length is absent."""
request = SimpleNamespace(
method="POST",
url=SimpleNamespace(path="/api/v1/documents/upload"),
headers={},
)
call_next_count = 0
async def call_next(_request: object) -> object:
nonlocal call_next_count
call_next_count += 1
return object()
response = await self._http_middleware()(request, call_next)
self.assertEqual(response.status_code, 411)
self.assertEqual(
response.content,
{"detail": "Content-Length header is required for document uploads"},
)
self.assertEqual(call_next_count, 0)
async def test_post_non_upload_path_skips_upload_content_length_guard(self) -> None:
"""Content-Length enforcement does not run for non-upload POST requests."""
request = SimpleNamespace(
method="POST",
url=SimpleNamespace(path="/api/v1/documents"),
headers={},
)
expected_response = object()
call_next_count = 0
async def call_next(_request: object) -> object:
nonlocal call_next_count
call_next_count += 1
return expected_response
response = await self._http_middleware()(request, call_next)
self.assertIs(response, expected_response)
self.assertEqual(call_next_count, 1)
if __name__ == "__main__":
unittest.main()

View File

@@ -6,7 +6,7 @@ This directory contains technical documentation for DMS.
- `../README.md` - project overview, setup, and quick operations
- `architecture-overview.md` - backend, frontend, and infrastructure architecture
- `api-contract.md` - API endpoint contract grouped by route module, including settings and processing-log trim defaults
- `api-contract.md` - API endpoint contract grouped by route module, including token auth roles, upload limits, and settings or processing-log security constraints
- `data-model-reference.md` - database entity definitions and lifecycle states
- `operations-and-configuration.md` - runtime operations, ports, volumes, and persisted settings configuration
- `frontend-design-foundation.md` - frontend visual system, tokens, UI implementation rules, processing-log timeline behavior, and settings helper-copy guidance
- `operations-and-configuration.md` - runtime operations, hardened compose defaults, security environment variables, and persisted settings configuration and read-sanitization behavior
- `frontend-design-foundation.md` - frontend visual system, tokens, UI implementation rules, authenticated media delivery under API token auth, processing-log timeline behavior, and settings helper-copy guidance

View File

@@ -10,6 +10,17 @@ Primary implementation modules:
- `backend/app/api/routes_processing_logs.py`
- `backend/app/api/routes_settings.py`
## Authentication And Authorization
- Protected endpoints require `Authorization: Bearer <token>`.
- `ADMIN_API_TOKEN` is required for all privileged access and acts as fail-closed root credential.
- `USER_API_TOKEN` is optional and, when configured, grants access to document endpoints only.
- Authorization matrix:
- `documents/*`: `admin` or `user`
- `search/*`: `admin` or `user`
- `settings/*`: `admin` only
- `processing/logs/*`: `admin` only
## Health
- `GET /health`
@@ -18,6 +29,8 @@ Primary implementation modules:
## Documents
- Access: admin or user token required
### Collection and metadata helpers
- `GET /documents`
@@ -76,9 +89,14 @@ Primary implementation modules:
- `ask`: returns `conflicts` if duplicate checksum is detected
- `replace`: creates new document linked to replaced document id
- `duplicate`: creates additional document record
- upload `POST` request rejected with `411` when `Content-Length` is missing
- `OPTIONS /documents/upload` CORS preflight bypasses upload `Content-Length` enforcement
- request rejected with `413` when file count, per-file size, or total request size exceeds configured limits
## Search
- Access: admin or user token required
- `GET /search`
- Query: `query` (min length 2), `offset`, `limit`, `include_trashed`, `only_trashed`, `path_filter`, `tag_filter`, `type_filter`, `processed_from`, `processed_to`
- Response model: `SearchResponse`
@@ -86,23 +104,32 @@ Primary implementation modules:
## Processing Logs
- Access: admin token required
- `GET /processing/logs`
- Query: `offset`, `limit`, `document_id`
- Response model: `ProcessingLogListResponse`
- `limit` is capped by runtime configuration
- sensitive fields are redacted in API responses
- `POST /processing/logs/trim`
- Query: optional `keep_document_sessions`, `keep_unbound_entries`
- Behavior: omitted query values fall back to persisted `/settings.processing_log_retention`
- query values are capped by runtime retention limits
- Response: trim counters
- `POST /processing/logs/clear`
- Response: clear counters
## Settings
- Access: admin token required
- `GET /settings`
- Response model: `AppSettingsResponse`
- persisted providers with invalid base URLs are ignored during read sanitization; response falls back to remaining valid providers or secure defaults
- `PATCH /settings`
- Body model: `AppSettingsUpdateRequest`
- Response model: `AppSettingsResponse`
- rejects invalid provider base URLs with `400` when scheme, allowlist, or network safety checks fail
- `POST /settings/reset`
- Response model: `AppSettingsResponse`
- `PATCH /settings/handwriting`

View File

@@ -49,6 +49,13 @@ Do not hardcode new palette or spacing values in component styles when a token a
- Do not render queued headers before their animation starts, even when polling returns batched updates.
- Preserve existing header content format and fold/unfold detail behavior as lines are revealed.
## Authenticated Media Delivery
- Document previews and thumbnails must load through authenticated fetch flows in `frontend/src/lib/api.ts`, then render via temporary object URLs.
- Direct `window.open` calls for protected media endpoints are not allowed because browser navigation requests do not include the API token header.
- Download actions for original files and markdown exports must use authenticated blob fetches plus controlled browser download triggers.
- Revoke all temporary object URLs after replacement, unmount, or completion to prevent browser memory leaks.
## Extension Checklist
When adding or redesigning a UI area:

View File

@@ -3,12 +3,12 @@
## Runtime Services
`docker-compose.yml` defines the runtime stack:
- `db` (Postgres 16, port `5432`)
- `redis` (Redis 7, port `6379`)
- `typesense` (Typesense 29, port `8108`)
- `api` (FastAPI backend, port `8000`)
- `db` (Postgres 16, localhost-bound port `5432`)
- `redis` (Redis 7, localhost-bound port `6379`)
- `typesense` (Typesense 29, localhost-bound port `8108`)
- `api` (FastAPI backend, localhost-bound port `8000`)
- `worker` (RQ background worker)
- `frontend` (Vite UI, port `5173`)
- `frontend` (Vite UI, localhost-bound port `5173`)
## Named Volumes
@@ -44,6 +44,15 @@ Tail logs:
docker compose logs -f
```
Before running compose, provide explicit API tokens in your shell or project `.env` file:
```bash
export ADMIN_API_TOKEN="<random-admin-token>"
export USER_API_TOKEN="<random-user-token>"
```
Compose now fails fast if either token variable is missing.
## Backend Configuration
Settings source:
@@ -55,8 +64,13 @@ Key environment variables used by `api` and `worker` in compose:
- `DATABASE_URL`
- `REDIS_URL`
- `STORAGE_ROOT`
- `ADMIN_API_TOKEN`
- `USER_API_TOKEN`
- `PUBLIC_BASE_URL`
- `CORS_ORIGINS` (API service)
- `PROVIDER_BASE_URL_ALLOWLIST`
- `PROVIDER_BASE_URL_ALLOW_HTTP`
- `PROVIDER_BASE_URL_ALLOW_PRIVATE_NETWORK`
- `TYPESENSE_PROTOCOL`
- `TYPESENSE_HOST`
- `TYPESENSE_PORT`
@@ -65,9 +79,17 @@ Key environment variables used by `api` and `worker` in compose:
Selected defaults from `Settings` (`backend/app/core/config.py`):
- `upload_chunk_size = 4194304`
- `max_upload_files_per_request = 50`
- `max_upload_file_size_bytes = 26214400`
- `max_upload_request_size_bytes = 104857600`
- `max_zip_members = 250`
- `max_zip_depth = 2`
- `max_zip_member_uncompressed_bytes = 26214400`
- `max_zip_total_uncompressed_bytes = 157286400`
- `max_zip_compression_ratio = 120.0`
- `max_text_length = 500000`
- `processing_log_max_document_sessions = 20`
- `processing_log_max_unbound_entries = 400`
- `default_openai_model = "gpt-4.1-mini"`
- `default_openai_timeout_seconds = 45`
- `default_summary_model = "gpt-4.1-mini"`
@@ -79,6 +101,15 @@ Selected defaults from `Settings` (`backend/app/core/config.py`):
Frontend runtime API target:
- `VITE_API_BASE` in `docker-compose.yml` frontend service
- `VITE_API_TOKEN` in `docker-compose.yml` frontend service (defaults to `USER_API_TOKEN` in compose, override to `ADMIN_API_TOKEN` when admin-only routes are needed)
Frontend API authentication behavior:
- `frontend/src/lib/api.ts` adds `Authorization: Bearer <VITE_API_TOKEN>` for all API requests only when `VITE_API_TOKEN` is non-empty
- requests are still sent without authorization when `VITE_API_TOKEN` is unset, which keeps unauthenticated endpoints such as `/api/v1/health` backward-compatible
Frontend container runtime behavior:
- the container runs as non-root `node`
- `/app` is owned by `node` in `frontend/Dockerfile` so Vite can create runtime temp config files under `/app`
Frontend local commands:
@@ -103,8 +134,30 @@ Settings include:
- predefined paths and tags
- handwriting-style clustering settings
Read sanitization is resilient to corrupt persisted provider rows. If a persisted provider entry fails URL validation, the entry is skipped and defaults are used when no valid provider remains. This prevents unrelated read endpoints from failing due to stale invalid provider data.
Retention settings are used by worker cleanup and by `POST /api/v1/processing/logs/trim` when trim query values are not provided.
## Security Controls
- Privileged APIs are token-gated with bearer auth:
- `documents` endpoints: user token or admin token
- `settings` and `processing/logs` endpoints: admin token only
- Authentication fails closed when `ADMIN_API_TOKEN` is not configured.
- Provider base URLs are validated on settings updates and before outbound model calls:
- allowlist enforcement (`PROVIDER_BASE_URL_ALLOWLIST`)
- scheme restrictions (`https` by default)
- local/private-network blocking and per-request DNS revalidation checks for outbound runtime calls
- Upload and archive safety guards are enforced:
- `POST /api/v1/documents/upload` requires `Content-Length` and enforces file-count, per-file size, and total request size limits
- `OPTIONS /api/v1/documents/upload` CORS preflight is excluded from `Content-Length` enforcement
- ZIP member count, per-member uncompressed size, total decompressed size, and compression-ratio guards
- Processing logs redact sensitive payload and text fields, and trim endpoints enforce retention caps from runtime config.
- Compose hardening defaults:
- host ports bind to `127.0.0.1` unless `HOST_BIND_IP` override is set
- `api`, `worker`, and `frontend` drop all Linux capabilities and set `no-new-privileges`
- backend and frontend containers run as non-root users by default
## Validation Checklist
After operational or configuration changes, verify:

View File

@@ -6,7 +6,7 @@ services:
POSTGRES_PASSWORD: dcm
POSTGRES_DB: dcm
ports:
- "5432:5432"
- "${HOST_BIND_IP:-127.0.0.1}:5432:5432"
volumes:
- db-data:/var/lib/postgresql/data
healthcheck:
@@ -18,7 +18,7 @@ services:
redis:
image: redis:7-alpine
ports:
- "6379:6379"
- "${HOST_BIND_IP:-127.0.0.1}:6379:6379"
volumes:
- redis-data:/data
@@ -29,7 +29,7 @@ services:
- "--api-key=dcm-typesense-key"
- "--enable-cors"
ports:
- "8108:8108"
- "${HOST_BIND_IP:-127.0.0.1}:8108:8108"
volumes:
- typesense-data:/data
@@ -41,16 +41,25 @@ services:
DATABASE_URL: postgresql+psycopg://dcm:dcm@db:5432/dcm
REDIS_URL: redis://redis:6379/0
STORAGE_ROOT: /data/storage
ADMIN_API_TOKEN: ${ADMIN_API_TOKEN:?ADMIN_API_TOKEN must be set}
USER_API_TOKEN: ${USER_API_TOKEN:?USER_API_TOKEN must be set}
PROVIDER_BASE_URL_ALLOWLIST: '${PROVIDER_BASE_URL_ALLOWLIST:-["api.openai.com"]}'
PROVIDER_BASE_URL_ALLOW_HTTP: ${PROVIDER_BASE_URL_ALLOW_HTTP:-false}
PROVIDER_BASE_URL_ALLOW_PRIVATE_NETWORK: ${PROVIDER_BASE_URL_ALLOW_PRIVATE_NETWORK:-false}
OCR_LANGUAGES: eng,deu
PUBLIC_BASE_URL: http://192.168.2.5:8000
CORS_ORIGINS: '["http://localhost:5173","http://localhost:3000","http://192.168.2.5:5173"]'
PUBLIC_BASE_URL: ${PUBLIC_BASE_URL:-http://localhost:8000}
CORS_ORIGINS: '${CORS_ORIGINS:-["http://localhost:5173","http://localhost:3000"]}'
TYPESENSE_PROTOCOL: http
TYPESENSE_HOST: typesense
TYPESENSE_PORT: 8108
TYPESENSE_API_KEY: dcm-typesense-key
TYPESENSE_COLLECTION_NAME: documents
ports:
- "8000:8000"
- "${HOST_BIND_IP:-127.0.0.1}:8000:8000"
security_opt:
- no-new-privileges:true
cap_drop:
- ALL
volumes:
- ./backend/app:/app/app
- dcm-storage:/data
@@ -71,6 +80,11 @@ services:
DATABASE_URL: postgresql+psycopg://dcm:dcm@db:5432/dcm
REDIS_URL: redis://redis:6379/0
STORAGE_ROOT: /data/storage
ADMIN_API_TOKEN: ${ADMIN_API_TOKEN:?ADMIN_API_TOKEN must be set}
USER_API_TOKEN: ${USER_API_TOKEN:?USER_API_TOKEN must be set}
PROVIDER_BASE_URL_ALLOWLIST: '${PROVIDER_BASE_URL_ALLOWLIST:-["api.openai.com"]}'
PROVIDER_BASE_URL_ALLOW_HTTP: ${PROVIDER_BASE_URL_ALLOW_HTTP:-false}
PROVIDER_BASE_URL_ALLOW_PRIVATE_NETWORK: ${PROVIDER_BASE_URL_ALLOW_PRIVATE_NETWORK:-false}
OCR_LANGUAGES: eng,deu
PUBLIC_BASE_URL: http://localhost:8000
TYPESENSE_PROTOCOL: http
@@ -81,6 +95,10 @@ services:
volumes:
- ./backend/app:/app/app
- dcm-storage:/data
security_opt:
- no-new-privileges:true
cap_drop:
- ALL
depends_on:
db:
condition: service_healthy
@@ -93,9 +111,10 @@ services:
build:
context: ./frontend
environment:
VITE_API_BASE: http://192.168.2.5:8000/api/v1
VITE_API_BASE: ${VITE_API_BASE:-http://localhost:8000/api/v1}
VITE_API_TOKEN: ${VITE_API_TOKEN:-${USER_API_TOKEN:-}}
ports:
- "5173:5173"
- "${HOST_BIND_IP:-127.0.0.1}:5173:5173"
volumes:
- ./frontend/src:/app/src
- ./frontend/index.html:/app/index.html
@@ -103,6 +122,10 @@ services:
depends_on:
api:
condition: service_started
security_opt:
- no-new-privileges:true
cap_drop:
- ALL
volumes:
db-data:

View File

@@ -3,14 +3,18 @@ FROM node:22-alpine
WORKDIR /app
COPY package.json /app/package.json
RUN npm install
COPY package-lock.json /app/package-lock.json
RUN npm ci
RUN chown -R node:node /app
COPY tsconfig.json /app/tsconfig.json
COPY tsconfig.node.json /app/tsconfig.node.json
COPY vite.config.ts /app/vite.config.ts
COPY index.html /app/index.html
COPY src /app/src
COPY --chown=node:node tsconfig.json /app/tsconfig.json
COPY --chown=node:node tsconfig.node.json /app/tsconfig.node.json
COPY --chown=node:node vite.config.ts /app/vite.config.ts
COPY --chown=node:node index.html /app/index.html
COPY --chown=node:node src /app/src
EXPOSE 5173
USER node
CMD ["npm", "run", "dev", "--", "--host", "0.0.0.0", "--port", "5173"]

View File

@@ -5,6 +5,7 @@
"type": "module",
"scripts": {
"dev": "vite",
"test": "node --experimental-strip-types src/lib/api.test.ts",
"build": "tsc -b && vite build",
"preview": "vite preview --host 0.0.0.0 --port 4173"
},

View File

@@ -14,6 +14,7 @@ import SettingsScreen from './components/SettingsScreen';
import UploadSurface from './components/UploadSurface';
import {
clearProcessingLogs,
downloadBlobFile,
deleteDocument,
exportContentsMarkdown,
getAppSettings,
@@ -117,15 +118,6 @@ export default function App(): JSX.Element {
}
}, []);
const downloadBlob = useCallback((blob: Blob, filename: string): void => {
const objectUrl = URL.createObjectURL(blob);
const anchor = document.createElement('a');
anchor.href = objectUrl;
anchor.download = filename;
anchor.click();
URL.revokeObjectURL(objectUrl);
}, []);
const loadCatalogs = useCallback(async (): Promise<void> => {
const [tags, paths, types] = await Promise.all([listTags(true), listPaths(true), listTypes(true)]);
setKnownTags(tags);
@@ -465,13 +457,13 @@ export default function App(): JSX.Element {
only_trashed: documentView === 'trash',
include_trashed: false,
});
downloadBlob(result.blob, result.filename);
downloadBlobFile(result.blob, result.filename);
} catch (caughtError) {
setError(caughtError instanceof Error ? caughtError.message : 'Failed to export selected markdown files');
} finally {
setIsRunningBulkAction(false);
}
}, [documentView, downloadBlob, selectedDocumentIds]);
}, [documentView, selectedDocumentIds]);
const handleExportPath = useCallback(async (): Promise<void> => {
const trimmedPrefix = exportPathInput.trim();
@@ -487,13 +479,13 @@ export default function App(): JSX.Element {
only_trashed: documentView === 'trash',
include_trashed: false,
});
downloadBlob(result.blob, result.filename);
downloadBlobFile(result.blob, result.filename);
} catch (caughtError) {
setError(caughtError instanceof Error ? caughtError.message : 'Failed to export path markdown files');
} finally {
setIsRunningBulkAction(false);
}
}, [documentView, downloadBlob, exportPathInput]);
}, [documentView, exportPathInput]);
const handleSaveSettings = useCallback(async (payload: AppSettingsUpdate): Promise<void> => {
setIsSavingSettings(true);

View File

@@ -1,12 +1,17 @@
/**
* Card view for displaying document summary, preview, and metadata.
*/
import { useState } from 'react';
import { useEffect, useRef, useState } from 'react';
import type { JSX } from 'react';
import { Download, FileText, Trash2 } from 'lucide-react';
import type { DmsDocument } from '../types';
import { contentMarkdownUrl, downloadUrl, thumbnailUrl } from '../lib/api';
import {
downloadBlobFile,
downloadDocumentContentMarkdown,
downloadDocumentFile,
getDocumentThumbnailBlob,
} from '../lib/api';
/**
* Defines properties accepted by the document card component.
@@ -79,12 +84,59 @@ export default function DocumentCard({
onFilterTag,
}: DocumentCardProps): JSX.Element {
const [isTrashing, setIsTrashing] = useState<boolean>(false);
const [thumbnailObjectUrl, setThumbnailObjectUrl] = useState<string | null>(null);
const thumbnailObjectUrlRef = useRef<string | null>(null);
const createdDate = new Date(document.created_at).toLocaleString();
const status = statusPresentation(document.status);
const compactPath = compactLogicalPath(document.logical_path, 180);
const trashDisabled = isTrashView || document.status === 'trashed' || isTrashing;
const trashTitle = trashDisabled ? 'Already in trash' : 'Move to trash';
/**
* Loads thumbnail preview through authenticated fetch and revokes replaced object URLs.
*/
useEffect(() => {
const revokeThumbnailObjectUrl = (): void => {
if (!thumbnailObjectUrlRef.current) {
return;
}
URL.revokeObjectURL(thumbnailObjectUrlRef.current);
thumbnailObjectUrlRef.current = null;
};
if (!document.preview_available) {
revokeThumbnailObjectUrl();
setThumbnailObjectUrl(null);
return;
}
let cancelled = false;
const loadThumbnail = async (): Promise<void> => {
try {
const blob = await getDocumentThumbnailBlob(document.id);
if (cancelled) {
return;
}
revokeThumbnailObjectUrl();
const objectUrl = URL.createObjectURL(blob);
thumbnailObjectUrlRef.current = objectUrl;
setThumbnailObjectUrl(objectUrl);
} catch {
if (cancelled) {
return;
}
revokeThumbnailObjectUrl();
setThumbnailObjectUrl(null);
}
};
void loadThumbnail();
return () => {
cancelled = true;
revokeThumbnailObjectUrl();
};
}, [document.id, document.preview_available]);
return (
<article
className={`document-card ${isSelected ? 'selected' : ''}`}
@@ -119,8 +171,8 @@ export default function DocumentCard({
</label>
</header>
<div className="document-preview">
{document.preview_available ? (
<img src={thumbnailUrl(document.id)} alt={document.original_filename} loading="lazy" />
{document.preview_available && thumbnailObjectUrl ? (
<img src={thumbnailObjectUrl} alt={document.original_filename} loading="lazy" />
) : (
<div className="document-preview-fallback">{document.extension || 'file'}</div>
)}
@@ -173,7 +225,13 @@ export default function DocumentCard({
onClick={(event) => {
event.preventDefault();
event.stopPropagation();
window.open(downloadUrl(document.id), '_blank', 'noopener,noreferrer');
void (async (): Promise<void> => {
try {
const payload = await downloadDocumentFile(document.id);
downloadBlobFile(payload.blob, payload.filename);
} catch {
}
})();
}}
>
<Download aria-hidden="true" />
@@ -186,7 +244,13 @@ export default function DocumentCard({
onClick={(event) => {
event.preventDefault();
event.stopPropagation();
window.open(contentMarkdownUrl(document.id), '_blank', 'noopener,noreferrer');
void (async (): Promise<void> => {
try {
const payload = await downloadDocumentContentMarkdown(document.id);
downloadBlobFile(payload.blob, payload.filename);
} catch {
}
})();
}}
>
<FileText aria-hidden="true" />

View File

@@ -1,14 +1,15 @@
/**
* Embedded document viewer panel for preview, metadata updates, and lifecycle actions.
*/
import { useEffect, useMemo, useState } from 'react';
import { useEffect, useMemo, useRef, useState } from 'react';
import type { JSX } from 'react';
import {
contentMarkdownUrl,
downloadBlobFile,
downloadDocumentContentMarkdown,
deleteDocument,
getDocumentDetails,
previewUrl,
getDocumentPreviewBlob,
reprocessDocument,
restoreDocument,
trashDocument,
@@ -44,6 +45,8 @@ export default function DocumentViewer({
requestConfirmation,
}: DocumentViewerProps): JSX.Element {
const [documentDetail, setDocumentDetail] = useState<DmsDocumentDetail | null>(null);
const [previewObjectUrl, setPreviewObjectUrl] = useState<string | null>(null);
const [isLoadingPreview, setIsLoadingPreview] = useState<boolean>(false);
const [isLoadingDetails, setIsLoadingDetails] = useState<boolean>(false);
const [originalFilename, setOriginalFilename] = useState<string>('');
const [logicalPath, setLogicalPath] = useState<string>('');
@@ -55,6 +58,7 @@ export default function DocumentViewer({
const [isDeleting, setIsDeleting] = useState<boolean>(false);
const [isMetadataDirty, setIsMetadataDirty] = useState<boolean>(false);
const [error, setError] = useState<string | null>(null);
const previewObjectUrlRef = useRef<string | null>(null);
/**
* Syncs editable metadata fields whenever selection changes.
@@ -62,6 +66,12 @@ export default function DocumentViewer({
useEffect(() => {
if (!document) {
setDocumentDetail(null);
if (previewObjectUrlRef.current) {
URL.revokeObjectURL(previewObjectUrlRef.current);
previewObjectUrlRef.current = null;
}
setPreviewObjectUrl(null);
setIsLoadingPreview(false);
setIsMetadataDirty(false);
return;
}
@@ -72,6 +82,57 @@ export default function DocumentViewer({
setError(null);
}, [document?.id]);
/**
* Loads authenticated preview bytes and exposes a temporary object URL for iframe or image rendering.
*/
useEffect(() => {
const revokePreviewObjectUrl = (): void => {
if (!previewObjectUrlRef.current) {
return;
}
URL.revokeObjectURL(previewObjectUrlRef.current);
previewObjectUrlRef.current = null;
};
if (!document) {
revokePreviewObjectUrl();
setPreviewObjectUrl(null);
setIsLoadingPreview(false);
return;
}
let cancelled = false;
setIsLoadingPreview(true);
const loadPreview = async (): Promise<void> => {
try {
const blob = await getDocumentPreviewBlob(document.id);
if (cancelled) {
return;
}
revokePreviewObjectUrl();
const objectUrl = URL.createObjectURL(blob);
previewObjectUrlRef.current = objectUrl;
setPreviewObjectUrl(objectUrl);
} catch {
if (cancelled) {
return;
}
revokePreviewObjectUrl();
setPreviewObjectUrl(null);
} finally {
if (!cancelled) {
setIsLoadingPreview(false);
}
}
};
void loadPreview();
return () => {
cancelled = true;
revokePreviewObjectUrl();
};
}, [document?.id]);
/**
* Refreshes editable metadata from list updates only while form is clean.
*/
@@ -418,10 +479,16 @@ export default function DocumentViewer({
<h2>{document.original_filename}</h2>
<p className="small">Status: {document.status}</p>
<div className="viewer-preview">
{isImageDocument ? (
<img src={previewUrl(document.id)} alt={document.original_filename} />
{previewObjectUrl ? (
isImageDocument ? (
<img src={previewObjectUrl} alt={document.original_filename} />
) : (
<iframe src={previewObjectUrl} title={document.original_filename} />
)
) : isLoadingPreview ? (
<p className="small">Loading preview...</p>
) : (
<iframe src={previewUrl(document.id)} title={document.original_filename} />
<p className="small">Preview unavailable for this document.</p>
)}
</div>
<label>
@@ -561,7 +628,16 @@ export default function DocumentViewer({
<button
type="button"
className="secondary-action"
onClick={() => window.open(contentMarkdownUrl(document.id), '_blank', 'noopener,noreferrer')}
onClick={() => {
void (async (): Promise<void> => {
try {
const payload = await downloadDocumentContentMarkdown(document.id);
downloadBlobFile(payload.blob, payload.filename);
} catch (caughtError) {
setError(caughtError instanceof Error ? caughtError.message : 'Failed to download markdown');
}
})();
}}
disabled={isDeleting}
title="Downloads recognized/extracted content as markdown for this document."
>

View File

@@ -0,0 +1,85 @@
// @ts-expect-error Node strip-types runtime requires explicit .ts extension in ESM imports.
import { downloadDocumentContentMarkdown, downloadDocumentFile, getDocumentPreviewBlob, getDocumentThumbnailBlob } from './api.ts';
/**
* Throws when a test condition is false.
*/
function assert(condition: boolean, message: string): void {
if (!condition) {
throw new Error(message);
}
}
/**
* Verifies that async functions reject with an expected message fragment.
*/
async function assertRejects(action: () => Promise<unknown>, expectedMessage: string): Promise<void> {
try {
await action();
} catch (error) {
const message = error instanceof Error ? error.message : String(error);
assert(message.includes(expectedMessage), `Expected error containing "${expectedMessage}" but received "${message}"`);
return;
}
throw new Error(`Expected rejection containing "${expectedMessage}"`);
}
/**
* Runs API helper tests for authenticated media and download flows.
*/
async function runApiTests(): Promise<void> {
const originalFetch = globalThis.fetch;
try {
const requestUrls: string[] = [];
globalThis.fetch = (async (input: RequestInfo | URL): Promise<Response> => {
requestUrls.push(typeof input === 'string' ? input : input.toString());
return new Response('preview-bytes', { status: 200 });
}) as typeof fetch;
const thumbnail = await getDocumentThumbnailBlob('doc-1');
const preview = await getDocumentPreviewBlob('doc-1');
assert(await thumbnail.text() === 'preview-bytes', 'Thumbnail blob bytes mismatch');
assert(await preview.text() === 'preview-bytes', 'Preview blob bytes mismatch');
assert(
requestUrls[0] === 'http://localhost:8000/api/v1/documents/doc-1/thumbnail',
`Unexpected thumbnail URL ${requestUrls[0]}`,
);
assert(
requestUrls[1] === 'http://localhost:8000/api/v1/documents/doc-1/preview',
`Unexpected preview URL ${requestUrls[1]}`,
);
globalThis.fetch = (async (): Promise<Response> => {
return new Response('file-bytes', {
status: 200,
headers: {
'content-disposition': 'attachment; filename="invoice.pdf"',
},
});
}) as typeof fetch;
const fileResult = await downloadDocumentFile('doc-2');
assert(fileResult.filename === 'invoice.pdf', `Unexpected download filename ${fileResult.filename}`);
assert((await fileResult.blob.text()) === 'file-bytes', 'Original download bytes mismatch');
globalThis.fetch = (async (): Promise<Response> => {
return new Response('# markdown', { status: 200 });
}) as typeof fetch;
const markdownResult = await downloadDocumentContentMarkdown('doc-3');
assert(markdownResult.filename === 'document-content.md', `Unexpected markdown filename ${markdownResult.filename}`);
assert((await markdownResult.blob.text()) === '# markdown', 'Markdown bytes mismatch');
globalThis.fetch = (async (): Promise<Response> => {
return new Response('forbidden', { status: 401 });
}) as typeof fetch;
await assertRejects(async () => downloadDocumentContentMarkdown('doc-4'), 'Failed to download document markdown');
} finally {
globalThis.fetch = originalFetch;
}
}
await runApiTests();

View File

@@ -16,7 +16,40 @@ import type {
/**
* Resolves backend base URL from environment with localhost fallback.
*/
const API_BASE = import.meta.env.VITE_API_BASE ?? 'http://localhost:8000/api/v1';
const API_BASE = import.meta.env?.VITE_API_BASE ?? 'http://localhost:8000/api/v1';
/**
* Optional bearer token used for authenticated backend routes.
*/
const API_TOKEN = import.meta.env?.VITE_API_TOKEN?.trim();
type ApiRequestInit = Omit<RequestInit, 'headers'> & { headers?: HeadersInit };
/**
* Merges request headers and appends bearer authorization when configured.
*/
function buildRequestHeaders(headers?: HeadersInit): Headers | undefined {
if (!API_TOKEN && !headers) {
return undefined;
}
const requestHeaders = new Headers(headers);
if (API_TOKEN) {
requestHeaders.set('Authorization', `Bearer ${API_TOKEN}`);
}
return requestHeaders;
}
/**
* Executes an API request with centralized auth-header handling.
*/
function apiRequest(input: string, init: ApiRequestInit = {}): Promise<Response> {
const headers = buildRequestHeaders(init.headers);
return fetch(input, {
...init,
...(headers ? { headers } : {}),
});
}
/**
* Encodes query parameters while skipping undefined and null values.
@@ -45,6 +78,22 @@ function responseFilename(response: Response, fallback: string): string {
return match[1];
}
/**
* Triggers a browser file download for blob payloads and releases temporary object URLs.
*/
export function downloadBlobFile(blob: Blob, filename: string): void {
const objectUrl = URL.createObjectURL(blob);
const anchor = document.createElement('a');
anchor.href = objectUrl;
anchor.download = filename;
document.body.appendChild(anchor);
anchor.click();
anchor.remove();
window.setTimeout(() => {
URL.revokeObjectURL(objectUrl);
}, 0);
}
/**
* Loads documents from the backend list endpoint.
*/
@@ -72,7 +121,7 @@ export async function listDocuments(options?: {
processed_from: options?.processedFrom,
processed_to: options?.processedTo,
});
const response = await fetch(`${API_BASE}/documents${query}`);
const response = await apiRequest(`${API_BASE}/documents${query}`);
if (!response.ok) {
throw new Error('Failed to load documents');
}
@@ -108,7 +157,7 @@ export async function searchDocuments(
processed_from: options?.processedFrom,
processed_to: options?.processedTo,
});
const response = await fetch(`${API_BASE}/search${query}`);
const response = await apiRequest(`${API_BASE}/search${query}`);
if (!response.ok) {
throw new Error('Search failed');
}
@@ -128,7 +177,7 @@ export async function listProcessingLogs(options?: {
offset: options?.offset ?? 0,
document_id: options?.documentId,
});
const response = await fetch(`${API_BASE}/processing/logs${query}`);
const response = await apiRequest(`${API_BASE}/processing/logs${query}`);
if (!response.ok) {
throw new Error('Failed to load processing logs');
}
@@ -146,7 +195,7 @@ export async function trimProcessingLogs(options?: {
keep_document_sessions: options?.keepDocumentSessions ?? 2,
keep_unbound_entries: options?.keepUnboundEntries ?? 80,
});
const response = await fetch(`${API_BASE}/processing/logs/trim${query}`, {
const response = await apiRequest(`${API_BASE}/processing/logs/trim${query}`, {
method: 'POST',
});
if (!response.ok) {
@@ -159,7 +208,7 @@ export async function trimProcessingLogs(options?: {
* Clears all persisted processing logs.
*/
export async function clearProcessingLogs(): Promise<{ deleted_entries: number }> {
const response = await fetch(`${API_BASE}/processing/logs/clear`, {
const response = await apiRequest(`${API_BASE}/processing/logs/clear`, {
method: 'POST',
});
if (!response.ok) {
@@ -173,7 +222,7 @@ export async function clearProcessingLogs(): Promise<{ deleted_entries: number }
*/
export async function listTags(includeTrashed = false): Promise<string[]> {
const query = buildQuery({ include_trashed: includeTrashed });
const response = await fetch(`${API_BASE}/documents/tags${query}`);
const response = await apiRequest(`${API_BASE}/documents/tags${query}`);
if (!response.ok) {
throw new Error('Failed to load tags');
}
@@ -186,7 +235,7 @@ export async function listTags(includeTrashed = false): Promise<string[]> {
*/
export async function listPaths(includeTrashed = false): Promise<string[]> {
const query = buildQuery({ include_trashed: includeTrashed });
const response = await fetch(`${API_BASE}/documents/paths${query}`);
const response = await apiRequest(`${API_BASE}/documents/paths${query}`);
if (!response.ok) {
throw new Error('Failed to load paths');
}
@@ -199,7 +248,7 @@ export async function listPaths(includeTrashed = false): Promise<string[]> {
*/
export async function listTypes(includeTrashed = false): Promise<string[]> {
const query = buildQuery({ include_trashed: includeTrashed });
const response = await fetch(`${API_BASE}/documents/types${query}`);
const response = await apiRequest(`${API_BASE}/documents/types${query}`);
if (!response.ok) {
throw new Error('Failed to load document types');
}
@@ -228,7 +277,7 @@ export async function uploadDocuments(
formData.append('tags', options.tags);
formData.append('conflict_mode', options.conflictMode);
const response = await fetch(`${API_BASE}/documents/upload`, {
const response = await apiRequest(`${API_BASE}/documents/upload`, {
method: 'POST',
body: formData,
});
@@ -245,7 +294,7 @@ export async function updateDocumentMetadata(
documentId: string,
payload: { original_filename?: string; logical_path?: string; tags?: string[] },
): Promise<DmsDocument> {
const response = await fetch(`${API_BASE}/documents/${documentId}`, {
const response = await apiRequest(`${API_BASE}/documents/${documentId}`, {
method: 'PATCH',
headers: {
'Content-Type': 'application/json',
@@ -262,7 +311,7 @@ export async function updateDocumentMetadata(
* Moves a document to trash state without removing stored files.
*/
export async function trashDocument(documentId: string): Promise<DmsDocument> {
const response = await fetch(`${API_BASE}/documents/${documentId}/trash`, { method: 'POST' });
const response = await apiRequest(`${API_BASE}/documents/${documentId}/trash`, { method: 'POST' });
if (!response.ok) {
throw new Error('Failed to trash document');
}
@@ -273,7 +322,7 @@ export async function trashDocument(documentId: string): Promise<DmsDocument> {
* Restores a document from trash to active state.
*/
export async function restoreDocument(documentId: string): Promise<DmsDocument> {
const response = await fetch(`${API_BASE}/documents/${documentId}/restore`, { method: 'POST' });
const response = await apiRequest(`${API_BASE}/documents/${documentId}/restore`, { method: 'POST' });
if (!response.ok) {
throw new Error('Failed to restore document');
}
@@ -284,7 +333,7 @@ export async function restoreDocument(documentId: string): Promise<DmsDocument>
* Permanently deletes a document record and associated stored files.
*/
export async function deleteDocument(documentId: string): Promise<{ deleted_documents: number; deleted_files: number }> {
const response = await fetch(`${API_BASE}/documents/${documentId}`, { method: 'DELETE' });
const response = await apiRequest(`${API_BASE}/documents/${documentId}`, { method: 'DELETE' });
if (!response.ok) {
throw new Error('Failed to delete document');
}
@@ -295,7 +344,7 @@ export async function deleteDocument(documentId: string): Promise<{ deleted_docu
* Loads full details for one document, including extracted text content.
*/
export async function getDocumentDetails(documentId: string): Promise<DmsDocumentDetail> {
const response = await fetch(`${API_BASE}/documents/${documentId}`);
const response = await apiRequest(`${API_BASE}/documents/${documentId}`);
if (!response.ok) {
throw new Error('Failed to load document details');
}
@@ -306,7 +355,7 @@ export async function getDocumentDetails(documentId: string): Promise<DmsDocumen
* Re-enqueues one document for extraction and classification processing.
*/
export async function reprocessDocument(documentId: string): Promise<DmsDocument> {
const response = await fetch(`${API_BASE}/documents/${documentId}/reprocess`, {
const response = await apiRequest(`${API_BASE}/documents/${documentId}/reprocess`, {
method: 'POST',
});
if (!response.ok) {
@@ -343,6 +392,60 @@ export function contentMarkdownUrl(documentId: string): string {
return `${API_BASE}/documents/${documentId}/content-md`;
}
/**
* Downloads preview bytes for one document using centralized auth headers.
*/
export async function getDocumentPreviewBlob(documentId: string): Promise<Blob> {
const response = await apiRequest(previewUrl(documentId));
if (!response.ok) {
throw new Error('Failed to load document preview');
}
return response.blob();
}
/**
* Downloads thumbnail bytes for one document using centralized auth headers.
*/
export async function getDocumentThumbnailBlob(documentId: string): Promise<Blob> {
const response = await apiRequest(thumbnailUrl(documentId));
if (!response.ok) {
throw new Error('Failed to load document thumbnail');
}
return response.blob();
}
/**
* Downloads the original document payload with backend-provided filename fallback.
*/
export async function downloadDocumentFile(documentId: string): Promise<{ blob: Blob; filename: string }> {
const response = await apiRequest(downloadUrl(documentId));
if (!response.ok) {
throw new Error('Failed to download document');
}
const blob = await response.blob();
return {
blob,
filename: responseFilename(response, 'document-download'),
};
}
/**
* Downloads extracted markdown content for one document with backend-provided filename fallback.
*/
export async function downloadDocumentContentMarkdown(
documentId: string,
): Promise<{ blob: Blob; filename: string }> {
const response = await apiRequest(contentMarkdownUrl(documentId));
if (!response.ok) {
throw new Error('Failed to download document markdown');
}
const blob = await response.blob();
return {
blob,
filename: responseFilename(response, 'document-content.md'),
};
}
/**
* Exports extracted content markdown files for selected documents or path filters.
*/
@@ -352,7 +455,7 @@ export async function exportContentsMarkdown(payload: {
include_trashed?: boolean;
only_trashed?: boolean;
}): Promise<{ blob: Blob; filename: string }> {
const response = await fetch(`${API_BASE}/documents/content-md/export`, {
const response = await apiRequest(`${API_BASE}/documents/content-md/export`, {
method: 'POST',
headers: {
'Content-Type': 'application/json',
@@ -373,7 +476,7 @@ export async function exportContentsMarkdown(payload: {
* Retrieves persisted application settings from backend.
*/
export async function getAppSettings(): Promise<AppSettings> {
const response = await fetch(`${API_BASE}/settings`);
const response = await apiRequest(`${API_BASE}/settings`);
if (!response.ok) {
throw new Error('Failed to load application settings');
}
@@ -384,7 +487,7 @@ export async function getAppSettings(): Promise<AppSettings> {
* Updates provider and task settings for OpenAI-compatible model execution.
*/
export async function updateAppSettings(payload: AppSettingsUpdate): Promise<AppSettings> {
const response = await fetch(`${API_BASE}/settings`, {
const response = await apiRequest(`${API_BASE}/settings`, {
method: 'PATCH',
headers: {
'Content-Type': 'application/json',
@@ -401,7 +504,7 @@ export async function updateAppSettings(payload: AppSettingsUpdate): Promise<App
* Resets persisted provider and task settings to backend defaults.
*/
export async function resetAppSettings(): Promise<AppSettings> {
const response = await fetch(`${API_BASE}/settings/reset`, {
const response = await apiRequest(`${API_BASE}/settings/reset`, {
method: 'POST',
});
if (!response.ok) {