CertWatch — Platform Overview

CertWatch automates the discovery, extraction, and tracking of organizational compliance certificates (ISO 9001, ISO 14001, EMAS, TISAX, etc.) for supply chain risk management.

What CertWatch Does

  • Organization & Site Management — Track suppliers and their physical locations
  • Certificate Lifecycle Tracking — Monitor certificates through active, expired, revoked, and pending states with an enforced state machine
  • Automated Web Scraping — Periodically scrape configured URLs for certificate data, with Playwright fallback for JS-rendered pages
  • LLM-Based Extraction — Use OpenAI API to extract structured certificate information from unstructured HTML and PDF content
  • PDF-Aware Change Detection — Per-PDF content hashing enables selective re-extraction when individual documents change
  • Duplicate Detection — Prevent redundant certificate records using hash-based and semantic matching
  • Webhook Notifications — Push real-time events with HMAC-SHA256 payload signatures when certificates are created, updated, expiring, or expired
  • Expiry Notification Deduplication — Windowed cooldowns prevent duplicate alerts (6-day cooldown for 7-day window, 25-day cooldown for 30-day window)
  • Adaptive Scheduling — Automatically increase scrape frequency as certificates approach expiry (4h / 12h / 24h tiers)
  • Bulk Upload with OCR — Upload certificate files (PDF, images), extract text via OCR, and match to organizations
  • Data Import — Import organizations and certificate types from Excel/CSV files
  • Email Verification — New user registrations require email verification before login
  • Multi-Tenant Architecture — Each customer organization has fully isolated data

Architecture

┌─────────────┐     ┌──────────────────┐     ┌──────────┐
│   Astro SSR  │────▶│  Django REST API  │────▶│ PostgreSQL│
│  Frontend    │     │  (gunicorn)       │     │    16     │
│  :4321       │     │  :8000            │     │  :5432    │
└─────────────┘     └────────┬─────────┘     └──────────┘
                             │
                    ┌────────┴─────────┐
                    │                  │
               ┌────▼────┐     ┌──────▼──────┐
               │  Celery  │     │  Celery Beat │
               │  Worker  │     │  (scheduler) │
               └────┬────┘     └─────────────┘
                    │
               ┌────▼────┐
               │  Redis 7 │
               │  :6379   │
               └─────────┘

All services are orchestrated via Docker Compose.

Tech Stack

Layer Technology
Backend Python 3.12, Django 5.x, Django REST Framework
Database PostgreSQL 16 with connection pooling (django-db-connection-pool)
Task Queue Celery 5.x + Redis 7
Scheduling Celery Beat with django_celery_beat
AI Extraction OpenAI API
OCR Tesseract via pytesseract
PDF Parsing pdfplumber
Web Scraping requests, beautifulsoup4, Playwright (optional fallback)
Data Import pandas + openpyxl
Frontend Astro 5.x with SSR (@astrojs/node)
API Docs drf-spectacular (OpenAPI / Swagger)

Security

  • Email Verification — Registration creates an inactive user; a verification email must be confirmed before login is allowed. Tokens expire after 24 hours.
  • HMAC Webhook Signatures — Every webhook delivery includes an X-Webhook-Signature: sha256=<hex> header computed with the subscriber's signing_secret.
  • API Key & Signing Secret Rotation — Dedicated endpoints to rotate subscriber credentials without losing subscription history.
  • Rate Limiting — Configurable throttles on resource-intensive endpoints (bulk upload, scrape trigger, discovery, import).
  • Admin IP RestrictionAdminIPRestrictionMiddleware blocks /admin/ access from non-allowed IPs/CIDRs (configured via ADMIN_ALLOWED_IPS env var). Supports X-Forwarded-For for proxy setups.
  • Filename Sanitization — Bulk upload filenames are stripped of path separators, null bytes, and HTML/script tags before storage.
  • Non-Nullable Tenant FKs — Organization, Subscriber, and UploadBatch tenant foreign keys are enforced as non-nullable at the database level.

Reliability

  • Database Connection Pooling — SQLAlchemy-based pooling via django-db-connection-pool with configurable pool size and overflow.
  • Request ID Propagation — Every HTTP request gets a UUID (from X-Request-ID header or auto-generated). The ID propagates into Celery task headers and log records for end-to-end tracing.
  • Scrape Failure Alerting — After each periodic scrape run, the failure rate from the last 24 hours is computed. A CRITICAL log is emitted if it exceeds the configurable threshold (default 50%).
  • Scraper Retry Budget — The scrape_endpoint Celery task has max_retries=0 at the task level; HTTP retries are handled internally by the Scraper class (up to 3 attempts with exponential backoff).
  • Invitation Emails — Sent asynchronously via Celery with retry (max 3 attempts, exponential backoff).
  • Async Document Hashing — Certificate document hash computation is offloaded to a Celery task to avoid blocking save().

Multi-Tenancy

CertWatch uses a single-database, shared-schema multi-tenancy model:

  • A Tenant represents a customer organization (distinct from Organization, which is a tracked supplier).
  • Every API request carries tenant context via a TenantToken in the Authorization header.
  • TenantMiddleware resolves request.tenant from the token after authentication.
  • All viewsets inherit TenantQuerySetMixin, which filters querysets to the authenticated tenant's data.
  • New records are automatically associated with the request tenant on creation.

Data belonging to one tenant is never visible to another.

Observability

  • Structured JSON Logging — Production uses python-json-logger for machine-parseable log output. Development uses human-readable verbose format.
  • Request ID in Logs — Every log record includes a request_id field via RequestIDFilter.
  • Health EndpointGET /api/health/ checks database and Redis connectivity, returns 200 or 503.
  • Monitoring HealthGET /api/monitoring/health/ includes scrape_failure_rate from the last 24 hours.
  • Scrape Logs — Every scrape attempt is recorded with status, duration, content hash, and certificate count.
  • Webhook Delivery Log — Every delivery attempt is recorded with response status, error message, and attempt count.

API Documentation

The API is also available as an interactive Swagger UI at:

GET /api/docs/      — Swagger UI
GET /api/schema/     — OpenAPI 3.0 schema (JSON)