CertWatch — Platform Overview
CertWatch automates the discovery, extraction, and tracking of organizational compliance certificates (ISO 9001, ISO 14001, EMAS, TISAX, etc.) for supply chain risk management.
What CertWatch Does
- Organization & Site Management — Track suppliers and their physical locations
- Certificate Lifecycle Tracking — Monitor certificates through active, expired, revoked, and pending states with an enforced state machine
- Automated Web Scraping — Periodically scrape configured URLs for certificate data, with Playwright fallback for JS-rendered pages
- LLM-Based Extraction — Use OpenAI API to extract structured certificate information from unstructured HTML and PDF content
- PDF-Aware Change Detection — Per-PDF content hashing enables selective re-extraction when individual documents change
- Duplicate Detection — Prevent redundant certificate records using hash-based and semantic matching
- Webhook Notifications — Push real-time events with HMAC-SHA256 payload signatures when certificates are created, updated, expiring, or expired
- Expiry Notification Deduplication — Windowed cooldowns prevent duplicate alerts (6-day cooldown for 7-day window, 25-day cooldown for 30-day window)
- Adaptive Scheduling — Automatically increase scrape frequency as certificates approach expiry (4h / 12h / 24h tiers)
- Bulk Upload with OCR — Upload certificate files (PDF, images), extract text via OCR, and match to organizations
- Data Import — Import organizations and certificate types from Excel/CSV files
- Email Verification — New user registrations require email verification before login
- Multi-Tenant Architecture — Each customer organization has fully isolated data
Architecture
┌─────────────┐ ┌──────────────────┐ ┌──────────┐
│ Astro SSR │────▶│ Django REST API │────▶│ PostgreSQL│
│ Frontend │ │ (gunicorn) │ │ 16 │
│ :4321 │ │ :8000 │ │ :5432 │
└─────────────┘ └────────┬─────────┘ └──────────┘
│
┌────────┴─────────┐
│ │
┌────▼────┐ ┌──────▼──────┐
│ Celery │ │ Celery Beat │
│ Worker │ │ (scheduler) │
└────┬────┘ └─────────────┘
│
┌────▼────┐
│ Redis 7 │
│ :6379 │
└─────────┘
All services are orchestrated via Docker Compose.
Tech Stack
| Layer | Technology |
|---|---|
| Backend | Python 3.12, Django 5.x, Django REST Framework |
| Database | PostgreSQL 16 with connection pooling (django-db-connection-pool) |
| Task Queue | Celery 5.x + Redis 7 |
| Scheduling | Celery Beat with django_celery_beat |
| AI Extraction | OpenAI API |
| OCR | Tesseract via pytesseract |
| PDF Parsing | pdfplumber |
| Web Scraping | requests, beautifulsoup4, Playwright (optional fallback) |
| Data Import | pandas + openpyxl |
| Frontend | Astro 5.x with SSR (@astrojs/node) |
| API Docs | drf-spectacular (OpenAPI / Swagger) |
Security
- Email Verification — Registration creates an inactive user; a verification email must be confirmed before login is allowed. Tokens expire after 24 hours.
- HMAC Webhook Signatures — Every webhook delivery includes an
X-Webhook-Signature: sha256=<hex>header computed with the subscriber'ssigning_secret. - API Key & Signing Secret Rotation — Dedicated endpoints to rotate subscriber credentials without losing subscription history.
- Rate Limiting — Configurable throttles on resource-intensive endpoints (bulk upload, scrape trigger, discovery, import).
- Admin IP Restriction —
AdminIPRestrictionMiddlewareblocks/admin/access from non-allowed IPs/CIDRs (configured viaADMIN_ALLOWED_IPSenv var). SupportsX-Forwarded-Forfor proxy setups. - Filename Sanitization — Bulk upload filenames are stripped of path separators, null bytes, and HTML/script tags before storage.
- Non-Nullable Tenant FKs — Organization, Subscriber, and UploadBatch tenant foreign keys are enforced as non-nullable at the database level.
Reliability
- Database Connection Pooling — SQLAlchemy-based pooling via
django-db-connection-poolwith configurable pool size and overflow. - Request ID Propagation — Every HTTP request gets a UUID (from
X-Request-IDheader or auto-generated). The ID propagates into Celery task headers and log records for end-to-end tracing. - Scrape Failure Alerting — After each periodic scrape run, the failure rate from the last 24 hours is computed. A CRITICAL log is emitted if it exceeds the configurable threshold (default 50%).
- Scraper Retry Budget — The
scrape_endpointCelery task hasmax_retries=0at the task level; HTTP retries are handled internally by the Scraper class (up to 3 attempts with exponential backoff). - Invitation Emails — Sent asynchronously via Celery with retry (max 3 attempts, exponential backoff).
- Async Document Hashing — Certificate document hash computation is offloaded to a Celery task to avoid blocking
save().
Multi-Tenancy
CertWatch uses a single-database, shared-schema multi-tenancy model:
- A Tenant represents a customer organization (distinct from Organization, which is a tracked supplier).
- Every API request carries tenant context via a TenantToken in the
Authorizationheader. - TenantMiddleware resolves
request.tenantfrom the token after authentication. - All viewsets inherit TenantQuerySetMixin, which filters querysets to the authenticated tenant's data.
- New records are automatically associated with the request tenant on creation.
Data belonging to one tenant is never visible to another.
Observability
- Structured JSON Logging — Production uses
python-json-loggerfor machine-parseable log output. Development uses human-readable verbose format. - Request ID in Logs — Every log record includes a
request_idfield viaRequestIDFilter. - Health Endpoint —
GET /api/health/checks database and Redis connectivity, returns200or503. - Monitoring Health —
GET /api/monitoring/health/includesscrape_failure_ratefrom the last 24 hours. - Scrape Logs — Every scrape attempt is recorded with status, duration, content hash, and certificate count.
- Webhook Delivery Log — Every delivery attempt is recorded with response status, error message, and attempt count.
API Documentation
The API is also available as an interactive Swagger UI at:
GET /api/docs/ — Swagger UI
GET /api/schema/ — OpenAPI 3.0 schema (JSON)