CertWatch — Module Reference
Module Map
| Module | Purpose | API Prefix |
|---|---|---|
certwatch/ |
Project settings, root URL conf, Celery config, middleware | — |
tenants/ |
Multi-tenant auth, memberships, invitations, token management | /api/auth/ |
organizations/ |
Organization & Site CRUD | /api/organizations/ |
certificates/ |
CertificateType & Certificate CRUD, public search | /api/certificates/ |
monitoring/ |
MonitoredEndpoint, ScrapeLog, scraper, LLM extractor, discovery | /api/monitoring/ |
webhooks/ |
Subscriber, Subscription, WebhookDelivery | /api/webhooks/ |
imports/ |
Excel/CSV data import | /api/imports/ |
bulk_upload/ |
Bulk certificate file upload, OCR, org matching, review | /api/bulk-upload/ |
certwatch
Project-level configuration, middleware, and infrastructure.
Key Components
- RequestIDMiddleware — Assigns a UUID to every request (from
X-Request-IDheader or auto-generated). Setsrequest.request_idand returns the ID in theX-Request-IDresponse header. A companionRequestIDFilterinjects the ID into all log records. - AdminIPRestrictionMiddleware — Blocks requests to
/admin/from IPs not in theADMIN_ALLOWED_IPSenv var (comma-separated IPs/CIDRs). SupportsX-Forwarded-Forfor reverse proxy setups. Empty config allows all access (dev default). - HealthCheckView —
GET /api/health/verifies database and Redis connectivity. - Celery signals —
before_task_publish,task_prerun, andtask_postrunsignals propagate request IDs into Celery task headers and thread-local storage for end-to-end tracing.
tenants
Handles authentication, multi-tenant isolation, team management, and API key lifecycle.
Models
| Model | Description |
|---|---|
| Tenant | A customer organization using CertWatch. Has name, slug (auto-generated), UUID primary key. |
| Membership | Links a Django User to a Tenant with a role (admin or member). Unique per user+tenant. |
| Invitation | A pending invite for a new user. Contains a unique URL-safe token, email, role, status (pending/accepted/expired), and expires_at (7 days). |
| TenantToken | API token carrying both user and tenant context. Has optional name, optional expires_at. Replaces default DRF token auth. |
| EmailVerification | Tracks email verification for new registrations. Fields: user (OneToOne to User), token (unique URL-safe string), created_at, expires_at (24 hours). |
Key Components
- TenantTokenAuthentication — Custom DRF auth backend. Reads
Authorization: Token <key>, looks upTenantToken, returns(user, token)sorequest.authcarries tenant context. - TenantMiddleware — Runs after
AuthenticationMiddleware. Resolvesrequest.tenantfrom the auth token. Public paths (register, login, health, etc.) skip tenant resolution. - TenantQuerySetMixin — Reusable viewset mixin. Filters querysets by
request.tenantusing a configurabletenant_fieldattribute. Auto-sets tenant onperform_create. - IsAdminMember — Permission class requiring the user to have an
adminrole membership in the current tenant.
organizations
Manages the suppliers whose certificates are being tracked.
Models
| Model | Description |
|---|---|
| Organization | A business entity (supplier). Fields: external_id (unique), name, website (validated URL), lei, vat, country (ISO 3166-1 alpha-2). Belongs to a Tenant. |
| Site | A physical/operational location. Fields: name, address. Belongs to an Organization. Multiple sites per organization. |
Behavior
- On organization creation, certificate discovery is automatically triggered (async Celery task).
- List views annotate organizations with
active_certificates,expiring_certificates, andexpired_certificatescounts. - Detail views include nested sites.
certificates
Core certificate data and lifecycle management.
Models
| Model | Description |
|---|---|
| CertificateType | Master data for certificate categories. Fields: code (unique, e.g. ISO_9001), name, description. Seeded via migrations. |
| Certificate | A compliance document issued to an organization. Fields: certificate_number, issuing_body, scope, issue_date, expiry_date, status, document_url, source_url, document_hash (SHA-256), external_id, last_expiry_notification_at. |
Certificate Statuses
| Status | Description |
|---|---|
ACTIVE |
Certificate is currently valid |
EXPIRED |
Certificate has passed its expiry date |
REVOKED |
Certificate has been revoked by the issuing body |
PENDING |
Certificate is awaiting validation |
Status State Machine
Status transitions are enforced by certificates/state_machine.py via Certificate.clean():
PENDING → ACTIVE, REVOKED, EXPIRED
ACTIVE → EXPIRED, REVOKED
EXPIRED → ACTIVE (re-certification — requires future expiry_date)
REVOKED → (terminal — no transitions allowed)
Invalid transitions raise a ValidationError. The EXPIRED → ACTIVE transition additionally requires that expiry_date is in the future.
Async Document Hashing
When a certificate's document_url is set or changed, the synchronous hash computation is skipped during save(). Instead, a compute_document_hash Celery task is enqueued to fetch the document and compute its SHA-256 hash asynchronously.
Supported Certificate Types (seeded)
ISO_9001, ISO_14001, ISO_45001, EMAS, TISAX, ISO_27001, ISO_50001, ISO_22301, ISO_13485, ISO_20000, SA_8000, OHSAS_18001, FSC, PEFC, BSCI, SEDEX, ECOVADIS, CDP, SBTi, GRI, IATF_16949, AS_9100, NADCAP, ISO_3834, EN_1090, CE_MARKING, API_SPEC, PED_2014_68_EU, ATEX, IECEx
Signals
- pre_save — Caches old field values for change detection.
- post_save — Dispatches
NEW_CERTIFICATEwebhook on creation,CERTIFICATE_UPDATEDwhen watched fields change (status,expiry_date,issuing_body,scope,document_url). Resetslast_expiry_notification_attoNonewhenexpiry_datechanges (allows fresh notifications after renewal).
monitoring
Automated web scraping pipeline for certificate discovery and extraction.
Models
| Model | Description |
|---|---|
| MonitoredEndpoint | A URL to periodically scrape. Fields: url, organization (FK), check_frequency (duration, default 24h), last_checked, last_content_hash (MD5), processed_pdf_urls (JSON list), pdf_content_hashes (JSON dict — URL→SHA-256), is_active. |
| ScrapeLog | Audit record per scrape attempt. Fields: endpoint (FK), timestamp, status (SUCCESS/FAILED/NO_CHANGE), content_hash, certificates_found, error_message, duration_ms. |
Scraping Pipeline
- Fetch — HTTP GET (with Playwright fallback for JS-rendered pages). Extracts HTML and discovers linked PDFs.
- Change Detection — MD5 hash comparison against
last_content_hash. Skips extraction if unchanged. - PDF Processing — Downloads linked PDFs, extracts text via
pdfplumber. Tracks processed URLs to avoid re-downloading. Per-PDF content hashes (pdf_content_hashes) enable selective re-extraction when individual PDFs change, even if the HTML page is unchanged. - LLM Extraction — Sends content to OpenAI API with a structured prompt. Parses JSON response into
ExtractedCertificateobjects. - Duplicate Detection — Multi-strategy: exact document hash, certificate number match, semantic field comparison, SimHash similarity.
- Certificate Creation — Creates or updates
Certificaterecords. Triggers webhook signals.
Playwright Fallback
The scraper includes an automatic Playwright fallback for pages that cannot be fetched with standard HTTP requests. This handles JavaScript-rendered certificate pages that return incomplete content via plain HTTP GET.
Trigger conditions — the fallback activates when the initial HTTP request encounters any of:
- HTTP 403 Forbidden — the target server blocks non-browser requests
- Suspiciously small responses — the response body is too small to contain meaningful content (likely a JS-only shell)
- Connection errors — the initial request fails entirely
When triggered, the scraper launches a headless Chromium browser via Playwright, renders the page with full JavaScript execution, and extracts the resulting HTML.
Playwright is an optional dependency. If Playwright is not installed, the fallback is skipped and the scraper proceeds with whatever content the HTTP request returned. To enable the fallback in production:
# Install the Playwright Python package (if not already in requirements)
pip install playwright
# Install the Chromium browser binary
playwright install chromium
Note: The
playwright install chromiumcommand downloads a Chromium binary (~150 MB). Run this during Docker image build or deployment provisioning, not at runtime.
Adaptive Scheduling
Check frequency adjusts based on certificate expiry proximity:
| Condition | Frequency |
|---|---|
| Certificates expiring within 7 days | Every 4 hours |
| Certificates expiring within 30 days | Every 12 hours |
| No expiring certificates | Every 24 hours (default) |
Website Discovery
The WebsiteDiscovery class crawls an organization's website to find certificate-related pages and PDFs. It:
- Starts from the organization's homepage
- Follows links matching certificate-related keywords
- Classifies discovered pages and documents
- Auto-creates
MonitoredEndpointrecords for relevant URLs
webhooks
Push notifications for certificate lifecycle events. See Webhooks Documentation for full details.
Models
| Model | Description |
|---|---|
| Subscriber | An external system receiving notifications. Fields: name, webhook_url, api_key (auto-generated), signing_secret (auto-generated, for HMAC signatures), is_active. Belongs to a Tenant. |
| Subscription | Defines what events a subscriber wants. Links subscriber to optional organization, optional certificate_type, and a list of event_types. |
| WebhookDelivery | Audit log per delivery attempt. Fields: subscriber, event_type, payload (JSON), response_status, error_message, attempts, status (PENDING/SUCCESS/FAILED), delivered_at. |
imports
Bulk data import from spreadsheet files.
Behavior
- Accepts
.xlsx,.xls, and.csvfiles via multipart upload. - Uses
pandas+openpyxlto parse the file. - Creates or updates
OrganizationandCertificateTyperecords. - Associates imported organizations with the authenticated tenant.
- Automatically triggers certificate discovery for newly created organizations.
- Returns counts of created/updated records and any errors.
- Supports partial success (HTTP 207) when some rows fail.
bulk_upload
Bulk certificate file upload with OCR extraction, AI-powered data extraction, organization matching, and human review.
Models
| Model | Description |
|---|---|
| UploadBatch | A group of files uploaded together. Fields: tenant (FK), status (pending/processing/completed). |
| UploadItem | A single file within a batch. Fields: original_filename, stored_file, file_type, file_size, status, extracted_text, extracted_data (JSON), matched_organization (FK), match_confidence, certificate (FK), error_message, user_corrections (JSON). |
Upload Item Status Flow
pending → processing → matched (auto-matched, certificate created)
→ needs_review (low confidence or missing fields)
→ extraction_failed (OCR/LLM error)
needs_review → matched (user confirms)
→ skipped (user skips)
Processing Pipeline
- Upload — Files validated (extensions:
.pdf,.png,.jpg,.jpeg,.tiff,.webp; max 20 MB each). Stored inmedia/bulk_uploads/. - Filename Parsing — Extracts hints from the filename (org name, cert type).
- Text Extraction — PDF text via
pdfplumber, image OCR viapytesseract. - LLM Extraction — Sends extracted text + filename hints to OpenAI. Parses structured certificate fields.
- Organization Matching — Fuzzy matches the extracted org name against existing organizations.
- Auto-Match — If confidence ≥ 0.8 and all required fields present → auto-creates certificate, status =
matched. - Review Routing — Otherwise → status =
needs_reviewfor human review.
Quality Metrics
- Auto-match rate per batch and aggregate
- Extraction failure rate
- User correction tracking (which fields were changed during review)