CertWatch — Module Reference

Module Map

Module Purpose API Prefix
certwatch/ Project settings, root URL conf, Celery config, middleware
tenants/ Multi-tenant auth, memberships, invitations, token management /api/auth/
organizations/ Organization & Site CRUD /api/organizations/
certificates/ CertificateType & Certificate CRUD, public search /api/certificates/
monitoring/ MonitoredEndpoint, ScrapeLog, scraper, LLM extractor, discovery /api/monitoring/
webhooks/ Subscriber, Subscription, WebhookDelivery /api/webhooks/
imports/ Excel/CSV data import /api/imports/
bulk_upload/ Bulk certificate file upload, OCR, org matching, review /api/bulk-upload/

certwatch

Project-level configuration, middleware, and infrastructure.

Key Components

  • RequestIDMiddleware — Assigns a UUID to every request (from X-Request-ID header or auto-generated). Sets request.request_id and returns the ID in the X-Request-ID response header. A companion RequestIDFilter injects the ID into all log records.
  • AdminIPRestrictionMiddleware — Blocks requests to /admin/ from IPs not in the ADMIN_ALLOWED_IPS env var (comma-separated IPs/CIDRs). Supports X-Forwarded-For for reverse proxy setups. Empty config allows all access (dev default).
  • HealthCheckViewGET /api/health/ verifies database and Redis connectivity.
  • Celery signalsbefore_task_publish, task_prerun, and task_postrun signals propagate request IDs into Celery task headers and thread-local storage for end-to-end tracing.

tenants

Handles authentication, multi-tenant isolation, team management, and API key lifecycle.

Models

Model Description
Tenant A customer organization using CertWatch. Has name, slug (auto-generated), UUID primary key.
Membership Links a Django User to a Tenant with a role (admin or member). Unique per user+tenant.
Invitation A pending invite for a new user. Contains a unique URL-safe token, email, role, status (pending/accepted/expired), and expires_at (7 days).
TenantToken API token carrying both user and tenant context. Has optional name, optional expires_at. Replaces default DRF token auth.
EmailVerification Tracks email verification for new registrations. Fields: user (OneToOne to User), token (unique URL-safe string), created_at, expires_at (24 hours).

Key Components

  • TenantTokenAuthentication — Custom DRF auth backend. Reads Authorization: Token <key>, looks up TenantToken, returns (user, token) so request.auth carries tenant context.
  • TenantMiddleware — Runs after AuthenticationMiddleware. Resolves request.tenant from the auth token. Public paths (register, login, health, etc.) skip tenant resolution.
  • TenantQuerySetMixin — Reusable viewset mixin. Filters querysets by request.tenant using a configurable tenant_field attribute. Auto-sets tenant on perform_create.
  • IsAdminMember — Permission class requiring the user to have an admin role membership in the current tenant.

organizations

Manages the suppliers whose certificates are being tracked.

Models

Model Description
Organization A business entity (supplier). Fields: external_id (unique), name, website (validated URL), lei, vat, country (ISO 3166-1 alpha-2). Belongs to a Tenant.
Site A physical/operational location. Fields: name, address. Belongs to an Organization. Multiple sites per organization.

Behavior

  • On organization creation, certificate discovery is automatically triggered (async Celery task).
  • List views annotate organizations with active_certificates, expiring_certificates, and expired_certificates counts.
  • Detail views include nested sites.

certificates

Core certificate data and lifecycle management.

Models

Model Description
CertificateType Master data for certificate categories. Fields: code (unique, e.g. ISO_9001), name, description. Seeded via migrations.
Certificate A compliance document issued to an organization. Fields: certificate_number, issuing_body, scope, issue_date, expiry_date, status, document_url, source_url, document_hash (SHA-256), external_id, last_expiry_notification_at.

Certificate Statuses

Status Description
ACTIVE Certificate is currently valid
EXPIRED Certificate has passed its expiry date
REVOKED Certificate has been revoked by the issuing body
PENDING Certificate is awaiting validation

Status State Machine

Status transitions are enforced by certificates/state_machine.py via Certificate.clean():

PENDING  → ACTIVE, REVOKED, EXPIRED
ACTIVE   → EXPIRED, REVOKED
EXPIRED  → ACTIVE  (re-certification — requires future expiry_date)
REVOKED  → (terminal — no transitions allowed)

Invalid transitions raise a ValidationError. The EXPIRED → ACTIVE transition additionally requires that expiry_date is in the future.

Async Document Hashing

When a certificate's document_url is set or changed, the synchronous hash computation is skipped during save(). Instead, a compute_document_hash Celery task is enqueued to fetch the document and compute its SHA-256 hash asynchronously.

Supported Certificate Types (seeded)

ISO_9001, ISO_14001, ISO_45001, EMAS, TISAX, ISO_27001, ISO_50001, ISO_22301, ISO_13485, ISO_20000, SA_8000, OHSAS_18001, FSC, PEFC, BSCI, SEDEX, ECOVADIS, CDP, SBTi, GRI, IATF_16949, AS_9100, NADCAP, ISO_3834, EN_1090, CE_MARKING, API_SPEC, PED_2014_68_EU, ATEX, IECEx

Signals

  • pre_save — Caches old field values for change detection.
  • post_save — Dispatches NEW_CERTIFICATE webhook on creation, CERTIFICATE_UPDATED when watched fields change (status, expiry_date, issuing_body, scope, document_url). Resets last_expiry_notification_at to None when expiry_date changes (allows fresh notifications after renewal).

monitoring

Automated web scraping pipeline for certificate discovery and extraction.

Models

Model Description
MonitoredEndpoint A URL to periodically scrape. Fields: url, organization (FK), check_frequency (duration, default 24h), last_checked, last_content_hash (MD5), processed_pdf_urls (JSON list), pdf_content_hashes (JSON dict — URL→SHA-256), is_active.
ScrapeLog Audit record per scrape attempt. Fields: endpoint (FK), timestamp, status (SUCCESS/FAILED/NO_CHANGE), content_hash, certificates_found, error_message, duration_ms.

Scraping Pipeline

  1. Fetch — HTTP GET (with Playwright fallback for JS-rendered pages). Extracts HTML and discovers linked PDFs.
  2. Change Detection — MD5 hash comparison against last_content_hash. Skips extraction if unchanged.
  3. PDF Processing — Downloads linked PDFs, extracts text via pdfplumber. Tracks processed URLs to avoid re-downloading. Per-PDF content hashes (pdf_content_hashes) enable selective re-extraction when individual PDFs change, even if the HTML page is unchanged.
  4. LLM Extraction — Sends content to OpenAI API with a structured prompt. Parses JSON response into ExtractedCertificate objects.
  5. Duplicate Detection — Multi-strategy: exact document hash, certificate number match, semantic field comparison, SimHash similarity.
  6. Certificate Creation — Creates or updates Certificate records. Triggers webhook signals.

Playwright Fallback

The scraper includes an automatic Playwright fallback for pages that cannot be fetched with standard HTTP requests. This handles JavaScript-rendered certificate pages that return incomplete content via plain HTTP GET.

Trigger conditions — the fallback activates when the initial HTTP request encounters any of:

  • HTTP 403 Forbidden — the target server blocks non-browser requests
  • Suspiciously small responses — the response body is too small to contain meaningful content (likely a JS-only shell)
  • Connection errors — the initial request fails entirely

When triggered, the scraper launches a headless Chromium browser via Playwright, renders the page with full JavaScript execution, and extracts the resulting HTML.

Playwright is an optional dependency. If Playwright is not installed, the fallback is skipped and the scraper proceeds with whatever content the HTTP request returned. To enable the fallback in production:

# Install the Playwright Python package (if not already in requirements)
pip install playwright

# Install the Chromium browser binary
playwright install chromium

Note: The playwright install chromium command downloads a Chromium binary (~150 MB). Run this during Docker image build or deployment provisioning, not at runtime.

Adaptive Scheduling

Check frequency adjusts based on certificate expiry proximity:

Condition Frequency
Certificates expiring within 7 days Every 4 hours
Certificates expiring within 30 days Every 12 hours
No expiring certificates Every 24 hours (default)

Website Discovery

The WebsiteDiscovery class crawls an organization's website to find certificate-related pages and PDFs. It:

  • Starts from the organization's homepage
  • Follows links matching certificate-related keywords
  • Classifies discovered pages and documents
  • Auto-creates MonitoredEndpoint records for relevant URLs

webhooks

Push notifications for certificate lifecycle events. See Webhooks Documentation for full details.

Models

Model Description
Subscriber An external system receiving notifications. Fields: name, webhook_url, api_key (auto-generated), signing_secret (auto-generated, for HMAC signatures), is_active. Belongs to a Tenant.
Subscription Defines what events a subscriber wants. Links subscriber to optional organization, optional certificate_type, and a list of event_types.
WebhookDelivery Audit log per delivery attempt. Fields: subscriber, event_type, payload (JSON), response_status, error_message, attempts, status (PENDING/SUCCESS/FAILED), delivered_at.

imports

Bulk data import from spreadsheet files.

Behavior

  • Accepts .xlsx, .xls, and .csv files via multipart upload.
  • Uses pandas + openpyxl to parse the file.
  • Creates or updates Organization and CertificateType records.
  • Associates imported organizations with the authenticated tenant.
  • Automatically triggers certificate discovery for newly created organizations.
  • Returns counts of created/updated records and any errors.
  • Supports partial success (HTTP 207) when some rows fail.

bulk_upload

Bulk certificate file upload with OCR extraction, AI-powered data extraction, organization matching, and human review.

Models

Model Description
UploadBatch A group of files uploaded together. Fields: tenant (FK), status (pending/processing/completed).
UploadItem A single file within a batch. Fields: original_filename, stored_file, file_type, file_size, status, extracted_text, extracted_data (JSON), matched_organization (FK), match_confidence, certificate (FK), error_message, user_corrections (JSON).

Upload Item Status Flow

pending → processing → matched         (auto-matched, certificate created)
                     → needs_review     (low confidence or missing fields)
                     → extraction_failed (OCR/LLM error)

needs_review → matched  (user confirms)
             → skipped  (user skips)

Processing Pipeline

  1. Upload — Files validated (extensions: .pdf, .png, .jpg, .jpeg, .tiff, .webp; max 20 MB each). Stored in media/bulk_uploads/.
  2. Filename Parsing — Extracts hints from the filename (org name, cert type).
  3. Text Extraction — PDF text via pdfplumber, image OCR via pytesseract.
  4. LLM Extraction — Sends extracted text + filename hints to OpenAI. Parses structured certificate fields.
  5. Organization Matching — Fuzzy matches the extracted org name against existing organizations.
  6. Auto-Match — If confidence ≥ 0.8 and all required fields present → auto-creates certificate, status = matched.
  7. Review Routing — Otherwise → status = needs_review for human review.

Quality Metrics

  • Auto-match rate per batch and aggregate
  • Extraction failure rate
  • User correction tracking (which fields were changed during review)