CertWatch — Module Reference

Module Map

Module	Purpose	API Prefix
`certwatch/`	Project settings, root URL conf, Celery config, middleware	—
`tenants/`	Multi-tenant auth, memberships, invitations, token management	`/api/auth/`
`organizations/`	Organization & Site CRUD	`/api/organizations/`
`certificates/`	CertificateType & Certificate CRUD, public search	`/api/certificates/`
`monitoring/`	MonitoredEndpoint, ScrapeLog, scraper, LLM extractor, discovery	`/api/monitoring/`
`webhooks/`	Subscriber, Subscription, WebhookDelivery	`/api/webhooks/`
`imports/`	Excel/CSV data import	`/api/imports/`
`bulk_upload/`	Bulk certificate file upload, OCR, org matching, review	`/api/bulk-upload/`

certwatch

Project-level configuration, middleware, and infrastructure.

Key Components

RequestIDMiddleware — Assigns a UUID to every request (from X-Request-ID header or auto-generated). Sets request.request_id and returns the ID in the X-Request-ID response header. A companion RequestIDFilter injects the ID into all log records.
AdminIPRestrictionMiddleware — Blocks requests to /admin/ from IPs not in the ADMIN_ALLOWED_IPS env var (comma-separated IPs/CIDRs). Supports X-Forwarded-For for reverse proxy setups. Empty config allows all access (dev default).
HealthCheckView — GET /api/health/ verifies database and Redis connectivity.
Celery signals — before_task_publish, task_prerun, and task_postrun signals propagate request IDs into Celery task headers and thread-local storage for end-to-end tracing.

tenants

Handles authentication, multi-tenant isolation, team management, and API key lifecycle.

Models

Model	Description
Tenant	A customer organization using CertWatch. Has `name`, `slug` (auto-generated), UUID primary key.
Membership	Links a Django `User` to a `Tenant` with a role (`admin` or `member`). Unique per user+tenant.
Invitation	A pending invite for a new user. Contains a unique URL-safe `token`, `email`, `role`, `status` (pending/accepted/expired), and `expires_at` (7 days).
TenantToken	API token carrying both user and tenant context. Has optional `name`, optional `expires_at`. Replaces default DRF token auth.
EmailVerification	Tracks email verification for new registrations. Fields: `user` (OneToOne to User), `token` (unique URL-safe string), `created_at`, `expires_at` (24 hours).

Key Components

TenantTokenAuthentication — Custom DRF auth backend. Reads Authorization: Token <key>, looks up TenantToken, returns (user, token) so request.auth carries tenant context.
TenantMiddleware — Runs after AuthenticationMiddleware. Resolves request.tenant from the auth token. Public paths (register, login, health, etc.) skip tenant resolution.
TenantQuerySetMixin — Reusable viewset mixin. Filters querysets by request.tenant using a configurable tenant_field attribute. Auto-sets tenant on perform_create.
IsAdminMember — Permission class requiring the user to have an admin role membership in the current tenant.

organizations

Manages the suppliers whose certificates are being tracked.

Models

Model	Description
Organization	A business entity (supplier). Fields: `external_id` (unique), `name`, `website` (validated URL), `lei`, `vat`, `country` (ISO 3166-1 alpha-2). Belongs to a `Tenant`.
Site	A physical/operational location. Fields: `name`, `address`. Belongs to an `Organization`. Multiple sites per organization.

Behavior

On organization creation, certificate discovery is automatically triggered (async Celery task).
List views annotate organizations with active_certificates, expiring_certificates, and expired_certificates counts.
Detail views include nested sites.

certificates

Core certificate data and lifecycle management.

Models

Model	Description
CertificateType	Master data for certificate categories. Fields: `code` (unique, e.g. `ISO_9001`), `name`, `description`. Seeded via migrations.
Certificate	A compliance document issued to an organization. Fields: `certificate_number`, `issuing_body`, `scope`, `issue_date`, `expiry_date`, `status`, `document_url`, `source_url`, `document_hash` (SHA-256), `external_id`, `last_expiry_notification_at`.

Certificate Statuses

Status	Description
`ACTIVE`	Certificate is currently valid
`EXPIRED`	Certificate has passed its expiry date
`REVOKED`	Certificate has been revoked by the issuing body
`PENDING`	Certificate is awaiting validation

Status State Machine

Status transitions are enforced by certificates/state_machine.py via Certificate.clean():

PENDING  → ACTIVE, REVOKED, EXPIRED
ACTIVE   → EXPIRED, REVOKED
EXPIRED  → ACTIVE  (re-certification — requires future expiry_date)
REVOKED  → (terminal — no transitions allowed)

Invalid transitions raise a ValidationError. The EXPIRED → ACTIVE transition additionally requires that expiry_date is in the future.

Async Document Hashing

When a certificate's document_url is set or changed, the synchronous hash computation is skipped during save(). Instead, a compute_document_hash Celery task is enqueued to fetch the document and compute its SHA-256 hash asynchronously.

Supported Certificate Types (seeded)

ISO_9001, ISO_14001, ISO_45001, EMAS, TISAX, ISO_27001, ISO_50001, ISO_22301, ISO_13485, ISO_20000, SA_8000, OHSAS_18001, FSC, PEFC, BSCI, SEDEX, ECOVADIS, CDP, SBTi, GRI, IATF_16949, AS_9100, NADCAP, ISO_3834, EN_1090, CE_MARKING, API_SPEC, PED_2014_68_EU, ATEX, IECEx

Signals

pre_save — Caches old field values for change detection.
post_save — Dispatches NEW_CERTIFICATE webhook on creation, CERTIFICATE_UPDATED when watched fields change (status, expiry_date, issuing_body, scope, document_url). Resets last_expiry_notification_at to None when expiry_date changes (allows fresh notifications after renewal).

monitoring

Automated web scraping pipeline for certificate discovery and extraction.

Models

Model	Description
MonitoredEndpoint	A URL to periodically scrape. Fields: `url`, `organization` (FK), `check_frequency` (duration, default 24h), `last_checked`, `last_content_hash` (MD5), `processed_pdf_urls` (JSON list), `pdf_content_hashes` (JSON dict — URL→SHA-256), `is_active`.
ScrapeLog	Audit record per scrape attempt. Fields: `endpoint` (FK), `timestamp`, `status` (SUCCESS/FAILED/NO_CHANGE), `content_hash`, `certificates_found`, `error_message`, `duration_ms`.

Scraping Pipeline

Fetch — HTTP GET (with Playwright fallback for JS-rendered pages). Extracts HTML and discovers linked PDFs.
Change Detection — MD5 hash comparison against last_content_hash. Skips extraction if unchanged.
PDF Processing — Downloads linked PDFs, extracts text via pdfplumber. Tracks processed URLs to avoid re-downloading. Per-PDF content hashes (pdf_content_hashes) enable selective re-extraction when individual PDFs change, even if the HTML page is unchanged.
LLM Extraction — Sends content to OpenAI API with a structured prompt. Parses JSON response into ExtractedCertificate objects.
Duplicate Detection — Multi-strategy: exact document hash, certificate number match, semantic field comparison, SimHash similarity.
Certificate Creation — Creates or updates Certificate records. Triggers webhook signals.

Playwright Fallback

The scraper includes an automatic Playwright fallback for pages that cannot be fetched with standard HTTP requests. This handles JavaScript-rendered certificate pages that return incomplete content via plain HTTP GET.

Trigger conditions — the fallback activates when the initial HTTP request encounters any of:

HTTP 403 Forbidden — the target server blocks non-browser requests
Suspiciously small responses — the response body is too small to contain meaningful content (likely a JS-only shell)
Connection errors — the initial request fails entirely

When triggered, the scraper launches a headless Chromium browser via Playwright, renders the page with full JavaScript execution, and extracts the resulting HTML.

Playwright is an optional dependency. If Playwright is not installed, the fallback is skipped and the scraper proceeds with whatever content the HTTP request returned. To enable the fallback in production:

# Install the Playwright Python package (if not already in requirements)
pip install playwright

# Install the Chromium browser binary
playwright install chromium

Note: The playwright install chromium command downloads a Chromium binary (~150 MB). Run this during Docker image build or deployment provisioning, not at runtime.

Adaptive Scheduling

Check frequency adjusts based on certificate expiry proximity:

Condition	Frequency
Certificates expiring within 7 days	Every 4 hours
Certificates expiring within 30 days	Every 12 hours
No expiring certificates	Every 24 hours (default)

Website Discovery

The WebsiteDiscovery class crawls an organization's website to find certificate-related pages and PDFs. It:

Starts from the organization's homepage
Follows links matching certificate-related keywords
Classifies discovered pages and documents
Auto-creates MonitoredEndpoint records for relevant URLs

webhooks

Push notifications for certificate lifecycle events. See Webhooks Documentation for full details.

Models

Model	Description
Subscriber	An external system receiving notifications. Fields: `name`, `webhook_url`, `api_key` (auto-generated), `signing_secret` (auto-generated, for HMAC signatures), `is_active`. Belongs to a `Tenant`.
Subscription	Defines what events a subscriber wants. Links `subscriber` to optional `organization`, optional `certificate_type`, and a list of `event_types`.
WebhookDelivery	Audit log per delivery attempt. Fields: `subscriber`, `event_type`, `payload` (JSON), `response_status`, `error_message`, `attempts`, `status` (PENDING/SUCCESS/FAILED), `delivered_at`.

imports

Bulk data import from spreadsheet files.

Behavior

Accepts .xlsx, .xls, and .csv files via multipart upload.
Uses pandas + openpyxl to parse the file.
Creates or updates Organization and CertificateType records.
Associates imported organizations with the authenticated tenant.
Automatically triggers certificate discovery for newly created organizations.
Returns counts of created/updated records and any errors.
Supports partial success (HTTP 207) when some rows fail.

bulk_upload

Bulk certificate file upload with OCR extraction, AI-powered data extraction, organization matching, and human review.

Models

Model	Description
UploadBatch	A group of files uploaded together. Fields: `tenant` (FK), `status` (pending/processing/completed).
UploadItem	A single file within a batch. Fields: `original_filename`, `stored_file`, `file_type`, `file_size`, `status`, `extracted_text`, `extracted_data` (JSON), `matched_organization` (FK), `match_confidence`, `certificate` (FK), `error_message`, `user_corrections` (JSON).

Upload Item Status Flow

pending → processing → matched         (auto-matched, certificate created)
                     → needs_review     (low confidence or missing fields)
                     → extraction_failed (OCR/LLM error)

needs_review → matched  (user confirms)
             → skipped  (user skips)

Processing Pipeline

Upload — Files validated (extensions: .pdf, .png, .jpg, .jpeg, .tiff, .webp; max 20 MB each). Stored in media/bulk_uploads/.
Filename Parsing — Extracts hints from the filename (org name, cert type).
Text Extraction — PDF text via pdfplumber, image OCR via pytesseract.
LLM Extraction — Sends extracted text + filename hints to OpenAI. Parses structured certificate fields.
Organization Matching — Fuzzy matches the extracted org name against existing organizations.
Auto-Match — If confidence ≥ 0.8 and all required fields present → auto-creates certificate, status = matched.
Review Routing — Otherwise → status = needs_review for human review.

Quality Metrics

Auto-match rate per batch and aggregate
Extraction failure rate
User correction tracking (which fields were changed during review)