PDFy

PDFy PDF Malware Scanner Design

1. Product Summary

PDFy is a website-first PDF malware scanning product. Users upload a PDF, receive a fast initial verdict, and optionally receive richer findings from background analysis. The first release is anonymous by default and optimized for simple public use with minimal friction.

2. Goals

Deliver a practical PDF malware scanning workflow that feels like a public web utility.
Return a fast, understandable initial verdict from static analysis.
Support deeper queued analysis without blocking the first result.
Keep storage, abuse exposure, and privacy risk low by defaulting to immediate deletion after analysis.
Build a backend analysis core that can support future website growth and additional clients.

3. Non-Goals

user accounts or login-gated usage in the MVP
long-term storage by default
live detonation or runtime execution of PDFs in the MVP
enterprise workflow features such as case management or billing

4. Approved Architecture

The system uses a split architecture:

Next.js for the public website, upload flow, result pages, and web-facing API layer
Python analysis services for PDF parsing, heuristic checks, extraction, and result normalization
Redis for background job queueing and transient coordination
Postgres for scan records, findings summaries, and retention metadata
S3-compatible object storage for temporary source-file and artifact storage
Background workers for advanced analysis, enrichment, report generation, and cleanup

This architecture is chosen because the product is website-first and the strongest practical malware-analysis ecosystem for PDFs is Python-oriented.

5. User Experience Scope

The MVP user flow is:

User visits the scanner homepage.
User uploads a PDF.
User chooses retention behavior, with immediate deletion as the default.
System runs fast analysis and returns an initial verdict quickly.
Background analysis extends the result when deeper findings become available.
User reviews a structured report.
Any retained files expire automatically.

Planned first-release pages:

Home / Scanner
Scan Result
Advanced Findings
Report View
Status / Expired

6. Retention And Privacy Rules

Retention policy is part of the product design, not an afterthought.

Default mode: delete_immediately
Optional mode: temporary_cache
Temporary cache target: up to 6 hours

Immediate deletion is the default because it reduces storage cost, lowers abuse potential, and creates a safer baseline for a public upload service. Temporary caching exists only as an opt-in convenience path for users who want short-lived revisit access to results.

Third-party reputation checks, such as VirusTotal hash lookups, are allowed in the design but must be optional and clearly disclosed because they can share hashes or indicators externally.

7. Analysis Pipeline

Fast Scan

The fast scan returns the first usable result and should cover:

file hashing and fingerprinting
metadata extraction
suspicious PDF keyword detection
structural anomaly checks
URL and IP extraction
initial severity scoring and verdict generation

Advanced Scan

Advanced analysis runs in the background and may cover:

object enumeration and extraction
stream decompression and inspection
JavaScript extraction and suspicious-pattern review
obfuscation heuristic detection
embedded file detection
exploit-pattern detection
optional third-party enrichment

Fast-scan results must remain visible even if advanced jobs fail or take longer than expected.

8. Data Flow

Upload enters through the web layer.
The web layer validates the request and creates a scan record.
The source file is stored temporarily for analysis.
Fast analysis runs and writes the initial result.
Advanced tasks are queued when applicable.
Deeper findings are attached to the same scan record.
Report payloads are generated from normalized findings.
Source files are deleted immediately or expired according to retention mode.

9. API Direction

The MVP should expose at least:

POST /api/scans for upload and scan creation
GET /api/scans/:scanId for status and summary retrieval
GET /api/scans/:scanId/report for the structured report payload

The product should use stable scan identifiers instead of login-bound history for the first release.

10. Safety Boundaries

Static analysis is in scope for the MVP.
Live detonation is out of scope for the MVP.
Analyzer and worker execution must remain behind internal service boundaries.
File validation, rate limiting, and retention cleanup are part of the production-leaning baseline.

11. Documentation Policy

Documentation is a required deliverable during development.

The repository should maintain:

product documentation
architecture documentation
API contracts
security and privacy guidance
operational runbooks
schema references
architecture decision records

These docs must be updated alongside code changes and used as the active guide for future implementation and expansion.

12. Initial Documentation Set

The project should keep the following files current:

docs/README.md
docs/product/vision.md
docs/architecture/system-overview.md
docs/architecture/analysis-pipeline.md
docs/api/contracts.md
docs/security/privacy.md
docs/operations/deployment.md
docs/decisions/ADR-0001-platform-architecture.md
docs/runbooks/retention-cleanup.md
docs/schemas/scan-result-schema.md

13. Validation And Delivery Expectations

The production-leaning MVP should be built with:

upload validation
safe default retention behavior
reliable scan status transitions
graceful handling of malformed PDFs
non-fatal advanced-analysis failure handling
enforceable cleanup jobs
documentation updates bundled with major feature work

14. Implementation Readiness

This design is intentionally structured so the next phase can produce:

a concrete repo plan
a modular monorepo layout
explicit service boundaries
implementation tasks tied to the documented architecture