PDFy PDF Malware Scanner Design
1. Product Summary
PDFy is a website-first PDF malware scanning product. Users upload a PDF, receive a fast initial verdict, and optionally receive richer findings from background analysis. The first release is anonymous by default and optimized for simple public use with minimal friction.
2. Goals
- Deliver a practical PDF malware scanning workflow that feels like a public web utility.
- Return a fast, understandable initial verdict from static analysis.
- Support deeper queued analysis without blocking the first result.
- Keep storage, abuse exposure, and privacy risk low by defaulting to immediate deletion after analysis.
- Build a backend analysis core that can support future website growth and additional clients.
3. Non-Goals
- user accounts or login-gated usage in the MVP
- long-term storage by default
- live detonation or runtime execution of PDFs in the MVP
- enterprise workflow features such as case management or billing
4. Approved Architecture
The system uses a split architecture:
Next.js for the public website, upload flow, result pages, and web-facing API layer
Python analysis services for PDF parsing, heuristic checks, extraction, and result normalization
Redis for background job queueing and transient coordination
Postgres for scan records, findings summaries, and retention metadata
S3-compatible object storage for temporary source-file and artifact storage
Background workers for advanced analysis, enrichment, report generation, and cleanup
This architecture is chosen because the product is website-first and the strongest practical malware-analysis ecosystem for PDFs is Python-oriented.
5. User Experience Scope
The MVP user flow is:
- User visits the scanner homepage.
- User uploads a PDF.
- User chooses retention behavior, with immediate deletion as the default.
- System runs fast analysis and returns an initial verdict quickly.
- Background analysis extends the result when deeper findings become available.
- User reviews a structured report.
- Any retained files expire automatically.
Planned first-release pages:
Home / Scanner
Scan Result
Advanced Findings
Report View
Status / Expired
6. Retention And Privacy Rules
Retention policy is part of the product design, not an afterthought.
- Default mode:
delete_immediately
- Optional mode:
temporary_cache
- Temporary cache target: up to
6 hours
Immediate deletion is the default because it reduces storage cost, lowers abuse potential, and creates a safer baseline for a public upload service. Temporary caching exists only as an opt-in convenience path for users who want short-lived revisit access to results.
Third-party reputation checks, such as VirusTotal hash lookups, are allowed in the design but must be optional and clearly disclosed because they can share hashes or indicators externally.
7. Analysis Pipeline
Fast Scan
The fast scan returns the first usable result and should cover:
- file hashing and fingerprinting
- metadata extraction
- suspicious PDF keyword detection
- structural anomaly checks
- URL and IP extraction
- initial severity scoring and verdict generation
Advanced Scan
Advanced analysis runs in the background and may cover:
- object enumeration and extraction
- stream decompression and inspection
- JavaScript extraction and suspicious-pattern review
- obfuscation heuristic detection
- embedded file detection
- exploit-pattern detection
- optional third-party enrichment
Fast-scan results must remain visible even if advanced jobs fail or take longer than expected.
8. Data Flow
- Upload enters through the web layer.
- The web layer validates the request and creates a scan record.
- The source file is stored temporarily for analysis.
- Fast analysis runs and writes the initial result.
- Advanced tasks are queued when applicable.
- Deeper findings are attached to the same scan record.
- Report payloads are generated from normalized findings.
- Source files are deleted immediately or expired according to retention mode.
9. API Direction
The MVP should expose at least:
POST /api/scans for upload and scan creation
GET /api/scans/:scanId for status and summary retrieval
GET /api/scans/:scanId/report for the structured report payload
The product should use stable scan identifiers instead of login-bound history for the first release.
10. Safety Boundaries
- Static analysis is in scope for the MVP.
- Live detonation is out of scope for the MVP.
- Analyzer and worker execution must remain behind internal service boundaries.
- File validation, rate limiting, and retention cleanup are part of the production-leaning baseline.
11. Documentation Policy
Documentation is a required deliverable during development.
The repository should maintain:
- product documentation
- architecture documentation
- API contracts
- security and privacy guidance
- operational runbooks
- schema references
- architecture decision records
These docs must be updated alongside code changes and used as the active guide for future implementation and expansion.
12. Initial Documentation Set
The project should keep the following files current:
docs/README.md
docs/product/vision.md
docs/architecture/system-overview.md
docs/architecture/analysis-pipeline.md
docs/api/contracts.md
docs/security/privacy.md
docs/operations/deployment.md
docs/decisions/ADR-0001-platform-architecture.md
docs/runbooks/retention-cleanup.md
docs/schemas/scan-result-schema.md
13. Validation And Delivery Expectations
The production-leaning MVP should be built with:
- upload validation
- safe default retention behavior
- reliable scan status transitions
- graceful handling of malformed PDFs
- non-fatal advanced-analysis failure handling
- enforceable cleanup jobs
- documentation updates bundled with major feature work
14. Implementation Readiness
This design is intentionally structured so the next phase can produce:
- a concrete repo plan
- a modular monorepo layout
- explicit service boundaries
- implementation tasks tied to the documented architecture