PDFy

PDFy PDF Malware Scanner Design

1. Product Summary

PDFy is a website-first PDF malware scanning product. Users upload a PDF, receive a fast initial verdict, and optionally receive richer findings from background analysis. The first release is anonymous by default and optimized for simple public use with minimal friction.

2. Goals

3. Non-Goals

4. Approved Architecture

The system uses a split architecture:

This architecture is chosen because the product is website-first and the strongest practical malware-analysis ecosystem for PDFs is Python-oriented.

5. User Experience Scope

The MVP user flow is:

  1. User visits the scanner homepage.
  2. User uploads a PDF.
  3. User chooses retention behavior, with immediate deletion as the default.
  4. System runs fast analysis and returns an initial verdict quickly.
  5. Background analysis extends the result when deeper findings become available.
  6. User reviews a structured report.
  7. Any retained files expire automatically.

Planned first-release pages:

6. Retention And Privacy Rules

Retention policy is part of the product design, not an afterthought.

Immediate deletion is the default because it reduces storage cost, lowers abuse potential, and creates a safer baseline for a public upload service. Temporary caching exists only as an opt-in convenience path for users who want short-lived revisit access to results.

Third-party reputation checks, such as VirusTotal hash lookups, are allowed in the design but must be optional and clearly disclosed because they can share hashes or indicators externally.

7. Analysis Pipeline

Fast Scan

The fast scan returns the first usable result and should cover:

Advanced Scan

Advanced analysis runs in the background and may cover:

Fast-scan results must remain visible even if advanced jobs fail or take longer than expected.

8. Data Flow

  1. Upload enters through the web layer.
  2. The web layer validates the request and creates a scan record.
  3. The source file is stored temporarily for analysis.
  4. Fast analysis runs and writes the initial result.
  5. Advanced tasks are queued when applicable.
  6. Deeper findings are attached to the same scan record.
  7. Report payloads are generated from normalized findings.
  8. Source files are deleted immediately or expired according to retention mode.

9. API Direction

The MVP should expose at least:

The product should use stable scan identifiers instead of login-bound history for the first release.

10. Safety Boundaries

11. Documentation Policy

Documentation is a required deliverable during development.

The repository should maintain:

These docs must be updated alongside code changes and used as the active guide for future implementation and expansion.

12. Initial Documentation Set

The project should keep the following files current:

13. Validation And Delivery Expectations

The production-leaning MVP should be built with:

14. Implementation Readiness

This design is intentionally structured so the next phase can produce: