Analysis Pipeline
Two-Phase Scan Model
PDFy uses a hybrid scan model:
Fast scan: synchronous or near-synchronous checks that return an initial result quickly
Advanced scan: queued analysis that augments the result with deeper findings
Fast Scan Responsibilities
The fast scan should produce the first usable verdict and include:
- file hash and basic fingerprinting
- metadata extraction
- suspicious PDF keyword detection
- quick structural anomaly checks
- URL and IP extraction
- initial severity scoring
- an initial verdict summary
Advanced Scan Responsibilities
Advanced analysis runs after the fast verdict and may include:
- deep object enumeration and extraction
- stream decompression and inspection
- embedded JavaScript extraction and heuristic review
- obfuscation pattern checks such as suspicious
eval, unescape, or encoded payload usage
- embedded file detection
- exploit-pattern detection against known suspicious techniques
- optional external enrichment such as VirusTotal hash reputation checks
Scan Lifecycle
- Receive upload request and retention preference.
- Persist a scan record.
- Store the uploaded file temporarily for processing.
- Run fast analysis and publish initial findings.
- Queue advanced tasks when applicable.
- Append advanced findings to the same scan record.
- Generate a structured report payload.
- Delete or expire the source file according to retention policy.
Failure Behavior
- Invalid or malformed PDFs should produce a completed scan with an error-aware result, not a silent failure.
- Advanced job failures should not erase fast-scan findings.
- Third-party enrichment failures should be non-fatal and reported separately from core analysis results.