Bumblebee: Read-only inventory collector for package, extension, and developer-tool metadata
Bumblebee: A Detailed Look at a Read-Only Inventory Scanner for Developer Endpoints
Introduction In modern software supply chains, knowing what actually sits on developer machines is as important as knowing what shipped in an SBOM. Bumblebee is a purpose-built, read-only inventory collector for macOS and Linux endpoints that brings order to a scattered on-disk state. It focuses on package, extension, and developer-tool metadata found in lockfiles, package-manager metadata, extension manifests, and MCP (Gemini Configuration Protocol) JSON configs. When given an exposure catalog, Bumblebee can flag exact matches to support fast, read-only exposure checks for teams that already know what they are looking for. In short, it answers the question: when an advisory names a package, extension, or version, which developer machines show a match in their on-disk metadata right now?
What Bumblebee Is Designed To Do
- Narrow the supply-chain question into a manageable on-disk state snapshot.
- Translate scattered local state into structured NDJSON component records for easy ingestion and correlation.
- Operate as a single static binary with zero non-standard dependencies, ensuring reproducibility and ease of deployment.
- Support targeted exposure checks by consuming an exposure catalog and returning precise matches.
- Avoid running external package-manager commands or reading arbitrary source files, respecting endpoint safety and performance constraints.
Scope and Build Philosophy Bumblebee is crafted as a self-contained, one-shot scanner with a conservative security posture and a focused feature set.
Key scope points include:
- A single static binary built with Go 1.25+ and zero non-stdlib dependencies.
- Three scan profiles to accommodate different populations and cadences:
- baseline: broad, global/user-level inventories.
- project: inventories scoped to known development directories.
- deep: targeted, on-demand checks with explicit roots and optional exposure catalogs.
- Read-only collection of local state from allowed sources: lockfiles, package-manager install metadata, extension manifests, and MCP JSON configs. It explicitly does not execute package manager commands (like npm ls, pip show, go list) nor read source files.
- MCP host configs may carry environment values and credentials in their env blocks. Bumblebee parses these values for server inventory needs but does not emit those values in its own records.
- Output is NDJSON, one record per line, with a final scan_summary to indicate current state readiness to responders.
Profiles and Their Use Cases Bumblebee exposes three distinct profiles, designed for different workflows and cadences:
Baseline
- Scans: Common global and user package roots, language toolchains, editor extensions, browser extensions, and MCP configs.
- Use case: Recurring, lightweight inventory checks performed by external runners to maintain a baseline view of endpoints.
Project
- Scans: Configured development directories such as ~/code, ~/src, or ~/work.
- Use case: Recurring inventory for known project workspaces, offering a project-scoped view of dependencies and configurations.
Deep
- Scans: Explicit --root paths, including broad roots like $HOME.
- Use case: On-demand incident or campaign checks, typically with --ecosystem, --exposure-catalog, or --findings-only enabled. Deep is the most thorough and is designed for targeted investigations.
Notes on Root Scope
- Baseline and project profiles are not designed to scan bare home directories.
- Deep is the only profile that supports scanning home-root-like paths, enabling a broader reach when necessary.
Install, Build, and Self-Test Install and build instructions are deliberately straightforward to support quick deployments and consistent builds.
Install (Go 1.25+)
- To install the latest tagged release into your Go bin:
- go install github.com/perplexityai/bumblebee/cmd/bumblebee@latest
- To pin a specific tag:
- go install github.com/perplexityai/bumblebee/cmd/bumblebee@v0.1.1
Build from a checkout
- go build -o bumblebee ./cmd/bumblebee
- Optional tests:
- go test ./…
Stamp a version at build time
- go build -ldflags "-X main.Version=v0.1.1" -o bumblebee ./cmd/bumblebee
Version reporting
- Running bumblebee version prints the version, the VCS revision, build time, and the Go runtime, allowing production records to be traced back to a specific build.
- Version precedence: -ldflags override, module version from go install, then the in-tree default tracked in VERSION.
Self-test
- After installing, run the built-in end-to-end check against embedded fixtures:
- bumblebee selftest
- Expected output: selftest OK (2 findings in 1ms)
- About the fixtures: they live inside the binary, use fake package names (e.g., bumblebee-selftest-evil@0.0.0), and do not perform network calls.
- A non-zero exit indicates the local install can no longer detect what it should, serving as a quick pre-deployment smoke test for fleet rollouts.
Quick Start and Day-to-Day Usage Bumblebee is a one-shot scanner. Each invocation performs a single scan and exits, with cadence being orchestrated by the runner (cron, launchd, systemd, MDM, etc.).
A quick sample workflow might be:
- Baseline global inventory:
- bumblebee scan --profile baseline > inventory.ndjson
- Daily project sweep with explicit roots:
- bumblebee scan --profile project --root "$HOME/code" --root "$HOME/Developer"
- Limit a run to selected ecosystems:
- bumblebee scan --profile baseline --ecosystem npm,pypi --ecosystem go
- On-demand exposure scan against a published advisory:
- bumblebee scan --profile deep --root "$HOME" --exposure-catalog ./catalog.json --max-duration 10m
Preview roots without scanning
- To preview the resolved roots for a given profile, use:
- bumblebee roots --profile baseline
- This prints tab-delimited lines of potential roots and used paths, helping operators plan scans without touching the endpoint.
Flag notes and behaviors
- --root: Filesystem path to scan; required for --deep, optional for other profiles.
- --ecosystem: Repeatable and comma-separated; limits results to specified ecosystems.
- --exposure-catalog: Accepts a JSON file or a directory of JSON catalogs (merged non-recursively). All files must share schema_version.
- --findings-only: Requires --exposure-catalog and suppresses package records while retaining findings.
- --help: Lists every flag and its usage.
What the Output Looks Like Bumblebee emits NDJSON, one record per line. Diagnostics are sent to stderr as NDJSON, while end-of-run behavior produces a scan_summary record. The system is designed for receivers to ingest and interpret in a streaming fashion or in batch fashion, depending on the downstream pipeline.
Package records
- Contain fields for identity, provenance, and confidence, enabling downstream systems to determine how much trust to place in a match.
- Example (simplified view):
- { "recordtype": "package", "recordid": "package:…", "schemaversion": "0.1.0", "scannername": "bumblebee", "scannerversion": "v0.1.1", "runid": "9b1f0c2e4d5a6b7c8d9e0f1a2b3c4d5e", "scantime": "2026-05-15T18:22:01.482Z", "endpoint": { "hostname": "alex-mbp", "os": "darwin", "arch": "arm64", "username": "alex", "uid": "501", "deviceid": "MDM-7F4A2B" }, "profile": "project", "ecosystem": "npm", "packagename": "@tanstack/query-core", "normalizedname": "@tanstack/query-core", "version": "5.59.20", "projectpath": "/Users/alex/code/web-app", "rootkind": "projectroot", "packagemanager": "pnpm", "sourcetype": "pnpm-lockfile", "sourcefile": "/Users/alex/code/web-app/pnpm-lock.yaml", "haslifecyclescripts": false, "confidence": "high" }
Finding records
Represent exposure matches against an inventory snapshot when an exposure catalog is used.
Example (simplified):
{ "recordtype": "finding", "recordid": "finding:…", "schemaversion": "0.1.0", "scannername": "bumblebee", "scannerversion": "v0.1.1", "runid": "3a8c7d1e9f0b2a4c6d8e0f1a2b3c4d5e", "scantime": "2026-05-15T18:22:01.482Z", "endpoint": { "hostname": "alex-mbp", "os": "darwin", "arch": "arm64", "username": "alex", "uid": "501", "deviceid": "MDM-7F4A2B" }, "profile": "deep", "findingtype": "packageexposure", "severity": "critical", "catalogid": "advisory-2026-0042", "catalogname": "example-pkg 1.2.3 (compromised release)", "ecosystem": "npm", "packagename": "example-pkg", "normalizedname": "example-pkg", "version": "1.2.3", "rootkind": "deephomeroot", "projectpath": "/Users/alex/code/web-app", "sourcetype": "pnpm-lockfile", "sourcefile": "/Users/alex/code/web-app/pnpm-lock.yaml", "confidence": "high", "evidence": "exact name+version match (version=1.2.3)" }
Important: The record_id is a content-addressed hash of a canonical identity tuple for the given record type. It ensures stable identity across runs and allows deduplication logic to be robust in downstream systems.
Exposure Catalog Format Bumblebee accepts exposure catalogs in a minimal JSON structure that focuses on exact matches by (ecosystem, name, version):
Minimal catalog format:
{ "schema_version": "0.1.0", "entries": [ { "id": "advisory-2026-0042", "name": "example-pkg 1.2.3 (compromised release)", "ecosystem": "npm", "package": "example-pkg", "versions": ["1.2.3"], "severity": "critical" } ] }
The catalog should be a JSON object with schema_version and entries keys; bare top-level arrays are rejected, and unsupported future schema versions are rejected.
You can load multiple catalogs by pointing --exposure-catalog at a directory containing multiple *.json catalogs; they are merged non-recursively, and all files must share a common schema_version.
Sample Exposure Catalogs and Threatintel The project maintains a threatint directory with exposure catalogs built from public threat intelligence. These catalogs are assembled by Perplexity Computer and updated via PRs as new campaigns are reported. See threatint/README.md for the current catalog list and guidance on review. This is a practical starting point for on-demand scans that need up-to-date exposure data.
How It Integrates into Operations
- Quiet, read-only operation means Bumblebee can be run on endpoints and in fleet rollouts without introducing new risk or dependencies.
- The NDJSON stream is designed for downstream processing: a lightweight, scalable approach to inventory, with the ability to feed into SIEMs, asset inventories, or security analytics pipelines.
- The scan_summary record helps receivers determine whether a given run should be promoted to current state, enabling controlled rollouts and versioned state.
Security, Licensing, and Documentation
- License: Apache License 2.0
- Documentation references: The project includes docs/transport.md for output destinations and docs/state-model.md for receiver-side state modeling. There are also docs/inventory-sources.md and related files that enumerate supported sources for the on-disk state.
- The design deliberately avoids network calls or code execution during scanning, emphasizing safety and predictability on developer endpoints.
Inventory Sources and Ecosystem Coverage Bumblebee covers a wide range of ecosystems through the on-disk state it reads. The per-ecosystem details summarized in the project include:
- npm family
- Emitted ecosystem: npm
- Sources: package-lock.json, npm-shrinkwrap.json, nodemodules/.package-lock.json, nodemodules//package.json
- pnpm
- Emitted ecosystem: npm
- Sources: pnpm-lock.yaml, .pnpm/…/package.json
- Yarn
- Emitted ecosystem: npm
- Sources: yarn.lock (Classic + Berry)
- Bun
- Emitted ecosystem: npm
- Sources: bun.lock; bun.lockb presence as diagnostic
- PyPI (Python)
- Emitted ecosystem: pypi
- Sources: *.dist-info/METADATA, INSTALLER, direct_url.json, *.egg-info/PKG-INFO
- Go modules
- Emitted ecosystem: go
- Sources: go.sum, go.mod
- RubyGems
- Emitted ecosystem: rubygems
- Sources: Gemfile.lock, installed *.gemspec
- Composer (PHP)
- Emitted ecosystem: packagist
- Sources: composer.lock, vendor/composer/installed.json
- MCP (Gemini/Credentials/Config)
- Emitted ecosystem: mcp
- Sources: JSON host configs like mcp.json, .mcp.json, claudedesktopconfig.json, mcpconfig.json, mcpsettings.json, clinemcpsettings.json, plus ~/.gemini/settings.json
- Note: Non-JSON configs (Codex config.toml, Continue YAML) are not parsed in v0.1
- Editor extensions
- Emitted ecosystem: editor-extension
- Sources: VS Code, Cursor, Windsurf, VSCodium manifests
- Browser extensions
- Emitted ecosystem: browser-extension
- Sources: Chromium-family manifests (manifest.json) and Firefox extensions.json per profile
These coverage details are summarized with nuanced per-ecosystem notes available in docs/inventory-sources.md.
NDJSON Records: A Practical Guide to Reading Bumblebee Output
- Each line of the output is a well-formed JSON object representing a single record (package, finding, etc.).
- To identify what was found on a given endpoint, you primarily scan for package records and, when using an exposure catalog, you’ll also see finding records indicating exposure matches.
- The scan_summary at the end of a run contains high-level metrics to help teams decide whether to promote the run to current state or run again after adjustments.
A glance at the Format:
- The package record demonstrates the identity and provenance information you can rely on to map to inventories and risk surfaces.
- The finding record demonstrates exposure matches against a catalog, including severity and evidence about the match.
- The run_id allows you to tie together all emitted records from a single invocation, connecting package detections with exposure findings.
Important Concepts and Terms
- NDJSON: Newline-delimited JSON – each line is a discrete JSON object. This format supports streaming and easy ingestion into log pipelines or data lakes.
- record_id: A content-addressed hash that uniquely and reproducibly identifies a given record across runs. This is crucial for deduplication and traceability.
- rootkind: A label that differentiates the root source type (e.g., projectroot, deephomeroot) to help receivers keep populations separate.
- environment integrity: Bumblebee does not emit environment variables from MCP env blocks, even though it reads them during inventory collection. This design choice trades certain data availability in records for security and privacy protections.
- findings-only mode: A mode where package records are suppressed and only findings (exposure alerts) are emitted, which is useful for alert-driven workflows.
Documentation and Further Reading
- Core docs for Bumblebee include references to:
- docs/transport.md for how NDJSON is written to HTTPS or file-based destinations.
- docs/state-model.md for a detailed description of the receiver-side current-state model and how to interpret scan_summary.
- docs/inventory-sources.md for a breakdown of per-ecosystem sources and how Bumblebee reads local metadata.
- Sample catalogs and threat intelligence resources can be found in threat_int/, including the current README and guidance on integrating current catalogs into exposure checks.
An Image-Free Reality This blog post draws directly from the input documentation. There were no images embedded in the source material, so there are no embedded visuals to reproduce here. If you plan to publish this post, you can pair it with schematic diagrams showing how Bumblebee reads on-disk state and outputs NDJSON, or include before/after visuals illustrating a typical baseline inventory versus a deep exposure scan.
Conclusion: A Clear, Read-Only Lens into Developer Endpoint State Bumblebee offers a focused, safe, and reproducible approach to inventorying what lies on developer endpoints. By turning scattered local state into structured NDJSON records and providing a flexible exposure-catalog-driven mode, it enables rapid, read-only exposure checks that can inform incident response, fleet management, and supply-chain risk assessment. Its emphasis on zero-dependency builds, clear profiles, and an explicit set of allowed sources makes it a practical tool for teams looking to understand, among other things, which machines might be affected by a given advisory, without triggering additional system changes or execution on endpoints.
Images
- No images were provided in the input. If you have diagrams, screenshots, or design sketches to accompany this post, I can integrate them into the narrative, with alt texts and captions, to enhance clarity and engagement.
Enjoying this project?
Discover more amazing open-source projects on TechLogHub. We curate the best developer tools and projects.
Repository:https://github.com/perplexityai/bumblebee
GitHub - perplexityai/bumblebee: Bumblebee: Read-only inventory collector for package, extension, and developer-tool metadata
Bumblebee is a purpose-built, read‑only inventory collector for macOS and Linux endpoints that brings order to a scattered on‑disk state....
github - perplexityai/bumblebee