GPUs Do Not Promise Determinism. Here Is the Hardware Reason.
Why running the same CUDA code twice can produce different floating-point results, and what you can do about it.
We focus on traceable model runs and offline deployment.
Clear provenance, documented configs, and repeatable setups.
Designed to run without outbound network calls or telemetry.
We aim for reproducible setups with documented tolerances and signed artifacts.
We design for minimal collection and keep evidence local where possible.
We do not promise the model is right. We aim to show what model ran, with what config, on what input. That's the part you can verify.
Artifacts should trace back to their origin. Model weights, adapters, and runtime are identified where possible.
Structured declarations of what should run. Machine-readable. Diffable.
Configurations can be signed so tampering is detectable.
Log entries can reference the previous; deletion or modification becomes detectable.
In transfer-heavy workloads, data movement dominates energy cost. Unified memory architectures can reduce this cost by eliminating copies between CPU and GPU memory. We measure this with Joules per token.
We document a measurement methodology for Joules/token benchmarking on Apple silicon.
macOS powermetrics API • 10-run averaging • thermal normalization • documented tolerances
Why running the same CUDA code twice can produce different floating-point results, and what you can do about it.
Verification proves what happened, not that the output is correct. This distinction matters for compliance and trust.
Why we measure inference efficiency in Joules per token, and how to do it repeatably on macOS.
Get notified when we publish new research or open access to our tools.
No spam, ever. We only email when we have something worth sharing.