Tokens per second tells you how fast. Joules per token tells you how efficient.
Why Energy Matters
Edge deployments face constraints that data center deployments don't:
- Battery life: Mobile and disconnected systems run on limited power
- Thermal limits: Fanless devices throttle under sustained load
- Cost: Energy bills scale with usage in always-on applications
Raw throughput ignores these constraints. A model that runs 50 tokens/second at 40 watts isn't more useful than one running 30 tokens/second at 15 watts if you're battery-constrained.
The Metric
Joules per token normalizes energy consumption by output:
J/token = (Power × Time) / Tokens
= (Watts × Seconds) / Tokens
Lower is better. It's hardware-comparable (with caveats) and captures the efficiency of the full stack: model, runtime, and hardware.
Measuring on macOS
Apple provides powermetrics for power measurement. Here's our protocol:
1. Stabilize Thermal State
# Wait 5 minutes at idle
sleep 300
Thermal state affects power consumption. Hot chips throttle and use power differently than cold chips.
2. Capture Baseline
sudo powermetrics --samplers cpu_power \
--sample-interval 100 \
--sample-count 300 \
-o baseline.txt
30 seconds of idle power tells you the floor.
3. Run Workload
./inference --input benchmark.txt --output results.txt
Use a fixed input for repeatability.
4. Calculate
baseline_watts = parse_average(baseline.txt)
workload_watts = parse_average(workload.txt)
inference_watts = workload_watts - baseline_watts
tokens = count_tokens(results.txt)
time_seconds = measure_runtime()
joules = inference_watts * time_seconds
j_per_token = joules / tokens
Repeatability Rules
For results to be comparable:
- Same hardware - Don't compare M1 to M2
- Same thermal state - Always stabilize first
- Same input - Use identical benchmark data
- Power adapter - Battery mode behaves differently
- Report conditions - Ambient temperature, chip variant, macOS version
We see <5% variance across 10 runs when following this protocol.
What It Doesn't Tell You
Joules per token doesn't tell you:
- Quality of outputs
- Latency per request
- Memory efficiency
- Cost per token (unless you know your electricity rate)
It's one metric among many. Use it alongside others.
Conclusion
Energy efficiency is measurable and comparable with proper methodology. Joules per token provides a normalized metric for edge deployment planning. The measurement is straightforward on macOS with appropriate stabilization and baseline correction.