Practical Outcomes
- Teams can compare model/runtime options using a single efficiency number, not just speed.
- On-prem and air-gapped deployments can forecast power costs more accurately.
- Procurement can evaluate hardware tradeoffs with measured efficiency data.
- Capacity planning improves when energy use is normalized per token.
Tokens per second tells you how fast. Joules per token tells you how efficient.
Why Energy Matters
Edge deployments face constraints that many data center deployments do not:
- Battery life: Mobile and disconnected systems run on limited power
- Thermal limits: Fanless devices throttle under sustained load
- Cost: Energy bills scale with usage in always-on applications
Raw throughput ignores these constraints. A model that runs 50 tokens/second at 40 watts isn't more useful than one running 30 tokens/second at 15 watts if you're battery-constrained.
The Metric
Joules per token normalizes energy consumption by output:
J/token = (Power × Time) / Tokens
= (Watts × Seconds) / Tokens
Lower is better. It's hardware-comparable (with caveats) and captures the efficiency of the full stack: model, runtime, and hardware.
Measuring on macOS
Apple provides powermetrics for power measurement. Here's our protocol:
1. Stabilize Thermal State
# Wait 5 minutes at idle
sleep 300
Thermal state affects power consumption. Hot chips throttle and use power differently than cold chips.
2. Capture Baseline
sudo powermetrics --samplers cpu_power \
--sample-interval 100 \
--sample-count 300 \
-o baseline.txt
30 seconds of idle power tells you the floor.
3. Run Workload
./inference --input benchmark.txt --output results.txt
Use a fixed input for repeatability.
4. Calculate
baseline_watts = parse_average(baseline.txt)
workload_watts = parse_average(workload.txt)
inference_watts = workload_watts - baseline_watts
tokens = count_tokens(results.txt)
time_seconds = measure_runtime()
joules = inference_watts * time_seconds
j_per_token = joules / tokens
Repeatability Rules
For results to be comparable:
- Same hardware - Don't compare M1 to M2
- Same thermal state - Always stabilize first
- Same input - Use identical benchmark data
- Power adapter - Battery mode behaves differently
- Report conditions - Ambient temperature, chip variant, macOS version
We see <5% variance across 10 runs when following this protocol.
What It Doesn't Tell You
Joules per token doesn't tell you:
- Quality of outputs
- Latency per request
- Memory efficiency
- Cost per token (unless you know your electricity rate)
It is one metric among many. Use it alongside others.
Bottom Line
Energy efficiency is measurable and comparable with a consistent method. Joules per token provides a normalized metric for edge deployment planning. On macOS, measurement is straightforward when thermal stabilization and baseline correction are done correctly.