
A peer-reviewed study has found that consumer-grade GPUs, including Nvidia’s RTX 4090, can significantly reduce the cost of running large language model (LLM) inference.
The research, published by io.net – a US developer of decentralised GPU cloud infrastructure – and accepted for the 6th International Artificial Intelligence and Blockchain Conference (AIBC 2025), provides the first open benchmarks of heterogeneous GPU clusters deployed on the company’s decentralised cloud platform.
The paper, Idle Consumer GPUs as a Complement to Enterprise Hardware for LLM Inference, reports that clusters built from RTX 4090 GPUs can deliver between 62% and 78% of the throughput of enterprise-grade H100 hardware at roughly half the cost.
For batch processing or latency-tolerant workloads, token costs fell by up to 75%.
The study also notes that, while H100 GPUs remain more energy efficient on a per-token basis, extending the life of existing consumer hardware and using renewable-rich grids can reduce overall emissions.
Aline Almeida, Head of Research at IOG Foundation and lead author of the study, says, “Our findings demonstrate that hybrid routing across enterprise and consumer GPUs offers a pragmatic balance between performance, cost, and sustainability.
“Rather than a binary choice, heterogeneous infrastructure allows organisations to optimise for their specific latency and budget requirements while reducing carbon impact.”
The research outlines how AI developers and MLOps teams can use mixed hardware clusters to improve cost-efficiency.
Enterprise GPUs can support real-time applications, while consumer GPUs can be deployed for batch tasks, development, overflow capacity, and workloads with higher latency tolerance.
Under these conditions, the study reports that organisations can achieve near-H100 performance with substantially lower operating costs.
Gaurav Sharma, CEO of io.net, comments, “This peer-reviewed analysis validates the core thesis behind io.net: that the future of compute will be distributed, heterogeneous, and accessible.
“By harnessing both data-centre-grade and consumer hardware, we can democratise access to advanced AI infrastructure while making it more sustainable.”
The company also argues that the study supports its position that decentralised networks can expand global compute capacity by making distributed GPU resources available to developers through a single, programmable platform.
Key findings include:
• Cost-performance ratios — Clusters of four RTX 4090 GPUs delivered 62% to 78% of H100 throughput at around half the operational cost, achieving the lowest cost per million tokens ($0.111–0.149).
• Latency profiles — H100 hardware maintained sub-55ms P99 time-to-first-token even at higher loads, while consumer GPU clusters were suited to workloads tolerating 200–500ms tail latencies, such as research, development environments, batch jobs, embeddings, and evaluation tasks.

Head office & Accounts:
Suite 14, 6-8 Revenge Road, Lordswood
Kent ME5 8UD
T: +44 (0)1634 673163
F: +44 (0)1634 673173