Building a Super Powerful AI Cluster with Five Mac Studios to Run the Llama 3.1405B Model

By Chuck Keith (NetworkChuck)
Youtuber And Influencer In Tech

7 Oct 2025

This project aims to connect five Mac Studios to form a super powerful AI cluster, with the ambitious goal of running the colossal Llama 3.1405B AI model, a feat typically reserved for enterprise-grade cloud servers. The endeavor involves leveraging unified memory, specialized clustering software like XO Labs, and overcoming significant networking bottlenecks inherent in consumer hardware setups.

Key Points Summary

AI Cluster Objective
The primary objective is to connect five Mac Studios into a super powerful AI cluster to run the largest and most challenging AI models, specifically targeting the Llama 3.1405B model, which typically requires massive cloud-based clusters.
Hardware Acquisition and Motivation
Five Mac Studios were acquired for a switch from PC to Mac in the Network Chuck Studios video editing pipeline. Before deployment, the opportunity was seized to experiment with AI clustering, specifically using new beta software called XO Labs.
XO Labs Software
XO Labs is new beta software designed for AI clustering, allowing various computer hardware, from Raspberry Pi to powerful gaming PCs, to connect and share resources for running AI models. The software automatically discovers nodes and provides a web-based GUI and an OpenAI-compatible API.
Importance of Local AI and AI Clusters
Local AI models offer privacy and independence from cloud-based services like ChatGPT. Running larger, more sophisticated AI models, which are resource-intensive, often necessitates powerful GPUs or AI clusters, as standard laptops are insufficient to match the quality of models like ChatGPT.
AI Model Parameters and VRAM Requirements
Parameters in AI models represent learned knowledge; more parameters equate to higher intelligence. Running larger parameter models demands significant Video RAM (VRAM), with models ranging from Llama 3.2 1B (4GB VRAM) to Llama 3.3 70B (48GB VRAM), and the target Llama 3.1 405B requiring one terabyte of VRAM.
Quantization
Quantization is a technique that makes large AI models fit on smaller GPUs by reducing numerical precision. While it incurs some precision loss (e.g., INT8 with 1-3% loss, INT4 with 10-30% loss), it enables consumer-grade GPUs to run models that would otherwise be too large.
Mac Studio Unified Memory Architecture
New M-series Macs feature unified memory, providing a single pool of RAM for both system and GPU, eliminating data transfer bottlenecks and making them cost-effective for VRAM compared to traditional GPUs like the Nvidia 4090. Each Mac Studio used has 64GB of unified RAM.
Mac vs. Nvidia for AI
Nvidia GPUs, like the 4090, typically outperform Macs in AI tasks due to dedicated tensor cores and optimization for CUDA, the industry standard for AI models. Apple's MLX (Machine Learning Acceleration) is used with XO Labs, but CUDA still benefits from broader support and optimization.
Network Connectivity Challenges
Connecting the Mac Studios for clustering utilized built-in 10 Gigabit Ethernet and then Thunderbolt. The 10 Gigabit Ethernet proved to be a significant bottleneck, causing substantial performance degradation. Thunderbolt offered higher bandwidth (up to 40 Gbps) and more direct PCIe access but still presented bottlenecks with multiple nodes.
XO Labs Installation and Initial Testing
XO Labs installation involved setting up Python 3.12, installing MLX (for Mac-specific acceleration), cloning the XO Labs repository, and running a configuration script. Initial testing with a Llama 3.2 1B model showed a single Mac Studio performing at 117 tokens per second, which dropped to 29 tokens per second when clustered via 10 Gigabit Ethernet due to networking limitations.
NordVPN Sponsorship
NordVPN sponsored the video, offering services like anonymity by masking public IP addresses, geo-unblocking content (e.g., Netflix regions), and protecting devices on public Wi-Fi networks with features like threat protection and ad blocking.
Testing with Llama 3.3 70B Model
Running the Llama 3.3 70B model on the cluster via 10 Gigabit Ethernet achieved about 15 tokens per second with good memory distribution across nodes, although it was not the fastest. Thunderbolt connection provided slightly better performance but still indicated network bottlenecks.
Attempting the Llama 3.1 405B Model
Running the colossal Llama 3.1 405B model (quantized 4-bit, ~200GB) was the ultimate goal. A single Mac Studio failed, quickly consuming swap memory and timing out. The five-Mac cluster running on 10 Gigabit Ethernet successfully loaded the model into unified memory without swap, albeit at a very slow 0.8 tokens per second. Thunderbolt connection yielded similar slow performance (0.6 tokens per second), reinforcing network as the primary bottleneck.
Ollama Comparison and Fabric Project Integration
Ollama demonstrated better performance for a 70B model on a single Mac Studio, suggesting better MLX optimization. XO Labs' OpenAI-compatible API enabled integration with the Fabric project, allowing for tasks like summarization and story generation using the clustered AI, showcasing its practical utility despite performance limitations.
Conclusion and Future Prospects
XO Labs is a promising tool, but its performance on Mac with MLX still requires further development, particularly regarding networking bottlenecks. Future considerations include testing XO Labs with Nvidia-based clusters or Raspberry Pi AI clusters to evaluate performance differences and potential improvements.

This endeavor successfully demonstrates the audacious goal of running the largest and most challenging AI models, like Llama 3.1405B, on a cluster of consumer-grade Mac Studios, pushing the boundaries of local AI capabilities against the backdrop of enterprise infrastructure.

Under Details

Aspect	Model	Normal Requirement	Cluster Goal	Cost (unit)	AI Optimization	Power Efficiency	VRAM Equivalent	VRAM	10 Gigabit Ethernet Performance	Thunderbolt Performance	Enterprise AI Standard	Purpose	Size Reduction (vs FP32)	Precision Loss	Impact on Llama 3.1 405B	Single Mac Studio	5 Mac Studios (10 Gig Ethernet)	5 Mac Studios (Thunderbolt)
Target AI Model	Llama 3.1 405B	1 TB VRAM (NVIDIA H100s/A100s)	Run on 5 Mac Studios (320GB unified RAM)
Mac Studio (M2 Ultra, 64GB Unified RAM)				$2600 (entire computer)	MLX (Apple-specific)	Extremely high	64GB per unit (unified)
NVIDIA GeForce RTX 4090				$1600 (GPU only)	Tensor Cores, CUDA (industry standard)	Lower than Mac Studio		24GB
Networking (AI Cluster)									Significant bottleneck, 29 tokens/sec for Llama 3.2 1B (vs 117 tokens/sec single Mac)	Improved, but still bottlenecked (50 tokens/sec for Llama 3.2 1B, with hub setup)	400-800+ Gigabits/sec with reduced overhead
Quantization (INT4)												Fit large models on smaller GPUs	8 times smaller	10-30%	Crucial for attempting to run on Mac Studios
Llama 3.1 405B Performance on Cluster																Failed, rapid swap memory use	Successful but very slow (0.8 tokens/sec), no swap due to distributed unified memory	Similar very slow performance (0.6 tokens/sec), loading issues

Related Tags

ArtificialIntelligence

AIClustering

Experimental

MacStudio

LLMs

XO_Labs

NordVPN

PerformanceTesting

Building a Super Powerful AI Cluster with Five Mac Studios to Run the Llama 3.1405B Model

Key Points Summary

Under Details

Tags

Share this post

Other Posts

Related Tags

Building a Super Powerful AI Cluster with Five Mac Studios to Run the Llama 3.1405B Model

Key Points Summary

Under Details

Tags

Share this post

Other Posts

Scientific Investigations into Absurd Hypotheticals: Powering Tokyo with Jellyfish and the Catastrophe of Banana Rain

N8N: The Ultimate Open-Source Automation Tool for Everything

First Look: Samsung S25 Edge Flagship

Related Tags