7 Oct 2025
This project aims to connect five Mac Studios to form a super powerful AI cluster, with the ambitious goal of running the colossal Llama 3.1405B AI model, a feat typically reserved for enterprise-grade cloud servers. The endeavor involves leveraging unified memory, specialized clustering software like XO Labs, and overcoming significant networking bottlenecks inherent in consumer hardware setups.

The primary objective is to connect five Mac Studios into a super powerful AI cluster to run the largest and most challenging AI models, specifically targeting the Llama 3.1405B model, which typically requires massive cloud-based clusters.
Five Mac Studios were acquired for a switch from PC to Mac in the Network Chuck Studios video editing pipeline. Before deployment, the opportunity was seized to experiment with AI clustering, specifically using new beta software called XO Labs.
XO Labs is new beta software designed for AI clustering, allowing various computer hardware, from Raspberry Pi to powerful gaming PCs, to connect and share resources for running AI models. The software automatically discovers nodes and provides a web-based GUI and an OpenAI-compatible API.
Local AI models offer privacy and independence from cloud-based services like ChatGPT. Running larger, more sophisticated AI models, which are resource-intensive, often necessitates powerful GPUs or AI clusters, as standard laptops are insufficient to match the quality of models like ChatGPT.
Parameters in AI models represent learned knowledge; more parameters equate to higher intelligence. Running larger parameter models demands significant Video RAM (VRAM), with models ranging from Llama 3.2 1B (4GB VRAM) to Llama 3.3 70B (48GB VRAM), and the target Llama 3.1 405B requiring one terabyte of VRAM.
Quantization is a technique that makes large AI models fit on smaller GPUs by reducing numerical precision. While it incurs some precision loss (e.g., INT8 with 1-3% loss, INT4 with 10-30% loss), it enables consumer-grade GPUs to run models that would otherwise be too large.
New M-series Macs feature unified memory, providing a single pool of RAM for both system and GPU, eliminating data transfer bottlenecks and making them cost-effective for VRAM compared to traditional GPUs like the Nvidia 4090. Each Mac Studio used has 64GB of unified RAM.
Nvidia GPUs, like the 4090, typically outperform Macs in AI tasks due to dedicated tensor cores and optimization for CUDA, the industry standard for AI models. Apple's MLX (Machine Learning Acceleration) is used with XO Labs, but CUDA still benefits from broader support and optimization.
Connecting the Mac Studios for clustering utilized built-in 10 Gigabit Ethernet and then Thunderbolt. The 10 Gigabit Ethernet proved to be a significant bottleneck, causing substantial performance degradation. Thunderbolt offered higher bandwidth (up to 40 Gbps) and more direct PCIe access but still presented bottlenecks with multiple nodes.
XO Labs installation involved setting up Python 3.12, installing MLX (for Mac-specific acceleration), cloning the XO Labs repository, and running a configuration script. Initial testing with a Llama 3.2 1B model showed a single Mac Studio performing at 117 tokens per second, which dropped to 29 tokens per second when clustered via 10 Gigabit Ethernet due to networking limitations.
NordVPN sponsored the video, offering services like anonymity by masking public IP addresses, geo-unblocking content (e.g., Netflix regions), and protecting devices on public Wi-Fi networks with features like threat protection and ad blocking.
Running the Llama 3.3 70B model on the cluster via 10 Gigabit Ethernet achieved about 15 tokens per second with good memory distribution across nodes, although it was not the fastest. Thunderbolt connection provided slightly better performance but still indicated network bottlenecks.
Running the colossal Llama 3.1 405B model (quantized 4-bit, ~200GB) was the ultimate goal. A single Mac Studio failed, quickly consuming swap memory and timing out. The five-Mac cluster running on 10 Gigabit Ethernet successfully loaded the model into unified memory without swap, albeit at a very slow 0.8 tokens per second. Thunderbolt connection yielded similar slow performance (0.6 tokens per second), reinforcing network as the primary bottleneck.
Ollama demonstrated better performance for a 70B model on a single Mac Studio, suggesting better MLX optimization. XO Labs' OpenAI-compatible API enabled integration with the Fabric project, allowing for tasks like summarization and story generation using the clustered AI, showcasing its practical utility despite performance limitations.
XO Labs is a promising tool, but its performance on Mac with MLX still requires further development, particularly regarding networking bottlenecks. Future considerations include testing XO Labs with Nvidia-based clusters or Raspberry Pi AI clusters to evaluate performance differences and potential improvements.
This endeavor successfully demonstrates the audacious goal of running the largest and most challenging AI models, like Llama 3.1405B, on a cluster of consumer-grade Mac Studios, pushing the boundaries of local AI capabilities against the backdrop of enterprise infrastructure.
| Aspect | Model | Normal Requirement | Cluster Goal | Cost (unit) | AI Optimization | Power Efficiency | VRAM Equivalent | VRAM | 10 Gigabit Ethernet Performance | Thunderbolt Performance | Enterprise AI Standard | Purpose | Size Reduction (vs FP32) | Precision Loss | Impact on Llama 3.1 405B | Single Mac Studio | 5 Mac Studios (10 Gig Ethernet) | 5 Mac Studios (Thunderbolt) |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Target AI Model | Llama 3.1 405B | 1 TB VRAM (NVIDIA H100s/A100s) | Run on 5 Mac Studios (320GB unified RAM) | |||||||||||||||
| Mac Studio (M2 Ultra, 64GB Unified RAM) | $2600 (entire computer) | MLX (Apple-specific) | Extremely high | 64GB per unit (unified) | ||||||||||||||
| NVIDIA GeForce RTX 4090 | $1600 (GPU only) | Tensor Cores, CUDA (industry standard) | Lower than Mac Studio | 24GB | ||||||||||||||
| Networking (AI Cluster) | Significant bottleneck, 29 tokens/sec for Llama 3.2 1B (vs 117 tokens/sec single Mac) | Improved, but still bottlenecked (50 tokens/sec for Llama 3.2 1B, with hub setup) | 400-800+ Gigabits/sec with reduced overhead | |||||||||||||||
| Quantization (INT4) | Fit large models on smaller GPUs | 8 times smaller | 10-30% | Crucial for attempting to run on Mac Studios | ||||||||||||||
| Llama 3.1 405B Performance on Cluster | Failed, rapid swap memory use | Successful but very slow (0.8 tokens/sec), no swap due to distributed unified memory | Similar very slow performance (0.6 tokens/sec), loading issues |
