The end-to-end open-source platform for on-device LLM deployment. Train with custom Triton kernels, export to CoreML, run on Apple Neural Engine.
From training to deployment in minutes, not months.
LoRA/QLoRA fine-tuning with custom Triton kernels. 8x faster training on consumer GPUs.
Convert to CoreML with ANE-optimized LUT quantization. 4-bit, 6-bit, and 8-bit support.
Run on Apple Neural Engine. Not CPU — NPU. Real-time inference on iPhone and iPad.
Everything you need to train and deploy LLMs on-device, nothing you don't.
Hand-optimized RMSNorm (8x), SwiGLU (5x), and RoPE (2x) kernels for dramatically faster training.
Full support for OLoRA, PiSSA, DoRA, RSLoRA, and LoftQ initialization — go beyond standard LoRA.
ANE-optimized export pipeline with LUT quantization. Deploy directly to Apple Neural Engine.
~50MB install vs 500MB+ alternatives. Minimal dependencies, maximum speed.
Instruction tuning with masked loss — train on completions only for cleaner, more focused models.
Built-in W&B integration, early stopping, memory monitoring, and checkpoint management.
From training to on-device inference — clean APIs at every step.
from nimbo import Nimbo, LoRAConfig, TrainingConfig
# Initialize trainer with model and dataset
trainer = Nimbo(
base_model_name="meta-llama/Llama-3.2-1B",
dataset="your_dataset.jsonl",
lora_config=LoRAConfig(r=16, lora_alpha=32),
training_config=TrainingConfig(
learning_rate=2e-4,
num_train_epochs=3,
train_on_responses_only=True, # Masked loss on completions
),
use_triton_kernels=True, # 8x faster RMSNorm, SwiGLU, RoPE
)
trainer.train()
trainer.save() # Merged model → ./nimbo_output/final_merged
from nimbo.export.coreml import convert_hf_to_coreml
# Convert merged model to CoreML with LUT quantization
result = convert_hf_to_coreml(
model_id="./nimbo_output/final_merged",
output_dir="./coreml_output",
lut_bits=4, # 4-bit LUT quantization for ANE
context_length=512,
split_model=True, # Split: embeddings, decoder, lm_head
)
# Output: .mlpackage files + meta.yaml + tokenizer
print(result.output_paths)
import NimboCore
// Load model from Files app or bundle
let manager = InferenceManager()
try await manager.loadModel(from: modelURL)
// Generate with streaming tokens
try await manager.generate(
prompt: "Explain quantum computing",
maxTokens: 512,
temperature: 0.7
) { token in
print(token, terminator: "")
}
// Runs on Apple Neural Engine — not CPU
Train, export, and deploy popular open-source LLMs.
NimboChat — a production-ready SwiftUI chat app powered by on-device inference.
A fully-featured iOS chat application that demonstrates real-time LLM inference on Apple Neural Engine. Built with SwiftUI, powered by NimboCore.
See how Nimbo compares to other fine-tuning frameworks.
| Nimbo | Transformers | Unsloth | |
|---|---|---|---|
| Install size | ~50 MB | 500 MB+ | 200 MB+ |
| Dependencies | Minimal | Heavy | Moderate |
| CoreML export | ✓ Built-in | ✗ | ✗ |
| On-device sample app | ✓ NimboChat | ✗ | ✗ |
| NPU support | ✓ Apple ANE | ✗ | ✗ |
| Custom Triton kernels | ✓ 8x speedup | ✗ | ✓ |
| Learning curve | Low | Steep | Moderate |
| License | Apache 2.0 | Apache 2.0 | Apache 2.0 |
Get started with Nimbo in minutes. Open source, Apache 2.0 licensed, community driven.