Ollama
aiWe have moved Ollama to the hold ring as we have switched to using llama.cpp server which provides quicker updates, fewer bugs, more settings and freedom. For example, we can use some models which are text only as VLM by injecting a custom image encoder.
Ollama is an open-source framework that allows you to run and manage large language models (LLMs) locally on your machine. It provides a simple way to download, run, and interact with various open-source models. Common use cases include local development and testing of AI applications, privacy-focused AI interactions where all data stays on your machine, building a personal AI assistant without cloud dependencies, and educational purposes or experimentation with LLMs.
While Ollama is user-friendly and great for getting started with LLMs, it has some performance limitations. It can have slower inference compared to optimized frameworks like vLLM, offers no tensor parallelism support which limits distributing model computation across GPUs, and lacks KV cache quantization, leading to higher memory usage during inference.