Openai style api for open large language models
Run Local LLMs on Any Device. Open-source
The Triton Inference Server provides an optimized cloud
FlashInfer: Kernel Library for LLM Serving
LMDeploy is a toolkit for compressing, deploying, and serving LLMs
A high-throughput and memory-efficient inference and serving engine
Ready-to-use OCR with 80+ supported languages
Operating LLMs in production
A library for accelerating Transformer models on NVIDIA GPUs
The official Python client for the Huggingface Hub
Simplifies the local serving of AI models from any source
AIMET is a library that provides advanced quantization and compression
Everything you need to build state-of-the-art foundation models
Easiest and laziest way for building multi-agent LLMs applications
Uncover insights, surface problems, monitor, and fine tune your LLM
Optimizing inference proxy for LLMs
GPU environment management and cluster orchestration
Large Language Model Text Generation Inference
Official inference library for Mistral models
Multilingual Automatic Speech Recognition with word-level timestamps
Neural Network Compression Framework for enhanced OpenVINO
Phi-3.5 for Mac: Locally-run Vision and Language Models
Standardized Serverless ML Inference Platform on Kubernetes
MII makes low-latency and high-throughput inference possible
Bring the notion of Model-as-a-Service to life