FlashInfer

FlashInfer is a kernel library designed to enhance the serving of Large Language Models (LLMs) by optimizing inference performance. It provides a high-performance framework that integrates seamlessly with existing systems, aiming to reduce latency and improve efficiency in LLM deployments. FlashInfer supports various hardware architectures and is built to scale with the demands of production environments.

Features

Optimized kernel operations for LLM inference
Seamless integration with existing serving frameworks
Support for multiple hardware architectures
Scalable design for production environments
Reduction in inference latency
Improved resource utilization
Compatibility with popular LLM architectures
Open-source availability
Active community support

Project Samples

Project Activity

See All Activity >

License

Apache License V2.0

Follow FlashInfer

FlashInfer Web Site

Other Useful Business Software

Gemini 3 and 200+ AI Models on One Platform

Access Google's best plus Claude, Llama, and Gemma. Fine-tune and deploy from one console.

Build generative AI apps with Vertex AI. Switch between models without switching platforms.

Start Free

Rate This Project

User Reviews

Be the first to post a review of FlashInfer!

Additional Project Details

Operating Systems

Linux

Programming Language

Python

Related Categories

Python LLM Inference Tool

Registered

2025-03-18

Similar Business Software

LM-Kit.NET

LM-Kit.NET is a cutting-edge, high-level inference SDK designed specifically to bring the advanced capabilities of Large Language Models (LLM) into the C# ecosystem. Tailored for developers working within .NET, LM-Kit.NET provides a comprehensive suite of powerful Generative AI tools, making...

See Software
Vertex AI

Build, deploy, and scale machine learning (ML) models faster, with fully managed ML tools for any use case. Through Vertex AI Workbench, Vertex AI is natively integrated with BigQuery, Dataproc, and Spark. You can use BigQuery ML to create and execute machine learning models in BigQuery...

See Software
Google AI Studio

Google AI Studio is a unified development platform that helps teams explore, build, and deploy applications using Google’s most advanced AI models, including Gemini 3. It brings text, image, audio, and video models together in one interactive playground. With vibe coding, developers can use...

See Software
vLLM

vLLM is a high-performance library designed to facilitate efficient inference and serving of Large Language Models (LLMs). Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry. It offers...

See Software
RunPod

RunPod offers a cloud-based platform designed for running AI workloads, focusing on providing scalable, on-demand GPU resources to accelerate machine learning (ML) model training and inference. With its diverse selection of powerful GPUs like the NVIDIA A100, RTX 3090, and H100, RunPod supports...

See Software
NVIDIA TensorRT

NVIDIA TensorRT is an ecosystem of APIs for high-performance deep learning inference, encompassing an inference runtime and model optimizations that deliver low latency and high throughput for production applications. Built on the CUDA parallel programming model, TensorRT optimizes neural...

See Software

Report inappropriate content

FlashInfer

FlashInfer: Kernel Library for LLM Serving

Get an email when there's a new version of FlashInfer

Features

Project Samples

Project Activity

Categories

License

Follow FlashInfer

User Reviews

Additional Project Details

Operating Systems

Programming Language

Related Categories

Registered