Popular - Optimization
Updated: 2026-04-02Optimization
- Open framework for fast LLM and multimodal serving on clusters.
- RadixAttention plus disaggregated prefill and decode paths.
- Speculative decoding and scheduling for multi-GPU throughput.
- Sits next to vLLM and TensorRT-LLM in cutting-edge LLM servers.
- NVIDIA inference compiler and runtime, including TensorRT-LLM.
- Kernel fusion, quantization, and tuning for GeForce through data-center GPUs.
- Imports from PyTorch, ONNX, and Hugging Face export flows.
- Typical path when teams standardize on NVIDIA GPU inference.
- Open-source LLM server with paged attention and continuous batching.
- Higher GPU throughput than naive Hugging Face serving loops.
- OpenAI-style HTTP API with broad model and quant support.
- Common beside TensorRT-LLM and SGLang in production LLM stacks.
- Experimentation for models, prompts, and parameters with guardrails.
- A/B routes, flags, and sequential tests tuned for LLM products.
- OpenAI-compatible gateway so clients swap models without rewrites.
- Defaults for stats-heavy teams outgrowing spreadsheet toggles.
- Compression API for LLMs, vision, and speech with accuracy guardrails.
- Exports ONNX, PyTorch, and TensorFlow after shrinking passes.
- Self-serve runs for teams testing latency before edge deploys.
- Shrink models as a service instead of hand-tuning every layer.
- Mojo: Python-like language that compiles to fast native kernels.
- MAX platform packages runtimes and ops for AI on many accelerators.
- Targets Python teams stuck on interpreter and glue overhead.
- Smaller images than naive stacks when tuned models ship to prod.
- Intel toolkit to optimize and deploy models on CPU, iGPU, NPU, and VPU.
- Imports PyTorch, TensorFlow, and ONNX graphs for tuned kernels.
- Common on industrial PCs, gateways, and consumer Intel laptops.
- Pairs with oneAPI drivers for edge and embedded inference stacks.
- Hosted GPU inference APIs for open-weights LLMs and embeddings.
- Autoscaling and dedicated endpoints with optional fine-tune paths.
- For teams that want production endpoints without building a stack.
- Broad model catalog versus single-vendor proprietary APIs only.
- Sparsification and quantization paths for faster transformers on CPUs.
- Open recipes plus services for sparse inference workflows.
- Targets x86 servers when GPU budgets are tight at inference time.
- Hooks into Hugging Face and export stacks teams already use.
- Cross-platform inference engine for ONNX graphs on CPU and GPU.
- Execution providers plug in CUDA, TensorRT, DirectML, and more.
- One exported model targets many devices without op rewrites.
- Common behind mobile, edge, and server ONNX deployments.