Open framework for fast LLM and multimodal serving on clusters.
RadixAttention plus disaggregated prefill and decode paths.
Speculative decoding and scheduling for multi-GPU throughput.
Sits next to vLLM and TensorRT-LLM in cutting-edge LLM servers.

NVIDIA TensorRT favicon

NVIDIA TensorRT

NVIDIA inference compiler and runtime, including TensorRT-LLM.
Kernel fusion, quantization, and tuning for GeForce through data-center GPUs.
Imports from PyTorch, ONNX, and Hugging Face export flows.
Typical path when teams standardize on NVIDIA GPU inference.

Open-source LLM server with paged attention and continuous batching.
Higher GPU throughput than naive Hugging Face serving loops.
OpenAI-style HTTP API with broad model and quant support.
Common beside TensorRT-LLM and SGLang in production LLM stacks.

XVariant favicon

Experimentation for models, prompts, and parameters with guardrails.
A/B routes, flags, and sequential tests tuned for LLM products.
OpenAI-compatible gateway so clients swap models without rewrites.
Defaults for stats-heavy teams outgrowing spreadsheet toggles.

Ensemble favicon

Compression API for LLMs, vision, and speech with accuracy guardrails.
Exports ONNX, PyTorch, and TensorFlow after shrinking passes.
Self-serve runs for teams testing latency before edge deploys.
Shrink models as a service instead of hand-tuning every layer.

Modular favicon

Mojo: Python-like language that compiles to fast native kernels.
MAX platform packages runtimes and ops for AI on many accelerators.
Targets Python teams stuck on interpreter and glue overhead.
Smaller images than naive stacks when tuned models ship to prod.

OpenVINO favicon

Intel toolkit to optimize and deploy models on CPU, iGPU, NPU, and VPU.
Imports PyTorch, TensorFlow, and ONNX graphs for tuned kernels.
Common on industrial PCs, gateways, and consumer Intel laptops.
Pairs with oneAPI drivers for edge and embedded inference stacks.

Together favicon

Hosted GPU inference APIs for open-weights LLMs and embeddings.
Autoscaling and dedicated endpoints with optional fine-tune paths.
For teams that want production endpoints without building a stack.
Broad model catalog versus single-vendor proprietary APIs only.

Neural Magic favicon

Sparsification and quantization paths for faster transformers on CPUs.
Open recipes plus services for sparse inference workflows.
Targets x86 servers when GPU budgets are tight at inference time.
Hooks into Hugging Face and export stacks teams already use.

ONNX Runtime favicon

Cross-platform inference engine for ONNX graphs on CPU and GPU.
Execution providers plug in CUDA, TensorRT, DirectML, and more.
One exported model targets many devices without op rewrites.
Common behind mobile, edge, and server ONNX deployments.

The AI Directory

A free, curated directory of AI resources and websites. Browse categories, explore the AI dictionary, and read ethical guidance.

If you find the directory useful, see the About page for optional one-time contributions.

No signup required.

The AI Directory

A free, curated directory of AI resources and websites. Browse categories, explore the AI dictionary, and read ethical guidance.

If you find the directory useful, see the About page for optional one-time contributions.

No signup required.