The iron ML notebook
  • The iron data science notebook
  • ML & Data Science
    • Frequent Questions
      • Discriminative vs Generative models
      • Supervised vs Unsupervised learning
      • Batch vs Online Learning
      • Instance-based vs Model-based Learning
      • Bias-Variance Tradeoff
      • Probability vs Likelihood
      • Covariance vs Correlation Matrix
      • Precision vs Recall
      • How does a ROC curve work?
      • Ridge vs Lasso
      • Anomaly detection methods
      • How to deal with imbalanced datasets?
      • What is "Statistically Significant"?
      • Recommendation systems methods
    • Statistics
      • The basics
      • Distributions
      • Sampling
      • IQR
      • Z-score
      • F-statistic
      • Outliers
      • The bayesian basis
      • Statistic vs Parameter
      • Markov Monte Carlo Chain
    • ML Techniques
      • Pre-process
        • PCA
      • Loss functions
      • Regularization
      • Optimization
      • Metrics
        • Distance measures
      • Activation Functions
      • Selection functions
      • Feature Normalization
      • Cross-validation
      • Hyperparameter tuning
      • Ensemble methods
      • Hard negative mining
      • ML Serving
        • Quantization
        • Kernel Auto-Tuning
        • NVIDIA TensorRT vs ONNX Runtime
    • Machine Learning Algorithms
      • Supervised Learning
        • Support Vector Machines
        • Adaptative boosting
        • Gradient boosting
        • Regression algorithms
          • Linear Regression
          • Lasso regression
          • Multi Layer Perceptron
        • Classification algorithms
          • Perceptron
          • Logistic Regression
          • Multilayer Perceptron
          • kNN
          • Naive Bayes
          • Decision Trees
          • Random Forest
          • Gradient Boosted Trees
      • Unsupervised learning
        • Clustering
          • Clustering metrics
          • kMeans
          • Gaussian Mixture Model
          • Hierarchical clustering
          • DBSCAN
      • Cameras
        • Intrinsic and extrinsic parameters
    • Computer Vision
      • Object Detection
        • Two-Stage detectors
          • Traditional Detection Models
          • R-CNN
          • Fast R-CNN
          • Faster R-CNN
        • One-Stage detectors
          • YOLO
          • YOLO v2
          • YOLO v3
          • YOLOX
        • Techniques
          • NMS
          • ROI Pooling
        • Metrics
          • Objectness Score
          • Coco Metrics
          • IoU
      • MOT
        • SORT
        • Deep SORT
  • Related Topics
    • Intro
    • Python
      • Global Interpreter Lock (GIL)
      • Mutability
      • AsyncIO
    • SQL
    • Combinatorics
    • Data Engineering Questions
    • Distributed computation
      • About threads & processes
      • REST vs gRPC
  • Algorithms & data structures
    • Array
      • Online Stock Span
      • Two Sum
      • Best time to by and sell stock
      • Rank word combination
      • Largest subarray with zero sum
    • Binary
      • Sum of Two Integers
    • Tree
      • Maximum Depth of Binary Tree
      • Same Tree
      • Invert/Flip Binary Tree
      • Binary Tree Paths
      • Binary Tree Maximum Path Sum
    • Matrix
      • Set Matrix Zeroes
    • Linked List
      • Reverse Linked List
      • Detect Cycle
      • Merge Two Sorted Lists
      • Merge k Sorted Lists
    • String
      • Longest Substring Without Repeating Characters
      • Longest Repeating Character Replacement
      • Minimum Window Substring
    • Interval
    • Graph
    • Heap
    • Dynamic Programming
      • Fibonacci
      • Grid Traveler
      • Can Sum
      • How Sum
      • Best Sum
      • Can Construct
      • Count Construct
      • All Construct
      • Climbing Stairs
Powered by GitBook
On this page
  • TL;DR
  • NVIDIA TensorRT
  • ONNX Runtime
  • Comparison of features

Was this helpful?

  1. ML & Data Science
  2. ML Techniques
  3. ML Serving

NVIDIA TensorRT vs ONNX Runtime

PreviousKernel Auto-TuningNextMachine Learning Algorithms

Last updated 7 months ago

Was this helpful?

TL;DR

  • ONNX Runtime is a versatile, hardware-agnostic inference engine with support for multiple execution providers, making it suitable for various deployment environments beyond NVIDIA hardware. It focuses on general-purpose optimization and cross-platform compatibility.

  • NVIDIA TensorRT is a specialized inference engine for NVIDIA GPUs, offering advanced, GPU-specific optimizations that maximize performance, especially for low-latency and real-time applications.

It's a high-performance deep learning inference SDK specifically optimized for NVIDIA GPUs. It is tailored for maximum efficiency on NVIDIA hardware and offers deep integration with CUDA and cuDNN libraries to extract the best possible performance from GPU resources.

These are the main features:

  1. NVIDIA Hardware-Specific and CUDA support.

  2. Layer and Kernel Auto-Tuning (Kernel Auto-Tuning)

  3. Quantization

  4. Tensor Memory management: allocates memory on GPU only when needed. This allows for more available GPU memory and bigger batch sizes.

It's an open-source inference engine that executes models in the Open Neural Network Exchange (ONNX) format. It's designed to be hardware-agnostic, allowing models to be deployed across various hardware platforms such as CPUs, GPUs, FPGAs, and specialized accelerators.

Main features:

  1. Hardware-Agnostic

  2. Model flexibility (TensorFlow, PyTorch, Keras, MXNet ...)

  3. Graph optimizations like constant folding, operator fusion, and dead node elimination.

  4. Quantization

  5. Dynamic input shapes

Comparison of features

Features

ONNX Runtime

NVIDIA TensorRT

Hardware Support

Multi-platform: CPUs, NVIDIA GPUs, Intel hardware, FPGAs, custom accelerators.

NVIDIA GPUs only

Execution Providers

CUDA, TensorRT, OpenVINO, DirectML, etc.

CUDA and TensorRT for NVIDIA GPUs.

Model Format

Supports ONNX models (from TensorFlow, PyTorch, Keras, etc)

TensorFlow, PyTorch, ONNX, Caffe. Converts them to TensorRT's format.

Optimization Techniques

Graph optimizations, and basic kernel auto-tuning (depending on the execution provider).

Advanced kernel auto-tuning, mixed-precision, and INT8 quantization.

Precision Support

Supports FP32, FP16, and INT8 quantization.

Optimized support for FP16 and INT8 with extensive calibration tools.

Dynamic Shapes

Natively supports dynamic input sizes and batch dimensions.

Supports dynamic shapes but requires explicit optimization during engine building.

Use Case Flexibility

Suitable for a wide range of hardware and model types, adaptable to various deployment environments.

Best for high-performance, low-latency applications on NVIDIA GPUs.

Ease of Use

More flexible; integrates with different backends using execution providers.

More complex; requires NVIDIA GPU-specific setup and optimizations.

NVIDIA TensorRT
ONNX Runtime