VRAM Estimator

Estimate GPU VRAM usage of transformer-based models.
Running Parameters
Inference
Training

Precision:

mixed
full (fp32)

Optimizer:

Adam
SGD
Model Parameters

Model Parameters could be taken from config.json on HuggingFace or directly from model via model.config

Expanding dimensionality within MLP block. Usually it is 4 × hidden size.

Might be different from number of attention heads when using Grouped Query Attention

Estimation Result
MiB
GiB
  • Total VRAM usage is 27836 MiB

  • CUDA Kernels use 1000 MiB of VRAM

    When PyTorch uses CUDA for the first time, it allocates between 300 MiB and 2 GiB of VRAM

  • Parameters use 8114 MiB of VRAM

    Number of Parameters (1.418 billion) × number of bytes per parameter (6; parameters are stored in both full precision and half precision)

  • Activations use 7104 MiB of VRAM

    Sum of sizes of all intermediate tensors during forward pass across all 24 layers. Activations size have quadratic dependence on Sequence Length.

  • Gradients use 5409 MiB of VRAM

    Gradient is stored for each parameter in full precision, so it is Number of Parameters (1.418 billion) × number of bytes per parameter (4)

  • First Moments use 5409 MiB of VRAM

    Optimizer stores moving average of gradients for each parameter in full precision, so it is Number of Parameters (1.418 billion) × number of bytes per parameter (4)

  • Output tensor uses 800 MiB of VRAM

    Batch Size (4) × Sequence Length (512) × Vocabulary Size (51200) × number of bytes per parameter (4) × 2 (storing probabilities after softmax output which are the same size as output)


While estimates might not be completely precise, they reflect my current understanding of the topic. For an in-depth explanation and the logic behind these numbers, feel free to check out my detailed post and the calculation code in the source repo. If you feel something is wrong please reach out via email alex@asmirnov.xyz or create an issue/PR in the repo. Cheers!