VRAM Calculator

Inference

Training

Precision:

mixed

full (fp32)

Optimizer:

Adam

SGD

momentum

Sequence Length

Batch Size

Number of GPUs

Model Parameters could be taken from config.json on HuggingFace or directly from model via model.config

Parameters Preset

Number of Parameters (billions)

Number of Layers

Vocab Size

Hidden Size

Number of Attention Heads

Intermediate Size

Expanding dimensionality within MLP block. Usually it is 4 × hidden size.

Number of Key Value Heads

Might be different from number of attention heads when using Grouped Query Attention

MiB

GiB

Total VRAM usage is 27836 MiB
CUDA Kernels use 1000 MiB of VRAM
When PyTorch uses CUDA for the first time, it allocates between 300 MiB and 2 GiB of VRAM
Parameters use 8114 MiB of VRAM
Number of Parameters (1.418 billion) × number of bytes per parameter (6; parameters are stored in both full precision and half precision)
Activations use 7104 MiB of VRAM
Sum of sizes of all intermediate tensors during forward pass across all 24 layers. Activations size have quadratic dependence on Sequence Length.
Gradients use 5409 MiB of VRAM
Gradient is stored for each parameter in full precision, so it is Number of Parameters (1.418 billion) × number of bytes per parameter (4)
First Moments use 5409 MiB of VRAM
Optimizer stores moving average of gradients for each parameter in full precision, so it is Number of Parameters (1.418 billion) × number of bytes per parameter (4)
Output tensor uses 800 MiB of VRAM
Batch Size (4) × Sequence Length (512) × Vocabulary Size (51200) × number of bytes per parameter (4) × 2 (storing probabilities after softmax output which are the same size as output)