Precision:
Optimizer:
Model Parameters could be taken from config.json
on HuggingFace or directly from model via model.config
Expanding dimensionality within MLP block. Usually it is 4 × hidden size.
Might be different from number of attention heads when using Grouped Query Attention
Total VRAM usage is 27836 MiB
When PyTorch uses CUDA for the first time, it allocates between 300 MiB and 2 GiB of VRAM
Number of Parameters (1.418 billion) × number of bytes per parameter (6; parameters are stored in both full precision and half precision)
Sum of sizes of all intermediate tensors during forward pass across all 24 layers. Activations size have quadratic dependence on Sequence Length.
Gradient is stored for each parameter in full precision, so it is Number of Parameters (1.418 billion) × number of bytes per parameter (4)
Optimizer stores moving average of gradients for each parameter in full precision, so it is Number of Parameters (1.418 billion) × number of bytes per parameter (4)
Batch Size (4) × Sequence Length (512) × Vocabulary Size (51200) × number of bytes per parameter (4) × 2 (storing probabilities after softmax output which are the same size as output)