fastvideo.distributed.device_communicators.base_device_communicator#
Module Contents#
Classes#
Base class for device-specific communicator with autograd support.
It can use the |
|
Collection of autograd functions for distributed operations. |
API#
- class fastvideo.distributed.device_communicators.base_device_communicator.DeviceCommunicatorBase(cpu_group: torch.distributed.ProcessGroup, device: torch.device | None = None, device_group: torch.distributed.ProcessGroup | None = None, unique_name: str = '')[source]#
Base class for device-specific communicator with autograd support. It can use the
cpu_groupto initialize the communicator. If the device has PyTorch integration (PyTorch can recognize its communication backend), thedevice_groupwill also be given.Initialization
- all_gather(input_: torch.Tensor, dim: int = -1) torch.Tensor[source]#
Performs an all_gather operation with gradient support.
- all_reduce(input_: torch.Tensor, op: torch.distributed.ReduceOp | None = ReduceOp.SUM) torch.Tensor[source]#
Performs an all_reduce operation with gradient support.
- all_to_all_4D(input_: torch.Tensor, scatter_dim: int = 2, gather_dim: int = 1) torch.Tensor[source]#
Performs a 4D all-to-all operation with gradient support.
- gather(input_: torch.Tensor, dst: int = 0, dim: int = -1) torch.Tensor | None[source]#
NOTE: We assume that the input tensor is on the same device across all the ranks. NOTE:
dstis the local rank of the destination rank.
- recv(size: torch.Size, dtype: torch.dtype, src: int | None = None) torch.Tensor[source]#
Receives a tensor from the source rank.
- class fastvideo.distributed.device_communicators.base_device_communicator.DistributedAutograd[source]#
Collection of autograd functions for distributed operations.
This class provides custom autograd functions for distributed operations like all_reduce, all_gather, and all_to_all. Each operation is implemented as a static inner class with proper forward and backward implementations.
- class AllGather(*args, **kwargs)[source]#
Bases:
torch.autograd.FunctionDifferentiable all_gather operation.
The operation gathers tensors from all ranks and concatenates them along a specified dimension. The backward pass uses reduce_scatter to efficiently distribute gradients back to source ranks.
Initialization
- static backward(ctx: Any, grad_output: torch.Tensor) tuple[None, torch.Tensor, None, None][source]#
- static forward(ctx: Any, group: torch.distributed.ProcessGroup, input_: torch.Tensor, world_size: int, dim: int) torch.Tensor[source]#
- class AllReduce(*args, **kwargs)[source]#
Bases:
torch.autograd.FunctionDifferentiable all_reduce operation.
The gradient of all_reduce is another all_reduce operation since the operation combines values from all ranks equally.
Initialization
- static backward(ctx: Any, grad_output: torch.Tensor) tuple[None, torch.Tensor, None][source]#
- static forward(ctx: Any, group: torch.distributed.ProcessGroup, input_: torch.Tensor, op: torch.distributed.ReduceOp | None = None) torch.Tensor[source]#
- class AllToAll4D(*args, **kwargs)[source]#
Bases:
torch.autograd.FunctionDifferentiable all_to_all operation specialized for 4D tensors.
This operation is particularly useful for attention operations where we need to redistribute data across ranks for efficient parallel processing.
The operation supports two modes:
scatter_dim=2, gather_dim=1: Used for redistributing attention heads
scatter_dim=1, gather_dim=2: Used for redistributing sequence dimensions
Initialization
- static backward(ctx: Any, grad_output: torch.Tensor) tuple[None, torch.Tensor, None, None, None][source]#
- static forward(ctx: Any, group: torch.distributed.ProcessGroup, input_: torch.Tensor, world_size: int, scatter_dim: int, gather_dim: int) torch.Tensor[source]#