device_communicators
¶
Modules¶
fastvideo.distributed.device_communicators.base_device_communicator
¶
Classes¶
fastvideo.distributed.device_communicators.base_device_communicator.DeviceCommunicatorBase
¶
DeviceCommunicatorBase(cpu_group: ProcessGroup, device: device | None = None, device_group: ProcessGroup | None = None, unique_name: str = '')
Base class for device-specific communicator with autograd support.
It can use the cpu_group to initialize the communicator.
If the device has PyTorch integration (PyTorch can recognize its
communication backend), the device_group will also be given.
Source code in fastvideo/distributed/device_communicators/base_device_communicator.py
Functions¶
fastvideo.distributed.device_communicators.base_device_communicator.DeviceCommunicatorBase.all_gather
¶all_gather(input_: Tensor, dim: int = -1) -> Tensor
Performs an all_gather operation with gradient support.
Source code in fastvideo/distributed/device_communicators/base_device_communicator.py
fastvideo.distributed.device_communicators.base_device_communicator.DeviceCommunicatorBase.all_reduce
¶Performs an all_reduce operation with gradient support.
Source code in fastvideo/distributed/device_communicators/base_device_communicator.py
fastvideo.distributed.device_communicators.base_device_communicator.DeviceCommunicatorBase.all_to_all_4D
¶Performs a 4D all-to-all operation with gradient support.
Source code in fastvideo/distributed/device_communicators/base_device_communicator.py
fastvideo.distributed.device_communicators.base_device_communicator.DeviceCommunicatorBase.gather
¶NOTE: We assume that the input tensor is on the same device across
all the ranks.
NOTE: dst is the local rank of the destination rank.
Source code in fastvideo/distributed/device_communicators/base_device_communicator.py
fastvideo.distributed.device_communicators.base_device_communicator.DeviceCommunicatorBase.recv
¶recv(size: Size, dtype: dtype, src: int | None = None) -> Tensor
Receives a tensor from the source rank.
Source code in fastvideo/distributed/device_communicators/base_device_communicator.py
fastvideo.distributed.device_communicators.base_device_communicator.DeviceCommunicatorBase.send
¶send(tensor: Tensor, dst: int | None = None) -> None
Sends a tensor to the destination rank in a non-blocking way
Source code in fastvideo/distributed/device_communicators/base_device_communicator.py
fastvideo.distributed.device_communicators.base_device_communicator.DistributedAutograd
¶
Collection of autograd functions for distributed operations.
This class provides custom autograd functions for distributed operations like all_reduce, all_gather, and all_to_all. Each operation is implemented as a static inner class with proper forward and backward implementations.
Classes¶
fastvideo.distributed.device_communicators.base_device_communicator.DistributedAutograd.AllGather
¶
Bases: Function
Differentiable all_gather operation.
The operation gathers tensors from all ranks and concatenates them along a specified dimension. The backward pass uses reduce_scatter to efficiently distribute gradients back to source ranks.
fastvideo.distributed.device_communicators.base_device_communicator.DistributedAutograd.AllReduce
¶
Bases: Function
Differentiable all_reduce operation.
The gradient of all_reduce is another all_reduce operation since the operation combines values from all ranks equally.
fastvideo.distributed.device_communicators.base_device_communicator.DistributedAutograd.AllToAll4D
¶
Bases: Function
Differentiable all_to_all operation specialized for 4D tensors.
This operation is particularly useful for attention operations where we need to redistribute data across ranks for efficient parallel processing.
The operation supports two modes: 1. scatter_dim=2, gather_dim=1: Used for redistributing attention heads 2. scatter_dim=1, gather_dim=2: Used for redistributing sequence dimensions
fastvideo.distributed.device_communicators.cpu_communicator
¶
Classes¶
fastvideo.distributed.device_communicators.cpu_communicator.CpuCommunicator
¶
CpuCommunicator(cpu_group: ProcessGroup, device: device | None = None, device_group: ProcessGroup | None = None, unique_name: str = '')
Bases: DeviceCommunicatorBase
Source code in fastvideo/distributed/device_communicators/cpu_communicator.py
Functions¶
fastvideo.distributed.device_communicators.cpu_communicator.CpuCommunicator.gather
¶NOTE: We assume that the input tensor is on the same device across
all the ranks.
NOTE: dst is the local rank of the destination rank.
Source code in fastvideo/distributed/device_communicators/cpu_communicator.py
fastvideo.distributed.device_communicators.cuda_communicator
¶
Classes¶
fastvideo.distributed.device_communicators.cuda_communicator.CudaCommunicator
¶
CudaCommunicator(cpu_group: ProcessGroup, device: device | None = None, device_group: ProcessGroup | None = None, unique_name: str = '')
Bases: DeviceCommunicatorBase
Source code in fastvideo/distributed/device_communicators/cuda_communicator.py
Functions¶
fastvideo.distributed.device_communicators.cuda_communicator.CudaCommunicator.recv
¶recv(size: Size, dtype: dtype, src: int | None = None) -> Tensor
Receives a tensor from the source rank.
Source code in fastvideo/distributed/device_communicators/cuda_communicator.py
fastvideo.distributed.device_communicators.cuda_communicator.CudaCommunicator.send
¶send(tensor: Tensor, dst: int | None = None) -> None
Sends a tensor to the destination rank in a non-blocking way
Source code in fastvideo/distributed/device_communicators/cuda_communicator.py
fastvideo.distributed.device_communicators.npu_communicator
¶
Classes¶
fastvideo.distributed.device_communicators.npu_communicator.NpuCommunicator
¶
NpuCommunicator(cpu_group: ProcessGroup, device: device | None = None, device_group: ProcessGroup | None = None, unique_name: str = '')
Bases: DeviceCommunicatorBase
Source code in fastvideo/distributed/device_communicators/npu_communicator.py
Functions¶
fastvideo.distributed.device_communicators.npu_communicator.NpuCommunicator.recv
¶recv(size: Size, dtype: dtype, src: int | None = None) -> Tensor
Receives a tensor from the source rank.
Source code in fastvideo/distributed/device_communicators/npu_communicator.py
fastvideo.distributed.device_communicators.npu_communicator.NpuCommunicator.send
¶send(tensor: Tensor, dst: int | None = None) -> None
Sends a tensor to the destination rank in a non-blocking way
Source code in fastvideo/distributed/device_communicators/npu_communicator.py
fastvideo.distributed.device_communicators.pyhccl
¶
Classes¶
fastvideo.distributed.device_communicators.pyhccl.PyHcclCommunicator
¶
PyHcclCommunicator(group: ProcessGroup | StatelessProcessGroup, device: int | str | device, library_path: str | None = None)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
group
|
ProcessGroup | StatelessProcessGroup
|
the process group to work on. If None, it will use the default process group. |
required |
device
|
int | str | device
|
the device to bind the PyHcclCommunicator to. If None, it will be bind to f"npu:{local_rank}". |
required |
library_path
|
str | None
|
the path to the HCCL library. If None, it will use the default library path. |
None
|
It is the caller's responsibility to make sure each communicator is bind to a unique device.
Source code in fastvideo/distributed/device_communicators/pyhccl.py
Functions¶
Functions¶
fastvideo.distributed.device_communicators.pynccl
¶
Classes¶
fastvideo.distributed.device_communicators.pynccl.PyNcclCommunicator
¶
PyNcclCommunicator(group: ProcessGroup | StatelessProcessGroup, device: int | str | device, library_path: str | None = None)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
group
|
ProcessGroup | StatelessProcessGroup
|
the process group to work on. If None, it will use the default process group. |
required |
device
|
int | str | device
|
the device to bind the PyNcclCommunicator to. If None, it will be bind to f"cuda:{local_rank}". |
required |
library_path
|
str | None
|
the path to the NCCL library. If None, it will use the default library path. |
None
|
It is the caller's responsibility to make sure each communicator is bind to a unique device.