Hi - I want to train a model with [e.g. 256 GPU]. I want to have 4 data parallelism (DDP) to replicate the full model, and in each parallelism use FSDP to shard the model into 64 GPUs. Any code example?
I know how to write it in a native Pytorch but how to do this in Trainer. Is it supportive?