For example, I use batch size=64, max steps=1000, log, evaluate and save steps=100.
Question 1: I want to confirm that if I use gradient accumulation (batch size=32, gradient accumulation steps=2), I don’t need to change step args (1000 → 2000, 100 → 200). Will transformers take care of gradient_accumulation_steps * xxx_step totally? I found the gradient_accumulation_steps document and this discussion. But I didn’t find any exact example.
Question 2: How about warm-up? Do I just keep the original args config?
Question 3: Will the data be fed into model in the same order? (I don’t set seed with --seed
, so the default seed 42 is used)