Combined reward over training epochs
FDFO (Row E) converges faster and reaches higher reward than Flow-GRPO, reflecting its more efficient use of Jacobian-transported gradients.