* first trl grpo implementation * added config yaml file * added read me and dependencies * updated reward format func