I just now wrote my own script with the same python package for a similar problem.
If you want punctuation, a heuristic based on time gaps should work well. Sentence and paragraph breaks should be easy to get right. Adding commas may be more tricky.
And, yes, why did it have to be so hard? Nothing worked! ffmpeg, 5 different packaged scripts from github, and several subtitles editors on my Linux Mint box all failed. It’s like VTT is a hokey standard that only Youtube pays attention to.