Convert WebVTT to a Transcript using Python

emfcamp HowTo python subtitles YouTube · 3 comments · 750 words · Viewed ~4,903 times.

I want to convert YouTube's auto-generated subtitles into a plain transcript. Why is this so hard?

This blog post gives a more detailed explanation than my answer to this StackOverflow question.

Here's what the subtitles look like when you view a video:

And here's what the code which generates those subtitles looks like:

00:00:00.930 --> 00:00:03.080 align:start position:0%



and<00:00:01.230><c> now</c><00:00:01.439><c> can</c><00:00:01.709><c> we</c><00:00:01.800><c> have</c><c.colorCCCCCC><00:00:01.920><c> a</c></c><c.colorE5E5E5><00:00:01.979><c> round</c><00:00:02.370><c> of</c><00:00:02.460><c> applause</c></c>



00:00:03.080 --> 00:00:03.090 align:start position:0%

and now can we have<c.colorCCCCCC> a</c><c.colorE5E5E5> round of applause

 </c>



00:00:03.090 --> 00:00:04.849 align:start position:0%

and now can we have<c.colorCCCCCC> a</c><c.colorE5E5E5> round of applause

for</c><c.colorCCCCCC><00:00:03.120><c> Terrence</c><00:00:03.629><c> Edwards</c><00:00:03.899><c> and</c><00:00:04.170><c> his</c></c><c.colorE5E5E5><00:00:04.200><c> talk</c><00:00:04.529><c> the</c></c>



00:00:04.849 --> 00:00:04.859 align:start position:0%

for<c.colorCCCCCC> Terrence Edwards and his</c><c.colorE5E5E5> talk the

 </c>

WTF? You're looking at WebVTT - Web Video Text Tracks Format - this allows words to be displayed as they're said. Each sentence and word is given a time-code and a position, colours are also possible to identify multiple speakers. It's great for subtitles, but it is lousy if all you want to do is read a transcript.

So, how do we convert the above to something like:

and now can we have a round of applause for Terrence Edwards and his talk the connected house of horrors

Python - the quick and dirty way

Using the open source WebVTT-PY Python library, we can directly get the raw text of each line of the subtitles

 Python 3import webvtt

vtt = webvtt.read('subtitles-en.vtt')



vtt[0].text

' \nand now can we have a round of applause'

vtt[1].text

'and now can we have a round of applause\n '

vtt[2].text

'and now can we have a round of applause\nfor Terrence Edwards and his talk the'

vtt[3].text

'for Terrence Edwards and his talk the\n '

vtt[4].text

'for Terrence Edwards and his talk the\nconnected house of horrors good'

vtt[5].text

'connected house of horrors good\n '

vtt[6].text

'connected house of horrors good\nafternoon'

Manually looking through the text, we can see that the 2nd element has the first complete sentence, then the 6th. Starting at 2, we can increment by 4 and grab elements 6, 10, 14, etc to build up a transcript. Does that work?

Yes! This is what happens if we slice the array:

 Python 3sub = vtt[2::4]



sub[0].text

'and now can we have a round of applause\nfor Terrence Edwards and his talk the'

sub[1].text

'connected house of horrors good\nafternoon'

sub[2].text

'AMF thank you so much for for coming\nhere my name is Terrence Eaton I need to'

sub[3].text

'tell you three things about this talk so\nthe first thing is that this does'

But are we sure that will work for all the subtitles? Or even for the entirety of this subtitle file?

Python the hard way

Let's take a look again at the first 4 subtitle entries.

vtt[0].text

' \nand now can we have a round of applause'

vtt[1].text

'and now can we have a round of applause\n '

vtt[2].text

'and now can we have a round of applause\nfor Terrence Edwards and his talk the'

vtt[3].text

'for Terrence Edwards and his talk the\n '

vtt[4].text

'for Terrence Edwards and his talk the\nconnected house of horrors good'

We can split those double lines using

vtt[2].text.splitlines()

['and now can we have a round of applause', 'for Terrence Edwards and his talk the']

Let's create a new array. Add all the lines split by \n.

 Python 3lines = []

for line in vtt:

    lines.extend(line.text.strip().splitlines())

Which gives us:

>>> lines[0]

'and now can we have a round of applause'

>>> lines[1]

'and now can we have a round of applause'

>>> lines[2]

'and now can we have a round of applause'

>>> lines[3]

'for Terrence Edwards and his talk the'

>>> lines[4]

'for Terrence Edwards and his talk the'

>>> lines[5]

'for Terrence Edwards and his talk the'

>>> lines[6]

'connected house of horrors good'

And now, to de-duplicate them:

 Python 3transcript = ""

previous = None

for line in lines:

    if line == previous:

       continue

    transcript += " " + line

    previous = line

Putting it all together

Ta-da!

 Python 3import webvtt

vtt = webvtt.read('subtitles.vtt')

transcript = ""



lines = []

for line in vtt:

    lines.extend(line.text.strip().splitlines())



previous = None

for line in lines:

    if line == previous:

       continue

    transcript += " " + line

    previous = line



print(transcript)

One thing to note is that there is no punctuation. So it's not as good as a proper transcription.

3 thoughts on “Convert WebVTT to a Transcript using Python”

2019-04-21 12:19

Jack Parsons says:

I just now wrote my own script with the same python package for a similar problem.

If you want punctuation, a heuristic based on time gaps should work well. Sentence and paragraph breaks should be easy to get right. Adding commas may be more tricky.

And, yes, why did it have to be so hard? Nothing worked! ffmpeg, 5 different packaged scripts from github, and several subtitles editors on my Linux Mint box all failed. It’s like VTT is a hokey standard that only Youtube pays attention to.

Cheers!

2022-03-29 22:23

M says:

Hi! I am super new to python/coding. How would I iterate this over all files in a given dictionary? Any help would be much appreciated! Thanks!

2022-04-02 08:18

@edent says:

Let's assume you have a dictionary called transcripts:

 Python 3transcripts = { "first" : "~/docs/file1.vtt", 

                "second" : "~/docs/file2.vtt",

...

}

To iterate over all of them:

 Python 3for transcript in transcripts.values() {

   yourFunction(transcript)

}

There are lots of good tutorials around - see https://www.geeksforgeeks.org/iterate-over-a-dictionary-in-python/