Convert WebVTT to a Transcript using Python

@edent — Mon, 10 Sep 2018 11:05:23 +0000

I want to convert YouTube's auto-generated subtitles into a plain transcript. Why is this so hard?

This blog post gives a more detailed explanation than my answer to this StackOverflow question.

Here's what the subtitles look like when you view a video:

And here's what the code which generates those subtitles looks like:

00:00:00.930 --> 00:00:03.080 align:start position:0%

and<00:00:01.230> now<00:00:01.439> can<00:00:01.709> we<00:00:01.800> have<00:00:01.920> a<00:00:01.979> round<00:00:02.370> of<00:00:02.460> applause

00:00:03.080 --> 00:00:03.090 align:start position:0%
and now can we have a round of applause
 

00:00:03.090 --> 00:00:04.849 align:start position:0%
and now can we have a round of applause
for<00:00:03.120> Terrence<00:00:03.629> Edwards<00:00:03.899> and<00:00:04.170> his<00:00:04.200> talk<00:00:04.529> the

00:00:04.849 --> 00:00:04.859 align:start position:0%
for Terrence Edwards and his talk the

WTF? You're looking at WebVTT - Web Video Text Tracks Format - this allows words to be displayed as they're said. Each sentence and word is given a time-code and a position, colours are also possible to identify multiple speakers. It's great for subtitles, but it is lousy if all you want to do is read a transcript.

So, how do we convert the above to something like:

and now can we have a round of applause for Terrence Edwards and his talk the connected house of horrors

Python - the quick and dirty way

Using the open source WebVTT-PY Python library, we can directly get the raw text of each line of the subtitles

import webvtt
vtt = webvtt.read('subtitles-en.vtt')

vtt[0].text
' \nand now can we have a round of applause'
vtt[1].text
'and now can we have a round of applause\n '
vtt[2].text
'and now can we have a round of applause\nfor Terrence Edwards and his talk the'
vtt[3].text
'for Terrence Edwards and his talk the\n '
vtt[4].text
'for Terrence Edwards and his talk the\nconnected house of horrors good'
vtt[5].text
'connected house of horrors good\n '
vtt[6].text
'connected house of horrors good\nafternoon'

Manually looking through the text, we can see that the 2nd element has the first complete sentence, then the 6th. Starting at 2, we can increment by 4 and grab elements 6, 10, 14, etc to build up a transcript. Does that work?

Yes! This is what happens if we slice the array:

sub = vtt[2::4]

sub[0].text
'and now can we have a round of applause\nfor Terrence Edwards and his talk the'
sub[1].text
'connected house of horrors good\nafternoon'
sub[2].text
'AMF thank you so much for for coming\nhere my name is Terrence Eaton I need to'
sub[3].text
'tell you three things about this talk so\nthe first thing is that this does'

But are we sure that will work for all the subtitles? Or even for the entirety of this subtitle file?

Python the hard way

Let's take a look again at the first 4 subtitle entries.

vtt[0].text
' \nand now can we have a round of applause'
vtt[1].text
'and now can we have a round of applause\n '
vtt[2].text
'and now can we have a round of applause\nfor Terrence Edwards and his talk the'
vtt[3].text
'for Terrence Edwards and his talk the\n '
vtt[4].text
'for Terrence Edwards and his talk the\nconnected house of horrors good'

We can split those double lines using

vtt[2].text.splitlines()
['and now can we have a round of applause', 'for Terrence Edwards and his talk the']

Let's create a new array. Add all the lines split by \n.

lines = []
for line in vtt:
    lines.extend(line.text.strip().splitlines())

Which gives us:

>>> lines[0]
'and now can we have a round of applause'
>>> lines[1]
'and now can we have a round of applause'
>>> lines[2]
'and now can we have a round of applause'
>>> lines[3]
'for Terrence Edwards and his talk the'
>>> lines[4]
'for Terrence Edwards and his talk the'
>>> lines[5]
'for Terrence Edwards and his talk the'
>>> lines[6]
'connected house of horrors good'

And now, to de-duplicate them:

transcript = ""
previous = None
for line in lines:
    if line == previous:
       continue
    transcript += " " + line
    previous = line

Putting it all together

Ta-da!

import webvtt
vtt = webvtt.read('subtitles.vtt')
transcript = ""

lines = []
for line in vtt:
    lines.extend(line.text.strip().splitlines())

previous = None
for line in lines:
    if line == previous:
       continue
    transcript += " " + line
    previous = line

print(transcript)

One thing to note is that there is no punctuation. So it's not as good as a proper transcription.

subtitles – Terence Eden’s Blog

Convert WebVTT to a Transcript using Python

Python - the quick and dirty way

Python the hard way

Putting it all together