Convert WebVTT to a Transcript using Python


I want to convert YouTube's auto-generated subtitles into a plain transcript. Why is this so hard?

This blog post gives a more detailed explanation than my answer to this StackOverflow question.

Here's what the subtitles look like when you view a video:
YouTube showing subtitles.

And here's what the code which generates those subtitles looks like:

00:00:00.930 --> 00:00:03.080 align:start position:0%

and<00:00:01.230><c> now</c><00:00:01.439><c> can</c><00:00:01.709><c> we</c><00:00:01.800><c> have</c><c.colorCCCCCC><00:00:01.920><c> a</c></c><c.colorE5E5E5><00:00:01.979><c> round</c><00:00:02.370><c> of</c><00:00:02.460><c> applause</c></c>

00:00:03.080 --> 00:00:03.090 align:start position:0%
and now can we have<c.colorCCCCCC> a</c><c.colorE5E5E5> round of applause
 </c>

00:00:03.090 --> 00:00:04.849 align:start position:0%
and now can we have<c.colorCCCCCC> a</c><c.colorE5E5E5> round of applause
for</c><c.colorCCCCCC><00:00:03.120><c> Terrence</c><00:00:03.629><c> Edwards</c><00:00:03.899><c> and</c><00:00:04.170><c> his</c></c><c.colorE5E5E5><00:00:04.200><c> talk</c><00:00:04.529><c> the</c></c>

00:00:04.849 --> 00:00:04.859 align:start position:0%
for<c.colorCCCCCC> Terrence Edwards and his</c><c.colorE5E5E5> talk the
 </c>

WTF? You're looking at WebVTT - Web Video Text Tracks Format - this allows words to be displayed as they're said. Each sentence and word is given a time-code and a position, colours are also possible to identify multiple speakers. It's great for subtitles, but it is lousy if all you want to do is read a transcript.

So, how do we convert the above to something like:

and now can we have a round of applause for Terrence Edwards and his talk the connected house of horrors

Python - the quick and dirty way

Using the open source WebVTT-PY Python library, we can directly get the raw text of each line of the subtitles

import webvtt
vtt = webvtt.read('subtitles-en.vtt')

vtt[0].text
' \nand now can we have a round of applause'
vtt[1].text
'and now can we have a round of applause\n '
vtt[2].text
'and now can we have a round of applause\nfor Terrence Edwards and his talk the'
vtt[3].text
'for Terrence Edwards and his talk the\n '
vtt[4].text
'for Terrence Edwards and his talk the\nconnected house of horrors good'
vtt[5].text
'connected house of horrors good\n '
vtt[6].text
'connected house of horrors good\nafternoon'

Manually looking through the text, we can see that the 2nd element has the first complete sentence, then the 6th. Starting at 2, we can increment by 4 and grab elements 6, 10, 14, etc to build up a transcript. Does that work?

Yes! This is what happens if we slice the array:

sub = vtt[2::4]

sub[0].text
'and now can we have a round of applause\nfor Terrence Edwards and his talk the'
sub[1].text
'connected house of horrors good\nafternoon'
sub[2].text
'AMF thank you so much for for coming\nhere my name is Terrence Eaton I need to'
sub[3].text
'tell you three things about this talk so\nthe first thing is that this does'

But are we sure that will work for all the subtitles? Or even for the entirety of this subtitle file?

Python the hard way

Let's take a look again at the first 4 subtitle entries.

vtt[0].text
' \nand now can we have a round of applause'
vtt[1].text
'and now can we have a round of applause\n '
vtt[2].text
'and now can we have a round of applause\nfor Terrence Edwards and his talk the'
vtt[3].text
'for Terrence Edwards and his talk the\n '
vtt[4].text
'for Terrence Edwards and his talk the\nconnected house of horrors good'

We can split those double lines using

vtt[2].text.splitlines()
['and now can we have a round of applause', 'for Terrence Edwards and his talk the']

Let's create a new array. Add all the lines split by \n.

lines = []
for line in vtt:
    lines.extend(line.text.strip().splitlines())

Which gives us:

>>> lines[0]
'and now can we have a round of applause'
>>> lines[1]
'and now can we have a round of applause'
>>> lines[2]
'and now can we have a round of applause'
>>> lines[3]
'for Terrence Edwards and his talk the'
>>> lines[4]
'for Terrence Edwards and his talk the'
>>> lines[5]
'for Terrence Edwards and his talk the'
>>> lines[6]
'connected house of horrors good'

And now, to de-duplicate them:

transcript = ""
previous = None
for line in lines:
    if line == previous:
       continue
    transcript += " " + line
    previous = line

Putting it all together

Ta-da!

import webvtt
vtt = webvtt.read('subtitles.vtt')
transcript = ""

lines = []
for line in vtt:
    lines.extend(line.text.strip().splitlines())

previous = None
for line in lines:
    if line == previous:
       continue
    transcript += " " + line
    previous = line

print(transcript)

One thing to note is that there is no punctuation. So it's not as good as a proper transcription.

Support this blog

Enjoyed this blog post? You can say thanks to the author in the following ways:

Donate to charity
Give to charity.
Buy me a birthday present
Amazon Wishlist
Get me a coffee
Donate on Ko-Fi.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.