Convert WebVTT to a Transcript using Python


I want to convert YouTube's auto-generated subtitles into a plain transcript. Why is this so hard?

This blog post gives a more detailed explanation than my answer to this StackOverflow question.

Here's what the subtitles look like when you view a video: YouTube showing subtitles.

And here's what the code which generates those subtitles looks like:

00:00:00.930 --> 00:00:03.080 align:start position:0%

and<00:00:01.230><c> now</c><00:00:01.439><c> can</c><00:00:01.709><c> we</c><00:00:01.800><c> have</c><c.colorCCCCCC><00:00:01.920><c> a</c></c><c.colorE5E5E5><00:00:01.979><c> round</c><00:00:02.370><c> of</c><00:00:02.460><c> applause</c></c>

00:00:03.080 --> 00:00:03.090 align:start position:0%
and now can we have<c.colorCCCCCC> a</c><c.colorE5E5E5> round of applause
 </c>

00:00:03.090 --> 00:00:04.849 align:start position:0%
and now can we have<c.colorCCCCCC> a</c><c.colorE5E5E5> round of applause
for</c><c.colorCCCCCC><00:00:03.120><c> Terrence</c><00:00:03.629><c> Edwards</c><00:00:03.899><c> and</c><00:00:04.170><c> his</c></c><c.colorE5E5E5><00:00:04.200><c> talk</c><00:00:04.529><c> the</c></c>

00:00:04.849 --> 00:00:04.859 align:start position:0%
for<c.colorCCCCCC> Terrence Edwards and his</c><c.colorE5E5E5> talk the
 </c>

WTF? You're looking at WebVTT - Web Video Text Tracks Format - this allows words to be displayed as they're said. Each sentence and word is given a time-code and a position, colours are also possible to identify multiple speakers. It's great for subtitles, but it is lousy if all you want to do is read a transcript.

So, how do we convert the above to something like:

and now can we have a round of applause for Terrence Edwards and his talk the connected house of horrors

Python - the quick and dirty way

Using the open source WebVTT-PY Python library, we can directly get the raw text of each line of the subtitles

Python 3 Python 3import webvtt
vtt = webvtt.read('subtitles-en.vtt')

vtt[0].text
' \nand now can we have a round of applause'
vtt[1].text
'and now can we have a round of applause\n '
vtt[2].text
'and now can we have a round of applause\nfor Terrence Edwards and his talk the'
vtt[3].text
'for Terrence Edwards and his talk the\n '
vtt[4].text
'for Terrence Edwards and his talk the\nconnected house of horrors good'
vtt[5].text
'connected house of horrors good\n '
vtt[6].text
'connected house of horrors good\nafternoon'

Manually looking through the text, we can see that the 2nd element has the first complete sentence, then the 6th. Starting at 2, we can increment by 4 and grab elements 6, 10, 14, etc to build up a transcript. Does that work?

Yes! This is what happens if we slice the array:

Python 3 Python 3sub = vtt[2::4]

sub[0].text
'and now can we have a round of applause\nfor Terrence Edwards and his talk the'
sub[1].text
'connected house of horrors good\nafternoon'
sub[2].text
'AMF thank you so much for for coming\nhere my name is Terrence Eaton I need to'
sub[3].text
'tell you three things about this talk so\nthe first thing is that this does'

But are we sure that will work for all the subtitles? Or even for the entirety of this subtitle file?

Python the hard way

Let's take a look again at the first 4 subtitle entries.

vtt[0].text
' \nand now can we have a round of applause'
vtt[1].text
'and now can we have a round of applause\n '
vtt[2].text
'and now can we have a round of applause\nfor Terrence Edwards and his talk the'
vtt[3].text
'for Terrence Edwards and his talk the\n '
vtt[4].text
'for Terrence Edwards and his talk the\nconnected house of horrors good'

We can split those double lines using

vtt[2].text.splitlines()
['and now can we have a round of applause', 'for Terrence Edwards and his talk the']

Let's create a new array. Add all the lines split by \n.

Python 3 Python 3lines = []
for line in vtt:
    lines.extend(line.text.strip().splitlines())

Which gives us:

>>> lines[0]
'and now can we have a round of applause'
>>> lines[1]
'and now can we have a round of applause'
>>> lines[2]
'and now can we have a round of applause'
>>> lines[3]
'for Terrence Edwards and his talk the'
>>> lines[4]
'for Terrence Edwards and his talk the'
>>> lines[5]
'for Terrence Edwards and his talk the'
>>> lines[6]
'connected house of horrors good'

And now, to de-duplicate them:

Python 3 Python 3transcript = ""
previous = None
for line in lines:
    if line == previous:
       continue
    transcript += " " + line
    previous = line

Putting it all together

Ta-da!

Python 3 Python 3import webvtt
vtt = webvtt.read('subtitles.vtt')
transcript = ""

lines = []
for line in vtt:
    lines.extend(line.text.strip().splitlines())

previous = None
for line in lines:
    if line == previous:
       continue
    transcript += " " + line
    previous = line

print(transcript)

One thing to note is that there is no punctuation. So it's not as good as a proper transcription.


Share this post on…

  • Mastodon
  • Facebook
  • LinkedIn
  • BlueSky
  • Threads
  • Reddit
  • HackerNews
  • Lobsters
  • WhatsApp
  • Telegram

3 thoughts on “Convert WebVTT to a Transcript using Python”

  1. Jack Parsons says:

    I just now wrote my own script with the same python package for a similar problem.

    If you want punctuation, a heuristic based on time gaps should work well. Sentence and paragraph breaks should be easy to get right. Adding commas may be more tricky.

    And, yes, why did it have to be so hard? Nothing worked! ffmpeg, 5 different packaged scripts from github, and several subtitles editors on my Linux Mint box all failed. It’s like VTT is a hokey standard that only Youtube pays attention to.

    Cheers!

    Reply
  2. M says:

    Hi! I am super new to python/coding. How would I iterate this over all files in a given dictionary? Any help would be much appreciated! Thanks!

    Reply

What are your reckons?

All comments are moderated and may not be published immediately. Your email address will not be published.

Allowed HTML: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong> <p> <pre> <br> <img src="" alt="" title="" srcset="">