<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet href="https://shkspr.mobi/blog/wp-content/themes/edent-wordpress-theme/rss-style.xsl" type="text/xsl"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	    xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	     xmlns:dc="http://purl.org/dc/elements/1.1/"
	   xmlns:atom="http://www.w3.org/2005/Atom"
	     xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	  xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>
<channel>
	<title>subtitles &#8211; Terence Eden’s Blog</title>
	<atom:link href="https://shkspr.mobi/blog/tag/subtitles/feed/" rel="self" type="application/rss+xml" />
	<link>https://shkspr.mobi/blog</link>
	<description>Regular nonsense about tech and its effects 🙃</description>
	<lastBuildDate>Tue, 10 Sep 2024 07:07:19 +0000</lastBuildDate>
	<language>en-GB</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>https://wordpress.org/?v=6.9.4</generator>

<image>
	<url>https://shkspr.mobi/blog/wp-content/uploads/2023/07/cropped-avatar-32x32.jpeg</url>
	<title>subtitles &#8211; Terence Eden’s Blog</title>
	<link>https://shkspr.mobi/blog</link>
	<width>32</width>
	<height>32</height>
</image> 
	<item>
		<title><![CDATA[Convert WebVTT to a Transcript using Python]]></title>
		<link>https://shkspr.mobi/blog/2018/09/convert-webvtt-to-a-transcript-using-python/</link>
					<comments>https://shkspr.mobi/blog/2018/09/convert-webvtt-to-a-transcript-using-python/#comments</comments>
				<dc:creator><![CDATA[@edent]]></dc:creator>
		<pubDate>Mon, 10 Sep 2018 11:05:23 +0000</pubDate>
				<category><![CDATA[/etc/]]></category>
		<category><![CDATA[emfcamp]]></category>
		<category><![CDATA[HowTo]]></category>
		<category><![CDATA[python]]></category>
		<category><![CDATA[subtitles]]></category>
		<category><![CDATA[YouTube]]></category>
		<guid isPermaLink="false">https://shkspr.mobi/blog/?p=30341</guid>

					<description><![CDATA[I want to convert YouTube&#039;s auto-generated subtitles into a plain transcript. Why is this so hard?  This blog post gives a more detailed explanation than my answer to this StackOverflow question.  Here&#039;s what the subtitles look like when you view a video:   And here&#039;s what the code which generates those subtitles looks like:  00:00:00.930 --&#62; 00:00:03.080 align:start position:0% …]]></description>
										<content:encoded><![CDATA[<p>I want to convert YouTube's auto-generated subtitles into a plain transcript. Why is this so hard?</p>

<p>This blog post gives a more detailed explanation than my answer to <a href="https://stackoverflow.com/questions/51784232/how-do-i-convert-the-webvtt-format-to-plain-text">this StackOverflow question</a>.</p>

<p>Here's what the subtitles look like when you view a video:
<img src="https://shkspr.mobi/blog/wp-content/uploads/2018/09/YouTube-showing-subtitles.jpg" alt="YouTube showing subtitles." width="600" height="338" class="aligncenter size-full wp-image-30343"></p>

<p>And here's what the code which generates those subtitles looks like:</p>

<pre><code class="language-_">00:00:00.930 --&gt; 00:00:03.080 align:start position:0%

and&lt;00:00:01.230&gt;&lt;c&gt; now&lt;/c&gt;&lt;00:00:01.439&gt;&lt;c&gt; can&lt;/c&gt;&lt;00:00:01.709&gt;&lt;c&gt; we&lt;/c&gt;&lt;00:00:01.800&gt;&lt;c&gt; have&lt;/c&gt;&lt;c.colorCCCCCC&gt;&lt;00:00:01.920&gt;&lt;c&gt; a&lt;/c&gt;&lt;/c&gt;&lt;c.colorE5E5E5&gt;&lt;00:00:01.979&gt;&lt;c&gt; round&lt;/c&gt;&lt;00:00:02.370&gt;&lt;c&gt; of&lt;/c&gt;&lt;00:00:02.460&gt;&lt;c&gt; applause&lt;/c&gt;&lt;/c&gt;

00:00:03.080 --&gt; 00:00:03.090 align:start position:0%
and now can we have&lt;c.colorCCCCCC&gt; a&lt;/c&gt;&lt;c.colorE5E5E5&gt; round of applause
 &lt;/c&gt;

00:00:03.090 --&gt; 00:00:04.849 align:start position:0%
and now can we have&lt;c.colorCCCCCC&gt; a&lt;/c&gt;&lt;c.colorE5E5E5&gt; round of applause
for&lt;/c&gt;&lt;c.colorCCCCCC&gt;&lt;00:00:03.120&gt;&lt;c&gt; Terrence&lt;/c&gt;&lt;00:00:03.629&gt;&lt;c&gt; Edwards&lt;/c&gt;&lt;00:00:03.899&gt;&lt;c&gt; and&lt;/c&gt;&lt;00:00:04.170&gt;&lt;c&gt; his&lt;/c&gt;&lt;/c&gt;&lt;c.colorE5E5E5&gt;&lt;00:00:04.200&gt;&lt;c&gt; talk&lt;/c&gt;&lt;00:00:04.529&gt;&lt;c&gt; the&lt;/c&gt;&lt;/c&gt;

00:00:04.849 --&gt; 00:00:04.859 align:start position:0%
for&lt;c.colorCCCCCC&gt; Terrence Edwards and his&lt;/c&gt;&lt;c.colorE5E5E5&gt; talk the
 &lt;/c&gt;
</code></pre>

<p>WTF? You're looking at <a href="https://www.w3.org/TR/webvtt1/">WebVTT</a> - Web Video Text Tracks Format - this allows words to be displayed as they're said. Each sentence and word is given a time-code and a position, colours are also possible to identify multiple speakers.  It's great for subtitles, but it is lousy if all you want to do is read a transcript.</p>

<p>So, how do we convert the above to something like:</p>

<blockquote><p>and now can we have a round of applause for Terrence Edwards and his talk the connected house of horrors</p></blockquote>

<h2 id="python-the-quick-and-dirty-way"><a href="https://shkspr.mobi/blog/2018/09/convert-webvtt-to-a-transcript-using-python/#python-the-quick-and-dirty-way">Python - the quick and dirty way</a></h2>

<p>Using the <a href="https://webvtt-py.readthedocs.io">open source WebVTT-PY</a> Python library, we can directly get the raw text of each line of the subtitles</p>

<pre><code class="language-python">import webvtt
vtt = webvtt.read('subtitles-en.vtt')

vtt[0].text
' \nand now can we have a round of applause'
vtt[1].text
'and now can we have a round of applause\n '
vtt[2].text
'and now can we have a round of applause\nfor Terrence Edwards and his talk the'
vtt[3].text
'for Terrence Edwards and his talk the\n '
vtt[4].text
'for Terrence Edwards and his talk the\nconnected house of horrors good'
vtt[5].text
'connected house of horrors good\n '
vtt[6].text
'connected house of horrors good\nafternoon'
</code></pre>

<p>Manually looking through the text, we can see that the 2nd element has the first complete sentence, then the 6th. Starting at 2, we can increment by 4 and grab elements 6, 10, 14, etc to build up a transcript. Does that work?</p>

<p>Yes! This is what happens if we <a href="https://docs.python.org/3/library/functions.html?highlight=slice#slice">slice the array</a>:</p>

<pre><code class="language-python">sub = vtt[2::4]

sub[0].text
'and now can we have a round of applause\nfor Terrence Edwards and his talk the'
sub[1].text
'connected house of horrors good\nafternoon'
sub[2].text
'AMF thank you so much for for coming\nhere my name is Terrence Eaton I need to'
sub[3].text
'tell you three things about this talk so\nthe first thing is that this does'
</code></pre>

<p>But are we sure that will work for all the subtitles? Or even for the entirety of this subtitle file?</p>

<h2 id="python-the-hard-way"><a href="https://shkspr.mobi/blog/2018/09/convert-webvtt-to-a-transcript-using-python/#python-the-hard-way">Python the hard way</a></h2>

<p>Let's take a look again at the first 4 subtitle entries.</p>

<pre><code class="language-_">vtt[0].text
' \nand now can we have a round of applause'
vtt[1].text
'and now can we have a round of applause\n '
vtt[2].text
'and now can we have a round of applause\nfor Terrence Edwards and his talk the'
vtt[3].text
'for Terrence Edwards and his talk the\n '
vtt[4].text
'for Terrence Edwards and his talk the\nconnected house of horrors good'
</code></pre>

<p>We can split those double lines using</p>

<pre><code class="language-_">vtt[2].text.splitlines()
['and now can we have a round of applause', 'for Terrence Edwards and his talk the']
</code></pre>

<p>Let's create a new array. Add all the lines split by <code>\n</code>.</p>

<pre><code class="language-python">lines = []
for line in vtt:
    lines.extend(line.text.strip().splitlines())
</code></pre>

<p>Which gives us:</p>

<pre><code class="language-_">&gt;&gt;&gt; lines[0]
'and now can we have a round of applause'
&gt;&gt;&gt; lines[1]
'and now can we have a round of applause'
&gt;&gt;&gt; lines[2]
'and now can we have a round of applause'
&gt;&gt;&gt; lines[3]
'for Terrence Edwards and his talk the'
&gt;&gt;&gt; lines[4]
'for Terrence Edwards and his talk the'
&gt;&gt;&gt; lines[5]
'for Terrence Edwards and his talk the'
&gt;&gt;&gt; lines[6]
'connected house of horrors good'
</code></pre>

<p>And now, to de-duplicate them:</p>

<pre><code class="language-python">transcript = ""
previous = None
for line in lines:
    if line == previous:
       continue
    transcript += " " + line
    previous = line
</code></pre>

<h2 id="putting-it-all-together"><a href="https://shkspr.mobi/blog/2018/09/convert-webvtt-to-a-transcript-using-python/#putting-it-all-together">Putting it all together</a></h2>

<p>Ta-da!</p>

<pre><code class="language-python">import webvtt
vtt = webvtt.read('subtitles.vtt')
transcript = ""

lines = []
for line in vtt:
    lines.extend(line.text.strip().splitlines())

previous = None
for line in lines:
    if line == previous:
       continue
    transcript += " " + line
    previous = line

print(transcript)
</code></pre>

<p>One thing to note is that there is <em>no</em> punctuation. So it's not as good as a proper transcription.</p>
<img src="https://shkspr.mobi/blog/wp-content/themes/edent-wordpress-theme/info/okgo.php?ID=30341&HTTP_REFERER=RSS" alt="" width="1" height="1" loading="eager">]]></content:encoded>
					
					<wfw:commentRss>https://shkspr.mobi/blog/2018/09/convert-webvtt-to-a-transcript-using-python/feed/</wfw:commentRss>
			<slash:comments>3</slash:comments>
		
		
			</item>
	</channel>
</rss>
