And here’s what the code which generates those subtitles looks like:
00:00:00.930 --> 00:00:03.080 align:start position:0%
and<span class="hljs-tag"><<span class="hljs-name">00:00:01.230</span>></span><span class="hljs-tag"><<span class="hljs-name">c</span>></span> now<span class="hljs-tag"></<span class="hljs-name">c</span>></span><span class="hljs-tag"><<span class="hljs-name">00:00:01.439</span>></span><span class="hljs-tag"><<span class="hljs-name">c</span>></span> can<span class="hljs-tag"></<span class="hljs-name">c</span>></span><span class="hljs-tag"><<span class="hljs-name">00:00:01.709</span>></span><span class="hljs-tag"><<span class="hljs-name">c</span>></span> we<span class="hljs-tag"></<span class="hljs-name">c</span>></span><span class="hljs-tag"><<span class="hljs-name">00:00:01.800</span>></span><span class="hljs-tag"><<span class="hljs-name">c</span>></span> have<span class="hljs-tag"></<span class="hljs-name">c</span>></span><span class="hljs-tag"><<span class="hljs-name">c.colorCCCCCC</span>></span><span class="hljs-tag"><<span class="hljs-name">00:00:01.920</span>></span><span class="hljs-tag"><<span class="hljs-name">c</span>></span> a<span class="hljs-tag"></<span class="hljs-name">c</span>></span><span class="hljs-tag"></<span class="hljs-name">c</span>></span><span class="hljs-tag"><<span class="hljs-name">c.colorE5E5E5</span>></span><span class="hljs-tag"><<span class="hljs-name">00:00:01.979</span>></span><span class="hljs-tag"><<span class="hljs-name">c</span>></span> round<span class="hljs-tag"></<span class="hljs-name">c</span>></span><span class="hljs-tag"><<span class="hljs-name">00:00:02.370</span>></span><span class="hljs-tag"><<span class="hljs-name">c</span>></span> of<span class="hljs-tag"></<span class="hljs-name">c</span>></span><span class="hljs-tag"><<span class="hljs-name">00:00:02.460</span>></span><span class="hljs-tag"><<span class="hljs-name">c</span>></span> applause<span class="hljs-tag"></<span class="hljs-name">c</span>></span><span class="hljs-tag"></<span class="hljs-name">c</span>></span>
00:00:03.080 --> 00:00:03.090 align:start position:0%
and now can we have<span class="hljs-tag"><<span class="hljs-name">c.colorCCCCCC</span>></span> a<span class="hljs-tag"></<span class="hljs-name">c</span>></span><span class="hljs-tag"><<span class="hljs-name">c.colorE5E5E5</span>></span> round of applause
<span class="hljs-tag"></<span class="hljs-name">c</span>></span>
00:00:03.090 --> 00:00:04.849 align:start position:0%
and now can we have<span class="hljs-tag"><<span class="hljs-name">c.colorCCCCCC</span>></span> a<span class="hljs-tag"></<span class="hljs-name">c</span>></span><span class="hljs-tag"><<span class="hljs-name">c.colorE5E5E5</span>></span> round of applause
for<span class="hljs-tag"></<span class="hljs-name">c</span>></span><span class="hljs-tag"><<span class="hljs-name">c.colorCCCCCC</span>></span><span class="hljs-tag"><<span class="hljs-name">00:00:03.120</span>></span><span class="hljs-tag"><<span class="hljs-name">c</span>></span> Terrence<span class="hljs-tag"></<span class="hljs-name">c</span>></span><span class="hljs-tag"><<span class="hljs-name">00:00:03.629</span>></span><span class="hljs-tag"><<span class="hljs-name">c</span>></span> Edwards<span class="hljs-tag"></<span class="hljs-name">c</span>></span><span class="hljs-tag"><<span class="hljs-name">00:00:03.899</span>></span><span class="hljs-tag"><<span class="hljs-name">c</span>></span> and<span class="hljs-tag"></<span class="hljs-name">c</span>></span><span class="hljs-tag"><<span class="hljs-name">00:00:04.170</span>></span><span class="hljs-tag"><<span class="hljs-name">c</span>></span> his<span class="hljs-tag"></<span class="hljs-name">c</span>></span><span class="hljs-tag"></<span class="hljs-name">c</span>></span><span class="hljs-tag"><<span class="hljs-name">c.colorE5E5E5</span>></span><span class="hljs-tag"><<span class="hljs-name">00:00:04.200</span>></span><span class="hljs-tag"><<span class="hljs-name">c</span>></span> talk<span class="hljs-tag"></<span class="hljs-name">c</span>></span><span class="hljs-tag"><<span class="hljs-name">00:00:04.529</span>></span><span class="hljs-tag"><<span class="hljs-name">c</span>></span> the<span class="hljs-tag"></<span class="hljs-name">c</span>></span><span class="hljs-tag"></<span class="hljs-name">c</span>></span>
00:00:04.849 --> 00:00:04.859 align:start position:0%
for<span class="hljs-tag"><<span class="hljs-name">c.colorCCCCCC</span>></span> Terrence Edwards and his<span class="hljs-tag"></<span class="hljs-name">c</span>></span><span class="hljs-tag"><<span class="hljs-name">c.colorE5E5E5</span>></span> talk the
<span class="hljs-tag"></<span class="hljs-name">c</span>></span>
WTF? You’re looking at WebVTT – Web Video Text Tracks Format – this allows words to be displayed as they’re said. Each sentence and word is given a time-code and a position, colours are also possible to identify multiple speakers. It’s great for subtitles, but it is lousy if all you want to do is read a transcript.
So, how do we convert the above to something like:
and now can we have a round of applause for Terrence Edwards and his talk the connected house of horrors
Python – the quick and dirty way
Using the open source WebVTT-PY Python library, we can directly get the raw text of each line of the subtitles
<span class="hljs-keyword">import</span> webvtt
vtt = webvtt.read(<span class="hljs-string">'subtitles-en.vtt'</span>)
vtt[<span class="hljs-number">0</span>].text
<span class="hljs-string">' \nand now can we have a round of applause'</span>
vtt[<span class="hljs-number">1</span>].text
<span class="hljs-string">'and now can we have a round of applause\n '</span>
vtt[<span class="hljs-number">2</span>].text
<span class="hljs-string">'and now can we have a round of applause\nfor Terrence Edwards and his talk the'</span>
vtt[<span class="hljs-number">3</span>].text
<span class="hljs-string">'for Terrence Edwards and his talk the\n '</span>
vtt[<span class="hljs-number">4</span>].text
<span class="hljs-string">'for Terrence Edwards and his talk the\nconnected house of horrors good'</span>
vtt[<span class="hljs-number">5</span>].text
<span class="hljs-string">'connected house of horrors good\n '</span>
vtt[<span class="hljs-number">6</span>].text
<span class="hljs-string">'connected house of horrors good\nafternoon'</span>
Manually looking through the text, we can see that the 2nd element has the first complete sentence, then the 6th. Starting at 2, we can increment by 4 and grab elements 6, 10, 14, etc to build up a transcript. Does that work?
Yes! This is what happens if we slice the array:
sub = vtt[<span class="hljs-number">2</span>::<span class="hljs-number">4</span>]
sub[<span class="hljs-number">0</span>].text
<span class="hljs-string">'and now can we have a round of applause\nfor Terrence Edwards and his talk the'</span>
sub[<span class="hljs-number">1</span>].text
<span class="hljs-string">'connected house of horrors good\nafternoon'</span>
sub[<span class="hljs-number">2</span>].text
<span class="hljs-string">'AMF thank you so much for for coming\nhere my name is Terrence Eaton I need to'</span>
sub[<span class="hljs-number">3</span>].text
<span class="hljs-string">'tell you three things about this talk so\nthe first thing is that this does'</span>
But are we sure that will work for all the subtitles? Or even for the entirety of this subtitle file?
Python the hard way
Let’s take a look again at the first 4 subtitle entries.
vtt[<span class="hljs-number">0</span>].<span class="hljs-built_in">text</span>
' \nand now can we have a <span class="hljs-built_in">round</span> <span class="hljs-keyword">of</span> applause'
vtt[<span class="hljs-number">1</span>].<span class="hljs-built_in">text</span>
'<span class="hljs-keyword">and</span> now can we have a <span class="hljs-built_in">round</span> <span class="hljs-keyword">of</span> applause\n '
vtt[<span class="hljs-number">2</span>].<span class="hljs-built_in">text</span>
'<span class="hljs-keyword">and</span> now can we have a <span class="hljs-built_in">round</span> <span class="hljs-keyword">of</span> applause\nfor Terrence Edwards <span class="hljs-keyword">and</span> his talk <span class="hljs-keyword">the</span>'
vtt[<span class="hljs-number">3</span>].<span class="hljs-built_in">text</span>
'<span class="hljs-keyword">for</span> Terrence Edwards <span class="hljs-keyword">and</span> his talk <span class="hljs-keyword">the</span>\n '
vtt[<span class="hljs-number">4</span>].<span class="hljs-built_in">text</span>
'<span class="hljs-keyword">for</span> Terrence Edwards <span class="hljs-keyword">and</span> his talk <span class="hljs-keyword">the</span>\nconnected house <span class="hljs-keyword">of</span> horrors good'
We can split those double lines using
vtt[<span class="hljs-number">2</span>].<span class="hljs-built_in">text</span>.splitlines()
['<span class="hljs-keyword">and</span> now can we have a <span class="hljs-built_in">round</span> <span class="hljs-keyword">of</span> applause', '<span class="hljs-keyword">for</span> Terrence Edwards <span class="hljs-keyword">and</span> his talk <span class="hljs-keyword">the</span>']
Let’s create a new array. Add all the lines split by \n
.
lines = []
<span class="hljs-keyword">for</span> line <span class="hljs-keyword">in</span> vtt:
lines.extend(line.text.strip().splitlines())
Which gives us:
<span class="hljs-meta">>></span>> lines[<span class="hljs-number">0</span>]
<span class="hljs-string">'and now can we have a round of applause'</span>
<span class="hljs-meta">>></span>> lines[<span class="hljs-number">1</span>]
<span class="hljs-string">'and now can we have a round of applause'</span>
<span class="hljs-meta">>></span>> lines[<span class="hljs-number">2</span>]
<span class="hljs-string">'and now can we have a round of applause'</span>
<span class="hljs-meta">>></span>> lines[<span class="hljs-number">3</span>]
<span class="hljs-string">'for Terrence Edwards and his talk the'</span>
<span class="hljs-meta">>></span>> lines[<span class="hljs-number">4</span>]
<span class="hljs-string">'for Terrence Edwards and his talk the'</span>
<span class="hljs-meta">>></span>> lines[<span class="hljs-number">5</span>]
<span class="hljs-string">'for Terrence Edwards and his talk the'</span>
<span class="hljs-meta">>></span>> lines[<span class="hljs-number">6</span>]
<span class="hljs-string">'connected house of horrors good'</span>
And now, to de-duplicate them:
transcript = <span class="hljs-string">""</span>
previous = <span class="hljs-literal">None</span>
<span class="hljs-keyword">for</span> line <span class="hljs-keyword">in</span> lines:
<span class="hljs-keyword">if</span> line == previous:
<span class="hljs-keyword">continue</span>
transcript += <span class="hljs-string">" "</span> + line
previous = line
Putting it all together
Ta-da!
<span class="hljs-keyword">import</span> webvtt
vtt = webvtt.read(<span class="hljs-string">'subtitles.vtt'</span>)
transcript = <span class="hljs-string">""</span>
lines = []
<span class="hljs-keyword">for</span> line <span class="hljs-keyword">in</span> vtt:
lines.extend(line.text.strip().splitlines())
previous = <span class="hljs-literal">None</span>
<span class="hljs-keyword">for</span> line <span class="hljs-keyword">in</span> lines:
<span class="hljs-keyword">if</span> line == previous:
<span class="hljs-keyword">continue</span>
transcript += <span class="hljs-string">" "</span> + line
previous = line
print(transcript)
One thing to note is that there is no punctuation. So it’s not as good as a proper transcription.