Aws Custom Vocabulary Internal Failure Please Try Your Request Again
A quick tutorial to AWS Transcribe with Python
Introduction to transcriptions services with AWS Transcribe via Google Colab & Google Drive
This by summer, I was working on some products that involve Voice communication-to-text mechanisms, and I retrieve it'due south all-time to apply existed APIs for these purposes. For this mail service, I want to share my lilliputian feel working with these APIs with those who desire to try out these wonderful technologies. Promise you guys detect it helpful :)
Oh! And here is the quick content I will cover:
- Setting Up: Full general packages and initializes essential functions
- Single Speaker Files
- Multiple Speakers Files
- Accessing and Uploading files through Colab and Google Drive to S3 Storage
- Creating a vocabulary to heighten transcription accurateness
Link to the code: Google Colab (here), Gist (hither), or Github(hither).
Why Oral communication-to-text?
Speech-to-text is a promising technology, not past itself every bit a product, simply more on its underlying applications to many other products. Equally our reading speed is much faster than listening speed, reading the transcription saves more than time than listening to audio with like content. Google'southward new flagship phone: Pixel 4, introduces the Recording app that performs real-time transcription! Promising products could be transcribing meetings (Zoom already offers this characteristic), lectures (text transcripts), and many more.
Over the year, Voice communication-to-text has been a quite mature field. As this challenge is more horizontal than vertical, companies with massive amounts of information from various input sources win. Without surprise, big corporations like Amazon, Google, IBM, Microsoft are leaders in providing transcription services on their cloud.
Each product has its pros and cons and might conform your purpose differently. I strongly recommend trying out all these services and select the i best perform on your desired utilise instance. For this mail, I focus on Amazon Transcription Service because of its rich output: a JSON file with all the timestamps and other information, which is super useful!
(Small hype for my (hopefully) next post: Google Cloud (when performing single speaker transcription): the output does not have whatsoever punctuations. I am hoping to write an RNN model to add together punctuation to enrich the output. Merely this is in the future. Let'due south become dorsum to AWS for now!)
I will go through the steps using Google Colab. The link to the full code is here. Allow's do this!
Setting Up: General packages and initializes essential functions.
!pip install boto3
import pandas equally pd
import time
import boto3
Boto is the AWS software development kit for Python. More info on Boto three documentation can exist found hither.
And we also need the admission key to our account on AWS. If you haven't created an business relationship yet, please practise and then (it'south free to create and have Free Tier is y'all don't apply too much)!
When you accept your business relationship, here's how to go your personal access key (if you already have your admission key, then feel free to use it):
- Footstep 1: Go to the AWS Management Console page.
- Step 2: Click on your username on the top right, cull "My Security Credentials."
- Pace iii: Choose "Access keys (access primal ID and secret access primal."
- Footstep iv: Create new keys, and think to relieve it!
- Step 5: Add to our code: Initialize the transcription job.
transcribe = boto3.client('transcribe',
aws_access_key_id = #insert your access primal ID here,
aws_secret_access_key = # insert your cloak-and-dagger access key hither
region_name = # region: usually, I put "united states-due east-ii"
Besides, we demand to create/connect our Amazon S3 Storage.
AWS Transcribe will transcribe files from your S3 Storage. This is very handy because yous can store files into Amazon S3 and direct process them from the cloud. Read how to create your S3 Bucket here.
Try to upload a random audio/video file to S3 Storage, and let'due south attempt the transcription service! These are the values we
- job_uri: S3 access link, which is usually "s3://bucket_name/" + audio_file_name (e.g. "s3://viethoangtranduong/aws.wav")
- job_name: For each transcription phone call, we need a job proper noun. For this instance, I use the sound file proper name itself. We can also use a hash function to automate the system.
Notes: the chore volition crash if there already existed a job with the same name. The possible ways to avoid these problems are:
- A hash function to encode both the sound file name and the timestamp of the job (this would avoid duplicates, fifty-fifty with files with the aforementioned names)
- A cardinal generator database: If nosotros utilize base 62 (as we want to avert "/" and "+"), then nosotros can have 62⁶ = 56.eight B unique codes (which should be plenty). We can use the unused keys for each job.
We can have ii databases to store used and unused keys. Each fourth dimension an unused primal is used, we move it to the other database. We must go on track of the file name and the matched keys for hereafter traversal. Using this method, we can further develop into link shorteners for transcription files. - file_format: the file format. AWS can handle nigh files similar .mp3, .wav, or even videos like .mp4
For easy cases, I created a function call check_job_name to handle duplicate job names.
def check_job_name(job_name):
job_verification = True # all the transcriptions
existed_jobs = transcribe.list_transcription_jobs() for job in existed_jobs['TranscriptionJobSummaries']:
if job_name == task['TranscriptionJobName']:
job_verification = Imitation
intermission if job_verification == False:
command = input(job_name + " has existed. \nDo you want to override the existed job (Y/N): ") if command.lower() == "y" or command.lower() == "yes": transcribe.delete_transcription_job(TranscriptionJobName=job_name)
elif command.lower() == "due north" or command.lower() == "no": job_name = input("Insert new job proper noun? ") check_job_name(job_name)
else:
print("Input can only be (Y/N)")
control = input(job_name + " has existed. \nDo you lot want to override the existed task (Y/Due north): ")
return job_name
For single speaker files
def amazon_transcribe(audio_file_name):
job_uri = # your S3 access link
# Usually, I put similar this to automate the process with the file name
# "s3://bucket_name" + audio_file_name # Usually, file names have spaces and have the file extension like .mp3
# we take only a file proper noun and delete all the space to name the job
job_name = (audio_file_name.split('.')[0]).supersede(" ", "") # file format
file_format = audio_file_name.split('.')[1]# check if name is taken or not
job_name = check_job_name(job_name)
transcribe.start_transcription_job(
TranscriptionJobName=job_name,
Media={'MediaFileUri': job_uri},
MediaFormat = file_format,
LanguageCode='en-U.s.a.')while True:
outcome = transcribe.get_transcription_job(TranscriptionJobName=job_name)
if result['TranscriptionJob']['TranscriptionJobStatus'] in ['COMPLETED', 'FAILED']:
break
time.slumber(15)
if issue['TranscriptionJob']['TranscriptionJobStatus'] == "COMPLETED":
information = pd.read_json(result['TranscriptionJob']['Transcript']['TranscriptFileUri'])
return data['results'][1][0]['transcript']
Because the transcription might take fourth dimension, we created a while loop to wait until it is done (rerun every xv seconds).
The last "if" argument is extracting the specific transcript from the JSON file. I will discuss how to extract timestamp at the finish of the post.
For multiple speakers files
For AWS Transcribe for multiple speakers, the maximum speakers it can detect is x.
I volition have 2 arguments as input this time: audio_file_name and the max_speakers. I strongly recommend having the max_speakers value already to enhance the accurateness for AWS, possibly. However, you tin also leave it blank.
def amazon_transcribe(audio_file_name, max_speakers = -i):if max_speakers > x:
raise ValueError("Maximum detected speakers is ten.")job_uri = "s3 bucket link" + audio_file_name
job_name = (audio_file_name.dissever('.')[0]).replace(" ", "")# check if proper noun is taken or non
while Truthful:
job_name = check_job_name(job_name)
if max_speakers != -1:
transcribe.start_transcription_job(
TranscriptionJobName=job_name,
Media={'MediaFileUri': job_uri},
MediaFormat=audio_file_name.divide('.')[1],
LanguageCode='en-U.s.a.',
Settings = {'ShowSpeakerLabels': Truthful,
'MaxSpeakerLabels': max_speakers})
else:
transcribe.start_transcription_job(
TranscriptionJobName=job_name,
Media={'MediaFileUri': job_uri},
MediaFormat=audio_file_name.split('.')[1],
LanguageCode='en-Us',
Settings = {'ShowSpeakerLabels': True})
result = transcribe.get_transcription_job(TranscriptionJobName=job_name)
if result['TranscriptionJob']['TranscriptionJobStatus'] in ['COMPLETED', 'FAILED']:
intermission
time.slumber(xv) if result['TranscriptionJob']['TranscriptionJobStatus'] == 'COMPLETED':
data = pd.read_json(result['TranscriptionJob']['Transcript']['TranscriptFileUri'])
return result
This time, the output is not the transcript anymore just will exist a file event (in Python, it's a dictionary data blazon).
data = pd.read_json(event['TranscriptionJob']['Transcript']['TranscriptFileUri'])
transcript = data['results'][ii][0]['transcript']
This code will give you the raw transcription (without speakers characterization): the issue volition exist like to inputting these files to the single speaker model.
How to add together speaker labels?
Now, we will read the JSON file in the "TranscriptFileUri."
Equally nosotros are using Google Colab, I will too demonstrate how to access files inside specific folders.
Assuming we already have it in a Folder: Colab Notebooks/AWS Transcribe reader: here'southward how to access it.
from google.colab import drive
import sys
import os drive.mount('/content/bulldoze/')
sys.path.suspend("/content/drive/My Drive/Colab Notebooks/AWS Transcribe reader")
os.chdir("/content/bulldoze/My Drive/Colab Notebooks/AWS Transcribe reader")
At present, we need to process the JSON output from AWS Transcribe. The code below volition provide a .txt file with [timestamp, speaker label, content].
When inputting the "filename.json" file, expect the "filename.txt" file for the full transcript.
import json
import datetime
import time as ptime def read_output(filename):
# example filename: audio.json# take the input as the filename
filename = (filename).carve up('.')[0]
# Create an output txt file
print(filename+'.txt')
with open(filename+'.txt','due west') as w:
with open(filename+'.json') as f:
data=json.loads(f.read())
labels = data['results']['speaker_labels']['segments']
speaker_start_times={}for characterization in labels:
items = data['results']['items']
for item in label['items']:
speaker_start_times[item['start_time']] = item['speaker_label']
lines = []
line = ''
fourth dimension = 0
speaker = 'null'
i = 0# loop through all elements
# if it's starting time
for detail in items:
i = i+1
content = particular['alternatives'][0]['content']
if item.get('start_time'):
current_speaker = speaker_start_times[item['start_time']] # in AWS output, in that location are types as punctuation
elif item['type'] == 'punctuation':
line = line + content# handle different speaker
elif item['type'] != 'punctuation':
if current_speaker != speaker:
if speaker:
lines.append({'speaker':speaker, 'line':line, 'time':time})
line = content
speaker = current_speaker
time = item['start_time']
line = line + ' ' + content
lines.suspend({'speaker': speaker, 'line': line,'fourth dimension': time}) # sort the results by the time
sorted_lines = sorted(lines,central=lambda k: bladder(yard['time']))
# write into the .txt file
for line_data in sorted_lines:
line = '[' + str(datetime.timedelta(seconds=int(round(float(line_data['time']))))) + '] ' + line_data.become('speaker') + ': ' + line_data.go('line')
w.write(line + '\north\n')
Then, in the aforementioned binder, the file "filename.txt" will appear, which contains all the transcripts.
Bonus 1: Accessing and uploading files directly to S3 storage
Uploading files to AWS S3 Storage will definitely automate a lot of processes.
# define AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, and bucket_name
# bucket_name: name of s3 storage folder
s3 = boto3.client('s3',
aws_access_key_id = AWS_ACCESS_KEY_ID,
aws_secret_access_key = AWS_SECRET_ACCESS_KEY,
region_name = "us-due east-two") s3.upload_file(file_name, bucket_name, file_name)
Bonus 2: Why don't we create a vocabulary to enhance accuracy?
We can manually upload the vocabulary into the AWS Transcribe service through the panel management. The accepted files are .csv or .txt.
Nonetheless, if we want to automate this process with Python, it will be a scrap trickier.
Hither is ane style I plant: AWS accepts a specific type of input through Python, which is a DataFrame with 4 columns: ['Phrases', 'IPA,' 'SoundsLike,' 'DisplayAs'] convert into .txt file. For more information on the columns' meaning and custom vocabulary, read here.
def vocab_name(custom_name):
vocab = pd.DataFrame([['Los-Angeles', np.nan, np.nan, "Los Angeles"], ["F.B.I.", "ɛ f b i aɪ", np.nan, "FBI"], ["Etienne", np.nan, "eh-tee-en", np.nan]], columns=['Phrase', 'IPA', 'SoundsLike', 'DisplayAs']) vocab.to_csv(custom_name+'.csv', header=True, index=None, sep='\t')
import csv
import time
csv_file = 'custom_name+'.csv
txt_file = 'custom_name+'.txt with open(txt_file, "due west") as my_output_file:
with open(csv_file, "r") equally my_input_file:
my_output_file.write(" ".join(row)+'\n') for row in csv.reader(my_input_file)]
my_output_file.close()
ptime.sleep(30) # wait for the file to finish bucket_name = #name of the S3 bucket
s3.upload_file(txt_file, bucket_name, txt_file)
ptime.slumber(60) response = transcribe.create_vocabulary(
VocabularyName= custom_name,
LanguageCode='en-U.s.',
VocabularyFileUri = "your s3 link" + txt_file)
# the link ordinarily is bucketname.region.amazonaws.com # after running vocab_name, we tin check the status through this line # if it'south set up, the VocabularyState will be 'Set up'
transcribe.list_vocabularies()
The uploading and adding the vocabulary took quite a bit of fourth dimension. This might non be the optimal arroyo, but it works (amid trying a list of string, lists of lists, just none work so far).
If y'all institute a way to automate this even further, please comment! I would love to discuss and acquire more.
Conclusion and … What's next?
Here is a quick tutorial on AWS Transcribe. I promise you found it useful! Also, I would love to hear your thoughts and hash out them fifty-fifty farther. (And I might write almost adding punctuations to Google Deject little-punctuations transcript, then I promise that I don't get lazy).
Link to the lawmaking: Google Colab (here), Gist (here), or Github(here).
And experience costless to contact me through linkedin.com/in/viethoangtranduong/or comment in this post, and I'll try my all-time to reply asap!
P.s: real-time transcription is much ameliorate than transcribing a similar file.
References
A V Minute Overview of AWS Transcribe
AWS Transcribe to Docx.
Source: https://towardsdatascience.com/a-quick-tutorial-to-aws-transcribe-with-python-53bbf6605a55
0 Response to "Aws Custom Vocabulary Internal Failure Please Try Your Request Again"
Post a Comment