Good morning everyone. This chat is for the mass transcription project.
@Dustin Surwill has joined the conversation
Hey @Nick Ward I got your email and wanted to clarification on the Law Ruler status criteria. When you say 'All' do you mean, every single status in the law ruler system, or all of the specific statuses you listed, already represented, declined, etc? I want to make sure we pull the right data set. I remember our talk, I am just making 100% sure.
Hey @Nick Ward I got your email and wanted to clarification on the Law Ruler status criteria. When you say 'All' do you mean, every single status in the law ruler system, or all of the specific statuses you listed, already represented, declined, etc? I want to make sure we pull the right data set. I remember our talk, I am just making 100% sure.
Update on the joining criteria,
James, Dustin & I talked about this yesterday at different times. Here is what we came up with.
This is how the files are joined and found because this is how they are saved in the sl-five9-recordings google cloud bucket: Alizabeth Burge-2025-02-14-14-44-01 6103687090.mp3 aka {agent_name}-{year}-{month}-{day}-{hour}-{minute}-{second} {phone}.mp3
I have a script we use for the legal compliance pipeline I will alter for downloading and transcribing files from the sl-five9-recordings google cloud bucket.
We're going to be uploading the .json files with the transcriptions sent back by assembly.ai into the google cloud bucket called sl-five9-recordings with the file name being Alizabeth Burge-2025-02-14-14-44-01 6103687090.json aka {agent_name}-{year}-{month}-{day}-{hour}-{minute}-{second} {phone}.json
Once this is completed, in the future, when you want specific criteria, we can find the phone number in the database, match it to the criteria/statuses, and pull our already transcribed and stored .json transcribed data. We can avoid double storing data into postgres doing it this way and still have the ability to get certain calls.
Currently in my legal pipeline we look for a minimum of 120 seconds. Out of 600 audio files we pull for legal use we still end up with about 30 voicemail files. So 45 seconds won't be long enough.
As for cost, estimated total bucket: ~$9,826 (36,392 hours x $0.27 per hour) which is around 4.5m+ audio files. With the removal of calls under 120 seconds, we should cut that down, so this is a high end estimate.
Processing time with assembly.ai, speech-to-text (pre-recorded): 200 concurrent transcriptions with automatic queuing for overflow with 30 requests per minute (account-specific), fails with 429 error when exceeded.
Once all of this is completed, we can filter down criteria, pull the .json files we want from the sl-five9-recordings google cloud bucket convert it however SimpleTalk needs it and then fire them off via API or however they accept the files.
I've included a flow chart for visual aid.
Here is the Github https://github.com/shield-legal/mass-transcription-simpletalk
@Chris Krecicki Right now we want all leads/calls that are in this campaign. The lead count is ~4.5K. Once we have the payload, I/we can evaluate whether or not to filter down the list to specific statuses before sending to the folks who will work with the data. I just want to be sure we catch all of the leads in this campaign (not vertical, not campaign type, just this specific campaign), with the current status labeled in the file for future filtering. Glad to chat if this needs more explanation. Thanks!
OK so we're only downloading and transcribing those 4.5K -- what is the campaign. I'll make updates in the code and run a test, expect it by EOD or Monday. @Nick Ward
OK so we're only downloading and transcribing those 4.5K -- what is the campaign. I'll make update in the code. @Nick Ward
OK so we're only downloading and transcribing those 4.5K -- what is the campaign. I'll make updates in the code and run a test, expect it by EOD or Monday. @Nick Ward
@Nick Ward confirming the campaign is: Depo-Provera - DL - Flatirons - Shield Legal aka case_type 1923
@Nick Ward @James Turner -- please confirm the sample sent via email
Correct @Chris Krecicki, 1923 is the LR # for the campaign
Please check email chain everyone https://github.com/shield-legal/mass-transcription-simpletalk/blob/master/prod_data/depo_provera_filenames.json -- I have these ready to go. I just need the approval. @Joe Santana
give me a couple days .. learning that time stamps are 7+ **+ hours ahead in the bucket due to how they are uplaoded and file names vary .. give me a bit to sort this out -- unexpected things
Aight boys, this is done. Spent a few hours sorting out those caveats. I am waiting for Dustin to get back to add permissions so I can upload the .json file with the transcription to the call bucket. But it is done. When do you all want to setup a meeting to review all this before we move forward?
Shoot me a calendar invite when you get a time together
I'll run this starting tomorrow. Once they get the DB migrated we can just take our local DB inserts and import them to prod so prod can catch up.
Ok. Letβs move the demo meeting to tomorrow, get everything ready. FYI @James Scott
@Ryan this is for the mass transcription project, what i was talking about here -- we should still do a demo today over the tort finder we were talking about in the ai-development chat
@Ryan check our thread in the other chat we have together