James Scott (jamesscott@shield-legal.com)
2025-01-13 12:06:09

hey this is easier

👍 Josh Josue
James Scott (jamesscott@shield-legal.com)
2025-01-13 12:06:40
James Scott (jamesscott@shield-legal.com)
2025-01-13 12:15:25

https://console.cloud.google.com/home/dashboard?project=ai-projects-406720

accounts.google.com
Josh Josue (jjosue@shield-legal.com)
2025-01-13 12:16:54

getting my headphones brb

James Scott (jamesscott@shield-legal.com)
2025-01-13 12:38:30

https://console.cloud.google.com/vertex-ai/workbench/instances?referrer=search&project=ai-projects-406720

accounts.google.com
James Scott (jamesscott@shield-legal.com)
2025-01-13 16:50:07

are you still in opfffice

Josh Josue (jjosue@shield-legal.com)
2025-01-13 16:50:17

yeah im still here

Josh Josue (jjosue@shield-legal.com)
2025-01-13 16:55:22

I've been looking into the Notebooks feature of the Vertex AI, I created a simple python script that will run based on a schedule

James Scott (jamesscott@shield-legal.com)
2025-01-13 16:56:29

u created that in gcp? yea the notebook itself is going to need to be run cuz its grabs a file, but good work for today you can go home have some time back and we can pick back up tomorrow

Josh Josue (jjosue@shield-legal.com)
2025-01-13 16:56:58

yea, the scheduled script will terminate at 5PM PST just to test it

Josh Josue (jjosue@shield-legal.com)
2025-01-13 16:57:07

oh sweet! thanks! cya tomorrow

James Scott (jamesscott@shield-legal.com)
2025-01-13 16:58:28

awesome we shall talk then!

Josh Josue (jjosue@shield-legal.com)
2025-01-14 11:28:56

G'morning! im back at the offiec

James Scott (jamesscott@shield-legal.com)
2025-01-14 11:29:11

awesome we can hop on a quck call

Josh Josue (jjosue@shield-legal.com)
2025-01-14 11:30:41

sure!

Josh Josue (jjosue@shield-legal.com)
2025-01-14 11:31:45

i can hear you

James Scott (jamesscott@shield-legal.com)
2025-01-14 11:31:56

ima do a google meet

👍 Josh Josue
James Scott (jamesscott@shield-legal.com)
2025-01-14 11:32:34

To join the video meeting, click this link: https://meet.google.com/syy-crhd-xug

To join by phone instead, dial (US) and enter this PIN: 710 152 576#

More phone numbers: https://https%3A//tel.meet/syy-crhd-xug?pin=3752056146248

James Scott (jamesscott@shield-legal.com)
2025-01-14 12:07:38

nicholas mcfadden is dustin direct manager, if you feel anything else or more things coming from him please let him know

👍 Josh Josue
Josh Josue (jjosue@shield-legal.com)
2025-01-14 12:08:04

noted - thank you!

Josh Josue (jjosue@shield-legal.com)
2025-01-14 12:53:53

I'm trying to run a script within the machine-learning instance, but it doesnt seem to have any Environments to show me

Josh Josue (jjosue@shield-legal.com)
2025-01-14 12:54:40

So I'm looking into editing the instance to have one

James Scott (jamesscott@shield-legal.com)
2025-01-14 12:56:35

You can create one

👍 Josh Josue
Josh Josue (jjosue@shield-legal.com)
2025-01-14 15:36:24

Hi James, are you ok with me giving status reports?

James Scott (jamesscott@shield-legal.com)
2025-01-14 15:37:06

what do you mean

James Scott (jamesscott@shield-legal.com)
2025-01-14 15:37:11

oh aahha

James Scott (jamesscott@shield-legal.com)
2025-01-14 15:37:13

yes thats fine

Josh Josue (jjosue@shield-legal.com)
2025-01-14 15:37:19

haha ok

Josh Josue (jjosue@shield-legal.com)
2025-01-14 15:37:59

alright - so I went into a rabbit hole trying to setup an Environment (via Google Compose) to see if it would appear in the script's job scheduler. However it did not.

Now I'm trying the Vertex AI Executor by creating a scheduled run this 2PM. I routed the output on bigQuery to a different table_id so it shouldnt mess up the entries currently there.

James Scott (jamesscott@shield-legal.com)
2025-01-14 15:38:24

sounds good with me!

James Scott (jamesscott@shield-legal.com)
2025-01-14 16:08:42

quick question was there any other information about the dustin incident?

James Scott (jamesscott@shield-legal.com)
2025-01-14 16:09:09

they are going to bring u in

Josh Josue (jjosue@shield-legal.com)
2025-01-14 16:09:13

Are there any specifics - the others were present in the room

Josh Josue (jjosue@shield-legal.com)
2025-01-14 16:09:24

bring me in to where?

Josh Josue (jjosue@shield-legal.com)
2025-01-14 16:23:25

some good news! The Vertex cronjob executed at the expected time of 2PM!

Waiting for it to get done to check the results...

Josh Josue (jjosue@shield-legal.com)
2025-01-14 17:03:09

are you free for a call?

James Scott (jamesscott@shield-legal.com)
2025-01-14 17:03:33

its a little late on my side! you can leave good work today

James Scott (jamesscott@shield-legal.com)
2025-01-14 17:03:44

unless its important

Josh Josue (jjosue@shield-legal.com)
2025-01-14 17:03:59

oh ok gotcha - yea we can pick it back up tomorrow

Josh Josue (jjosue@shield-legal.com)
2025-01-14 17:04:04

thanks!

Josh Josue (jjosue@shield-legal.com)
2025-01-14 17:04:19

sorry i forgot bout the time difference

James Scott (jamesscott@shield-legal.com)
2025-01-14 17:04:48

awesome! sounds good have a good time

Josh Josue (jjosue@shield-legal.com)
2025-01-15 11:23:49

G'morning! I'm back at the office

James Scott (jamesscott@shield-legal.com)
2025-01-15 11:26:38

awesome!

Josh Josue (jjosue@shield-legal.com)
2025-01-15 11:27:02

Can i present my findings from yesterday?

Josh Josue (jjosue@shield-legal.com)
2025-01-15 11:27:13

(if you're free)

James Scott (jamesscott@shield-legal.com)
2025-01-15 11:29:29

i will be in meetings for a while today! i would say just keep going with what ur doing

👍 Josh Josue
James Scott (jamesscott@shield-legal.com)
2025-01-15 12:09:38

https://console.cloud.google.com/bigquery?referrer=search&project=ai-projects-406720&inv=1&invt=Abm6FA&ws=!1m0

accounts.google.com
Josh Josue (jjosue@shield-legal.com)
2025-01-15 13:49:28

SUCCESS! 🥳 The Workflow notebook executed on the scheduled time and successfully created a BigQuery table filled with entries

There is slight security risk tho - I had to embed the contents of your JSON key contents into the script. But other than that, this seems like a viable way for us to schedule cronjobs

James Scott (jamesscott@shield-legal.com)
2025-01-15 14:45:22

wooohoo!! awesome lol thats great news, now i think step 2 would be to update the scrip to pull automatically, that is for a set period and time but it needs to be updated and grabbed automatically

Josh Josue (jjosue@shield-legal.com)
2025-01-15 15:16:11

so by "pull automatically" did you mean code being pulled from a repo?

Or did you mean I move the other scripts to the BigQuery Workflow?

James Scott (jamesscott@shield-legal.com)
2025-01-15 16:24:47

we can talk about it tomorrow but basically the code fetches a json thzt has the jsons of all the data, which then needs to be downloaded and uploaded to the bigquerytable, we need to autimically fetch the latest big data pull and do that json download

Josh Josue (jjosue@shield-legal.com)
2025-01-15 16:25:46

oh ok gotcha - yea we can pick it up again tomorrow!

Josh Josue (jjosue@shield-legal.com)
2025-01-16 11:21:33

g'morning! I'm back

James Scott (jamesscott@shield-legal.com)
2025-01-16 11:29:36

Awesome I would say just continue with what we weee working on yesterday updating the data or fetching the data form the fda api in the code

Josh Josue (jjosue@shield-legal.com)
2025-01-16 12:47:21

Yesterday I ran the workflow to download 3 json files and that ran fine. Then today I tested the workflow to download as many as there are based on the csv rows and it failed (stopped after the 280th download)

I think I really will have to create an K8 environment to run all of these scripts.

Josh Josue (jjosue@shield-legal.com)
2025-01-16 12:47:52

I'm going to ask Joe for permissions on installing Docker Desktop so I can get started heading in that direction

James Scott (jamesscott@shield-legal.com)
2025-01-16 12:48:56

yea i mean i think i have the data up untill 2023, so it should be just for the year of 2024

James Scott (jamesscott@shield-legal.com)
2025-01-16 12:49:19

i dont think there should be 280+ files for a single year because it hink it ws like 800 something

James Scott (jamesscott@shield-legal.com)
2025-01-16 12:49:27

for the last 8 years

James Scott (jamesscott@shield-legal.com)
2025-01-16 12:49:52

i can hop on a call real quick

Josh Josue (jjosue@shield-legal.com)
2025-01-16 12:49:59

yea plz

James Scott (jamesscott@shield-legal.com)
2025-01-16 13:01:51
Josh Josue (jjosue@shield-legal.com)
2025-01-17 12:17:31

Hi James, what's this column titled "Unnamed" represent?

Josh Josue (jjosue@shield-legal.com)
2025-01-17 12:17:34
James Scott (jamesscott@shield-legal.com)
2025-01-17 13:16:49

This is just an index column sometimes when saving and loading the dataset in python this happened it needs to be dropped or when saving u have to set index = Falsd

Josh Josue (jjosue@shield-legal.com)
2025-01-17 14:34:37

ok gotcha

Josh Josue (jjosue@shield-legal.com)
2025-01-17 14:34:55

so great news - i now have a script that creates a CSV with all those json links!

James Scott (jamesscott@shield-legal.com)
2025-01-17 14:35:11

from 2003 through current? awesome!

Josh Josue (jjosue@shield-legal.com)
2025-01-17 14:35:23

just curious, in the future, do we wanna keep making csv files OR maybe have it on a DB to be referenced by the other scripts

Josh Josue (jjosue@shield-legal.com)
2025-01-17 14:35:32

yep!

James Scott (jamesscott@shield-legal.com)
2025-01-17 14:35:57

well, if you can directly updload it or the script to big query, thats the goal, but the code i found takes a csv then uploads, if u can do direct upload of the data that is perfect

Josh Josue (jjosue@shield-legal.com)
2025-01-17 14:36:01

it can be told to run for any array list of years (e.g. [2014, 2023, 2025])

James Scott (jamesscott@shield-legal.com)
2025-01-17 14:36:18

more so, from the presentations, i am doing and feedback we want to include all years worth of data in a table

Josh Josue (jjosue@shield-legal.com)
2025-01-17 14:36:22

yep! That's my next step - uploading the csv to the cloud

James Scott (jamesscott@shield-legal.com)
2025-01-17 14:36:34

yes if u can do it without a csv than perfect

👍 Josh Josue
James Scott (jamesscott@shield-legal.com)
2025-01-17 14:36:53

so we we can remove that redunancy

Josh Josue (jjosue@shield-legal.com)
2025-01-17 14:37:20

oh ok cool! yea i'll go ahead and store this scraped data on whatever Google's version of DynamoDB is

Josh Josue (jjosue@shield-legal.com)
2025-01-17 14:38:04

i wanna draw a diagram so we can discuss it at some point, but the overview is, the notebook scripts you made will be triggered by a python script cronjob living in Google's version of an EC2

Josh Josue (jjosue@shield-legal.com)
2025-01-17 14:38:51

that way, we benefit from Google's UI for manual override and we keep it to how you're comfortable seeing those notebooks still

James Scott (jamesscott@shield-legal.com)
2025-01-17 14:40:18

i dont think we need to store that json anywhere, it can live as a variable in our code everytime we run it, if thats what u doing? cuz from there you go to downloading the data, which the code works for that we just need to now get all data into the table which we can look into not having it in a csv, so augment the code to take that data, store it in a dataframe, and then push to bigquery

Josh Josue (jjosue@shield-legal.com)
2025-01-17 14:44:10

Oh I’m not storing a json, I’m storing the csv of links onto a DB which could then be referenced by the bigquery workflow script

James Scott (jamesscott@shield-legal.com)
2025-01-17 14:44:33

you have a second?

Josh Josue (jjosue@shield-legal.com)
2025-01-17 14:47:13

Yep

Josh Josue (jjosue@shield-legal.com)
2025-01-17 14:47:30

Want to huddle?

Josh Josue (jjosue@shield-legal.com)
2025-01-17 14:50:17

Just gonna drive to get lunch

James Scott (jamesscott@shield-legal.com)
2025-01-17 14:52:45

oh ok take ur time no rush i stepped away

James Scott (jamesscott@shield-legal.com)
2025-01-17 14:53:16

also its the holiday weekend leave around 2 today i see no reason to stay longer

Josh Josue (jjosue@shield-legal.com)
2025-01-17 15:00:38

Oh awesome! Thanks!

Josh Josue (jjosue@shield-legal.com)
2025-01-17 15:01:02

Ok I’m ready to huddle

Josh Josue (jjosue@shield-legal.com)
2025-01-20 13:44:02

Update: My script now automatically uploads the resulting csv file onto Google Storage! woot!

James Scott (jamesscott@shield-legal.com)
2025-01-20 13:54:57

Just the storage or big query? Good job! For all years right ?

Josh Josue (jjosue@shield-legal.com)
2025-01-20 13:55:23

to big query

your script does the big query stuff so I have to chain it onto the automation

Josh Josue (jjosue@shield-legal.com)
2025-01-20 13:55:36

yep! all years (2004 - to current)

Josh Josue (jjosue@shield-legal.com)
2025-01-20 13:55:54

I was going to get started with Google Run Functions but saw Ryan's message about Fivetran

James Scott (jamesscott@shield-legal.com)
2025-01-20 13:56:22

Yea make sure it’s update into big query! And is it automated for future years as well? Like not just the current years so if we run next quarter it automatically downloads

Josh Josue (jjosue@shield-legal.com)
2025-01-20 13:56:36

The architecture i had in mind was, this script runs on a Google Function then kicks off your script for the BigQuery part

James Scott (jamesscott@shield-legal.com)
2025-01-20 13:57:25

Gotcha ! There is some task after this but this is good I need to redo the models in the full data

Josh Josue (jjosue@shield-legal.com)
2025-01-20 13:58:09

Yep, it's automated to run from 2004 to the current year

James Scott (jamesscott@shield-legal.com)
2025-01-20 13:58:19

Awesome let’s go

😁 Josh Josue
🥳 Josh Josue
Josh Josue (jjosue@shield-legal.com)
2025-01-20 13:58:58

as it stands, the FDA only goes up to 2024 Q3

James Scott (jamesscott@shield-legal.com)
2025-01-20 13:59:52

is it in big query yet i tried querying it still shows 2023 latest year

Josh Josue (jjosue@shield-legal.com)
2025-01-20 14:00:33

not yet

Josh Josue (jjosue@shield-legal.com)
2025-01-20 14:00:45

i could manually run your script to update the bigquery entries

James Scott (jamesscott@shield-legal.com)
2025-01-20 14:01:18

No it should be automated as well, I’m expecting it to run the script automatically end to end

Josh Josue (jjosue@shield-legal.com)
2025-01-20 14:02:14

yes absolutely - eventually that's the goal

  1. Cronjob that runs my scraping script
  2. When done, it kicks off the BigQuery adverse-events workflow
James Scott (jamesscott@shield-legal.com)
2025-01-20 14:02:33

Awesome !

Josh Josue (jjosue@shield-legal.com)
2025-01-20 14:03:02

so my next step is to find where I can run this scraping script on Google Cloud

Josh Josue (jjosue@shield-legal.com)
2025-01-20 14:03:36

THEN I could kickoff the BigQuery workflow

James Scott (jamesscott@shield-legal.com)
2025-01-20 14:03:41

Why can’t the scrapping script be run in the workflow !

Josh Josue (jjosue@shield-legal.com)
2025-01-20 14:03:54

there's too many components

Josh Josue (jjosue@shield-legal.com)
2025-01-20 14:04:15

it's not just a single file script (for code reuse)

Josh Josue (jjosue@shield-legal.com)
2025-01-20 14:05:14

also more flexible since the config files can easily be replaced (it contains credentials, bucket name, storage path, etc)

James Scott (jamesscott@shield-legal.com)
2025-01-20 14:11:12

Awesome ok that’s fine with me! Did u have a good weekend

👍 Josh Josue
Josh Josue (jjosue@shield-legal.com)
2025-01-20 14:11:41

It was pretty chill - got back to working on some tinkering

Josh Josue (jjosue@shield-legal.com)
2025-01-20 14:11:44

how was yours?

James Scott (jamesscott@shield-legal.com)
2025-01-22 11:20:24

question is the data for all years in there yet? i am going to adjust some of my models

Josh Josue (jjosue@shield-legal.com)
2025-01-22 11:25:20

Still on the road but I’ll check soon as I’m in

Josh Josue (jjosue@shield-legal.com)
2025-01-22 11:34:02

so i integrated your script into mine yesterday and ran it locally on my machine - but for some reason my machine rebooted overnight 😕

Josh Josue (jjosue@shield-legal.com)
2025-01-22 11:34:22

i'll manually run the workflow script rn

James Scott (jamesscott@shield-legal.com)
2025-01-22 11:34:36

thanks!!

Josh Josue (jjosue@shield-legal.com)
2025-01-22 11:51:47

Update on the automation: Google Function seems to have an unresolved feature ticket (dating back to 2019) concerning the ChromeWebDriver (which is needed for the scraping)

So I'm currently looking for workarounds for this issue

James Scott (jamesscott@shield-legal.com)
2025-01-22 11:52:53

question: does this notebook job run end to end cron-josh-adverse-events

Josh Josue (jjosue@shield-legal.com)
2025-01-22 11:54:08

No - that one starts at the step of the pipeline to download all the scraped json links

Josh Josue (jjosue@shield-legal.com)
2025-01-22 11:54:51

the one that runs end to end (scraping + downloading json + bigquery table creation) is on the Docker image that i'm deploying onto GCP

Josh Josue (jjosue@shield-legal.com)
2025-01-22 11:55:11

which is currently having the ChromeWebDriver issue

James Scott (jamesscott@shield-legal.com)
2025-01-22 11:55:39

ah gotcha awesome ok ! ima go head and delete tht one if it nots needed.

Josh Josue (jjosue@shield-legal.com)
2025-01-22 11:55:51

ok sure

Josh Josue (jjosue@shield-legal.com)
2025-01-22 11:56:26

you have a moment for a call?

James Scott (jamesscott@shield-legal.com)
2025-01-22 11:57:21

yup!

Josh Josue (jjosue@shield-legal.com)
2025-01-22 13:19:33

ok so i've verified that BOTH 2017 and 2024 have a column for "case_date" in their data frame

Josh Josue (jjosue@shield-legal.com)
2025-01-22 13:21:11

im gonna continue to debug the script as a whole

Josh Josue (jjosue@shield-legal.com)
2025-01-22 13:32:58

ok so I traced something odd on the filtering line df_csv = df_csv[df_csv['year'].isin(['2017', '2024'])] Before this line, df_csv has 1k entries, after the filter it's 0, which is odd bcuz I verified that the csv on the Storage bucket does in fact have 2017 and 2024 (among others) on the "year" column

James Scott (jamesscott@shield-legal.com)
2025-01-22 13:33:39

Maybe take that out and run since we doing them all anyway it could be an error in format

👍 Josh Josue
Josh Josue (jjosue@shield-legal.com)
2025-01-22 13:35:39

Ok i've started the notebook and it's going 🤞hopefully it updates the bigquery table

Meanwhile, I'll proceed with this ChromWebDriver issue

James Scott (jamesscott@shield-legal.com)
2025-01-22 14:48:03

Awesome how’s the e issue looking

James Scott (jamesscott@shield-legal.com)
2025-01-22 14:48:10

Did it update the table

Josh Josue (jjosue@shield-legal.com)
2025-01-22 14:53:25

Currently downloading jsons for 2013s so not yet

Josh Josue (jjosue@shield-legal.com)
2025-01-22 14:54:55

So I’m trying to save the company money in deploying to Google’s version of AWS Lambda

But it seems I’m going to be forced to deploy onto GCE due to Function’s lack of support for ChromeWebdriver

Josh Josue (jjosue@shield-legal.com)
2025-01-22 14:56:27

The docker image for scraping runs fine on Docker Desktop though

Josh Josue (jjosue@shield-legal.com)
2025-01-22 15:39:07

the workbench notebook went idle at 456/1564 (2015 Q2 dataset)

So i pivoted - I modified my code to append 2024 datasets onto BigQuery

James Scott (jamesscott@shield-legal.com)
2025-01-22 16:46:05

opps sorry just seeing this

James Scott (jamesscott@shield-legal.com)
2025-01-22 16:46:55

does it go idle from just juypter notebook instance? like if u run the code there? i know these can run for hours in that but condused as to why its idling in docker

Josh Josue (jjosue@shield-legal.com)
2025-01-22 16:47:37

not the docker, the jupyter workbench script that i ran earlier - it just stopped outta nowhere

Josh Josue (jjosue@shield-legal.com)
2025-01-22 16:48:16

But im working on a solution to prioritize getting you an updated big query table

Josh Josue (jjosue@shield-legal.com)
2025-01-22 16:58:11

Ok so I've succeeded in appending 2024 Q1 to dev-adverse_events_copy

WHEW! that took 30 minutes to run Loaded 37821879 rows into ai-projects-406720.drug_model.dev-adverse_events_copy.

Josh Josue (jjosue@shield-legal.com)
2025-01-22 16:58:24

i'll proceed to upload 2024 Q2 and Q3

Josh Josue (jjosue@shield-legal.com)
2025-01-22 17:04:14

Oh also, heads up, on Monday 27th, I have to be at DMV in the morning

James Scott (jamesscott@shield-legal.com)
2025-01-22 18:02:55

i think i got through the rror, i created a chunking script that chunks the data by 100 and then appends, its running now on like 300 so we will see what happens this way it removes the idle thing i think

👍 Josh Josue
Josh Josue (jjosue@shield-legal.com)
2025-01-22 18:26:05

Nicee

My script finished appending 2024Q1 to the bigquery dev table earlier

Working on 2024Q2 rn

James Scott (jamesscott@shield-legal.com)
2025-01-22 18:35:45

i am rolling up on 600 out of 1500 so whne this gets done it should be appened to big query the full 2004 to 2024 so you would just need to productionlize this script to append future data thats not 2024 onward to that biquiery

👍 Josh Josue
Josh Josue (jjosue@shield-legal.com)
2025-01-22 18:38:46

Ok so the script on Jupyter notebook is the one containing your chunk changes?

Josh Josue (jjosue@shield-legal.com)
2025-01-22 18:40:43

Tomorrow I’m going to look into selenium alternatives (like beautiful soup) that would be compatible with Google Function Run

James Scott (jamesscott@shield-legal.com)
2025-01-22 18:43:00

yes its called untitled now what is selunium going to be used for

Josh Josue (jjosue@shield-legal.com)
2025-01-22 18:43:41

That was the library that scraped our json links

Josh Josue (jjosue@shield-legal.com)
2025-01-22 18:45:02

But my deployments to Google Run havent been successful due to Google Run not compatible with this method (they have yet to implement the feature)

James Scott (jamesscott@shield-legal.com)
2025-01-22 18:46:37

with this new dataset or script being done, would the worlflows work now that its just or should need to append new data to the table which should take a lot of memory or time at most it would pr only 20-30 new jsons

Josh Josue (jjosue@shield-legal.com)
2025-01-22 18:53:51

Sorry that was a bit difficult for me to understand. The workflow is only a fraction of the pipeline (at that step, the json links have presumably been scraped)

Josh Josue (jjosue@shield-legal.com)
2025-01-22 18:54:36

But with our new appending technique, it shouldnt take as long as downloading all json starting from 2004’s datasets

Josh Josue (jjosue@shield-legal.com)
2025-01-23 11:57:20

G’morning! So I wanted to present the new gameplan of the cronjob

Josh Josue (jjosue@shield-legal.com)
2025-01-23 11:57:54

Just wanted to make sure I had it aligned with our endgoal

James Scott (jamesscott@shield-legal.com)
2025-01-23 11:59:03

Yes I am almost complete in the code we can hop Ina. Call Ina. Couple minutes

👍 Josh Josue
James Scott (jamesscott@shield-legal.com)
2025-01-23 12:08:26

ready?

Josh Josue (jjosue@shield-legal.com)
2025-01-23 12:12:19

*Thread Reply:* i think i lost you?

Josh Josue (jjosue@shield-legal.com)
2025-01-23 12:12:26

*Thread Reply:* i cant hear u anymore

James Scott (jamesscott@shield-legal.com)
2025-01-23 12:12:32

i am having such bad issues with my network today

Josh Josue (jjosue@shield-legal.com)
2025-01-23 13:18:12

great news! I came up with a fix for the script (it was easier to debug locally 😅)

Josh Josue (jjosue@shield-legal.com)
2025-01-23 13:18:15

Successfully updated adverse_events_prod

James Scott (jamesscott@shield-legal.com)
2025-01-23 13:19:12

Awesome what happened !!

Josh Josue (jjosue@shield-legal.com)
2025-01-23 13:19:47

so there was a discrepancy between how pandas dataframes handles the dataypes and pyarrow

Josh Josue (jjosue@shield-legal.com)
2025-01-23 13:20:23

the fix was to explicitly change all df fields to type str based on the schema used for bigquery

Josh Josue (jjosue@shield-legal.com)
2025-01-23 13:21:00

I added the bug fix into your notebook

James Scott (jamesscott@shield-legal.com)
2025-01-23 13:21:19

Ahh I had that before I shoulda kept it 😡 it’s updated now ?

Josh Josue (jjosue@shield-legal.com)
2025-01-23 13:24:40

yep! it's updated

James Scott (jamesscott@shield-legal.com)
2025-01-23 13:24:52

so this should be able to be recoded into the prod script in the workflow from ur flow chart suggestion and run in the bigquery workflows to just appened

👍 Josh Josue
Josh Josue (jjosue@shield-legal.com)
2025-01-23 13:25:26

Yeah im going to go ahead and incorporate your updated script into my project

Josh Josue (jjosue@shield-legal.com)
2025-01-23 13:25:58

Also, I'm still trying to debug Doccker image deployments into Google Run

James Scott (jamesscott@shield-legal.com)
2025-01-23 13:31:15

import pandasgbq projectid = 'ai-projects-406720' pandasgbq.togbq( df, 'drugmodel.adverseeventsprod', projectid=projectid, ifexists='append', # Change from 'replace' to 'append' table_schema=schema )

👍 Josh Josue
James Scott (jamesscott@shield-legal.com)
2025-01-23 13:31:26

this is the code to append the data instead of replace

James Scott (jamesscott@shield-legal.com)
2025-01-23 13:31:41

so thats how we will use this going forward to update the table with new data

👍 Josh Josue
Josh Josue (jjosue@shield-legal.com)
2025-01-23 13:35:03

awsome! thanks!

Josh Josue (jjosue@shield-legal.com)
2025-01-23 16:52:41

Just finished integrating your new chunk script into the scraper codebase

I'll go back to debugging Docker image deployments on Google Run

James Scott (jamesscott@shield-legal.com)
2025-01-24 06:48:25

hey you might want to double check this when you get in, so I looked at your table the one you said you get to work, there is only 518k rows my orignal dataset for the years of 2017+ has 37 million

James Scott (jamesscott@shield-legal.com)
2025-01-24 07:36:20

the csv file in the notebook has 67 million lines

Josh Josue (jjosue@shield-legal.com)
2025-01-24 11:37:42

oh no

Josh Josue (jjosue@shield-legal.com)
2025-01-24 11:37:46

you have time for a call?

James Scott (jamesscott@shield-legal.com)
2025-01-24 11:38:04

Give me 5 I think I fixed it

Josh Josue (jjosue@shield-legal.com)
2025-01-24 11:41:57

is it because the adverse_events_database_prod notebook stopped at chunk 14 of 16?

James Scott (jamesscott@shield-legal.com)
2025-01-24 11:45:30

can i call now

Josh Josue (jjosue@shield-legal.com)
2025-01-24 11:45:37

yep!

Josh Josue (jjosue@shield-legal.com)
2025-01-24 12:52:42

the API seems promising - im gonna go ahead and work on refactoring the work to use this new method instead of web scraping

James Scott (jamesscott@shield-legal.com)
2025-01-24 12:54:49

Awesome !

Josh Josue (jjosue@shield-legal.com)
2025-01-24 13:17:26

Another concern of mine was security - if i structure this code to run on workflow notebooks, the gcloud credentials will most likely be embedded within the code

Josh Josue (jjosue@shield-legal.com)
2025-01-24 13:17:34

is that ok?

James Scott (jamesscott@shield-legal.com)
2025-01-24 13:19:49

Can we make environment variables ? From our data there is no package to the outside or external resources it’s all within gcp right so would that matter

👍 Josh Josue
Josh Josue (jjosue@shield-legal.com)
2025-01-24 13:21:37

ok yea this should be fine, was just checkin

Josh Josue (jjosue@shield-legal.com)
2025-01-24 14:53:44

so i've looked around bigquery's GUI and couldnt find a place to enter env vars

Josh Josue (jjosue@shield-legal.com)
2025-01-24 14:54:11

how do ppl usually do that with Jupyter notebooks? I've always just loaded these values from a .env in python

James Scott (jamesscott@shield-legal.com)
2025-01-24 15:23:14

yes its usually a folder systerm

James Scott (jamesscott@shield-legal.com)
2025-01-24 15:23:29

but dont worry about it if u cant apply it

Josh Josue (jjosue@shield-legal.com)
2025-01-27 12:30:24

g'morning - I'm a little confused, should our code be running on BigQuery workflow or GCF (Google Cloud Function)?

Josh Josue (jjosue@shield-legal.com)
2025-01-27 12:31:30

also, I'm currently running tests on the updated workflow version

it will write any error responses with the corresponding attempted url so that I can retrieve any missing data after a cronjob run

James Scott (jamesscott@shield-legal.com)
2025-01-27 12:33:38

Let em clarify with him

James Scott (jamesscott@shield-legal.com)
2025-01-27 12:40:52

Sounds good with everything else !

Josh Josue (jjosue@shield-legal.com)
2025-01-27 12:41:07

so we are proceeding with bigquery workflow?

James Scott (jamesscott@shield-legal.com)
2025-01-27 12:43:45

Yes I told him

👍 Josh Josue
Josh Josue (jjosue@shield-legal.com)
2025-01-27 16:05:37

I just realized something - is there a reason the case_number field is not a primary key? I thought it could help when running scripts for missing data and preventing duplicates

Josh Josue (jjosue@shield-legal.com)
2025-01-27 16:06:37

Also, is the "file_name" field still relevant since I no longer have that information when I changed the ingestion method to use FDA's api

James Scott (jamesscott@shield-legal.com)
2025-01-27 16:07:48

Field name is not relevant at all! And case_number can have multie entries so there is no primary key in the dataset

Josh Josue (jjosue@shield-legal.com)
2025-01-27 16:09:12

ohh gotcha, so what do you think is the best method for ensuring unique entries?

Josh Josue (jjosue@shield-legal.com)
2025-01-27 16:09:39

like let's say, i found out the cronjob got a 400 response and I have to rerun a job for March 2004

James Scott (jamesscott@shield-legal.com)
2025-01-27 16:12:07

With the data completed for 2004-2024 why would we need to do that? And this is the raw data approach right now. We are just ingesting the data as it comes. Feature engineering later when I get to the models is when unique entries are created and feature engineering is done.

Josh Josue (jjosue@shield-legal.com)
2025-01-27 16:12:59

i'm planning for ingestion of future data and failsafes we'll need

James Scott (jamesscott@shield-legal.com)
2025-01-27 16:13:18

Gotcha that makes sense

Josh Josue (jjosue@shield-legal.com)
2025-01-27 16:14:13

how bout if I made a primary key of a casenumber + substancename.. would that be unique enough?

Josh Josue (jjosue@shield-legal.com)
2025-01-27 16:14:56

or casenumber + applicationnumber

James Scott (jamesscott@shield-legal.com)
2025-01-27 16:15:26

why is there a need for unique id, i am getting confused on that part, a cronjob requires a unique id?

Josh Josue (jjosue@shield-legal.com)
2025-01-27 16:15:54

so i would assume that if you have duplicates of entries, then the AI you're creating would have those entries having more weights or something

Josh Josue (jjosue@shield-legal.com)
2025-01-27 16:16:49

the unique id would help when the script has to retry getting any failed GET requests, it would ensure no duplicate entries go into our BigQuery table

James Scott (jamesscott@shield-legal.com)
2025-01-27 16:17:11

ahah your thinking like 10 steps ahead lol

👍 Josh Josue
😁 Josh Josue
James Scott (jamesscott@shield-legal.com)
2025-01-27 16:17:20

there is no duplicate entries in my dataset

James Scott (jamesscott@shield-legal.com)
2025-01-27 16:17:22

for the model

James Scott (jamesscott@shield-legal.com)
2025-01-27 16:17:30

your on the right track though

Josh Josue (jjosue@shield-legal.com)
2025-01-27 16:17:38

is that due to a "cleaning" process after the ingestion phase?

James Scott (jamesscott@shield-legal.com)
2025-01-27 16:18:16

yes, like i said there is feature engineering done, this raw data is nothing like the model training

James Scott (jamesscott@shield-legal.com)
2025-01-27 16:18:20

there is about 4-5 other tables

James Scott (jamesscott@shield-legal.com)
2025-01-27 16:18:28

before we get to the model daast

Josh Josue (jjosue@shield-legal.com)
2025-01-27 16:18:37

oh ok, i wasnt aware of those already existing

Josh Josue (jjosue@shield-legal.com)
2025-01-27 16:19:09

ok so currently, the code is still running on Workflow (which is a good sign) so I'm gonna make a release of this version

Josh Josue (jjosue@shield-legal.com)
2025-01-27 16:19:34

After it's done, I'll verify the big query rows match the csv entries count

James Scott (jamesscott@shield-legal.com)
2025-01-27 16:19:38

and this updates the prod table correct_ i currently have step 2 ready for workflow job as well

James Scott (jamesscott@shield-legal.com)
2025-01-27 16:20:21

with all the data this has made step 2 have 130million rows

👍 Josh Josue
Josh Josue (jjosue@shield-legal.com)
2025-01-27 16:20:26

no, this is pointed at a dev-josh table since your prod table should be considered all good and I dont wanna alter that

James Scott (jamesscott@shield-legal.com)
2025-01-27 16:31:27

i dropped file_name from he prod table u should be abple to append them in the fuutre no worries

👍 Josh Josue
James Scott (jamesscott@shield-legal.com)
2025-01-27 16:38:09

and i have step 2 notebook ready to be attached to the workflow like the next one

Josh Josue (jjosue@shield-legal.com)
2025-01-27 16:38:50

ok i think im ready look at that

Josh Josue (jjosue@shield-legal.com)
2025-01-27 16:39:27

here's an example of the scenario that I would come across

Josh Josue (jjosue@shield-legal.com)
2025-01-27 16:40:09

when the script reached 2023-10-02 TO 2023-11-01, FDA will start telling me to try again later with response code 429

James Scott (jamesscott@shield-legal.com)
2025-01-27 16:40:16

is to many request what does that mean is that the connection hit or data?

Josh Josue (jjosue@shield-legal.com)
2025-01-27 16:40:27

yea too many requests

Josh Josue (jjosue@shield-legal.com)
2025-01-27 16:40:48

and also another scenario is 1 month contains way too much data that the script hits the skip limit

Josh Josue (jjosue@shield-legal.com)
2025-01-27 16:41:15

those are the 2 main scenarios where I would need to run the script to retry to get data at a later time

James Scott (jamesscott@shield-legal.com)
2025-01-27 16:50:41

can you log into monday and see if you see the board i am getitng the task together

Josh Josue (jjosue@shield-legal.com)
2025-01-27 16:51:14

yep! i can see em

Josh Josue (jjosue@shield-legal.com)
2025-01-27 16:51:43

you mind if i start creating cards and editing stuff?

James Scott (jamesscott@shield-legal.com)
2025-01-27 16:54:14

i can add in the task and or we can go over it tomorrow

👍 Josh Josue
Josh Josue (jjosue@shield-legal.com)
2025-01-27 16:54:25

sounds good, thanks!

James Scott (jamesscott@shield-legal.com)
2025-01-28 12:10:10

you should have access to the github now as well

🙏 Josh Josue
Josh Josue (jjosue@shield-legal.com)
2025-01-31 13:24:28

Hi James! happy friday!I

I've finished formalizing the adverse events workflow Some highlights for v0.6.0 : • conditional for start date used when downloading data (by default it's the current date, otherwise, take the latest case_date from the bigquery table) • Some unit tests and integration tests added to the project • Infrastructure naming (env vars) are organized and accessible I could hop on a call to discuss the next steps I can tackle

Josh Josue (jjosue@shield-legal.com)
2025-01-31 13:27:40

I've scheduled a cronjob for Monday morning just to do a full simulation of all the parts being automated

James Scott (jamesscott@shield-legal.com)
2025-01-31 16:50:11

awesome!!! soounds good lets have a call on monday enjoy your weekend

Josh Josue (jjosue@shield-legal.com)
2025-01-31 16:55:31

Thanks! Have a great weekend!

Josh Josue (jjosue@shield-legal.com)
2025-02-03 11:59:43

Good morning! The scheduled workflow on Big Query ran successfully this morning v0.6.0

it gathered data up to 09/09/2024 and stored it into an integration test table

James Scott (jamesscott@shield-legal.com)
2025-02-03 12:00:01

oh awesome!!

Josh Josue (jjosue@shield-legal.com)
2025-02-03 12:01:09

What are the next steps? the ICD10 scripts?

James Scott (jamesscott@shield-legal.com)
2025-02-03 12:01:48

so that created the adverse events table, the next would be in coprorate the icd10 logic

James Scott (jamesscott@shield-legal.com)
2025-02-03 12:01:50

let me link u

James Scott (jamesscott@shield-legal.com)
2025-02-03 12:02:53

this is step 2

Josh Josue (jjosue@shield-legal.com)
2025-02-03 12:04:44

if i close all the other tabs open, will it affect your console?

James Scott (jamesscott@shield-legal.com)
2025-02-03 12:05:07

yes i want to put this in vscode jupyter notebook

👍 Josh Josue
James Scott (jamesscott@shield-legal.com)
2025-02-03 12:05:11

i dont htink it will affect me no

James Scott (jamesscott@shield-legal.com)
2025-02-03 12:05:13

u can close them

👍 Josh Josue
Josh Josue (jjosue@shield-legal.com)
2025-02-06 12:49:13

Hi James - when you get a chance, could i plz have the link for step 3?

Josh Josue (jjosue@shield-legal.com)
2025-02-06 12:52:58

also I've created a solution for unifying the notebooks with shared env vars

James Scott (jamesscott@shield-legal.com)
2025-02-06 12:53:52

Ok I can give u the first couple of parts but for steps 1-2 but I also wanted to talk about adding in the logs for each step I can show u what I mean when I get back im at an appointment

👍 Josh Josue
Josh Josue (jjosue@shield-legal.com)
2025-02-06 12:53:57

it comes in the form of having a json downloaded from a bucket

the json will contain: • base names to Big Query tables • Schemas

James Scott (jamesscott@shield-legal.com)
2025-02-06 12:54:17

That’s fine ! Nice workaround

🤓 Josh Josue
James Scott (jamesscott@shield-legal.com)
2025-02-06 12:54:48

Abd we can put out keys in there ?

Josh Josue (jjosue@shield-legal.com)
2025-02-06 12:55:33

the key is only embedded in the notebook code

Josh Josue (jjosue@shield-legal.com)
2025-02-06 12:56:46

The contents of the json is for infrastructure purposes, (like all relevant tables for integration testing vs the prod tables all being used across several notebook scripts)

Each notebook code still needs the api secrets embedded in them bcuz it allows the script to auth Google cloud libraries

James Scott (jamesscott@shield-legal.com)
2025-02-06 13:41:07

got a second

Josh Josue (jjosue@shield-legal.com)
2025-02-06 14:19:02

Sorry just saw this

Josh Josue (jjosue@shield-legal.com)
2025-02-06 14:19:20

My laptop lost internet connection at the office

Josh Josue (jjosue@shield-legal.com)
2025-02-06 14:19:33

But yea i could talk

James Scott (jamesscott@shield-legal.com)
2025-02-06 14:19:58

i can hop on super quick but i wont stay on long

Josh Josue (jjosue@shield-legal.com)
2025-02-06 14:20:33

Sure just gotta finish ordering my food

Josh Josue (jjosue@shield-legal.com)
2025-02-07 12:34:18

I've completed the logger util

But as I finished it, I foresee some technical debt in updating N number of notebooks using these utility modules. So I'm going to pause refactoring Step3 and implement a solution for each notebook to obtain these util modules

James Scott (jamesscott@shield-legal.com)
2025-02-07 12:36:55

Yes the logging for each step would be different is that what your saying

Josh Josue (jjosue@shield-legal.com)
2025-02-07 12:37:56

not quite, I'm saying all of the steps so far (adverse and icd10) use GoogleCloudUtil that I wrote, and eventually can integrate the LoggerUtil i just made. And so will future steps

Josh Josue (jjosue@shield-legal.com)
2025-02-07 12:38:11

and right now, I'm have to copy paste these Utils over and over into each notebook

Josh Josue (jjosue@shield-legal.com)
2025-02-07 12:38:22

that's error prone and time consuming

Josh Josue (jjosue@shield-legal.com)
2025-02-07 12:38:35

so I have to create an automation solution

James Scott (jamesscott@shield-legal.com)
2025-02-07 12:39:00

Does google cloud utility provide logging similar to logging utility or in confused ahha

Josh Josue (jjosue@shield-legal.com)
2025-02-07 12:39:28

The google cloud utility is what allows the scripts to auth with Google's API, upload/download stuff to buckets and bigquery

Josh Josue (jjosue@shield-legal.com)
2025-02-07 12:39:59

it also handles ensuring that if a table doesnt exist, it will be created for upsert operations

James Scott (jamesscott@shield-legal.com)
2025-02-07 12:44:03

Ok gotcha

Josh Josue (jjosue@shield-legal.com)
2025-02-07 12:44:33

the number of notebooks is growing and I have to keep up with the changes

Josh Josue (jjosue@shield-legal.com)
2025-02-07 15:42:00

If Step 1 was ingestion of adverse_events and Step 2 was icd10 table creation how would you describe Step 3?

James Scott (jamesscott@shield-legal.com)
2025-02-07 15:43:44

it would be drug label info or extraction

👍 Josh Josue
Josh Josue (jjosue@shield-legal.com)
2025-02-07 15:45:41

Also the solution I implemented for the modules is: • bash script downloads all git repo releases of utility modules (GoogleCloudUtil, LoggerUtil, etc) ◦ Uploads them to Google Bucket • Each notebook pulls down the modules and installs dependencies This will ensure the versioning and unify the modules being reused by our notebook scripts at each step

James Scott (jamesscott@shield-legal.com)
2025-02-07 15:51:06

You the best!!

🤓 Josh Josue
James Scott (jamesscott@shield-legal.com)
2025-02-07 15:51:17

Lets go over it Monday

👍 Josh Josue
Josh Josue (jjosue@shield-legal.com)
2025-02-10 14:03:44

you have a moment? - I can present the current state of the Workflow

James Scott (jamesscott@shield-legal.com)
2025-02-10 15:03:00

lets do it tomorrow during our standup call

👍 Josh Josue
James Scott (jamesscott@shield-legal.com)
2025-02-11 10:27:03

lets have our meeting closer to 1 today

James Scott (jamesscott@shield-legal.com)
2025-02-11 10:27:06

i will join then

Josh Josue (jjosue@shield-legal.com)
2025-02-11 10:27:31

Are we skipping the 9am meeting?

Josh Josue (jjosue@shield-legal.com)
2025-02-11 10:27:45

Also did you mean 1 pst?

James Scott (jamesscott@shield-legal.com)
2025-02-11 10:28:02

oh no 10 am insted

👍 Josh Josue
Josh Josue (jjosue@shield-legal.com)
2025-02-11 12:02:31

just a heads up- i think my internet at the office is being weird

Josh Josue (jjosue@shield-legal.com)
2025-02-11 12:06:55

are we huddling in here?

James Scott (jamesscott@shield-legal.com)
2025-02-11 12:07:56

yes i have a couple of minutes for a quick call

James Scott (jamesscott@shield-legal.com)
2025-02-11 12:24:48

pip install google-cloud-secret-manager

James Scott (jamesscott@shield-legal.com)
2025-02-11 12:24:57

from google.cloud import secretmanager

def accesssecret(projectid: str, secretid: str, versionid: str = "latest") -> str: """ Accesses the specified secret version in Google Cloud Secret Manager.

Args:
    project_id (str): GCP project ID.
    secret_id (str): Name of the secret.
    version_id (str, optional): Secret version (default: "latest").

Returns:
    str: The secret value.
"""
# Create the Secret Manager client
client = secretmanager.SecretManagerServiceClient()

# Build the resource name
name = f"projects/{project_id}/secrets/{secret_id}/versions/{version_id}"

# Access the secret version
response = client.access_secret_version(name=name)

# Return the secret payload as a string
return response.payload.data.decode("UTF-8")

Example usage

projectid = "your-gcp-project-id" secretid = "your-secret-name" secretvalue = accesssecret(projectid, secretid) print("Secret Value:", secret_value)

James Scott (jamesscott@shield-legal.com)
2025-02-11 12:25:13

can you test to see if u need auth to run this inside that enviromen?

Josh Josue (jjosue@shield-legal.com)
2025-02-11 12:25:35

I've tested that method but yea i can double check

Josh Josue (jjosue@shield-legal.com)
2025-02-11 12:31:58

yea confirmed, that wouldn't work

this line requires that google cloud credentials from the JSON we spoke about: client = secretmanager.SecretManagerServiceClient()

Josh Josue (jjosue@shield-legal.com)
2025-02-11 12:33:13

also, if it did work, then anybody in the world could get anybody else's key(s) just by knowing their key id and project id

Josh Josue (jjosue@shield-legal.com)
2025-02-11 16:37:56

I noticed that the drug info doesnt have a schema - is that right?

James Scott (jamesscott@shield-legal.com)
2025-02-12 05:41:43

yes it should, also i have corrected step 5 so all the way to step 9 is done

👍 Josh Josue
James Scott (jamesscott@shield-legal.com)
2025-02-12 13:36:15

once u get done producitonalizing the code, we have a task that we need to do so steps 1-9 or 10 we need to switch and do something cooler and add a piece of the pipeline like i had to do preivously which is or the new one is to extract drug information from studies around the world

Josh Josue (jjosue@shield-legal.com)
2025-02-12 13:37:39

ok yea for sure!

Josh Josue (jjosue@shield-legal.com)
2025-02-12 13:37:46

how'd your presentations go yesterday?

Josh Josue (jjosue@shield-legal.com)
2025-02-12 13:38:09

(I'm currently running integration tests on step3 - if all is good, i'll move to step 4)

James Scott (jamesscott@shield-legal.com)
2025-02-12 13:38:26

That was the result of it lol we met with some guys who look for these drugs and we have to take that and incorporate what they do in Thai process

👍 Josh Josue
😆 Josh Josue
Josh Josue (jjosue@shield-legal.com)
2025-02-13 15:39:02

Hi james, on step 4 there's this if statement that i just wanna make sure is still relevant

Josh Josue (jjosue@shield-legal.com)
2025-02-13 15:40:42

scenarioA: if file exists in bucket, skip it scenarioB: it doesnt matter if file is already in bucket, grab its pdf and overwrite existing file in bucket

Josh Josue (jjosue@shield-legal.com)
2025-02-13 15:40:58

Which scenario am I going with?

James Scott (jamesscott@shield-legal.com)
2025-02-13 15:41:31

Scenario A

👍 Josh Josue
Josh Josue (jjosue@shield-legal.com)
2025-02-14 11:39:45

G'morning James! So you know how the logs are being uploaded onto Big Query? Did you want all the steps to share the same table or shall I separate them by step?

James Scott (jamesscott@shield-legal.com)
2025-02-14 11:48:26

Hmm what do u think

Josh Josue (jjosue@shield-legal.com)
2025-02-14 11:49:56

So I'm thinkin of separating them because if a step failed, we can just look directly for the table confined for that step

Josh Josue (jjosue@shield-legal.com)
2025-02-14 11:50:44

the tables are prepended with the infra property (e.g. "integ_test", "dev", "prod") so they'd be grouped up any way as big query shows them in alphabetical order

Josh Josue (jjosue@shield-legal.com)
2025-02-14 12:04:19

i hope that's cool with you?

James Scott (jamesscott@shield-legal.com)
2025-02-14 12:04:41

Yes that’s what I was going for as well separate them is fine with me

👍 Josh Josue
Josh Josue (jjosue@shield-legal.com)
2025-02-17 14:06:12

Hi James! got a question about Step5 (plz see image)

DrugSummary.warningsprecautions_ seems to start with a string "This is a list..." and then is overwritten by extractcategory()_ without being used. It seems the initial string value didnt matter at all amirite?

James Scott (jamesscott@shield-legal.com)
2025-02-17 14:07:30

that is not a string value, that is a definition of the column of a pydantic model

James Scott (jamesscott@shield-legal.com)
2025-02-17 14:07:45

well it is string

James Scott (jamesscott@shield-legal.com)
2025-02-17 14:07:49

but its a definition

👍 Josh Josue
James Scott (jamesscott@shield-legal.com)
2025-02-18 10:50:48

is there anthing you need to show or look at for this meeting, if not we can cancel and work

Josh Josue (jjosue@shield-legal.com)
2025-02-18 10:51:21

Nothing new to demo yet - but I’m close to finishing step 5

James Scott (jamesscott@shield-legal.com)
2025-02-18 10:51:40

ok awesome! we can skip this call how was the weekend

👍 Josh Josue
Josh Josue (jjosue@shield-legal.com)
2025-02-18 10:52:19

It was pretty fun! Friends and i are hooked on that Marvel Rivals game 👾 😆

How was yours?

James Scott (jamesscott@shield-legal.com)
2025-02-18 10:52:40

!man i play that everyday!! whats your rank?

🔥 Josh Josue
James Scott (jamesscott@shield-legal.com)
2025-02-18 10:52:43

pc? or console?

Josh Josue (jjosue@shield-legal.com)
2025-02-18 10:53:31

Noice!! I’m a lowly silver 1 😅 but I’m tryna catch up

I play on ps

Josh Josue (jjosue@shield-legal.com)
2025-02-18 10:55:15

Who do you main as?

James Scott (jamesscott@shield-legal.com)
2025-02-18 11:07:13

Support rocket lol I’m plat 3

🔥 Josh Josue
James Scott (jamesscott@shield-legal.com)
2025-02-18 11:07:23

Ewww PlayStation lol

😆 Josh Josue
James Scott (jamesscott@shield-legal.com)
2025-02-18 11:07:29

We can deff play

James Scott (jamesscott@shield-legal.com)
2025-02-18 11:07:31

In pc

Josh Josue (jjosue@shield-legal.com)
2025-02-18 11:08:09

Oh dayum!! You’re prolly a great healer haha

James Scott (jamesscott@shield-legal.com)
2025-02-18 11:27:14

hahaha i be doing ok my team mates just suck i would be higher haha

Josh Josue (jjosue@shield-legal.com)
2025-02-18 14:43:32

for step 5 optimization, is it ok if I retrieved all entries in the drugsummaries big query table (if any) and then filter out the brandnames with existing entries? This could cut down on the bedrock_api calls

James Scott (jamesscott@shield-legal.com)
2025-02-18 15:38:21

sure ! thats fine

James Scott (jamesscott@shield-legal.com)
2025-02-18 15:55:06

also let me know when you want to play it is crossplay

👍 Josh Josue
Josh Josue (jjosue@shield-legal.com)
2025-02-19 14:11:13

Heya James! I cant seem to see the code for step 7. I'm looking at productionn.ipynb

Josh Josue (jjosue@shield-legal.com)
2025-02-19 14:11:16
James Scott (jamesscott@shield-legal.com)
2025-02-19 15:53:33

yes you can skip that one

James Scott (jamesscott@shield-legal.com)
2025-02-20 06:18:55

did you get to the ranking model yet?

Josh Josue (jjosue@shield-legal.com)
2025-02-20 12:05:29

Wwhich step is that?

Josh Josue (jjosue@shield-legal.com)
2025-02-20 12:05:39

I’m currently on retained and removed labels step 6

James Scott (jamesscott@shield-legal.com)
2025-02-20 12:35:10

Gotcha ok we might have to do another step in there from my talks yesterday

Josh Josue (jjosue@shield-legal.com)
2025-02-20 12:35:37

gotcha - so a step in step 6 OR after step 6?

Josh Josue (jjosue@shield-legal.com)
2025-02-20 12:36:44

I'm also going to add a filtering phase so that step 6 doesnt run on brand_names that already have entries on either the retained/removed tables

Josh Josue (jjosue@shield-legal.com)
2025-02-20 12:37:17

By adding these filtering steps, I found that it significantly cuts down on the entire workflow's runtime (from 2 hrs to 25 min)

James Scott (jamesscott@shield-legal.com)
2025-02-20 12:38:07

After step 6

James Scott (jamesscott@shield-legal.com)
2025-02-20 12:38:19

That’s good !

James Scott (jamesscott@shield-legal.com)
2025-02-20 12:55:15

Basically it’s kinda having to redo the back in of the process a little bit so we need to target drugs that don’t have cases in them then do a ranking model for those drugs and the ones that do we can also just show as a separate function / feature I guess. I gotta talk to Ryan about it

Josh Josue (jjosue@shield-legal.com)
2025-02-20 13:28:30

i see... well i just released step 6 - what should I tackle on next?

Josh Josue (jjosue@shield-legal.com)
2025-02-20 13:28:59

i'm lookin at step 8 - is this where the aforementioned alterations will go?

Josh Josue (jjosue@shield-legal.com)
2025-02-21 11:53:32

G'morning! So I'm currently testing Step 8 - would you be free to do a call on those changes you mentioned?

Josh Josue (jjosue@shield-legal.com)
2025-02-21 13:25:40

After this aggregation part, you want me to filter out rows that have number_of_cases greater than 0? # Aggregation result = df_adverse_financial[df_adverse_financial['earnings'].notnull()].groupby( ['manufacturer_name', 'brand_name', 'activesubstancename', 'case_year'] ).agg( number_of_cases=('case_number', 'nunique'),

James Scott (jamesscott@shield-legal.com)
2025-02-21 13:29:58

Hmmm I would hold off on number 8 actually this is where we need to have a call the tanning model doesn’t need to be completed but because we need to get the number of open cases from case text and only filter the ones with 0 open cases because I need to use this list to extract clinical trial data form and make sense of it all

Josh Josue (jjosue@shield-legal.com)
2025-02-21 13:30:37

ok, im free to huddle rn

Josh Josue (jjosue@shield-legal.com)
2025-02-21 13:30:49

or did you wanna huddle anotther time?

James Scott (jamesscott@shield-legal.com)
2025-02-21 13:31:01

Give me about 10 minutes

Josh Josue (jjosue@shield-legal.com)
2025-02-21 13:31:19

aight cool

James Scott (jamesscott@shield-legal.com)
2025-02-21 13:43:58

~?ready~

James Scott (jamesscott@shield-legal.com)
2025-02-21 13:44:04

ready

Josh Josue (jjosue@shield-legal.com)
2025-02-21 13:44:08

yep!

Josh Josue (jjosue@shield-legal.com)
2025-02-24 14:10:50

hi James! So I'm currently debuggin why the chunk method of upserting to a table for Step 8's result: it's due to new fields that come up every so often

I've tried to download the schema of your table adverse_events_icd_metrics but it seems that's missing the fields too. Do you know what the expected columns are supposed to be?

James Scott (jamesscott@shield-legal.com)
2025-02-24 14:13:24

if your trying to create drugmodel.adverseeventsicdmetricsretainedlabels_prod correct? remeber we removed the financial data column, so with that, u just have to delete the output table of step 8 so it creates a new one this the exact columns

Josh Josue (jjosue@shield-legal.com)
2025-02-24 14:15:00

does the production.ipynb reflect that change? because i still see the dfadversefinancial on there

James Scott (jamesscott@shield-legal.com)
2025-02-24 14:15:31

the production.ipynb reflecks it, but im not sure in ur prod code in workflow does for the previous step

James Scott (jamesscott@shield-legal.com)
2025-02-24 14:15:43

basically

James Scott (jamesscott@shield-legal.com)
2025-02-24 14:15:52

if the table exist already, it needs to have the same column names

James Scott (jamesscott@shield-legal.com)
2025-02-24 14:16:11

so us removing it adjust that, so u have to delete the table in big query so it creates a new one with the new column changes

👍 Josh Josue
Josh Josue (jjosue@shield-legal.com)
2025-02-24 14:16:17

the table existing wont be an issue - i delete it after each test

James Scott (jamesscott@shield-legal.com)
2025-02-24 14:19:53

ok let me know hopefulyl that helped

James Scott (jamesscott@shield-legal.com)
2025-02-24 14:19:53

Queries

query_metrics = """
    SELECT a.**
    FROM drug_model.adverse_events_icd_prod a
    INNER JOIN drug_model.adverse_events_retained_labels_prod b
    ON a.brand_name = b.brand_name AND a.reactionmeddrapt = b.reactionmeddrapt
"""
James Scott (jamesscott@shield-legal.com)
2025-02-24 14:20:04

this is the only place u are fetching data from in step 8

James Scott (jamesscott@shield-legal.com)
2025-02-24 14:20:15

and this table should have those removed columns

Josh Josue (jjosue@shield-legal.com)
2025-02-24 14:46:49

Ok I’ll double check the table being used in my query

It’s just that i remember we deleted some code for step 8 on friday, but cant remember exactly what it was

James Scott (jamesscott@shield-legal.com)
2025-02-24 17:43:38

that was on step 6 i believe it was the financial data iw as saying

Josh Josue (jjosue@shield-legal.com)
2025-02-24 17:50:46

Oh ok thanks

Josh Josue (jjosue@shield-legal.com)
2025-02-24 17:51:15

I was able to compile a complete schema and got a solution to work. Gonna formalize it and then run it thru another test

James Scott (jamesscott@shield-legal.com)
2025-02-25 09:58:34

ryan is joining this ai spring meeting

James Scott (jamesscott@shield-legal.com)
2025-02-25 11:07:00

take the time and show him some of the stuiff u been wokring on,

Josh Josue (jjosue@shield-legal.com)
2025-02-25 11:07:35

Will do

Josh Josue (jjosue@shield-legal.com)
2025-02-27 12:53:36

On the case_text script, the first frame does a query for df_ranking, but it doesnnt seem to be referenced at all for the 2nd frame where the goals is to get brand_name, case_url_link, case_text

Josh Josue (jjosue@shield-legal.com)
2025-02-27 12:53:58

Is the first frame relevant at all for step 9?

James Scott (jamesscott@shield-legal.com)
2025-02-27 12:55:50

can you explain what you mean by first frame

Josh Josue (jjosue@shield-legal.com)
2025-02-27 12:56:17

i thought that's what they're called on a notebook - im referring to the first giant text box

Josh Josue (jjosue@shield-legal.com)
2025-02-27 12:56:33
Josh Josue (jjosue@shield-legal.com)
2025-02-27 12:56:49

starts with that code ^'

Josh Josue (jjosue@shield-legal.com)
2025-02-27 12:57:10

i'm on casetext.ipynb

James Scott (jamesscott@shield-legal.com)
2025-02-27 13:34:08

oh your working on casetex tnow

Josh Josue (jjosue@shield-legal.com)
2025-02-27 13:34:37

yes, from what i understand - it's happens before step 9

Josh Josue (jjosue@shield-legal.com)
2025-02-27 13:34:58

because it's supposed to help me weed out brand_names with cases - right?

James Scott (jamesscott@shield-legal.com)
2025-02-28 05:27:45

yes !

James Scott (jamesscott@shield-legal.com)
2025-02-28 14:16:37

u doing ok

Josh Josue (jjosue@shield-legal.com)
2025-02-28 14:42:57

yea, i think im on the right path

Josh Josue (jjosue@shield-legal.com)
2025-02-28 14:43:42

i've documented my step9 gameplan on Miro

Josh Josue (jjosue@shield-legal.com)
2025-02-28 14:44:04

plz confirm if i understood the overview correctly

James Scott (jamesscott@shield-legal.com)
2025-03-03 10:59:37

Looks good with me !

James Scott (jamesscott@shield-legal.com)
2025-03-03 11:01:53

Only update I have is using the exact table names but other than that the process looks good to me

Josh Josue (jjosue@shield-legal.com)
2025-03-03 11:09:28

Ok gotcha, I’ll update the table names!

(Step 9 is still undergoing integration tests)

James Scott (jamesscott@shield-legal.com)
2025-03-03 12:43:50

awesome!

James Scott (jamesscott@shield-legal.com)
2025-03-03 14:38:04

once this is done i can run my script and work on the case details, i think if you want we can talk about the next part which is more modeling/llm focused if you wanna work on that, no rush for this part

👍 Josh Josue
Josh Josue (jjosue@shield-legal.com)
2025-03-04 12:47:41

Status Update: I've implemented a chunkified version for case_text (there's sooo many case urls lol) and I'm exceeding quota for Big Query. My solution is to increase the chunky size

James Scott (jamesscott@shield-legal.com)
2025-03-04 13:08:35

oh really ok lol i dint know there was an quota for that i think we can increase it

James Scott (jamesscott@shield-legal.com)
2025-03-04 14:09:01

His familiar are u with setting of vs code

James Scott (jamesscott@shield-legal.com)
2025-03-04 14:09:06

Up vs code

Josh Josue (jjosue@shield-legal.com)
2025-03-04 14:12:03

yea a lil bit

Josh Josue (jjosue@shield-legal.com)
2025-03-04 14:12:31

what specifically are you trying to setup in vs code?

James Scott (jamesscott@shield-legal.com)
2025-03-04 14:14:35

Instead of Jupiter notebooks I want to be able to have it in vs ode and leverage fit more

Josh Josue (jjosue@shield-legal.com)
2025-03-04 14:14:42

ctrl + shift + X opens the panel with extensions

Josh Josue (jjosue@shield-legal.com)
2025-03-04 14:15:15

there's jupyter notebook support made by microsoft

Josh Josue (jjosue@shield-legal.com)
2025-03-04 14:20:23

what's the scope of this setup? are we also taking into account setting up a virtualenv for the python project?

James Scott (jamesscott@shield-legal.com)
2025-03-04 14:22:20

Really ? I just want to use the vs code interface for the projects and future projects instead of Jupyter notebooks and leveraging the co pilot the notebook environments is getting frustrating to me

Josh Josue (jjosue@shield-legal.com)
2025-03-04 14:22:55

i see

Josh Josue (jjosue@shield-legal.com)
2025-03-04 14:23:13

i just searched, they also have github copilot as a vs code extension

James Scott (jamesscott@shield-legal.com)
2025-03-04 14:24:15

Yes let’s do it

Josh Josue (jjosue@shield-legal.com)
2025-03-04 14:24:35

cool! do we gotta huddle or somethin?

James Scott (jamesscott@shield-legal.com)
2025-03-04 14:28:09

No lol it doesn’t have to be right away if u have free time from the other task then we can switch but the key is also maybe to leverage the run times from gcp

Josh Josue (jjosue@shield-legal.com)
2025-03-04 14:29:15

oh lol

Josh Josue (jjosue@shield-legal.com)
2025-03-04 14:29:17

ok gotcha

Josh Josue (jjosue@shield-legal.com)
2025-03-04 14:29:46

yea we could def set that up - i'd be up to huddle on it tomorrow around 10AM PST?

Josh Josue (jjosue@shield-legal.com)
2025-03-04 14:30:40

i def have been running each step locally first and see how long they take (so as to not run up the gcp bill for compute time on just tests)

James Scott (jamesscott@shield-legal.com)
2025-03-04 14:44:04

That’s fine!

James Scott (jamesscott@shield-legal.com)
2025-03-04 14:54:25

Yes u just want to make changes and push to got instead of Jupyter stuff be more software engineer like and have a process for it

Josh Josue (jjosue@shield-legal.com)
2025-03-05 12:01:03

g'morning! did you still wanna do that huddle for vs code setup?

James Scott (jamesscott@shield-legal.com)
2025-03-05 12:14:50

yes got a couple minuters?

James Scott (jamesscott@shield-legal.com)
2025-03-06 06:23:37

quick quesiton is the ranking model done?

Josh Josue (jjosue@shield-legal.com)
2025-03-06 10:37:07

I had my machine running since 2:30 yesterday and I’m not sure how long these ranknet trainings take

Josh Josue (jjosue@shield-legal.com)
2025-03-06 10:37:39

But it’s reached epoch 3 and is still going this morning

Josh Josue (jjosue@shield-legal.com)
2025-03-06 10:40:14

What’s the expected value for those losses?

James Scott (jamesscott@shield-legal.com)
2025-03-06 10:40:37

It’s not suooose to be Nan

James Scott (jamesscott@shield-legal.com)
2025-03-06 10:40:41

Hmmm

James Scott (jamesscott@shield-legal.com)
2025-03-06 10:41:10

Mathis is because of formatting issues in the data like it looks like it’s containing infs or nans

James Scott (jamesscott@shield-legal.com)
2025-03-06 10:41:15

Maybe **

Josh Josue (jjosue@shield-legal.com)
2025-03-06 10:42:23

I see

Josh Josue (jjosue@shield-legal.com)
2025-03-06 10:42:35

Well I’ll take a look at the dataset and dataloader

Josh Josue (jjosue@shield-legal.com)
2025-03-06 10:43:04

How long does the training usually take? (So i know when something’s wrong)

James Scott (jamesscott@shield-legal.com)
2025-03-06 11:37:33

Hmmm it can take however long but u did right there shouldn’t be any nan on loss

Josh Josue (jjosue@shield-legal.com)
2025-03-06 11:38:12

Ok gotcha

Josh Josue (jjosue@shield-legal.com)
2025-03-06 11:38:53

I’m running a test on GCP to try and find where the nan is coming from

Hoping to find the source by the time i arrive at the office

James Scott (jamesscott@shield-legal.com)
2025-03-06 11:51:20

Take ur time

Josh Josue (jjosue@shield-legal.com)
2025-03-06 12:35:47

do you have a moment to huddle?

Josh Josue (jjosue@shield-legal.com)
2025-03-06 12:49:27

turnns out a bunch of records have <NA> or NaN

Josh Josue (jjosue@shield-legal.com)
2025-03-06 12:49:47

lots of occurrences like this:

Josh Josue (jjosue@shield-legal.com)
2025-03-06 12:50:35

is it a valid solution to set them to 0 ? I ask bcuz making median_patient_age 0 seems odd

James Scott (jamesscott@shield-legal.com)
2025-03-06 12:54:14

yes that is odd, i would run in the prod notebook because that gave the results

James Scott (jamesscott@shield-legal.com)
2025-03-06 12:54:20

and see what te datatables are like in there

James Scott (jamesscott@shield-legal.com)
2025-03-06 12:54:28

i am not sure whats wrong in the pipeline perspective

James Scott (jamesscott@shield-legal.com)
2025-03-06 12:54:36

but you can decode from the working prod notebook

👍 Josh Josue
Josh Josue (jjosue@shield-legal.com)
2025-03-06 13:02:37

i ran it against your prod table and noticed that median_patient_age is not among the fields, and the only NaN values are in avg patient age and avg weight

So i'm going back to the previous step and make sure to cleanup the data being produced

James Scott (jamesscott@shield-legal.com)
2025-03-06 13:59:24

ok sounds good!

Josh Josue (jjosue@shield-legal.com)
2025-03-06 15:29:07

Success!! i took a small sample of 1000 records and managed to produced a ranking table!

Josh Josue (jjosue@shield-legal.com)
2025-03-06 15:29:37

gonna test it on the entire icd retained label metrics table and have it run on GCP

James Scott (jamesscott@shield-legal.com)
2025-03-06 16:33:12

Awesome !!! Nice nice nice this is great let me know when it’s done

Josh Josue (jjosue@shield-legal.com)
2025-03-07 10:33:19

so yesterday, Big Query inexplicably terminated the test i was running (it was on for 2 hr and 43 min) and that'll be a separate thing I have to look at

Josh Josue (jjosue@shield-legal.com)
2025-03-07 10:35:33

My computer's been running since yesterday. It's working on the training ranknet phase for a total of 172,019,504 DrugDataset entries

currently it's on Epoch 3...

Josh Josue (jjosue@shield-legal.com)
2025-03-07 10:36:03
James Scott (jamesscott@shield-legal.com)
2025-03-07 10:36:32

Huh it deff didn’t take that long because it’s only ranking and these models are ones that is only records without a claim in case text correct

Josh Josue (jjosue@shield-legal.com)
2025-03-07 10:36:48

yes that's correct

Josh Josue (jjosue@shield-legal.com)
2025-03-07 10:37:04
Josh Josue (jjosue@shield-legal.com)
2025-03-07 10:37:34

it got 19900 entries in total from the icd retainedd labels table, but after the filtering, got 17729

Josh Josue (jjosue@shield-legal.com)
2025-03-07 10:39:09

also i should note that I've also had to make a set - in order to get rid of brand name duplicates

James Scott (jamesscott@shield-legal.com)
2025-03-07 10:40:08

So how is the model training on 172 million entries but the drugs in there is only 17729. Are you using the right table for training ? This shiikd be the metrics icd10 table. The most we are looking about is what ehh let’s say even 20 years so for 1 brand_name the most rows it can or should contain is 20 I don’t think we have 175 million rows

Josh Josue (jjosue@shield-legal.com)
2025-03-07 10:41:27

yeah i found that odd

Josh Josue (jjosue@shield-legal.com)
2025-03-07 10:41:58

it should've been 1:1 right? like if the icd10 had 1 mil entries, the model training entries would be also 1 mil right?

Josh Josue (jjosue@shield-legal.com)
2025-03-07 10:42:59

but yes, i am using the table produced by the previous step containing icd metrics retained labels

Josh Josue (jjosue@shield-legal.com)
2025-03-07 10:43:31

and then i eliminate the ones with cases

Josh Josue (jjosue@shield-legal.com)
2025-03-07 10:43:55

(referencing the case texts entries)

James Scott (jamesscott@shield-legal.com)
2025-03-07 10:44:10

Yes it should be one to one the same amount of records ur training form the previous step should be the same. I honestly think I need to look at it because with the removal of records form casetexf he teisninf should be easier and faster have u ran the training from the prod notebook and compared times

Josh Josue (jjosue@shield-legal.com)
2025-03-07 10:46:14

i havent yet

Josh Josue (jjosue@shield-legal.com)
2025-03-07 10:46:22

i'll go ahead and run that now

James Scott (jamesscott@shield-legal.com)
2025-03-07 10:46:43

Yea man ! Do a compare and contrast of that

Josh Josue (jjosue@shield-legal.com)
2025-03-07 11:45:43

ok, i just confirmed that the diff occurs after DrugDataset is created

Josh Josue (jjosue@shield-legal.com)
2025-03-07 11:45:59
Josh Josue (jjosue@shield-legal.com)
2025-03-07 11:47:32

it seems to be due to the pairs being created - so i'll investigate that

James Scott (jamesscott@shield-legal.com)
2025-03-07 12:03:59

Ok!

James Scott (jamesscott@shield-legal.com)
2025-03-07 12:04:16

Yes I knew that something was fishy

Josh Josue (jjosue@shield-legal.com)
2025-03-07 12:33:26

i have a few questions, a huddle might be ideal if you have time

James Scott (jamesscott@shield-legal.com)
2025-03-07 12:35:20

Ok I’m out at lunch if that’s ok

👍 Josh Josue
Josh Josue (jjosue@shield-legal.com)
2025-03-07 12:35:23

this nested loop will inevitably result in pairs[] being larger than the df coming

James Scott (jamesscott@shield-legal.com)
2025-03-07 13:42:17

im back

Josh Josue (jjosue@shield-legal.com)
2025-03-07 13:42:40

cool!

Josh Josue (jjosue@shield-legal.com)
2025-03-07 13:42:43

im free to huddle

Josh Josue (jjosue@shield-legal.com)
2025-03-07 13:44:59

*Thread Reply:* integtestadverseeventsranking

Josh Josue (jjosue@shield-legal.com)
2025-03-07 13:43:06

My code finished after 30min, upserted 17,729 entries to integ_test_integ_test_adverse_events_ranking_logs .The number of entries matches the count for brand_names without cases

James Scott (jamesscott@shield-legal.com)
2025-03-07 17:39:30

NICE!

James Scott (jamesscott@shield-legal.com)
2025-03-07 17:39:34

ima look

Josh Josue (jjosue@shield-legal.com)
2025-03-07 17:39:58

Oh no that msg was before our discussion 😅

Josh Josue (jjosue@shield-legal.com)
2025-03-07 17:40:13

I’m still currently working on automating table replacements

James Scott (jamesscott@shield-legal.com)
2025-03-07 17:41:23

ok lol

James Scott (jamesscott@shield-legal.com)
2025-03-10 14:12:54

hows your day going how was ur weekend

Josh Josue (jjosue@shield-legal.com)
2025-03-10 14:20:04

Heya! It was great- got some good practice sesh airbrush painting haha

How was yours?

Josh Josue (jjosue@shield-legal.com)
2025-03-10 14:21:36

Today’s good - made a fix last night and I’m pickin up where the Workflow job left off. Currently waiting on Step 8 to finish. Then i should have data (born from the full 37 mil adverse events) for Step 9 to rank

James Scott (jamesscott@shield-legal.com)
2025-03-10 17:48:15

did u look into vscode yet

Josh Josue (jjosue@shield-legal.com)
2025-03-10 17:49:19

I did some light researching - seems it’s possible to connect to the GCP server while using vscode so that your code runs remotely

James Scott (jamesscott@shield-legal.com)
2025-03-10 17:49:42

exactly what we need awesome

James Scott (jamesscott@shield-legal.com)
2025-03-10 17:57:04

do you think we can focus on that for tomorrow to get it out the way

Josh Josue (jjosue@shield-legal.com)
2025-03-10 18:01:11

Sure! I’ll pivot to looking into that and see how to set it up with our GCP instance

James Scott (jamesscott@shield-legal.com)
2025-03-10 18:12:25

yes, the other stuff should just be running now to complete correct, deff wanna start getting more software engineer like with it

Josh Josue (jjosue@shield-legal.com)
2025-03-10 18:15:06

im still trying to fix Step 9, all of a sudden running into a missing column issue

verifying if this is a bug in the code or dirty data

Josh Josue (jjosue@shield-legal.com)
2025-03-10 18:15:24

I'm def tryna automate as much as i can

James Scott (jamesscott@shield-legal.com)
2025-03-10 18:15:29

its probably one of the ones we deleted previously

Josh Josue (jjosue@shield-legal.com)
2025-03-10 18:15:41

implementing automated tests, abstracted modules and such

Josh Josue (jjosue@shield-legal.com)
2025-03-10 18:16:57

maybe it is, i'm double checking stuff rn

Josh Josue (jjosue@shield-legal.com)
2025-03-10 18:23:21

also, as im moving along, I'm integrating new modules to older steps (like the logger to step2)

Josh Josue (jjosue@shield-legal.com)
2025-03-10 19:08:36

I'm trying to add SSH keys to the machine-learning instance but I dont have permissions

Josh Josue (jjosue@shield-legal.com)
2025-03-10 19:39:35

Step 9 has completed - but the ranking's earliest year is 2001

I should note that im also waiting on Step 1 to ingest stuff - (I've explicitly told it to start from 1994)

James Scott (jamesscott@shield-legal.com)
2025-03-10 20:21:20

ima take a loko

James Scott (jamesscott@shield-legal.com)
2025-03-10 20:21:22

that seems good

James Scott (jamesscott@shield-legal.com)
2025-03-10 20:21:26

i am not tripping about that lol

James Scott (jamesscott@shield-legal.com)
2025-03-10 20:21:31

2001 is good

Josh Josue (jjosue@shield-legal.com)
2025-03-10 20:22:09

oh ok thanks lol

James Scott (jamesscott@shield-legal.com)
2025-03-10 20:29:42

we need to double check tomorrow the casetext query, some of these are coming back with cases

👍 Josh Josue
James Scott (jamesscott@shield-legal.com)
2025-03-10 20:31:03

ehh i think its ok, i could give the list of links to cases but combiung through them could be annoying rather give them to the user to figure out

Josh Josue (jjosue@shield-legal.com)
2025-03-11 12:33:56

when you have a moment, im ready to show you how i've ssh'd into a GCP VM

James Scott (jamesscott@shield-legal.com)
2025-03-11 12:38:57

ok lets take alook

James Scott (jamesscott@shield-legal.com)
2025-03-11 12:38:59

u free?

Josh Josue (jjosue@shield-legal.com)
2025-03-11 12:40:11

yep!

Josh Josue (jjosue@shield-legal.com)
2025-03-11 13:59:46

When you get the chance, could u try to add an ssh key i to machine-learning? I’ve already given myself an admin role and still not able to do so

Josh Josue (jjosue@shield-legal.com)
2025-03-11 14:00:20

I’m wondering if you’re account is able to add one

James Scott (jamesscott@shield-legal.com)
2025-03-11 14:26:18

Ok can we go over it shortly

Josh Josue (jjosue@shield-legal.com)
2025-03-11 14:26:50

I’m currently eating 😅

I’ll msg u when i get back

James Scott (jamesscott@shield-legal.com)
2025-03-11 14:27:05

Take ur time no rush

Josh Josue (jjosue@shield-legal.com)
2025-03-11 15:01:13

im back

James Scott (jamesscott@shield-legal.com)
2025-03-11 15:05:18

ok give me a couple minutes

James Scott (jamesscott@shield-legal.com)
2025-03-11 15:41:10

let me look at this tomorrow i have some things i need to take care of, good work today though!

👍 Josh Josue
James Scott (jamesscott@shield-legal.com)
2025-03-12 09:24:09

real quick, whats the results of the rank stored in whats the tbale name?

Josh Josue (jjosue@shield-legal.com)
2025-03-12 12:11:00

sorry just saw this

Josh Josue (jjosue@shield-legal.com)
2025-03-12 12:11:30

integ_test_adverse_events_ranking

Josh Josue (jjosue@shield-legal.com)
2025-03-12 12:12:02

not sure why i didnt get a slack notification this morning

James Scott (jamesscott@shield-legal.com)
2025-03-12 12:24:45

its ok!

James Scott (jamesscott@shield-legal.com)
2025-03-12 12:45:40

hows teh vs code ocming along

Josh Josue (jjosue@shield-legal.com)
2025-03-12 12:46:18

we have a successful test that finished this morning - Step 1 to Step 9

Josh Josue (jjosue@shield-legal.com)
2025-03-12 12:46:35

i'm making some small fixes (like typos in the logs table names, etc)

James Scott (jamesscott@shield-legal.com)
2025-03-12 12:46:47

Let’s go!! lol can u update the tables in Miro with the actual table names

Josh Josue (jjosue@shield-legal.com)
2025-03-12 12:47:04

oh yea for sure - i'll create a ticket so i dont forget

James Scott (jamesscott@shield-legal.com)
2025-03-12 12:53:34

How’s vs come coming along

Josh Josue (jjosue@shield-legal.com)
2025-03-12 12:56:53

that i prolly need your help on

Josh Josue (jjosue@shield-legal.com)
2025-03-12 12:57:11

bcuz I succeeded in giving myself an admin role for the machine-learning instance

Josh Josue (jjosue@shield-legal.com)
2025-03-12 12:57:43

BUT for some reason, i'm just not allowed to make any changes to it (including adding SSH keys)

James Scott (jamesscott@shield-legal.com)
2025-03-12 12:59:49

Ok let’s take a look

Josh Josue (jjosue@shield-legal.com)
2025-03-12 13:04:03

are you able to add ssh keys? if you're also not able to, I'm guessing this is due to some GCP configuration (iirc on AWS if you create an instance, there's a scenario where you wont be able to add SSH keys afterwards)

James Scott (jamesscott@shield-legal.com)
2025-03-12 13:06:08

Let’s take a look if ur free

Josh Josue (jjosue@shield-legal.com)
2025-03-12 13:06:46

yep im ready to huddle

Josh Josue (jjosue@shield-legal.com)
2025-03-12 13:08:46

ssh-keygen -t rsa -f gcp-shield-legal -C dev-cronjob -b 2048

James Scott (jamesscott@shield-legal.com)
2025-03-12 13:09:31

ssh-keygen -t rsa -f gcp-shield-legal-machinelearning -C shield-legal -b 2048

James Scott (jamesscott@shield-legal.com)
2025-03-12 13:33:06

https://www.zotero.org/

zotero.org
Josh Josue (jjosue@shield-legal.com)
2025-03-12 13:41:59

IdentityFile C:\Users\Jehoshua.ssh\machine-learning

Josh Josue (jjosue@shield-legal.com)
2025-03-12 13:55:32

what's your github username?

James Scott (jamesscott@shield-legal.com)
2025-03-12 13:55:51

jamesshield

Josh Josue (jjosue@shield-legal.com)
2025-03-12 13:56:41

i see jamesshields and JamesShield

Josh Josue (jjosue@shield-legal.com)
2025-03-12 13:56:59

just wanna make sure i dont add a rando this secret project haha

James Scott (jamesscott@shield-legal.com)
2025-03-12 13:57:53

It’s the second one Ahha and yes that would be bad

James Scott (jamesscott@shield-legal.com)
2025-03-14 08:16:46

were u able to like get in the enviroment connect to github and do stuff

Josh Josue (jjosue@shield-legal.com)
2025-03-14 10:09:33

yea! i git cloned one of my repos into the vs-code-machine-learning and run it just fine

Josh Josue (jjosue@shield-legal.com)
2025-03-14 10:10:09

i also tried using the git extension on vs code (never used it, i've always used sourcetree) and that worked out fine too

James Scott (jamesscott@shield-legal.com)
2025-03-14 11:05:28

awesome let me try and do it too ima download whay i was wokring on ltest and try to updalod it prob need to make a new branch have u been following any type of structe or naming convention

Josh Josue (jjosue@shield-legal.com)
2025-03-14 11:12:02

I created a new repo for step 10

Josh Josue (jjosue@shield-legal.com)
2025-03-14 11:12:23

I name the repos based on what your notebooks’ step titles are

Josh Josue (jjosue@shield-legal.com)
2025-03-14 11:16:25

I’ll send u a git invite for Step 10’s repo once i go into the office

James Scott (jamesscott@shield-legal.com)
2025-03-14 11:24:27

gotcha ok, yes if u follow the miro boad and have the updaed table names thats a fine naming convention since it relates back to that

Josh Josue (jjosue@shield-legal.com)
2025-03-14 12:09:43

alrighty, i've added you to step 10's repo (named clinical_extraction)

James Scott (jamesscott@shield-legal.com)
2025-03-14 13:31:26

are these repo in shield nam

Josh Josue (jjosue@shield-legal.com)
2025-03-14 13:31:48

lemme go ahead and do that

James Scott (jamesscott@shield-legal.com)
2025-03-14 13:32:03

like usuingur shield account or personal?

Josh Josue (jjosue@shield-legal.com)
2025-03-14 13:32:15

oh im using shield legal git acct

James Scott (jamesscott@shield-legal.com)
2025-03-14 13:32:33

ok awesome !

Josh Josue (jjosue@shield-legal.com)
2025-03-14 13:33:45

im transferring to SL-BI

James Scott (jamesscott@shield-legal.com)
2025-03-14 13:34:04

oh no

James Scott (jamesscott@shield-legal.com)
2025-03-14 13:34:07

dont do that

James Scott (jamesscott@shield-legal.com)
2025-03-14 13:34:09

why?

Josh Josue (jjosue@shield-legal.com)
2025-03-14 13:34:12

gotcha

Josh Josue (jjosue@shield-legal.com)
2025-03-14 13:34:14

ok nvm

Josh Josue (jjosue@shield-legal.com)
2025-03-14 13:34:19

lol i thought that's what you meant

James Scott (jamesscott@shield-legal.com)
2025-03-14 13:35:21

no thats a whole other project lol

Josh Josue (jjosue@shield-legal.com)
2025-03-14 13:35:46

ohh ok i thought that was, like, the "team" we're on

Josh Josue (jjosue@shield-legal.com)
2025-03-14 13:41:40

also, i should mention that i did have to install git and a few things on the VM, not sure if you'll run into that

Josh Josue (jjosue@shield-legal.com)
2025-03-14 14:08:36

Just wanted to confirm - i will be refactoring the entire step 10 script

James Scott (jamesscott@shield-legal.com)
2025-03-14 14:08:53

form what i dont u mean?

James Scott (jamesscott@shield-legal.com)
2025-03-14 15:16:55

also i am able to clean the repos and get in! ima start my coding form here and u can just pull my changes when done

James Scott (jamesscott@shield-legal.com)
2025-03-14 15:18:38

i dont think this has pyhton isntalled lol

Josh Josue (jjosue@shield-legal.com)
2025-03-14 15:18:57

Lol i thought so too

Josh Josue (jjosue@shield-legal.com)
2025-03-14 15:19:00

Try python3 for executing it

James Scott (jamesscott@shield-legal.com)
2025-03-14 15:19:28

oh i mean like when i am trying to code, there is no python kernal installed

James Scott (jamesscott@shield-legal.com)
2025-03-14 15:23:25

prob install it through terminal idk

Josh Josue (jjosue@shield-legal.com)
2025-03-14 15:23:59

hmmm... i was able to run code just fine on that VM also i usually create a virtual env

James Scott (jamesscott@shield-legal.com)
2025-03-14 15:25:09

ah, i can do that

James Scott (jamesscott@shield-legal.com)
2025-03-14 15:25:16

thought it woul dbe easier

James Scott (jamesscott@shield-legal.com)
2025-03-14 15:25:24

since we would share it

Josh Josue (jjosue@shield-legal.com)
2025-03-14 15:26:15

oh i see

Josh Josue (jjosue@shield-legal.com)
2025-03-14 15:26:43

i guess we'll just have to sync up code via git

James Scott (jamesscott@shield-legal.com)
2025-03-14 15:28:59

what you mean if we are both in the envorment, shouldnt the terminal install work for both of us

Josh Josue (jjosue@shield-legal.com)
2025-03-14 15:29:45

i believe the VM partitions us by user - you cant see my files and i cant see yours either

Josh Josue (jjosue@shield-legal.com)
2025-03-14 15:30:43

if the goal is to have use both see the same files, i could look into how the VM can create shared storage volume

James Scott (jamesscott@shield-legal.com)
2025-03-14 15:31:21

That’s true, I guess the requirements . Txt needs to be pulled form somewhere idk, how would we solve the problem via git well if we are working on GitHub we don’t need the same files but for python packages we need to be in sync for that cuz u are not away of everything

Josh Josue (jjosue@shield-legal.com)
2025-03-14 15:31:42

the repo has requirements.txt - so we can update it as needed

Josh Josue (jjosue@shield-legal.com)
2025-03-14 15:32:05

and that requirements.txt is what will keep our dependencies in sync

Josh Josue (jjosue@shield-legal.com)
2025-03-14 15:33:16

i also put a get_modules.bash script in all of our projects to grab the common modules among them (like table names, google cloud util, etc)

James Scott (jamesscott@shield-legal.com)
2025-03-14 15:36:04

Ok I guess I can keep the repo requirements txt updated is that the same without ? Like if I’m working on step 10 versus step 1 would that be the same repo file

Josh Josue (jjosue@shield-legal.com)
2025-03-14 15:37:18

most of them use the same requirements.txt, step 9 was the exception - i created a separate requirements.txt for that project

Josh Josue (jjosue@shield-legal.com)
2025-03-14 15:37:59

if you pip installed the requirements.txt currently in the Step 10 repo, it should have everything we've been using in the previous steps

Josh Josue (jjosue@shield-legal.com)
2025-03-14 15:39:31

iirc step 9 was the one step that used torch, which took a while downloading and installing - didnt want the other steps to be bogged down by that when they dont need it

James Scott (jamesscott@shield-legal.com)
2025-03-14 16:52:31

ima try that when i get back to my pc

Josh Josue (jjosue@shield-legal.com)
2025-03-17 11:49:48

G’morning! Recently saw this open source software (docker deployable) that I could implement when we have time or when we do another pipeline

It basically solves our needs for: • UI showing steps of pipeline • Automated pulling from our GitHub repository (ensuring integrity of releases and streamlining deployment) • Live feed of a gant chart timeline for each step being done • Handling of secrets • Infrastructure as code (steps are saved as YAML)

James Scott (jamesscott@shield-legal.com)
2025-03-17 11:50:42

nice, lets talk about it, so what are u working on now? i have a meeting with cam and he wants us to prsent something soon, so i wanted to get more data on these clinical trials

Josh Josue (jjosue@shield-legal.com)
2025-03-17 11:51:05

I'm currently refactoring the script used for step 10

Josh Josue (jjosue@shield-legal.com)
2025-03-17 11:51:26

starting with the LLM part

Josh Josue (jjosue@shield-legal.com)
2025-03-17 11:51:38

what is the scope of the presentation?

James Scott (jamesscott@shield-legal.com)
2025-03-17 11:52:08

no idea lol, have. ameeting with him tomorrow about it, but more so the refactoring of the llm in regrards to the classificaiton corect

Josh Josue (jjosue@shield-legal.com)
2025-03-17 11:58:03

i ran an integration test on Friday, and the integ_test_adverse_events_ranking is updated

James Scott (jamesscott@shield-legal.com)
2025-03-17 12:12:35

nice!

Josh Josue (jjosue@shield-legal.com)
2025-03-17 14:03:44

So i've been going thru Step 10's code, namely the QAResponseModel2 class

aside from breaking down the code, was there a feature you needed implemented/added to it?

James Scott (jamesscott@shield-legal.com)
2025-03-17 14:04:05

can u link me tro the file ur looiking at

Josh Josue (jjosue@shield-legal.com)
2025-03-17 15:25:23

sure one sec

Josh Josue (jjosue@shield-legal.com)
2025-03-17 15:27:47

im looking at frame 42

James Scott (jamesscott@shield-legal.com)
2025-03-17 17:08:25

no that part is done and final, but there is another part i am currently coding to add

Josh Josue (jjosue@shield-legal.com)
2025-03-17 17:11:14

Oh ok gotcha

James Scott (jamesscott@shield-legal.com)
2025-03-17 17:41:15

after this, are u good with bi dashboarding?

Josh Josue (jjosue@shield-legal.com)
2025-03-17 17:48:16

I’m down learn it!

Josh Josue (jjosue@shield-legal.com)
2025-03-17 17:48:48

I’ve done Plotly before - is there a python library you’re leaning towards?

Josh Josue (jjosue@shield-legal.com)
2025-03-17 17:50:05

also, who's the target audience? if it's outside the company, i'm also capable of creating the backend with login and auth so users that are only allowed to see the data can see it

James Scott (jamesscott@shield-legal.com)
2025-03-17 17:51:12

Bush’s to check the rank algorithm but that’s fine we should start plotting this in the chart it’s going to be can and other stakeholders I don’t think that with is necessary for now but let me show u how I envision it and we can start to put together a moor board

James Scott (jamesscott@shield-legal.com)
2025-03-17 17:52:10

I like this kinda dashboard look and it’s very official

James Scott (jamesscott@shield-legal.com)
2025-03-17 17:52:13

https://kwork.com/virtual-assistant/27330375/i-will-create-a-google-data-studio-or-looker-studio-dashboard

kwork.com
Josh Josue (jjosue@shield-legal.com)
2025-03-17 17:53:13

oooo that does look pretty spiffy!

Josh Josue (jjosue@shield-legal.com)
2025-03-17 17:56:54

that link is referring to Looker studio, (but I also know Node, React and NextJS if that becomes relevant)

James Scott (jamesscott@shield-legal.com)
2025-03-17 18:01:01

I know right lol if we were to get the dashboard looking like that my word !

😆 Josh Josue
Josh Josue (jjosue@shield-legal.com)
2025-03-18 13:07:38

after the demo and the looker UI stuff, would I have the chance to work on improving our pipeline with Kestra?

Josh Josue (jjosue@shield-legal.com)
2025-03-18 13:08:02

one of my main concerns at the moment is that our deployment is not automated

James Scott (jamesscott@shield-legal.com)
2025-03-18 13:09:51

Yes to the first one and isn’t it in the scheduled to run? That’s fine

Josh Josue (jjosue@shield-legal.com)
2025-03-18 13:10:43

it is scheduled to run

by deployment, I'm referring to the step of putting the code onto the BigQuery Workflow - me - I am the deployment rn lol

Josh Josue (jjosue@shield-legal.com)
2025-03-18 13:11:14

i literally have to copy and paste stuff into the notebooks (which isnt a standard software releasing practice lol)

James Scott (jamesscott@shield-legal.com)
2025-03-18 13:11:31

oh ahha i thought its already in there i am confused but its fine but yes we can work on a more standard approach but this is fine for now

👍 Josh Josue
Josh Josue (jjosue@shield-legal.com)
2025-03-18 13:12:42

that said i think im almost done refactoring the LLM part of step 10 - when im done what are my next steps?

Josh Josue (jjosue@shield-legal.com)
2025-03-18 13:14:35

or rather, what features does your branch have? (that I will be refactorring when you're done)

James Scott (jamesscott@shield-legal.com)
2025-03-18 13:18:32

when us aid refactoring that llm

James Scott (jamesscott@shield-legal.com)
2025-03-18 13:18:34

is for

James Scott (jamesscott@shield-legal.com)
2025-03-18 13:18:37

clincial trials

James Scott (jamesscott@shield-legal.com)
2025-03-18 13:18:46

refacrotering was suppose to be for the recommendaiton llm

James Scott (jamesscott@shield-legal.com)
2025-03-18 13:18:59

theres 2

Josh Josue (jjosue@shield-legal.com)
2025-03-18 13:19:15

ohhh

Josh Josue (jjosue@shield-legal.com)
2025-03-18 13:19:22

ok lemme give it another read

James Scott (jamesscott@shield-legal.com)
2025-03-18 13:19:25

this is the one im on

James Scott (jamesscott@shield-legal.com)
2025-03-18 13:19:28

i mean its fine

James Scott (jamesscott@shield-legal.com)
2025-03-18 13:19:34

it doesnt seme like u updated this code

James Scott (jamesscott@shield-legal.com)
2025-03-18 13:19:40

but that llm would work for recatoring too

James Scott (jamesscott@shield-legal.com)
2025-03-18 13:19:47

if u just change the prompt and ingestion around

James Scott (jamesscott@shield-legal.com)
2025-03-18 13:19:51

which seems like u did

Josh Josue (jjosue@shield-legal.com)
2025-03-18 13:20:49

ok so frame 42 and 43 look identical

Josh Josue (jjosue@shield-legal.com)
2025-03-18 13:21:00

i'm about to finish refactoring frame 42

James Scott (jamesscott@shield-legal.com)
2025-03-18 13:22:58

let me to ur code

Josh Josue (jjosue@shield-legal.com)
2025-03-18 13:23:09

I dont see code pertaining to recommendations

Or did you mean I am writing code for recommendations

James Scott (jamesscott@shield-legal.com)
2025-03-18 13:23:18

i am going to link u to that

Josh Josue (jjosue@shield-legal.com)
2025-03-18 13:23:18

you can pull it on a branch

Josh Josue (jjosue@shield-legal.com)
2025-03-18 13:23:29

josh/refactor_llm_part is the branch

James Scott (jamesscott@shield-legal.com)
2025-03-18 13:23:52

let me pull and push the example llm that i did before for prod

👍 Josh Josue
James Scott (jamesscott@shield-legal.com)
2025-03-18 13:25:01

can u link me to the clone link

Josh Josue (jjosue@shield-legal.com)
2025-03-18 13:25:37

git clone <https://github.com/josh-SL/clinical_extraction.git>

James Scott (jamesscott@shield-legal.com)
2025-03-18 13:30:11

just pushed the files

James Scott (jamesscott@shield-legal.com)
2025-03-18 13:30:13

take a look

Josh Josue (jjosue@shield-legal.com)
2025-03-18 13:31:00

thanks! i'll take a look at these

Josh Josue (jjosue@shield-legal.com)
2025-03-18 13:32:40

so am i understanding this correctly - step 10 is basically:

  1. clinical study ingestion (you're currently adding to it)
  2. clinical study LLM extraction (I'm almost done refactoring)
  3. litigation recommendation (what I'll refactor next)
James Scott (jamesscott@shield-legal.com)
2025-03-18 13:35:14
  1. clinical study LLM extraction (I'm almost done refactoring)
James Scott (jamesscott@shield-legal.com)
2025-03-18 13:35:19

what do u mean refactor os this

Josh Josue (jjosue@shield-legal.com)
2025-03-18 13:36:07

so just like the previous steps, I've had to abstract some of the logic - to make reusable code - in this case the bedrock calls

Josh Josue (jjosue@shield-legal.com)
2025-03-18 13:37:08

it also involved abstracting the prompts as params and moved them to another file (separate from the code)

James Scott (jamesscott@shield-legal.com)
2025-03-18 13:37:14

gotcha ok! thats fine with me

Josh Josue (jjosue@shield-legal.com)
2025-03-18 15:31:41

in the llm_recommendation notebook's code, the first line is # Read the extracted case data df = pd.read_csv('extracted_case_data.csv') are these csv files going to be stored on Google Bucket for our production level pipeline?

Josh Josue (jjosue@shield-legal.com)
2025-03-18 16:03:43

OR is that data supposed to be coming from the case_texts table?

James Scott (jamesscott@shield-legal.com)
2025-03-18 16:29:13

Well I think that is the process that’s changed right so the start is going to come from clinicaltrialprod which is the information from the clinical data

James Scott (jamesscott@shield-legal.com)
2025-03-19 09:48:33

let me know when you have a couple minutes

James Scott (jamesscott@shield-legal.com)
2025-03-19 09:56:46

nbm i go tit

James Scott (jamesscott@shield-legal.com)
2025-03-19 10:02:27

able to run the llm and code from end to end in vs code

Josh Josue (jjosue@shield-legal.com)
2025-03-19 10:20:44

Oh awesome! What branch should i pull down?

James Scott (jamesscott@shield-legal.com)
2025-03-19 10:20:59

Not done yet I will let u know

Josh Josue (jjosue@shield-legal.com)
2025-03-19 16:19:36

alrighty, i've finished refactoring the llm_recommendation code that you put in my branch yesterday

James Scott (jamesscott@shield-legal.com)
2025-03-19 16:26:01

Let’s take a look at it tomorrow and see the results

Josh Josue (jjosue@shield-legal.com)
2025-03-19 16:50:02

i'll try to get test data to show you tomorrow, but I thought the data ingestion part of step10 was still undergoing changes in your branch?

James Scott (jamesscott@shield-legal.com)
2025-03-19 16:52:02

Scroll up! The ingestion comes from a the clinicaltrialprod

Josh Josue (jjosue@shield-legal.com)
2025-03-19 16:52:51

oh my bad i forgot, ok i'll run it with that data

Josh Josue (jjosue@shield-legal.com)
2025-03-20 10:21:03

good morning! the test results data can be found in integ_test_drug_ranking_llm_recommendation

I might have to do some tweaking since some of the rows and columns are nan

James Scott (jamesscott@shield-legal.com)
2025-03-20 10:22:06

the recommendaiton is nan? yea prob some prompt engineering but i actually got a gameplan on how this new dashboard is going to be awesome, i need to invite u to a new miro board so we can get the dashboard together this is going to be epic and prob get us some raises ahha

😄 Josh Josue
Josh Josue (jjosue@shield-legal.com)
2025-03-20 12:40:54

alrighty! i fixed it - the data on the aforementioned table looks more correct now

Josh Josue (jjosue@shield-legal.com)
2025-03-20 14:14:18

i'll get started on refactoring the ingestion part of the script

James Scott (jamesscott@shield-legal.com)
2025-03-20 14:14:48

What do u mean

Josh Josue (jjosue@shield-legal.com)
2025-03-20 14:15:15

like, "productionalizing" it

Josh Josue (jjosue@shield-legal.com)
2025-03-20 14:15:28

as i have done to all the previous steps

James Scott (jamesscott@shield-legal.com)
2025-03-20 14:15:41

Oh gotcha !

Josh Josue (jjosue@shield-legal.com)
2025-03-21 10:38:26

g'morning! Ok so i've finished integrating the ingestion part into my code and ran step 10 again

So the tables integ_test_clinical_abstracts integ_test_drug_ranking_llm_recommendation have been created

James Scott (jamesscott@shield-legal.com)
2025-03-21 11:59:23

got a second

Josh Josue (jjosue@shield-legal.com)
2025-03-21 11:59:34

yep

Josh Josue (jjosue@shield-legal.com)
2025-03-21 11:59:37

whats up?

Josh Josue (jjosue@shield-legal.com)
2025-03-24 14:49:44

Heya! So I found the reason why the abstracts were missing columns. It’s related to how I’m processing the bedrock response - working on a fix for it

James Scott (jamesscott@shield-legal.com)
2025-03-24 14:54:52

Ok !

James Scott (jamesscott@shield-legal.com)
2025-03-25 11:53:18

hhey you have a second

Josh Josue (jjosue@shield-legal.com)
2025-03-25 12:30:40

im bout to hop onto the meeting

Josh Josue (jjosue@shield-legal.com)
2025-03-25 12:30:43

what's up?

James Scott (jamesscott@shield-legal.com)
2025-03-25 12:37:10

adverseeventsranking_prod

James Scott (jamesscott@shield-legal.com)
2025-03-25 12:40:04

#this is what you have to fix dfupload = pd.merge(dforiginal, rankeddf[['caseyear', 'brandname', 'manufacturername', 'activesubstancename', 'rank']], how='left', on=['caseyear', 'brandname', 'manufacturer_name', 'activesubstancename'])

James Scott (jamesscott@shield-legal.com)
2025-03-26 10:48:10

hows the update looking

Josh Josue (jjosue@shield-legal.com)
2025-03-26 11:14:22

I’ve made the fix and now currently running Step 9 to verify the data on Big Query

Josh Josue (jjosue@shield-legal.com)
2025-03-26 11:15:27

Also running Step 10 all over again from last night - the api decided to cut connection mid run

So i added an upload phase after the ingestion to salvage the work done by a previous run

James Scott (jamesscott@shield-legal.com)
2025-03-26 12:05:13

oh yikes ok sounds good let me know whens tep 9 is done

Josh Josue (jjosue@shield-legal.com)
2025-03-26 12:11:17

will do

Josh Josue (jjosue@shield-legal.com)
2025-03-26 12:45:07

oof, my Step 9 running on Big Query was terminated as it ran out of RAM

going to retry on my work laptop

Josh Josue (jjosue@shield-legal.com)
2025-03-26 16:23:20

So I'm having trouble getting through Step 9 after i added that merge line

Josh Josue (jjosue@shield-legal.com)
2025-03-26 16:23:51

the point of failure is at the predict and rank phase

Josh Josue (jjosue@shield-legal.com)
2025-03-26 16:24:07

it just suddenly gets killed

Josh Josue (jjosue@shield-legal.com)
2025-03-26 16:24:21

(this attempt is on our online GCP instance)

James Scott (jamesscott@shield-legal.com)
2025-03-26 16:24:21

lets walk through i

James Scott (jamesscott@shield-legal.com)
2025-03-26 16:24:30

hmmm

James Scott (jamesscott@shield-legal.com)
2025-03-26 16:24:37

this runs fine in workbench

James Scott (jamesscott@shield-legal.com)
2025-03-26 16:24:45

are we using the

James Scott (jamesscott@shield-legal.com)
2025-03-26 16:24:54

or make sure our cpu and stuff is good

James Scott (jamesscott@shield-legal.com)
2025-03-26 16:24:58

i mean maybe the new one

Josh Josue (jjosue@shield-legal.com)
2025-03-26 16:25:00

i've encountered errors coming from panda

James Scott (jamesscott@shield-legal.com)
2025-03-26 16:25:03

isnt the same as the orginal one

Josh Josue (jjosue@shield-legal.com)
2025-03-26 16:25:45

got a moment to huddle?

James Scott (jamesscott@shield-legal.com)
2025-03-26 16:26:03

its a little late on my end, i am at a loacrosee game

James Scott (jamesscott@shield-legal.com)
2025-03-26 16:26:04

hmmm

Josh Josue (jjosue@shield-legal.com)
2025-03-26 16:26:09

ohh my bad

James Scott (jamesscott@shield-legal.com)
2025-03-26 16:26:09

i can walk throught it text wise though

Josh Josue (jjosue@shield-legal.com)
2025-03-26 16:26:18

ok one sec

Josh Josue (jjosue@shield-legal.com)
2025-03-26 16:26:35
Josh Josue (jjosue@shield-legal.com)
2025-03-26 16:27:06

so on this spot, your notebook was doing it in place, but pandas told me not to do that, hence i changed it

Josh Josue (jjosue@shield-legal.com)
2025-03-26 16:27:27

got past those lines that replaced the nan and fillna()

Josh Josue (jjosue@shield-legal.com)
2025-03-26 16:27:51

but now, the code does the train_ranknet() part and it just dies

Josh Josue (jjosue@shield-legal.com)
2025-03-26 16:29:51

There are only 10 epochs, but they would die before finishing - any guesses as to why that is?\

Josh Josue (jjosue@shield-legal.com)
2025-03-26 16:31:09

google says i probably ran out of ram (going to check the ram on our instance)

James Scott (jamesscott@shield-legal.com)
2025-03-26 16:32:15

i think it ram issue

James Scott (jamesscott@shield-legal.com)
2025-03-26 16:32:16

the join

James Scott (jamesscott@shield-legal.com)
2025-03-26 16:32:19

is after the model

James Scott (jamesscott@shield-legal.com)
2025-03-26 16:32:20

not before

Josh Josue (jjosue@shield-legal.com)
2025-03-26 16:32:35

ok yea gotcha

James Scott (jamesscott@shield-legal.com)
2025-03-26 16:32:40

so the code shouldnt have been adjusted from what it was preivously other than that one line of code after

Josh Josue (jjosue@shield-legal.com)
2025-03-26 16:33:24

i had to adjust it due to an error crash with pandas:

FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method. The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

James Scott (jamesscott@shield-legal.com)
2025-03-26 16:33:50

what line are you getting the error from

Josh Josue (jjosue@shield-legal.com)
2025-03-26 16:34:48

the lines wont translate 1:1 but this is the code:

Josh Josue (jjosue@shield-legal.com)
2025-03-26 16:35:15

i've had to put group['score'] = ...... on the left side

Josh Josue (jjosue@shield-legal.com)
2025-03-26 16:36:10

the original way looked like this:

James Scott (jamesscott@shield-legal.com)
2025-03-26 16:36:27

gotcha ya you can just ask chatgpt how to solve that

Josh Josue (jjosue@shield-legal.com)
2025-03-26 16:36:39

that was chatgpt's suggestion

Josh Josue (jjosue@shield-legal.com)
2025-03-26 16:36:59

remove "inplace" param and set the value to group['score']

James Scott (jamesscott@shield-legal.com)
2025-03-26 16:39:36

i would keep in place

James Scott (jamesscott@shield-legal.com)
2025-03-26 16:39:45

the way i have it works

James Scott (jamesscott@shield-legal.com)
2025-03-26 16:40:02

so i wouldnt deviate form that unless you know the data structure in which you are trying to achieve for that step

Josh Josue (jjosue@shield-legal.com)
2025-03-26 16:41:21

ok i'll put the "in place" back

Josh Josue (jjosue@shield-legal.com)
2025-03-26 16:41:42

i'll have to look into increasing the ram of the instance

James Scott (jamesscott@shield-legal.com)
2025-03-26 16:42:22

yea 48 vCPUs, 192 GB RAM

James Scott (jamesscott@shield-legal.com)
2025-03-26 16:42:25

this is what we were using before

James Scott (jamesscott@shield-legal.com)
2025-03-26 16:42:31

on y preivous instance

Josh Josue (jjosue@shield-legal.com)
2025-03-26 17:50:03

just a heads up, i rebooted the vs-code-machine-learning instance

please update your ssh config with the new ip: 35.185.19.12

James Scott (jamesscott@shield-legal.com)
2025-03-26 17:50:20

Ok I can do that

Josh Josue (jjosue@shield-legal.com)
2025-03-27 10:13:02

G'morning - so I think I've gotten the non-scaled values except I forgot to delete the existing table. Gonna rerun Step 9 with a fresh new table

James Scott (jamesscott@shield-legal.com)
2025-03-27 15:06:31

hows it looking now?

Josh Josue (jjosue@shield-legal.com)
2025-03-27 15:06:58

been fixing bugs - currently waiting on a new step 9 run

Josh Josue (jjosue@shield-legal.com)
2025-03-27 15:07:13

while simultaneously debugging step 10

Josh Josue (jjosue@shield-legal.com)
2025-03-27 15:08:46

i forgot to mention that even after upgrading the RAM, the data being merged for step 9 was wayyy too big - like a snake trying to swallow an elephant

so i had to chunkify that process

James Scott (jamesscott@shield-legal.com)
2025-03-27 15:16:16

hmmm thats impossible, reemeber, if u look at the other data, this step should only be like 3k rows 4k rows max

James Scott (jamesscott@shield-legal.com)
2025-03-27 15:16:25

it should be doing anything that requires chunking

James Scott (jamesscott@shield-legal.com)
2025-03-27 15:16:34

shouldnt **

Josh Josue (jjosue@shield-legal.com)
2025-03-27 15:17:05

i might need your help verifying if this is a data issue then

Josh Josue (jjosue@shield-legal.com)
2025-03-27 15:17:44

typically, when I make software, I create automated tests - but in this scenario, it's difficult to make those automated assertions of expected values

James Scott (jamesscott@shield-legal.com)
2025-03-27 15:18:31

can you share with me the code or let me see through github?

Josh Josue (jjosue@shield-legal.com)
2025-03-27 15:19:11

it's on github; branch name is dev

Josh Josue (jjosue@shield-legal.com)
2025-03-27 15:19:22

project name is ranking_model_yearly

Josh Josue (jjosue@shield-legal.com)
2025-03-27 15:19:30

lemme get the github link....

Josh Josue (jjosue@shield-legal.com)
2025-03-27 15:19:51

<https://github.com/josh-SL/ranking_model_yearly.git>

Josh Josue (jjosue@shield-legal.com)
2025-03-27 15:23:48

The main function in question is YearlyRanking.ranking_procedures()

Josh Josue (jjosue@shield-legal.com)
2025-03-27 15:28:09

I have an integration test in this project specifically to verify that the output data is not scaled, but I was hoping we could do a call so I could verify with you the following: • sample input data is valid • test assertions are valid

Josh Josue (jjosue@shield-legal.com)
2025-03-27 15:38:12

ok i got it

James Scott (jamesscott@shield-legal.com)
2025-03-27 15:38:19

i really dont htink we need to make a new whole repo for every single step do we ? seems overkill should just cycle through the branches

James Scott (jamesscott@shield-legal.com)
2025-03-27 15:38:25

and i dont have access to see your git

Josh Josue (jjosue@shield-legal.com)
2025-03-27 15:39:17

oh lemme add you

Josh Josue (jjosue@shield-legal.com)
2025-03-27 15:39:37

just sent you an invite

Josh Josue (jjosue@shield-legal.com)
2025-03-27 15:40:25

the reason why each step is its own repo is because it makes each of them testable and it makes the entire process flexible

Josh Josue (jjosue@shield-legal.com)
2025-03-27 15:40:48

if down the line we adjust ingestion of Step X, we wont have to touch the other steps

Josh Josue (jjosue@shield-legal.com)
2025-03-27 15:40:58

goal is to be modularized and not tightly coupled

James Scott (jamesscott@shield-legal.com)
2025-03-27 15:41:00

can you show me at which code/process your test is using a repo

Josh Josue (jjosue@shield-legal.com)
2025-03-27 15:41:33

i run pytest

Josh Josue (jjosue@shield-legal.com)
2025-03-27 15:42:01

to run just the integration tests use: pytest tests\integrations_tests

Josh Josue (jjosue@shield-legal.com)
2025-03-27 15:42:09

would you like to do a call?

James Scott (jamesscott@shield-legal.com)
2025-03-27 15:44:30

i am traveling right now and my service is bad, can you hear me

Josh Josue (jjosue@shield-legal.com)
2025-03-27 15:44:45

ohh gotcha

James Scott (jamesscott@shield-legal.com)
2025-03-27 15:45:04

also did you change anything in gcp? i cant even start the notebook now

James Scott (jamesscott@shield-legal.com)
2025-03-27 15:45:10

in workbench

Josh Josue (jjosue@shield-legal.com)
2025-03-27 15:45:33

i had to reboot that instance, so might have to update the ip address

Josh Josue (jjosue@shield-legal.com)
2025-03-27 15:45:37

lemme get it for u

Josh Josue (jjosue@shield-legal.com)
2025-03-27 15:45:55

35.185.19.12

James Scott (jamesscott@shield-legal.com)
2025-03-27 15:45:59

oh no

James Scott (jamesscott@shield-legal.com)
2025-03-27 15:46:01

https://console.cloud.google.com/vertex-ai/workbench/instances?referrer=search&inv=1&invt=AbtLnA&project=ai-projects-406720

accounts.google.com
James Scott (jamesscott@shield-legal.com)
2025-03-27 15:46:03

im talking about in here

Josh Josue (jjosue@shield-legal.com)
2025-03-27 15:46:36

i didnt change anything for machine-learning

Josh Josue (jjosue@shield-legal.com)
2025-03-27 15:46:44

but i did get an error msg

Josh Josue (jjosue@shield-legal.com)
2025-03-27 15:46:53

exceeded CPUSALLREGIONS

James Scott (jamesscott@shield-legal.com)
2025-03-27 15:46:58

but what about the overall account qoutas? associated with any accounts in gcp?

James Scott (jamesscott@shield-legal.com)
2025-03-27 15:47:09

mmaybe its cuz of the additional one we created

James Scott (jamesscott@shield-legal.com)
2025-03-27 15:47:13

im not to sure

Josh Josue (jjosue@shield-legal.com)
2025-03-27 15:47:26

it might be, the new one has a lot more CPUs

Josh Josue (jjosue@shield-legal.com)
2025-03-27 15:47:36

the error says Limit: 64 globally

Josh Josue (jjosue@shield-legal.com)
2025-03-27 15:48:03

i could shut off the beefier instance - shall i do that?

James Scott (jamesscott@shield-legal.com)
2025-03-27 15:55:42

i forget what i set it to, we could just increase it again to handle both instances

James Scott (jamesscott@shield-legal.com)
2025-03-27 15:56:42

littl ebit hard to ofllow ur code though at which step are u ingesting the data from before

Josh Josue (jjosue@shield-legal.com)
2025-03-27 15:56:50

ok i'll hold off shutting the beefy instance down

Josh Josue (jjosue@shield-legal.com)
2025-03-27 15:57:11

the overview of the algorithm are in the driver.py

James Scott (jamesscott@shield-legal.com)
2025-03-27 16:07:01

lets talk about this tomororw, this is kinda overkill, if you could its ok to ask more quesitons about the rpocess, the ingestion table is or should be the ranking metrics, which is 3-4k records, and what you do in step name is just adding a new column through scalling of ranking. all your doing is just joining the new ranking table to the orginal 3-4k metric table to combine the orignal table used for ranking to the ranking one

Josh Josue (jjosue@shield-legal.com)
2025-03-27 16:08:53

ok, yes plz i would very much like to huddle about it tomorrow

Josh Josue (jjosue@shield-legal.com)
2025-03-27 16:08:58

safe travels!

James Scott (jamesscott@shield-legal.com)
2025-03-27 16:09:36

can you give me the github to the previous stetp before this one?

Josh Josue (jjosue@shield-legal.com)
2025-03-27 16:12:46

<https://github.com/josh-SL/retained_label_metrics.git>

Josh Josue (jjosue@shield-legal.com)
2025-03-27 16:12:57

i also added you to that repo

James Scott (jamesscott@shield-legal.com)
2025-03-27 16:13:46

yes this repo stuff deff needs to be changed, i should have access to this stuff and or is in our enrioment and not detached or multiple one off instances of reports

James Scott (jamesscott@shield-legal.com)
2025-03-27 16:13:58

repos

James Scott (jamesscott@shield-legal.com)
2025-03-27 16:14:03

we can talk more about it tomorrow

👍 Josh Josue
Josh Josue (jjosue@shield-legal.com)
2025-03-27 17:13:21

my chunkified method successfully generated integ_test_adverse_events_ranking, and it seems that the reason why it's so large is bcuz of duplicates

Josh Josue (jjosue@shield-legal.com)
2025-03-27 17:13:39

but the values are no longer scaled

James Scott (jamesscott@shield-legal.com)
2025-03-28 06:30:14

this is still incorrect, great that the value are no longer sclaled but look at that table, its simple comparison. this is showing 161 million records.. min shows 3700

Josh Josue (jjosue@shield-legal.com)
2025-03-28 10:19:55

*Thread Reply:* yea it definitely is wayyy too big - I aim to find out why the data gets bigger than the input. The tests seem to point to the ranknet function

James Scott (jamesscott@shield-legal.com)
2025-03-28 06:50:45

when you hop on, can you reduce the cpu in the cs code enviroment

Josh Josue (jjosue@shield-legal.com)
2025-03-28 10:17:43

*Thread Reply:* alrighty, i have set the vs-code-machine-learning instance to 2vCPUs and 8GB memory

Josh Josue (jjosue@shield-legal.com)
2025-03-28 10:12:52

did you mean the vs-code-machine-learning instance?

Josh Josue (jjosue@shield-legal.com)
2025-03-28 10:13:28

also i have a better test now, i replaced everything in my project with the code from your notebook

Josh Josue (jjosue@shield-legal.com)
2025-03-28 10:14:11

this is a simple test to verify that the length of the input data is the same as the output

Josh Josue (jjosue@shield-legal.com)
2025-03-28 10:14:42

the sample data came from integ-test-icd-metrics-retained (this allows me for faster turnarounds when testing)

Josh Josue (jjosue@shield-legal.com)
2025-03-28 10:16:13

the input is 1k, but the output is 87k, so it fails the assertion

Josh Josue (jjosue@shield-legal.com)
2025-03-28 10:28:38
James Scott (jamesscott@shield-legal.com)
2025-03-28 10:30:56

you have a second for a talk and next steps

Josh Josue (jjosue@shield-legal.com)
2025-03-28 10:31:14

Sure

Josh Josue (jjosue@shield-legal.com)
2025-03-28 12:42:53

I've acquired the diffs between our tables

My table 2 is short by 36 million - going to start debugging from there

James Scott (jamesscott@shield-legal.com)
2025-03-28 14:11:28

I think k they is a great starting point !

Josh Josue (jjosue@shield-legal.com)
2025-03-28 15:55:45

while im waiting for Step2's results, I had a breakthrough it Step 9's test

The input and output df lengths match if I eliminated duplicates from the input and output dataframes

Duplicate is established by these columns: ['manufacturer_name', 'brand_name', 'activesubstancename', 'case_year', 'number_of_cases', 'number_of_patients', 'average_patient_age', 'average_patient_weight_kg']

James Scott (jamesscott@shield-legal.com)
2025-03-28 15:56:37

saw you said breakthrough it made me laugh lol like you solved world hunger or something

😂 Josh Josue
Josh Josue (jjosue@shield-legal.com)
2025-03-28 15:57:29

i got excited as it was really confusing to me why that was happening haha

Josh Josue (jjosue@shield-legal.com)
2025-03-28 15:58:30

but knowing that, as i'm debuggin these steps, i'll keep in mind if duplicates might be an issue

James Scott (jamesscott@shield-legal.com)
2025-03-28 16:01:40

so the length of step 9 match from both instances?

Josh Josue (jjosue@shield-legal.com)
2025-03-28 16:02:56

oh no that's not it

Im saying i now have a unit test to help me determine if my code is going to mess up the merge

Josh Josue (jjosue@shield-legal.com)
2025-03-28 16:03:30

this wouldd just help me in this debugging process

James Scott (jamesscott@shield-legal.com)
2025-03-28 16:14:30

awesome!

Josh Josue (jjosue@shield-legal.com)
2025-03-31 10:57:43

Good morning! So I ran 3 tests to debug step2, and it shows that my code and the original notebook's code are consistent in its results

I'm not sure why your adverse_events_icd_prod has 116 million entries, but the tests show that the issue is not with the data source nor the code.

James Scott (jamesscott@shield-legal.com)
2025-03-31 10:59:59

whats the comparison with what i have?

Josh Josue (jjosue@shield-legal.com)
2025-03-31 11:00:41

your icd prod table has 116 million while those tests above resulted in around 80 million

Josh Josue (jjosue@shield-legal.com)
2025-03-31 11:01:27

Also, I've confirmed (with an automated test) that Step9 notebook code does in fact result in varying lengths (input data len VS output data len)

Josh Josue (jjosue@shield-legal.com)
2025-03-31 11:01:54

But after I eliminated duplicates (using dataframe operations) the lengths started to match

Josh Josue (jjosue@shield-legal.com)
2025-03-31 11:04:02

Going by this, I ran a query on my step 9 table, it went from 163 million down to 19 million

James Scott (jamesscott@shield-legal.com)
2025-03-31 11:04:31

you should run step 9 in the notebook, it shouldn ttake lon and log the metrics or the lengths, and see whats happening, because this will show u that how u switched the code was the issue or the datatables being used

Josh Josue (jjosue@shield-legal.com)
2025-03-31 11:05:17

ok yeah I can run similar tests on step9

James Scott (jamesscott@shield-legal.com)
2025-03-31 11:05:49

like in the notebook how i did it

James Scott (jamesscott@shield-legal.com)
2025-03-31 11:05:54

just log the lengths in the code

James Scott (jamesscott@shield-legal.com)
2025-03-31 15:58:27

how it looking did u run the code

Josh Josue (jjosue@shield-legal.com)
2025-03-31 15:58:57

I tried several times - in the jupyterlab notebook itself

Josh Josue (jjosue@shield-legal.com)
2025-03-31 15:59:01

but it kept failing

Josh Josue (jjosue@shield-legal.com)
2025-03-31 15:59:18

literally would just crash the page and I'd have to start over

Josh Josue (jjosue@shield-legal.com)
2025-03-31 15:59:25

so I'm taking a different approach

Josh Josue (jjosue@shield-legal.com)
2025-03-31 16:00:53

I noticed the row count anomaly start between Step 6 and Step 8. So I created a DISTINCT only table from step 6 (went from 3.6 million down to 1.9 million)

I will test it as input for step 8

Josh Josue (jjosue@shield-legal.com)
2025-03-31 16:04:13

But I can tell you with full confidence that given a csv file as input to the original Step 9 notebook code, the input length will vary from the output length

which is why I'm pursuing the possibility of duplicates in previous steps

James Scott (jamesscott@shield-legal.com)
2025-04-01 07:50:26

i updated the instnace for higher ram and the code runs

James Scott (jamesscott@shield-legal.com)
2025-04-01 08:02:44

i am adjusting the code and making sure this step in concrete

James Scott (jamesscott@shield-legal.com)
2025-04-01 08:45:42

2025-04-01 13:45:14,813 - INFO - Starting RankNet processing... 2025-04-01 13:45:16,970 - INFO - Loaded DataFrame from BigQuery with 65524 rows. 2025-04-01 13:45:18,727 - INFO - Preprocessing data... 2025-04-01 13:45:18,821 - INFO - Preprocessed DataFrame has 65524 rows. 2025-04-01 13:45:19,743 - INFO - Created 112630 ranking pairs. 2025-04-01 13:45:19,751 - INFO - Training RankNet model... 2025-04-01 13:45:21,649 - INFO - Epoch 1, Loss: 0.6934238484637304 2025-04-01 13:45:23,479 - INFO - Epoch 2, Loss: 0.6931662190366875 2025-04-01 13:45:25,313 - INFO - Epoch 3, Loss: 0.6931543030522086 2025-04-01 13:45:27,157 - INFO - Epoch 4, Loss: 0.6931536412374539 2025-04-01 13:45:29,006 - INFO - Epoch 5, Loss: 0.6931815118274905 2025-04-01 13:45:29,111 - INFO - Ranked DataFrame has 65524 rows. 2025-04-01 13:45:29,140 - INFO - Merged DataFrame has 65524 rows. 65524 out of 65524 rows loaded.?it/s]2025-04-01 13:45:34,727 - INFO - 100%|██████████| 1/1 [00:00&lt;00:00, 1610.10it/s] 2025-04-01 13:45:34,729 - INFO - Successfully uploaded ranked data to BigQuery.

[8]:

👍 Josh Josue
James Scott (jamesscott@shield-legal.com)
2025-04-01 08:45:51

logs from the new updates everything matches

Josh Josue (jjosue@shield-legal.com)
2025-04-01 10:09:16

ok thanks for confirming that!

Josh Josue (jjosue@shield-legal.com)
2025-04-01 10:10:08

I think i found the source of the problem - my step 8's chunkified process - it's doing the query_metrics N times

going to run tests

James Scott (jamesscott@shield-legal.com)
2025-04-01 10:56:23

ahh see that could deff be it

James Scott (jamesscott@shield-legal.com)
2025-04-01 10:56:35

u can productoinalize that code

Josh Josue (jjosue@shield-legal.com)
2025-04-01 10:57:48

Yea, I’m running a test of the JOIN query of my tables, get that length, and then querying your tables, get that length and then compare if the diff is significant

Josh Josue (jjosue@shield-legal.com)
2025-04-01 11:33:35

ok so i did the Step8's JOIN queries: • yours came out 51 million entries • mine was 45.9 million The difference is most probably due to your icd10 table being 116 million and mine is 80 mil

Josh Josue (jjosue@shield-legal.com)
2025-04-01 11:34:31

So my question for you is, if I performed the JOIN query once (like the test above), could I chunkify the data aggregation and use that same JOIN query df or would that also be wrong?

Josh Josue (jjosue@shield-legal.com)
2025-04-01 11:34:51

If not, then I guess I'll just have to get rid of the chunkified method altogether

Josh Josue (jjosue@shield-legal.com)
2025-04-01 16:37:02

ok good news, Step 6 is confirmed to be fixed! • adverse_events_retained_labels_prod at 1.7 million • integ_test_icd_retained_labels at 1.9 million

James Scott (jamesscott@shield-legal.com)
2025-04-01 16:44:30

Awesome !

Josh Josue (jjosue@shield-legal.com)
2025-04-02 12:54:24

Step 8 seems like a big bottleneck with millions of entries from icd and metrics tables

Tried 3 times to run a test (home machine, local machine, vs-code VM) and they would all run out of RAM

Trying to run the Jupyterlab code and see how long it takes to finish

Josh Josue (jjosue@shield-legal.com)
2025-04-02 13:21:08

i need help - the machine-learning VM died and I'm not authorized to start it back up

James Scott (jamesscott@shield-legal.com)
2025-04-02 13:23:20

Yes iw a updating it

James Scott (jamesscott@shield-legal.com)
2025-04-02 13:23:22

For more power

Josh Josue (jjosue@shield-legal.com)
2025-04-02 13:23:27

oh gotcha

Josh Josue (jjosue@shield-legal.com)
2025-04-02 13:23:30

thanks

James Scott (jamesscott@shield-legal.com)
2025-04-02 13:24:43

It’s starting now give it like 2 min

👍 Josh Josue
James Scott (jamesscott@shield-legal.com)
2025-04-02 13:24:48

I am up and in it’s

Josh Josue (jjosue@shield-legal.com)
2025-04-02 14:29:08

The Step8 notebooks on GCP keep terminating

I think I'll have to take a different approach

Josh Josue (jjosue@shield-legal.com)
2025-04-02 14:30:51

in this new approach I'll do the following: • Fetch JOIN query • batch by batch, take chunks of the query and perform aggregation process on it • upload batch to big query

James Scott (jamesscott@shield-legal.com)
2025-04-02 14:44:01

let me see

Josh Josue (jjosue@shield-legal.com)
2025-04-02 15:30:32

yea it looks like the Step8 notebook crashed

James Scott (jamesscott@shield-legal.com)
2025-04-02 16:08:27

thats not the error for notebook crashing

James Scott (jamesscott@shield-legal.com)
2025-04-02 16:08:49

that means case number wasnt found

Josh Josue (jjosue@shield-legal.com)
2025-04-02 16:19:17

I get the error msg but it was the last output on that run

Josh Josue (jjosue@shield-legal.com)
2025-04-02 16:19:42

I didnt see any “memory” related errors

Josh Josue (jjosue@shield-legal.com)
2025-04-02 16:21:02

I created a separate branch for the approach I mentioned earlier (involve batch processing and creating pivot tables)

Gonna try to test it out

Josh Josue (jjosue@shield-legal.com)
2025-04-02 16:46:29

alrighty! so this version that I'm currently running on my local machine seems promising

It's already produced some data on manual_test_adverse_events_icd_metrics_retained_labels

James Scott (jamesscott@shield-legal.com)
2025-04-03 07:42:43

quick question, did you ever figure out the clincial trial additions for a bigger table?

Josh Josue (jjosue@shield-legal.com)
2025-04-03 10:23:50

looks like Step 8 is fixed - the new method created a table with 79k, but after a query of SELECT DISTINCT it yielded 71k

Josh Josue (jjosue@shield-legal.com)
2025-04-03 10:24:19

as for your question, are you talking about Step 10?

Josh Josue (jjosue@shield-legal.com)
2025-04-03 10:27:46

I havent been able to find any additional international api's for that yet since I've been focused on debugging as of late

Today, I'll be updating Step 9 based on your new notebook code

Josh Josue (jjosue@shield-legal.com)
2025-04-03 12:02:15

are you updating the machine-learning instance? it's down again and I can't start it up

James Scott (jamesscott@shield-legal.com)
2025-04-03 12:09:27

No sometimes it’s down due to inactivity

Josh Josue (jjosue@shield-legal.com)
2025-04-03 12:09:43

i see

Josh Josue (jjosue@shield-legal.com)
2025-04-03 12:15:16

it's weird, I can start it back up on the Vertex page, but not on the instances page

James Scott (jamesscott@shield-legal.com)
2025-04-03 13:04:31

are u oik now

Josh Josue (jjosue@shield-legal.com)
2025-04-03 13:05:50

yea im able to look at the jupyter notebook now

Josh Josue (jjosue@shield-legal.com)
2025-04-03 13:14:27

going thru your new step 9, looks like you;ve changed the sigmoid calculation

Josh Josue (jjosue@shield-legal.com)
2025-04-03 13:15:11

and create_pairs()

James Scott (jamesscott@shield-legal.com)
2025-04-03 13:15:46

Yes it is working now it takes a while to run end to end I changed it back to all pairs but the codes works flawlessly

👍 Josh Josue
Josh Josue (jjosue@shield-legal.com)
2025-04-03 13:17:31

also there are 2 frames that contain very similar code for step 9

Josh Josue (jjosue@shield-legal.com)
2025-04-03 13:17:43

I'm currently going by the first one

James Scott (jamesscott@shield-legal.com)
2025-04-03 14:23:13

let me look

James Scott (jamesscott@shield-legal.com)
2025-04-03 14:23:16

there is a difference

James Scott (jamesscott@shield-legal.com)
2025-04-03 14:23:24

one is a pair of 5000 in create paris

James Scott (jamesscott@shield-legal.com)
2025-04-03 14:23:28

the other is the full one

James Scott (jamesscott@shield-legal.com)
2025-04-03 14:24:38

the first one is ok for now i guess if it makes u helps u create the process but hte second one is verythign

Josh Josue (jjosue@shield-legal.com)
2025-04-03 14:27:17

Gotcha

Josh Josue (jjosue@shield-legal.com)
2025-04-03 14:28:00

Ok yea I’m currently running the first one

After that finishes, I’ll update it to reflect the second one

James Scott (jamesscott@shield-legal.com)
2025-04-03 14:31:14

awesome! sounds ogo dwiht me the secodn one takes a while

James Scott (jamesscott@shield-legal.com)
2025-04-03 14:31:17

at least 1-2 hours

👍 Josh Josue
James Scott (jamesscott@shield-legal.com)
2025-04-03 14:31:24

i got to step 4 then it failed i was pissed

Josh Josue (jjosue@shield-legal.com)
2025-04-03 14:31:58

Step 4?

Josh Josue (jjosue@shield-legal.com)
2025-04-03 14:32:16

What’s step 4?

Josh Josue (jjosue@shield-legal.com)
2025-04-03 14:36:37

From the jupyterlab notebook?

James Scott (jamesscott@shield-legal.com)
2025-04-03 14:40:26

no step for in epochs

Josh Josue (jjosue@shield-legal.com)
2025-04-03 14:40:47

Oh gotcha

Josh Josue (jjosue@shield-legal.com)
2025-04-03 15:14:50

alrighty - it finished successfully!

Josh Josue (jjosue@shield-legal.com)
2025-04-03 15:16:32

integ_test_adverse_events_ranking_logs has 68k

Josh Josue (jjosue@shield-legal.com)
2025-04-03 15:18:35

your adverse_ranking_prod table has 65k so I think this is a SUCCESS

Josh Josue (jjosue@shield-legal.com)
2025-04-03 15:18:59

going to update the code to reflect the 2nd one

James Scott (jamesscott@shield-legal.com)
2025-04-03 16:14:07

Ahha sounds like a success to me lol

James Scott (jamesscott@shield-legal.com)
2025-04-03 16:14:19

We have to make sure that data is in step 9

Josh Josue (jjosue@shield-legal.com)
2025-04-03 16:25:00

Sorry what do you mean by that? This is the resulting table of Step9

Josh Josue (jjosue@shield-legal.com)
2025-04-03 16:25:25

Also, I’m currently running the updated code based on second frame

James Scott (jamesscott@shield-legal.com)
2025-04-04 07:07:24

And step 9 with 68k came from the new step 8 data?

Josh Josue (jjosue@shield-legal.com)
2025-04-04 10:24:42

yes that is correct

Josh Josue (jjosue@shield-legal.com)
2025-04-04 10:26:12

however, that 68k was based off the 1st frame on jupyterlab

Currently on Epoch3 for the updated code based off the 2nd frame on jupyterlab

Josh Josue (jjosue@shield-legal.com)
2025-04-04 10:27:10

seems like there's 4 to 5 hours between each epoch so far

James Scott (jamesscott@shield-legal.com)
2025-04-04 11:21:55

really thats crazy lol

😆 Josh Josue
Josh Josue (jjosue@shield-legal.com)
2025-04-04 11:55:13

That said, given the runtime of our entire pipeline, and lack of automated deployment, i think i need to setup Kestra for our production solution

(When all this is done of course and when you have the data needed for your demo)

James Scott (jamesscott@shield-legal.com)
2025-04-07 05:22:08

lets try to get the data pipeline working correctly end to end before we talk about switching, but i understand what you mean

Josh Josue (jjosue@shield-legal.com)
2025-04-07 09:43:15

yea for sure!

Ok so the first frame of Step 9 came out to 68k, while the second frame came out to 2 million (and ran for about a day)

Josh Josue (jjosue@shield-legal.com)
2025-04-07 09:44:20

i know that frame 1 is only a limited version to how many pairs it creates - did your 2nd version ever finish running? I remember you mentioned it failed on epoch 4

Josh Josue (jjosue@shield-legal.com)
2025-04-07 10:08:05

i noticed my batch_size was set to 32, going to update it to 128 and set epochs to 5. Gonna run it again

James Scott (jamesscott@shield-legal.com)
2025-04-07 10:22:52

lets just run the one with the number of pairs listed cuz we need to move on from this this week, so instead of spending a day running that finalize the working one with the reduced pair count

Josh Josue (jjosue@shield-legal.com)
2025-04-07 10:24:01

copy that

Josh Josue (jjosue@shield-legal.com)
2025-04-07 10:24:16

the result of the first version can be found on integ_test_adverse_events_ranking

Josh Josue (jjosue@shield-legal.com)
2025-04-07 10:24:27

i'll revert my code to that version

James Scott (jamesscott@shield-legal.com)
2025-04-07 10:24:31

and this is 68k records?

Josh Josue (jjosue@shield-legal.com)
2025-04-07 10:24:36

yes 68k

James Scott (jamesscott@shield-legal.com)
2025-04-07 10:24:53

awesome lets use that, there are some things we need to get done asap, so lets hop on a call when thats done

Josh Josue (jjosue@shield-legal.com)
2025-04-07 10:25:32

you want me to run the first one again?

What i'm saying is that the result is already up for that first version

Josh Josue (jjosue@shield-legal.com)
2025-04-07 10:25:46

and i can hop on a call rn

James Scott (jamesscott@shield-legal.com)
2025-04-07 10:26:07

i have a doctors appointment in like 5 minutes but i can be quick

Josh Josue (jjosue@shield-legal.com)
2025-04-07 10:26:17

sure

Josh Josue (jjosue@shield-legal.com)
2025-04-07 10:31:23

integ_test_drug_ranking_llm_recommendation

Josh Josue (jjosue@shield-legal.com)
2025-04-07 12:34:01

I need help obtaining a new API key for Voyage

Josh Josue (jjosue@shield-legal.com)
2025-04-07 12:34:27

I'm not able to properly auth the VoyageAIEmbeddings

Josh Josue (jjosue@shield-legal.com)
2025-04-07 12:34:37

I'm not able to properly auth the VoyageAIEmbeddings

Josh Josue (jjosue@shield-legal.com)
2025-04-07 12:38:00

Is that from a paid account?

James Scott (jamesscott@shield-legal.com)
2025-04-07 12:39:04

For that step? Yes it is ugh boy one second

Josh Josue (jjosue@shield-legal.com)
2025-04-07 13:07:05

any luck?

James Scott (jamesscott@shield-legal.com)
2025-04-07 13:10:55

i updated hte api key should run

Josh Josue (jjosue@shield-legal.com)
2025-04-07 13:11:09

awesome! thanks

Josh Josue (jjosue@shield-legal.com)
2025-04-07 13:12:57

i got an error

Josh Josue (jjosue@shield-legal.com)
2025-04-07 13:13:00
James Scott (jamesscott@shield-legal.com)
2025-04-07 13:14:31

ok one second

James Scott (jamesscott@shield-legal.com)
2025-04-07 13:28:37

try now

Josh Josue (jjosue@shield-legal.com)
2025-04-07 13:29:13

i still got the same error 😕

James Scott (jamesscott@shield-legal.com)
2025-04-07 13:31:14

says it takes a couple minutes

Josh Josue (jjosue@shield-legal.com)
2025-04-07 13:31:23

oh ok gotcha

Josh Josue (jjosue@shield-legal.com)
2025-04-07 13:31:29

also i accepted the Voyage invite

Josh Josue (jjosue@shield-legal.com)
2025-04-07 13:32:40

alrighty! step 10 is running now

James Scott (jamesscott@shield-legal.com)
2025-04-07 13:41:26

awesome!

Josh Josue (jjosue@shield-legal.com)
2025-04-07 13:51:26

so i know we have a meeting about the dashboard tomorrow

the last time i ran step10, it was about 7 hrs, so just in case it doesnt finish by EOD, I'll continue to run it thru tonight

James Scott (jamesscott@shield-legal.com)
2025-04-07 13:54:08

which step 10 is this the file system or clincial trials? is there already a table with a lot of them populated

Josh Josue (jjosue@shield-legal.com)
2025-04-07 13:55:31

so i have a table saved from the previous run (from like 2 weeks ago) and I'm running the classification part off of those abstracts

James Scott (jamesscott@shield-legal.com)
2025-04-07 13:55:44

can u point me to the table

Josh Josue (jjosue@shield-legal.com)
2025-04-07 13:55:53

one sec

Josh Josue (jjosue@shield-legal.com)
2025-04-07 13:56:38

integ_test_preprocess_abstracts

James Scott (jamesscott@shield-legal.com)
2025-04-07 13:59:23

this is good for right now, let me use this

James Scott (jamesscott@shield-legal.com)
2025-04-07 14:00:04

wit one second

James Scott (jamesscott@shield-legal.com)
2025-04-07 14:01:18

no this needs updating

James Scott (jamesscott@shield-legal.com)
2025-04-07 14:01:21

look at this

James Scott (jamesscott@shield-legal.com)
2025-04-07 14:01:28

versus this

James Scott (jamesscott@shield-legal.com)
2025-04-07 14:01:35

SELECT ** FROM ai-projects-406720.drug_model.clincial_trial_prod LIMIT 1000

James Scott (jamesscott@shield-legal.com)
2025-04-07 14:01:50

this is the final product im expecting

James Scott (jamesscott@shield-legal.com)
2025-04-07 14:02:30

think it just needs to be runrun with all the code

Josh Josue (jjosue@shield-legal.com)
2025-04-07 14:03:03

ok yea, I'll run the ingestion part too on the new Step 9 rank table

James Scott (jamesscott@shield-legal.com)
2025-04-07 14:04:46

yup that is perfect

Josh Josue (jjosue@shield-legal.com)
2025-04-07 14:07:31

I also wanted to mention that the vs-code-machine-learning instance isnt stable. It sometimes would suddenly terminate outta nowhere - this is another issue I hope to solve with a formal pipeline instance

James Scott (jamesscott@shield-legal.com)
2025-04-07 19:59:51

maybe we need to use github codespaces

Josh Josue (jjosue@shield-legal.com)
2025-04-07 20:01:35

We could, but if this is about the repos being too many, I’m going to combine them all into one repo after we get this data for the dashboard done

James Scott (jamesscott@shield-legal.com)
2025-04-07 20:35:03

oh no its about being able tos hare code and stuff lol

James Scott (jamesscott@shield-legal.com)
2025-04-07 20:35:14

and work esiser

Josh Josue (jjosue@shield-legal.com)
2025-04-08 10:24:45

oh ok gotcha

Josh Josue (jjosue@shield-legal.com)
2025-04-08 10:25:23

Unfortunately, the Step 10 run crashed ughh I'll look into fixes for this

Josh Josue (jjosue@shield-legal.com)
2025-04-08 10:51:20

made some adjustments and running a new test on my local machine

Josh Josue (jjosue@shield-legal.com)
2025-04-08 10:55:31

made some adjustments and running a new test on my local machine

James Scott (jamesscott@shield-legal.com)
2025-04-08 11:00:16

awesome

Josh Josue (jjosue@shield-legal.com)
2025-04-08 11:28:29

ok so while im waiting on Step10 to finish, I'll tinker with Looker today

James Scott (jamesscott@shield-legal.com)
2025-04-08 11:33:01

do u want to have a call about it

Josh Josue (jjosue@shield-legal.com)
2025-04-08 11:35:18

maybe later this afternoon, i'll need to take a look at what i'm working with first

James Scott (jamesscott@shield-legal.com)
2025-04-08 11:35:41

ok i can assist if needed its not that bad

Josh Josue (jjosue@shield-legal.com)
2025-04-08 11:35:58

thanks!

Josh Josue (jjosue@shield-legal.com)
2025-04-08 12:38:36

i still cant view the project

James Scott (jamesscott@shield-legal.com)
2025-04-08 12:42:11

Let me look

Josh Josue (jjosue@shield-legal.com)
2025-04-08 12:47:58

lemme see

Josh Josue (jjosue@shield-legal.com)
2025-04-08 12:48:18

that took me to the original one

Josh Josue (jjosue@shield-legal.com)
2025-04-08 12:48:26

the green looking UI

Josh Josue (jjosue@shield-legal.com)
2025-04-08 12:49:39

how do i navigate to the new dashboard that you created?

Josh Josue (jjosue@shield-legal.com)
2025-04-08 12:54:36

Under "shared with me" i only have Tortellignece.ai MFP 2.0 - Private & Confidential

Josh Josue (jjosue@shield-legal.com)
2025-04-08 12:57:33

i sent a request for access on the blue dashboard you showed earlier today

James Scott (jamesscott@shield-legal.com)
2025-04-08 12:57:57

yea im trying its not working

Josh Josue (jjosue@shield-legal.com)
2025-04-08 12:58:06

aww dang

Josh Josue (jjosue@shield-legal.com)
2025-04-08 13:00:14

does this help?

James Scott (jamesscott@shield-legal.com)
2025-04-08 13:22:32

i see ur in the previous one

James Scott (jamesscott@shield-legal.com)
2025-04-08 13:22:34

lets do this

James Scott (jamesscott@shield-legal.com)
2025-04-08 13:22:36

make a copy of it

James Scott (jamesscott@shield-legal.com)
2025-04-08 13:22:39

and u can work from there

Josh Josue (jjosue@shield-legal.com)
2025-04-08 13:22:53

i already am tinkering with a duplicate of the old one

James Scott (jamesscott@shield-legal.com)
2025-04-08 13:23:04

awesome do u need the new datatbales?

Josh Josue (jjosue@shield-legal.com)
2025-04-08 13:23:23

yeah, i dont actually know what each of your tiles have

Josh Josue (jjosue@shield-legal.com)
2025-04-08 13:37:22

also, im having trouble with getting data to show up - I've checked that the data sources are properly connected

Josh Josue (jjosue@shield-legal.com)
2025-04-08 13:37:46
James Scott (jamesscott@shield-legal.com)
2025-04-08 13:38:22

Let’s have a short meeting on this if ur free

Josh Josue (jjosue@shield-legal.com)
2025-04-08 13:40:30

yea im up for a huddle

James Scott (jamesscott@shield-legal.com)
2025-04-08 13:41:03

cant hear u

Josh Josue (jjosue@shield-legal.com)
2025-04-08 13:41:33

looking at my mic settings

Josh Josue (jjosue@shield-legal.com)
2025-04-08 13:41:36

srry

James Scott (jamesscott@shield-legal.com)
2025-04-08 13:42:43
James Scott (jamesscott@shield-legal.com)
2025-04-08 13:46:57
James Scott (jamesscott@shield-legal.com)
2025-04-08 13:47:06

https://themedialaboratory.slack.com/archives/D088F7N2UG2/p1744134260051499

James Scott (https://themedialaboratory.slack.com/team/U067V217857)
James Scott (jamesscott@shield-legal.com)
2025-04-08 13:55:05

SELECT **, LAG(rank) OVER (PARTITION BY brandname, manufacturername, activesubstancename ORDER BY caseyear) AS lastyear_rank FROM ai-projects-406720.drug_model.integ_test_adverse_events_ranking

James Scott (jamesscott@shield-legal.com)
2025-04-08 14:28:04

clinicaltrailprod you have to create a blend with the v3 table

Josh Josue (jjosue@shield-legal.com)
2025-04-08 15:55:37

ok so how's this lookin?

Josh Josue (jjosue@shield-legal.com)
2025-04-08 15:56:03

i added the columns that I could see from your screenshot (e.g. pmid, risk_assessment, etc)

James Scott (jamesscott@shield-legal.com)
2025-04-08 18:14:37

awesome!

James Scott (jamesscott@shield-legal.com)
2025-04-09 06:32:09

i am almost done with the clincial trial data on the backend

Josh Josue (jjosue@shield-legal.com)
2025-04-09 10:21:07

*Thread Reply:* Do you mean that you're making changes to the jupyterlab code?

Josh Josue (jjosue@shield-legal.com)
2025-04-09 10:15:42

oh cool

mine was in the middle of the litigation recommendation phase, but terminated unexpectedly at 4223 out of 4724

Josh Josue (jjosue@shield-legal.com)
2025-04-09 10:15:57

I'm going to rerun that phase

Josh Josue (jjosue@shield-legal.com)
2025-04-09 10:21:55

I'm able to salvage yesterday's run bcuz I saved csv's between each phase

James Scott (jamesscott@shield-legal.com)
2025-04-09 10:21:58

hmm? did you do the like or get all the drugs?

Josh Josue (jjosue@shield-legal.com)
2025-04-09 10:22:21

yea - I'm processing the step 9 ranking table's updated results

James Scott (jamesscott@shield-legal.com)
2025-04-09 10:35:34

im tlaking for the lincial data

Josh Josue (jjosue@shield-legal.com)
2025-04-09 10:37:55

also yes

Josh Josue (jjosue@shield-legal.com)
2025-04-09 10:38:20

the beginning of step 10 ingested all the clinical abstracts for each drug name obtained from Step 9's ranking table

Josh Josue (jjosue@shield-legal.com)
2025-04-09 10:38:40

the beginning of step 10 ingested all the clinical abstracts for each drug name obtained from Step 9's ranking table

Josh Josue (jjosue@shield-legal.com)
2025-04-09 11:57:26

so on that dashboard duplicate that I'm working on - What else should I tweak about it?

James Scott (jamesscott@shield-legal.com)
2025-04-09 11:59:24

I would have to get u the table name

James Scott (jamesscott@shield-legal.com)
2025-04-09 12:49:52

ok

James Scott (jamesscott@shield-legal.com)
2025-04-09 12:49:59

so that table

James Scott (jamesscott@shield-legal.com)
2025-04-09 12:50:03

want to have a quick call

Josh Josue (jjosue@shield-legal.com)
2025-04-09 12:52:35

yep sure

Josh Josue (jjosue@shield-legal.com)
2025-04-09 12:53:10

oh man

James Scott (jamesscott@shield-legal.com)
2025-04-09 12:53:32
Josh Josue (jjosue@shield-legal.com)
2025-04-09 12:53:34

it's weird bcuz it shows my mic is picking up my vovice

Josh Josue (jjosue@shield-legal.com)
2025-04-09 14:14:19

So i think i have the line graphs set up right, but due to the metrics being blank it doesnt show anything

Josh Josue (jjosue@shield-legal.com)
2025-04-09 14:14:46

just wanted to confirm i did that correctly

James Scott (jamesscott@shield-legal.com)
2025-04-10 05:21:38

ok i gotta fix that graph

Josh Josue (jjosue@shield-legal.com)
2025-04-10 07:00:03

my step 10 llm recommendation phase created integ_test_drug_ranking_llm_recommendation in case we wanted to have something to use for the dashboard demo

James Scott (jamesscott@shield-legal.com)
2025-04-10 07:21:56

what step is this?

James Scott (jamesscott@shield-legal.com)
2025-04-10 07:22:15

like can you show me the code for this?

James Scott (jamesscott@shield-legal.com)
2025-04-10 07:24:20

also, how do you intend to use this? how are you going to reference it back to any of the clincial trials and pmids with just a brand_name recommendation and reason?

Josh Josue (jjosue@shield-legal.com)
2025-04-10 10:26:56

This is for Step 10 https://github.com/josh-SL/clinical_extraction

I wasnt aware it was missing those columns - i was still using the older version of the jupyterlab code during yesterday's run. I'll be updating Step 10 today with your latest jupyterlab code

Josh Josue (jjosue@shield-legal.com)
2025-04-10 10:27:23

I'll also add a test asserting the pmid column is present for each entry

Josh Josue (jjosue@shield-legal.com)
2025-04-10 11:37:21

just to make sure we're on the same page - what are the expected columns for Step 10?

Josh Josue (jjosue@shield-legal.com)
2025-04-10 11:37:34

(this would help me write a test for it)

James Scott (jamesscott@shield-legal.com)
2025-04-10 11:38:04

Look at clinical trial prod

👍 Josh Josue
Josh Josue (jjosue@shield-legal.com)
2025-04-10 12:16:41

ok so just to confirm, I do have those columns from clinical trial prod - but you're saying that the final table from Step 10 would have those columns + llm recommendation and reason columns

did I understand that correctly?

James Scott (jamesscott@shield-legal.com)
2025-04-10 13:41:23

Yes yes! Let think through it having a recommendation doesn’t mean anything if we don’t know what it’s recommending lol

Josh Josue (jjosue@shield-legal.com)
2025-04-10 13:42:23

lol ok gotcha, i thought the target audience only cared about what drug had what recommendation

fixing it rn

James Scott (jamesscott@shield-legal.com)
2025-04-10 13:43:21

But they still need to know the reference of the recommendation lol

👍 Josh Josue
Josh Josue (jjosue@shield-legal.com)
2025-04-10 15:40:21

did you push your changes to your script onto the repo?

Josh Josue (jjosue@shield-legal.com)
2025-04-10 15:40:57

i'm looking at the jupterlab code and im not sure what the updates are

James Scott (jamesscott@shield-legal.com)
2025-04-10 15:43:32

No I haven’t pushed anything what changed

Josh Josue (jjosue@shield-legal.com)
2025-04-10 15:44:11

ok gotcha

Josh Josue (jjosue@shield-legal.com)
2025-04-10 15:44:31

well as it stands, the dev branch of step10 now has those expected columns

Josh Josue (jjosue@shield-legal.com)
2025-04-10 15:44:46

I'll wait for your changes before doing another full run of step10

James Scott (jamesscott@shield-legal.com)
2025-04-10 15:45:43

Awesome ! Can you try to get everything into one GitHub repo and multiple branches Zac or even multiple folders of the same repo

James Scott (jamesscott@shield-legal.com)
2025-04-10 15:45:58

If u need naming conventions that are best we can have a meeting

Josh Josue (jjosue@shield-legal.com)
2025-04-10 15:46:12

yea sure we can do a quick huddle

Josh Josue (jjosue@shield-legal.com)
2025-04-10 15:47:24

i'll go ahead and start consolidating all the steps into 1 repo

James Scott (jamesscott@shield-legal.com)
2025-04-10 15:48:31

I am away from my pc now but when are u leaving in an hour or so?

Josh Josue (jjosue@shield-legal.com)
2025-04-10 15:48:52

oh ok it's all good then, I have a good idea on how to structure this repo

Josh Josue (jjosue@shield-legal.com)
2025-04-10 15:49:00

multiple folders - a folder for each step

Josh Josue (jjosue@shield-legal.com)
2025-04-10 15:49:15

prolly gonna leave the office in 30min

Josh Josue (jjosue@shield-legal.com)
2025-04-10 15:49:46

did you have a preference for the name of the repo?

James Scott (jamesscott@shield-legal.com)
2025-04-10 16:07:17

I think it should be shield-genai-tortelllifence

James Scott (jamesscott@shield-legal.com)
2025-04-10 16:07:30

Or the tort name from the dashboard

James Scott (jamesscott@shield-legal.com)
2025-04-11 10:42:41

i think we as a team might move to github codespaces for everything not to sure, but the github repo with everything is going to be good

James Scott (jamesscott@shield-legal.com)
2025-04-11 10:42:42

more so

James Scott (jamesscott@shield-legal.com)
2025-04-11 10:43:01

the latest tableu completed whats that name

Josh Josue (jjosue@shield-legal.com)
2025-04-11 10:54:14

Yea I’m excited for this new repo bcuz i learned that Kestra can be configured to get each step’s code from there, also i can refactor for proper practices like a .env file and automated deployment

Josh Josue (jjosue@shield-legal.com)
2025-04-11 10:54:59

The step10 table?

integtestdrugrankingllm_recommendation

James Scott (jamesscott@shield-legal.com)
2025-04-11 11:28:06

oh yes this wasnt updated yet right?

James Scott (jamesscott@shield-legal.com)
2025-04-11 11:28:15

and i was looking to see what to do next

James Scott (jamesscott@shield-legal.com)
2025-04-11 11:28:26

like which step i need to work on based off of what u did

Josh Josue (jjosue@shield-legal.com)
2025-04-11 11:28:31

Corect - i have to refactor with your new recommendation code

Josh Josue (jjosue@shield-legal.com)
2025-04-11 11:28:36

Then rerun step 10

Josh Josue (jjosue@shield-legal.com)
2025-04-11 11:29:30

Step 10 has abstracts ingestion, abstracts classification, litigation recommendation all done

Josh Josue (jjosue@shield-legal.com)
2025-04-11 11:30:15

But any updates on the code that you made recently, I need it to refactor the code currently on the dev branch

James Scott (jamesscott@shield-legal.com)
2025-04-11 11:35:06

I would say just get everything in there and refactor it later

Josh Josue (jjosue@shield-legal.com)
2025-04-11 11:39:20

Ok

Josh Josue (jjosue@shield-legal.com)
2025-04-11 11:40:25

But my disclaimer is that this is not a simple copy and paste - I’ll will still need to update the file references in the code as I’m moving each step into this repo

James Scott (jamesscott@shield-legal.com)
2025-04-14 04:15:12

hey man, i am off this week, but i deff want to focus on your development with this code base, so we can focus on you this week and get anythings unresolved or confusion out the way with the project task expectations, so let me know whenever you get on!

Josh Josue (jjosue@shield-legal.com)
2025-04-14 10:20:10

g'morning! I've completed putting all Steps (1 to 10) on 1 repo. I've refactored all of them to use a .env file

I was planning on setting up Kestra pipeline this week

is your updated Step 10 on Jupyterlab or on our github repo?

Josh Josue (jjosue@shield-legal.com)
2025-04-14 10:23:54

g'morning! I've completed putting all Steps (1 to 10) on 1 repo. I've refactored all of them to use a .env file

I was planning on setting up Kestra pipeline this week

is your updated Step 10 on Jupyterlab or on our github repo?

Josh Josue (jjosue@shield-legal.com)
2025-04-14 11:01:20

One of the unresolved issues is that the Jupyterlab code for Step 9 still doesnt pass the test for matching input and output data lengths

This screenshot shows that 1000 entries were used for sample_data, but the result had only 3 entries

James Scott (jamesscott@shield-legal.com)
2025-04-14 14:38:40

dont set up kestra we have tod icuss as a team

James Scott (jamesscott@shield-legal.com)
2025-04-14 14:38:52

whats the github repo, can u make me owner/admin of it

Josh Josue (jjosue@shield-legal.com)
2025-04-14 15:29:03

ok, I stopped the kestra instance

But i managed to test it to gather info on it and so far the gains we would have from it are: • Scalable deployments (we're going to have pipelines to drugs, food, clothing, etc) • automated deployments • Realtime gant charts (we can see how long each step takes to finish) • Security in keeping our api keys (using a .env instead of having it out in the open in a notebook) • less clunky and more reliable (running our code on Big Query Workflow would crash a lot and causes delays in my development time)

Josh Josue (jjosue@shield-legal.com)
2025-04-14 15:29:39

ok, I stopped the kestra instance

But i managed to test it to gather info on it and so far the gains we would have from it are: • Scalable deployments (we're going to have pipelines to drugs, food, clothing, etc) • automated deployments • Realtime gant charts (we can see how long each step takes to finish) • Security in keeping our api keys (using a .env instead of having it out in the open in a notebook) • less clunky and more reliable (running our code on Big Query Workflow would crash a lot and causes delays in my development time)

Josh Josue (jjosue@shield-legal.com)
2025-04-14 15:33:24

I also sent the git repo transfer request to you

James Scott (jamesscott@shield-legal.com)
2025-04-14 18:31:51

is kestra in gcp i just need to go over what it is cuz im not sure

James Scott (jamesscott@shield-legal.com)
2025-04-14 18:33:31

i see your github code!

James Scott (jamesscott@shield-legal.com)
2025-04-14 18:35:35

i did github codespace i like it a lot

👍 Josh Josue
James Scott (jamesscott@shield-legal.com)
2025-04-14 18:35:41

its what we been trying to do

👍 Josh Josue
Josh Josue (jjosue@shield-legal.com)
2025-04-14 18:42:36

Kestra runs on a docker instance which runs in GCP - I was using the free version

Josh Josue (jjosue@shield-legal.com)
2025-04-14 18:44:59

it's a tool used for ETLs

Josh Josue (jjosue@shield-legal.com)
2025-04-14 18:46:44

Here's an example of a pipeline UI from their tutorial

Josh Josue (jjosue@shield-legal.com)
2025-04-14 18:47:42

We'll be able to monitor pipelines running in realtime with these gant charts

James Scott (jamesscott@shield-legal.com)
2025-04-15 03:21:04

hows it look flow and perfoemance wise

Josh Josue (jjosue@shield-legal.com)
2025-04-15 11:14:44

I stopped the instance upon your request yesterday

I could go ahead and finish setting it up today with our new repo if that’s ok

The other thing I’m working on is reconciling the discrepancy between our datasets. I’m using the FASENRA entries as a benchmark. So far i was able to gather more ICD10 entries after running Step 1 specifically for 2015. I’m going to continue to look for the years with missing entries

James Scott (jamesscott@shield-legal.com)
2025-04-15 11:15:17

i think the most pronto thing is the dashboard, how has that been looking is all the data from the tables uploaded in there?

James Scott (jamesscott@shield-legal.com)
2025-04-15 11:15:28

i want ot give it to ryan shortly

Josh Josue (jjosue@shield-legal.com)
2025-04-15 11:17:45

All the tables that you told me to put up are there

But as I’ve mentioned, the line graph doesnt show anything since those fields being referenced are blank

Josh Josue (jjosue@shield-legal.com)
2025-04-15 11:18:36

I didnt receive any more specs needed for the dashboard

James Scott (jamesscott@shield-legal.com)
2025-04-15 11:56:15

can you link me to the dashboard

Josh Josue (jjosue@shield-legal.com)
2025-04-15 11:56:55

Should I remove the Drug Litigation section?

Josh Josue (jjosue@shield-legal.com)
2025-04-15 14:00:24

what did Ryan say about the dashboard?

Josh Josue (jjosue@shield-legal.com)
2025-04-15 14:00:30

(he cancelled the meeting today)

James Scott (jamesscott@shield-legal.com)
2025-04-15 14:39:50

oh i havent talked to him about it, i was trying to talk to him today

Josh Josue (jjosue@shield-legal.com)
2025-04-15 14:40:08

gotcha

Josh Josue (jjosue@shield-legal.com)
2025-04-15 14:40:13

have you seen the dashboard?

Josh Josue (jjosue@shield-legal.com)
2025-04-15 14:40:24

i'm not sure if that's ok since we're just using it as a placeholder

Josh Josue (jjosue@shield-legal.com)
2025-04-15 14:41:12

I've been working on our data and pipeline this whole time since I'm not sure what direction to take for that dashboard

James Scott (jamesscott@shield-legal.com)
2025-04-15 14:42:24

yes im looking at it now, the direction is sitll the same, its just that the data needs to be populated, was the last step run of the prediction with the clinical trial summary or the recommendation

Josh Josue (jjosue@shield-legal.com)
2025-04-15 14:43:59

the last step ran was step 10 (without your updates)

but the blend that it's using is your data table

Josh Josue (jjosue@shield-legal.com)
2025-04-15 14:44:38

just wanted to show my progress on Kestra - this gant chart is pretty useful in remotely monitoring the pipeline

James Scott (jamesscott@shield-legal.com)
2025-04-15 14:45:06

Gotcha I am on vacation so that’s why I been a little spotty this week, however was the step done with my updates since we can’t use the data table with only the single recommendation

Josh Josue (jjosue@shield-legal.com)
2025-04-15 14:46:38

I never got your updated code - would you like me to run step 10 from jupterlab notebook?

Josh Josue (jjosue@shield-legal.com)
2025-04-15 14:47:22

otherwise, i have integtestrankingllmrecommendation just to populate the tables

James Scott (jamesscott@shield-legal.com)
2025-04-15 14:48:20

that integ test doesnt mean anything since we cant relate the recommendation run back

James Scott (jamesscott@shield-legal.com)
2025-04-15 14:48:46

this is the recommendation script that needs to be runa gain

Josh Josue (jjosue@shield-legal.com)
2025-04-15 14:48:53

ok

Josh Josue (jjosue@shield-legal.com)
2025-04-15 14:49:19

im assuming i'll run it based on your tables?

James Scott (jamesscott@shield-legal.com)
2025-04-15 14:50:28

no

James Scott (jamesscott@shield-legal.com)
2025-04-15 14:50:34

ai-projects-406720.drugmodel.integtestclinicalabstracts`

James Scott (jamesscott@shield-legal.com)
2025-04-15 14:50:38

this table

Josh Josue (jjosue@shield-legal.com)
2025-04-15 14:50:47

ok thanks, just wanted to confirm

James Scott (jamesscott@shield-legal.com)
2025-04-15 14:53:50

no problem! i got github code spaces i am testingit out now with the github code u gave me

👍 Josh Josue
Josh Josue (jjosue@shield-legal.com)
2025-04-15 14:55:29

this is bad practice but in the interest of time, here's the .env file you'll need to be at the project's root dir lol

Josh Josue (jjosue@shield-legal.com)
2025-04-15 14:55:33
Josh Josue (jjosue@shield-legal.com)
2025-04-16 12:04:44

mini status report - the new step 10 code failed due to ValueError: columns overlap but no suffix specified: Index(['error'], dtype='object')

Working on a fix for it...

James Scott (jamesscott@shield-legal.com)
2025-04-16 15:31:32

Awesome !

James Scott (jamesscott@shield-legal.com)
2025-04-18 07:18:13

did you end up fixing it

Josh Josue (jjosue@shield-legal.com)
2025-04-18 10:16:58

yep, got the script running all thru yesterday and it finished this morning

Josh Josue (jjosue@shield-legal.com)
2025-04-18 10:18:10

I've had to tweak the new jupyterlab script bcuz the resulting table did not have the score columns

I'm also trying to see if i can fix the blend so the case_year isnt null

Josh Josue (jjosue@shield-legal.com)
2025-04-18 10:19:00

I've had to tweak the new jupyterlab script bcuz the resulting table did not have the score columns

I'm also trying to see if i can fix the blend so the case_year isnt null

Josh Josue (jjosue@shield-legal.com)
2025-04-18 10:23:04

ok i think i fixed the case_year filter

James Scott (jamesscott@shield-legal.com)
2025-04-18 12:05:49

nice!! we dont need to show case year in there, what else is going on for like steps to do you think

James Scott (jamesscott@shield-legal.com)
2025-04-18 12:05:55

i will be back monday

👍 Josh Josue
Josh Josue (jjosue@shield-legal.com)
2025-04-18 12:14:54

Step8 data discrepancy: I need to go back on my investigation on data discrepancy from Step8's table. It's caused by a domino effect that goes all the way from Step2's table. I'm going to do a more granular ingestion to ensure it doesnt have missing entries (those missing entries are affecting the metrics calculations) • Production level pipeline: I've succeeded in creating a way to get .env values in our free version of Kestra (the paid version has a nice UI to upload it). I also succeeded in getting some integration tests to run as an example. But, the Kestra runs are each isolated and does pip installs each time and sometimes runs out of space, so I have to look into optimizing that (maybe find a way to make the virtualenv to persist throughout Steps 1-10)

James Scott (jamesscott@shield-legal.com)
2025-04-18 13:32:12

Hmm so before kestra is there or is the pipeline or the data completed and looks good

Josh Josue (jjosue@shield-legal.com)
2025-04-18 13:34:50

sorry i dont understand what you meant by that

Josh Josue (jjosue@shield-legal.com)
2025-04-18 16:01:46

great news! i got our Kestra pipeline to stabilize (by creating a docker image with our dependencies) and now have a remote & dependable way to run our entire project!

this is helping me with debugging because the Big Query Workflow was prone to crashing/terminating which caused delays. it also expedites the deployment process

Josh Josue (jjosue@shield-legal.com)
2025-04-18 16:02:20

great news! i got our Kestra pipeline to stabilize (by creating a docker image with our dependencies) and now have a remote & dependable way to run our entire project!

this is helping me with debugging because the Big Query Workflow was prone to crashing/terminating which delayed the process. it also expedites the deployment process

Josh Josue (jjosue@shield-legal.com)
2025-04-18 16:02:46

great news! i got our Kestra pipeline to stabilize (by creating a docker image with our dependencies) and now have a remote & dependable way to run our entire project!

this is helping me with debugging because the Big Query Workflow was prone to crashing/terminating which caused delays. it also expedites the deployment process

James Scott (jamesscott@shield-legal.com)
2025-04-18 18:48:51

really? you have to show it to me this is a great job!

Josh Josue (jjosue@shield-legal.com)
2025-04-18 19:41:38

haha thanks, I'm still trying to figure out how your icd table has 116 million. So i'm rereunning Steps 1 and 2 and then comparing it to your table using FASENRA as the reference drug

Josh Josue (jjosue@shield-legal.com)
2025-04-18 19:42:17

haha thanks, I'm still trying to figure out how your icd table has 116 million. So i'm rereunning Steps 1 and 2 and then comparing it to your table using FASENRA as the reference drug

Josh Josue (jjosue@shield-legal.com)
2025-04-18 19:42:29

http://34.46.241.86:8080/ui // the url of Kestra

Josh Josue (jjosue@shield-legal.com)
2025-04-18 19:42:41

username: dev@shield-legal.com pwd: tortAI123$

Josh Josue (jjosue@shield-legal.com)
2025-04-18 19:43:14

There are a lot of failures - but that was when I was trying to figure out Kestra's yaml, so lots of trial & error lol

Josh Josue (jjosue@shield-legal.com)
2025-04-21 14:23:38

Heya! I was able to reconcile the missing data of 2024 for Step2. So I'll continue doing this method til my Step2 table is closer to the original Step2 table

Josh Josue (jjosue@shield-legal.com)
2025-04-21 14:23:59

Heya! I was able to reconcile the missing data of 2024 for Step2. So I'll continue doing this method til my Step2 table is closer to the original Step2 table

James Scott (jamesscott@shield-legal.com)
2025-04-22 09:45:49

whenever you get on, lets have a quick meeting

Josh Josue (jjosue@shield-legal.com)
2025-04-22 10:10:41

Good morning - i'm reaady for a quick meeting

James Scott (jamesscott@shield-legal.com)
2025-04-22 10:16:43

let wait till whenever your in office to go over some stuff

👍 Josh Josue
Josh Josue (jjosue@shield-legal.com)
2025-04-22 11:39:10

ok im at the office now

James Scott (jamesscott@shield-legal.com)
2025-04-22 12:02:30

Ok give me a few minutes

James Scott (jamesscott@shield-legal.com)
2025-04-22 12:30:18

you free

Josh Josue (jjosue@shield-legal.com)
2025-04-22 12:30:24

yessir

Josh Josue (jjosue@shield-legal.com)
2025-04-22 13:04:07

are we meeting with Ryan?

James Scott (jamesscott@shield-legal.com)
2025-04-22 13:04:43

Oh no it’s cancelled

Josh Josue (jjosue@shield-legal.com)
2025-04-22 13:04:55

oh lol ok

Josh Josue (jjosue@shield-legal.com)
2025-04-23 10:56:56

gmorning! the integ_test_adverse_events_ranking table has been updated

Josh Josue (jjosue@shield-legal.com)
2025-04-23 10:57:08

currently running Step10 for the recommendations

Josh Josue (jjosue@shield-legal.com)
2025-04-23 10:57:38

Should i change the data source on the Drug Rankings table of the dashboard?

James Scott (jamesscott@shield-legal.com)
2025-04-23 10:57:46

how are you running this? is it in that app or just vs code from github?

James Scott (jamesscott@shield-legal.com)
2025-04-23 10:57:55

yes u should

👍 Josh Josue
Josh Josue (jjosue@shield-legal.com)
2025-04-23 10:58:01

it's running on a GCP instance - the Kestra pipeline

Josh Josue (jjosue@shield-legal.com)
2025-04-23 10:58:38

with Kestra, I can check on its progress wherever I am

James Scott (jamesscott@shield-legal.com)
2025-04-23 11:04:36

you gotta show me how to do it

Josh Josue (jjosue@shield-legal.com)
2025-04-23 11:04:39

There's something weird with that SQL query on the dashboard

Josh Josue (jjosue@shield-legal.com)
2025-04-23 11:04:43

yea for sure!

Josh Josue (jjosue@shield-legal.com)
2025-04-23 11:04:47

i can do a quick call rn

Josh Josue (jjosue@shield-legal.com)
2025-04-23 11:05:05

my tablee has rank 1's but our dashboard's query is doin something funky

James Scott (jamesscott@shield-legal.com)
2025-04-23 12:22:10

hey did u get my message

Josh Josue (jjosue@shield-legal.com)
2025-04-23 12:22:28

message about what?

Josh Josue (jjosue@shield-legal.com)
2025-04-23 12:22:32

showing you Kestra?

Josh Josue (jjosue@shield-legal.com)
2025-04-23 12:23:04

im free to do a call rn if you're cool with that

James Scott (jamesscott@shield-legal.com)
2025-04-23 12:26:19

i am bout to walk downstairs to get my lunch, but after i should be free

Josh Josue (jjosue@shield-legal.com)
2025-04-23 12:26:27

alrighty!

Josh Josue (jjosue@shield-legal.com)
2025-04-23 15:34:21

Here's a quick summary of how to access our Kestra pipeline: ```1. Go to http://34.56.77.27:8080/ui (you can obtain the IP address from our GCP instances page

  1. user: dev@shield-legal.com pwd: tortAI123$
  2. The Flows tab will show you the different flows available
  3. To run a flow, pick one and then click the "Execute" button at the top right

NOTE: Kestra treats any logs as an "error" even though it isnt. Fixing that is on my TODO list```

Josh Josue (jjosue@shield-legal.com)
2025-04-23 15:34:42

Here's an example of a Flow yaml, I've put emojis on the important configurations

Josh Josue (jjosue@shield-legal.com)
2025-04-23 15:35:44

This abstracts the version of code we are running from the pipeline infrastructure that we want to run it on

James Scott (jamesscott@shield-legal.com)
2025-04-24 09:04:10

ok lets go over totday how to run this

Josh Josue (jjosue@shield-legal.com)
2025-04-24 11:19:09

Yep sure! Once i get into the office

Josh Josue (jjosue@shield-legal.com)
2025-04-24 11:48:10

alrighty, im at the office

Josh Josue (jjosue@shield-legal.com)
2025-04-24 11:56:03

ready when you are

James Scott (jamesscott@shield-legal.com)
2025-04-24 11:56:23

ok in a little i stepped out for lunch on my end

👍 Josh Josue
Josh Josue (jjosue@shield-legal.com)
2025-04-24 15:11:56

im back at the office if you wanna huddle

James Scott (jamesscott@shield-legal.com)
2025-04-24 15:20:27

Yes ! I have been super busy today with my WiFi being out trying to get it fixed with google been so noting

James Scott (jamesscott@shield-legal.com)
2025-04-24 15:20:31

Annoying

Josh Josue (jjosue@shield-legal.com)
2025-04-24 15:20:57

ohh dang

Josh Josue (jjosue@shield-legal.com)
2025-04-24 15:21:42

if you'd like, i could also try to make instructions on how to run it with screenshots

Josh Josue (jjosue@shield-legal.com)
2025-04-24 15:27:39

also, i wanted to ask for some collab time on 2 issues: • Step 9 is up to date with the Jupyterlab code but still has not passed the test for input/output data lengths • Dashboard ranking table isnt showing any drug with rank 1 (but the source table has rank 1's, so I'm guessing it has something to do with the query being done)

James Scott (jamesscott@shield-legal.com)
2025-04-25 12:09:38

Hey how’s it going

Josh Josue (jjosue@shield-legal.com)
2025-04-25 12:26:27

heya! did you receive my previous message?

Josh Josue (jjosue@shield-legal.com)
2025-04-25 12:27:46

i'm doing some code cleanup and double checking Step 9. The current code on dev and main are identical to the Jupyterlab notebook, yet tests would show the inconsistency in the input/output lengths

James Scott (jamesscott@shield-legal.com)
2025-04-25 12:33:26

let me know when ur doen the code cleanup lets hop on a call

Josh Josue (jjosue@shield-legal.com)
2025-04-25 12:38:39

ok im ready

James Scott (jamesscott@shield-legal.com)
2025-04-25 12:56:00

ready

Josh Josue (jjosue@shield-legal.com)
2025-04-25 15:50:08

So i ran notebook code for steps 1 and 2

Josh Josue (jjosue@shield-legal.com)
2025-04-25 15:51:06

step 1 yielded 977,523 entries using the csv called drug_adverse_event_data_combined_datasets_chunked.csv

Josh Josue (jjosue@shield-legal.com)
2025-04-25 15:51:43

that's about 30 million off from adverse_events_prod

Josh Josue (jjosue@shield-legal.com)
2025-04-25 15:55:09

So what i'll do is copy adverse_events_prod and run Step 2 on that instead

James Scott (jamesscott@shield-legal.com)
2025-04-26 08:32:05

is that my code or yours?

Josh Josue (jjosue@shield-legal.com)
2025-04-27 10:46:40

All the code I'm running now is from the Jupyterlab Notebook and the data is based off your original table adverse_events_prod. The code is on the "jupyterlab" branch

Step 2 crashed after running for 5 hours. It attempted to execute process_adverse_events_data() . I'm sure this code ran well when the dataset was less than 37 million, but we need to collab on a piece-wise solution.

Tomorrow, I'll look into delegating that process to the Big Query through SQL

Josh Josue (jjosue@shield-legal.com)
2025-04-27 10:47:58

All the code I'm running now is from the Jupyterlab Notebook and the data is based off your original table adverse_events_prod. The code is on the "jupyterlab" branch

Step 2 crashed after running for 5 hours. It attempted to execute process_adverse_events_data() . I'm sure this code ran well when the dataset was less than 37 million, but we need to collab on a piece-wise solution.

Tomorrow, I'll look into delegating that process to the Big Query through SQL

James Scott (jamesscott@shield-legal.com)
2025-04-28 05:40:05

ok thanks let me know when you get on

Josh Josue (jjosue@shield-legal.com)
2025-04-28 11:43:03

g'morning - i'm at the office now

James Scott (jamesscott@shield-legal.com)
2025-04-28 12:30:35

ok i will give u a call shortly

👍 Josh Josue
Josh Josue (jjosue@shield-legal.com)
2025-04-28 13:13:13

i think i have a viable sql solution. I'm doing spot checks and it has matching entry counts

Josh Josue (jjosue@shield-legal.com)
2025-04-28 13:25:41

I also noticed something odd - how does your Step 1 table not have a case entry but Step 2 has 6 of them?

Is there another data ingestion somewhere else? Perhaps a manual one that was done in the past?

Josh Josue (jjosue@shield-legal.com)
2025-04-28 13:26:01

I also noticed something odd - how does your Step 1 table not have a case entry but Step 2 has 6 of them?

Is there another data ingestion somewhere else? Perhaps a manual one that was done in the past?

James Scott (jamesscott@shield-legal.com)
2025-04-28 13:47:11

ok do you have a couple minutes at the top of the hour

Josh Josue (jjosue@shield-legal.com)
2025-04-28 13:47:26

yep!

James Scott (jamesscott@shield-legal.com)
2025-04-28 14:01:00

ready

Josh Josue (jjosue@shield-legal.com)
2025-04-28 14:01:11

🫡

Josh Josue (jjosue@shield-legal.com)
2025-04-28 14:01:42

can u hear me?

Josh Josue (jjosue@shield-legal.com)
2025-04-28 14:01:52

i cant hear u

Josh Josue (jjosue@shield-legal.com)
2025-05-01 16:23:28

It's worth noting the casetext.com no longer has the service working, so Step9's ingestion of case texts has been deprecated

Their page just says: "This service is no longer available, but we appreciate you being a part of it. For legal research, please visit Westlaw, and if you're curious about legal AI, check out CoCounsel. Thanks for stopping by!"

legal.thomsonreuters.com
Josh Josue (jjosue@shield-legal.com)
2025-05-07 16:35:51

I was able to produce a table from the Jupyterlab Steps 8 & 9 named jupyter_adverse_events_ranking and the result looks promising

It has unique active substance name and only has a total of 4,884 entries. Perhaps you'd be inclined to look at this table

James Scott (jamesscott@shield-legal.com)
2025-05-07 16:38:36

Yes I can take a look! I been re running some of the code myself thank u I will let u know tomorrow

Josh Josue (jjosue@shield-legal.com)
2025-05-07 16:42:01

yea thanks!

the caveat is that Step 8 was using "replace" so the data kept replacing until it was only 2024. But if you validate the ranking table as "good" then I can make the adjustments to the code to include all years. On the plus side, both Step 8 and Step 9 tables have the exact same length

Josh Josue (jjosue@shield-legal.com)
2025-05-08 11:57:32

I also went ahead and created kestra_adverse_events_ranking to contain all the years, it has 10,059 entries

James Scott (jamesscott@shield-legal.com)
2025-05-08 11:58:13

He’s we can talk little bit busy but I can switch these tables I can show u what I did

Josh Josue (jjosue@shield-legal.com)
2025-05-08 11:58:14

I tested to see DUPIXENT and it has a unique entry per year

James Scott (jamesscott@shield-legal.com)
2025-05-08 12:54:01

Yes I did my test rhis morning I can switch to yours forestry

James Scott (jamesscott@shield-legal.com)
2025-05-08 12:54:04

For kestra

Josh Josue (jjosue@shield-legal.com)
2025-05-08 12:57:09

sounds good, plz let me know how that turns out

Josh Josue (jjosue@shield-legal.com)
2025-05-08 16:54:23

so is the data on kestra_adverse_events_ranking good?

James Scott (jamesscott@shield-legal.com)
2025-05-08 16:57:01

I used the one u ent this morning I haven’t check that one yet prob will do it in the morning im building out all the graphs in python

👍 Josh Josue
James Scott (jamesscott@shield-legal.com)
2025-05-09 12:14:22

you free

Josh Josue (jjosue@shield-legal.com)
2025-05-09 12:15:02

Almost done with our meeting

James Scott (jamesscott@shield-legal.com)
2025-05-09 12:18:34

okkkk

Josh Josue (jjosue@shield-legal.com)
2025-05-09 12:20:26

ok im free noow

James Scott (jamesscott@shield-legal.com)
2025-05-09 12:24:14

ok

Josh Josue (jjosue@shield-legal.com)
2025-05-09 12:36:04
James Scott (jamesscott@shield-legal.com)
2025-05-09 13:21:40

u never created the recommendation code adjustment correct ?

Josh Josue (jjosue@shield-legal.com)
2025-05-09 13:22:26

no i did not alter that Step 10 recommendation code

James Scott (jamesscott@shield-legal.com)
2025-05-09 13:22:55

Gotcha ok

Josh Josue (jjosue@shield-legal.com)
2025-06-04 12:18:00

I’ll get em to ya once I’m back from this meeting

James Scott (jamesscott@shield-legal.com)
2025-06-04 12:18:56

huh

Josh Josue (jjosue@shield-legal.com)
2025-06-04 12:19:21

Oh srry, wrong chat lol

James Scott (jamesscott@shield-legal.com)
2025-06-04 12:19:50

aahah its ok!

Josh Josue (jjosue@shield-legal.com)
2025-06-17 10:48:42

Hi James, does our ChatGPT accounts need an admin to allow usage or increase quota? I tried to use the api key yesterday and it gave and error 429 stating that I've reach the quota (even though it's the first time I was using it)

James Scott (jamesscott@shield-legal.com)
2025-06-17 10:57:20

The enterprise ChatGPT? I think is under Ryan name I am not even sure if I have admin but actually let me look cuz he is our on vacation

James Scott (jamesscott@shield-legal.com)
2025-06-17 10:59:15

ok i am on i am the owner so i can adjust things

James Scott (jamesscott@shield-legal.com)
2025-06-17 10:59:19

let me look at qouta

Josh Josue (jjosue@shield-legal.com)
2025-06-17 11:00:00

thanks! and also plz verify that my account has permissions to use API keys

James Scott (jamesscott@shield-legal.com)
2025-06-17 11:10:16

what project will this be for

Josh Josue (jjosue@shield-legal.com)
2025-06-17 11:11:01

Nick has me working on Short Form being filled out for Acts and Dichello concerning sexual abuse cases

Josh Josue (jjosue@shield-legal.com)
2025-06-17 11:11:37

I tested ChatGPT to extract info from docx files yesterday and it yielded the expected results, so I'm trying to automate that process

James Scott (jamesscott@shield-legal.com)
2025-06-17 11:13:12

yea i just needed to know who to assign the api key too i have to create a project and stuff

👍 Josh Josue
James Scott (jamesscott@shield-legal.com)
2025-06-17 11:15:32

in chaptgpt itself i created a new project for you called short form so you should use that for your liek quesitosn and code stuff

🙏 Josh Josue
James Scott (jamesscott@shield-legal.com)
2025-06-17 11:16:44

now let me do it on the billings ide for api keys

Josh Josue (jjosue@shield-legal.com)
2025-06-17 11:19:35

thanks man, I appreciate it!

James Scott (jamesscott@shield-legal.com)
2025-06-17 11:23:00

sk-proj-epGZHG4dY-YaxrhyDAMeoGyxcGHbCex2F8LCp82WPexS2gMeVC17KGnt0lzUvTMYJijX8GdTT3BlbkFJRZHxWz21EsO1eBO3Hn3lSk8wL2_oRkCLZ94HFUaOeVFWxGlWlZSjKUFk0JhOA4vwwtY9yFdogA

James Scott (jamesscott@shield-legal.com)
2025-06-17 11:23:09

let me know when u have it so i can delete

James Scott (jamesscott@shield-legal.com)
2025-06-17 11:23:32

and test this out and see if it works

Josh Josue (jjosue@shield-legal.com)
2025-06-17 11:24:40

just tested it and i got the same 429 error

Josh Josue (jjosue@shield-legal.com)
2025-06-17 11:24:48

i verified that the client does in fact have the key

Josh Josue (jjosue@shield-legal.com)
2025-06-17 11:25:05

openai.RateLimitError: Error code: 429 - {'error': {'message': 'You exceeded your current quota, please check your plan and billing details. For more information on this error, read the docs: <https://platform.openai.com/docs/guides/error-codes/api-errors.>', 'type': 'insufficient_quota', 'param': None, 'code': 'insufficient_quota'}}

James Scott (jamesscott@shield-legal.com)
2025-06-17 11:26:01

ah

James Scott (jamesscott@shield-legal.com)
2025-06-17 11:26:04

that is because

James Scott (jamesscott@shield-legal.com)
2025-06-17 11:26:08

this is different than chat gpt

Josh Josue (jjosue@shield-legal.com)
2025-06-17 11:26:22

ohh

James Scott (jamesscott@shield-legal.com)
2025-06-17 11:26:23

i have to add cams card

James Scott (jamesscott@shield-legal.com)
2025-06-17 11:26:37

but do u neeed chatgpt?

Josh Josue (jjosue@shield-legal.com)
2025-06-17 11:26:38

I thought OpenAI owned chatgpt

James Scott (jamesscott@shield-legal.com)
2025-06-17 11:26:45

cuz for this kinda stuff we use aws

Josh Josue (jjosue@shield-legal.com)
2025-06-17 11:26:48

I'm just trying to use chatgpt (if that's free)

Josh Josue (jjosue@shield-legal.com)
2025-06-17 11:26:54

oh i see

Josh Josue (jjosue@shield-legal.com)
2025-06-17 11:26:58

the Bedrock LLM right?

James Scott (jamesscott@shield-legal.com)
2025-06-17 11:26:59

oh no its a cost associated

James Scott (jamesscott@shield-legal.com)
2025-06-17 11:27:05

yes all my models i use is in bedrock llm

Josh Josue (jjosue@shield-legal.com)
2025-06-17 11:27:12

i mean I could use bedrock llm if that's ok with you

James Scott (jamesscott@shield-legal.com)
2025-06-17 11:27:21

let me get that

James Scott (jamesscott@shield-legal.com)
2025-06-17 11:27:24

ah

James Scott (jamesscott@shield-legal.com)
2025-06-17 11:27:26

hmmm

James Scott (jamesscott@shield-legal.com)
2025-06-17 11:27:32

u might need an account in aws

James Scott (jamesscott@shield-legal.com)
2025-06-17 11:27:35

u dont have one right

Josh Josue (jjosue@shield-legal.com)
2025-06-17 11:27:40

I dont

James Scott (jamesscott@shield-legal.com)
2025-06-17 11:27:48

let me give u this a template and see if it owrks

Josh Josue (jjosue@shield-legal.com)
2025-06-17 11:28:00

i could also use gemini if that works with our GCP account

James Scott (jamesscott@shield-legal.com)
2025-06-17 11:31:21

naw i dont have anything related to gemini

James Scott (jamesscott@shield-legal.com)
2025-06-17 11:31:52

you should be able to tweak this to waht you need but htis is basically the template

James Scott (jamesscott@shield-legal.com)
2025-06-17 11:32:31
James Scott (jamesscott@shield-legal.com)
2025-06-17 11:33:01

like this prompt file has the keys and connection the ipynb file has the run commands

Josh Josue (jjosue@shield-legal.com)
2025-06-17 12:23:27

ok thanks!

Josh Josue (jjosue@shield-legal.com)
2025-08-11 11:00:06

G'morning James, I tried to create API keys in OpenAI but it asked me to create a new org - are you able to create an API key for me? (it's be used for a project for Abe)

James Scott (jamesscott@shield-legal.com)
2025-08-11 11:02:26

Let me try it out !

Josh Josue (jjosue@shield-legal.com)
2025-08-11 11:47:59

any luck?

James Scott (jamesscott@shield-legal.com)
2025-08-11 11:49:24

do u need a new api key i have one existing here for u

James Scott (jamesscott@shield-legal.com)
2025-08-11 11:49:34

sorry 3 other people pinged me same time

Josh Josue (jjosue@shield-legal.com)
2025-08-11 11:50:08

it's all good

I think the last api key i received was for AWS Bedrock

James Scott (jamesscott@shield-legal.com)
2025-08-11 11:50:15

yes

James Scott (jamesscott@shield-legal.com)
2025-08-11 11:50:20

let me give u the chatgpt one

James Scott (jamesscott@shield-legal.com)
2025-08-11 11:50:25

i have both anthropic and openai

Josh Josue (jjosue@shield-legal.com)
2025-08-11 11:50:30

lemem scroll up

James Scott (jamesscott@shield-legal.com)
2025-08-11 11:50:32

do u want both and or use anthropic

🙏 Josh Josue
James Scott (jamesscott@shield-legal.com)
2025-08-11 11:50:58

sk-proj-PYxX1UDM76o5AaydkP65bORfdNbvLZZOkVJ1O61h64fELyb8ij2m-i57uHkEytiUUgq9Rbo1T3BlbkFJNxk9cTFliMJweU2Xtpwbwd23Vp3XBYq2isnjnoQEl2RK65Z6RXBKgOvEGym85-yqfM7ZezyHYA

🙏 Josh Josue
James Scott (jamesscott@shield-legal.com)
2025-08-11 11:51:17

this is your open ai one

Josh Josue (jjosue@shield-legal.com)
2025-08-11 11:51:46

thanks!