https://console.cloud.google.com/home/dashboard?project=ai-projects-406720
I've been looking into the Notebooks feature of the Vertex AI, I created a simple python script that will run based on a schedule
u created that in gcp? yea the notebook itself is going to need to be run cuz its grabs a file, but good work for today you can go home have some time back and we can pick back up tomorrow
yea, the scheduled script will terminate at 5PM PST just to test it
To join the video meeting, click this link: https://meet.google.com/syy-crhd-xug
To join by phone instead, dial (US)
More phone numbers: https://https%3A//tel.meet/syy-crhd-xug?pin=3752056146248
nicholas mcfadden is dustin direct manager, if you feel anything else or more things coming from him please let him know
I'm trying to run a script within the machine-learning instance, but it doesnt seem to have any Environments to show me
So I'm looking into editing the instance to have one
Hi James, are you ok with me giving status reports?
alright - so I went into a rabbit hole trying to setup an Environment (via Google Compose) to see if it would appear in the script's job scheduler. However it did not.
Now I'm trying the Vertex AI Executor by creating a scheduled run this 2PM. I routed the output on bigQuery to a different table_id so it shouldnt mess up the entries currently there.
quick question was there any other information about the dustin incident?
Are there any specifics - the others were present in the room
some good news! The Vertex cronjob executed at the expected time of 2PM!
Waiting for it to get done to check the results...
its a little late on my side! you can leave good work today
oh ok gotcha - yea we can pick it back up tomorrow
i will be in meetings for a while today! i would say just keep going with what ur doing
SUCCESS! 🥳 The Workflow notebook executed on the scheduled time and successfully created a BigQuery table filled with entries
There is slight security risk tho - I had to embed the contents of your JSON key contents into the script. But other than that, this seems like a viable way for us to schedule cronjobs
wooohoo!! awesome lol thats great news, now i think step 2 would be to update the scrip to pull automatically, that is for a set period and time but it needs to be updated and grabbed automatically
so by "pull automatically" did you mean code being pulled from a repo?
Or did you mean I move the other scripts to the BigQuery Workflow?
we can talk about it tomorrow but basically the code fetches a json thzt has the jsons of all the data, which then needs to be downloaded and uploaded to the bigquerytable, we need to autimically fetch the latest big data pull and do that json download
oh ok gotcha - yea we can pick it up again tomorrow!
Awesome I would say just continue with what we weee working on yesterday updating the data or fetching the data form the fda api in the code
Yesterday I ran the workflow to download 3 json files and that ran fine. Then today I tested the workflow to download as many as there are based on the csv rows and it failed (stopped after the 280th download)
I think I really will have to create an K8 environment to run all of these scripts.
I'm going to ask Joe for permissions on installing Docker Desktop so I can get started heading in that direction
yea i mean i think i have the data up untill 2023, so it should be just for the year of 2024
i dont think there should be 280+ files for a single year because it hink it ws like 800 something
Hi James, what's this column titled "Unnamed" represent?
This is just an index column sometimes when saving and loading the dataset in python this happened it needs to be dropped or when saving u have to set index = Falsd
so great news - i now have a script that creates a CSV with all those json links!
just curious, in the future, do we wanna keep making csv files OR maybe have it on a DB to be referenced by the other scripts
well, if you can directly updload it or the script to big query, thats the goal, but the code i found takes a csv then uploads, if u can do direct upload of the data that is perfect
it can be told to run for any array list of years (e.g. [2014, 2023, 2025])
more so, from the presentations, i am doing and feedback we want to include all years worth of data in a table
yep! That's my next step - uploading the csv to the cloud
yes if u can do it without a csv than perfect
oh ok cool! yea i'll go ahead and store this scraped data on whatever Google's version of DynamoDB is
i wanna draw a diagram so we can discuss it at some point, but the overview is, the notebook scripts you made will be triggered by a python script cronjob living in Google's version of an EC2
that way, we benefit from Google's UI for manual override and we keep it to how you're comfortable seeing those notebooks still
i dont think we need to store that json anywhere, it can live as a variable in our code everytime we run it, if thats what u doing? cuz from there you go to downloading the data, which the code works for that we just need to now get all data into the table which we can look into not having it in a csv, so augment the code to take that data, store it in a dataframe, and then push to bigquery
Oh I’m not storing a json, I’m storing the csv of links onto a DB which could then be referenced by the bigquery workflow script
oh ok take ur time no rush i stepped away
also its the holiday weekend leave around 2 today i see no reason to stay longer
Update: My script now automatically uploads the resulting csv file onto Google Storage! woot!
Just the storage or big query? Good job! For all years right ?
to big query
your script does the big query stuff so I have to chain it onto the automation
I was going to get started with Google Run Functions but saw Ryan's message about Fivetran
Yea make sure it’s update into big query! And is it automated for future years as well? Like not just the current years so if we run next quarter it automatically downloads
The architecture i had in mind was, this script runs on a Google Function then kicks off your script for the BigQuery part
Gotcha ! There is some task after this but this is good I need to redo the models in the full data
Yep, it's automated to run from 2004 to the current year
Awesome let’s go
as it stands, the FDA only goes up to 2024 Q3
is it in big query yet i tried querying it still shows 2023 latest year
i could manually run your script to update the bigquery entries
No it should be automated as well, I’m expecting it to run the script automatically end to end
yes absolutely - eventually that's the goal
so my next step is to find where I can run this scraping script on Google Cloud
Why can’t the scrapping script be run in the workflow !
it's not just a single file script (for code reuse)
also more flexible since the config files can easily be replaced (it contains credentials, bucket name, storage path, etc)
Awesome ok that’s fine with me! Did u have a good weekend
It was pretty chill - got back to working on some tinkering
question is the data for all years in there yet? i am going to adjust some of my models
Still on the road but I’ll check soon as I’m in
so i integrated your script into mine yesterday and ran it locally on my machine - but for some reason my machine rebooted overnight 😕
Update on the automation: Google Function seems to have an unresolved feature ticket (dating back to 2019) concerning the ChromeWebDriver (which is needed for the scraping)
So I'm currently looking for workarounds for this issue
question: does this notebook job run end to end cron-josh-adverse-events
No - that one starts at the step of the pipeline to download all the scraped json links
the one that runs end to end (scraping + downloading json + bigquery table creation) is on the Docker image that i'm deploying onto GCP
which is currently having the ChromeWebDriver issue
ah gotcha awesome ok ! ima go head and delete tht one if it nots needed.
ok so i've verified that BOTH 2017 and 2024 have a column for "case_date" in their data frame
im gonna continue to debug the script as a whole
ok so I traced something odd on the filtering line
df_csv = df_csv[df_csv['year'].isin(['2017', '2024'])]
Before this line, df_csv has 1k entries, after the filter it's 0, which is odd bcuz I verified that the csv on the Storage bucket does in fact have 2017 and 2024 (among others) on the "year" column
Maybe take that out and run since we doing them all anyway it could be an error in format
Ok i've started the notebook and it's going 🤞hopefully it updates the bigquery table
Meanwhile, I'll proceed with this ChromWebDriver issue
Currently downloading jsons for 2013s so not yet
So I’m trying to save the company money in deploying to Google’s version of AWS Lambda
But it seems I’m going to be forced to deploy onto GCE due to Function’s lack of support for ChromeWebdriver
The docker image for scraping runs fine on Docker Desktop though
the workbench notebook went idle at 456/1564 (2015 Q2 dataset)
So i pivoted - I modified my code to append 2024 datasets onto BigQuery
does it go idle from just juypter notebook instance? like if u run the code there? i know these can run for hours in that but condused as to why its idling in docker
not the docker, the jupyter workbench script that i ran earlier - it just stopped outta nowhere
But im working on a solution to prioritize getting you an updated big query table
Ok so I've succeeded in appending 2024 Q1 to dev-adverse_events_copy
WHEW! that took 30 minutes to run
Loaded 37821879 rows into ai-projects-406720.drug_model.dev-adverse_events_copy.
Oh also, heads up, on Monday 27th, I have to be at DMV in the morning
i think i got through the rror, i created a chunking script that chunks the data by 100 and then appends, its running now on like 300 so we will see what happens this way it removes the idle thing i think
Nicee
My script finished appending 2024Q1 to the bigquery dev table earlier
Working on 2024Q2 rn
i am rolling up on 600 out of 1500 so whne this gets done it should be appened to big query the full 2004 to 2024 so you would just need to productionlize this script to append future data thats not 2024 onward to that biquiery
Ok so the script on Jupyter notebook is the one containing your chunk changes?
Tomorrow I’m going to look into selenium alternatives (like beautiful soup) that would be compatible with Google Function Run
yes its called untitled now what is selunium going to be used for
That was the library that scraped our json links
But my deployments to Google Run havent been successful due to Google Run not compatible with this method (they have yet to implement the feature)
with this new dataset or script being done, would the worlflows work now that its just or should need to append new data to the table which should take a lot of memory or time at most it would pr only 20-30 new jsons
Sorry that was a bit difficult for me to understand. The workflow is only a fraction of the pipeline (at that step, the json links have presumably been scraped)
But with our new appending technique, it shouldnt take as long as downloading all json starting from 2004’s datasets
G’morning! So I wanted to present the new gameplan of the cronjob
Just wanted to make sure I had it aligned with our endgoal
Yes I am almost complete in the code we can hop Ina. Call Ina. Couple minutes
i am having such bad issues with my network today
great news! I came up with a fix for the script (it was easier to debug locally 😅)
so there was a discrepancy between how pandas dataframes handles the dataypes and pyarrow
the fix was to explicitly change all df fields to type str based on the schema used for bigquery
Ahh I had that before I shoulda kept it 😡 it’s updated now ?
so this should be able to be recoded into the prod script in the workflow from ur flow chart suggestion and run in the bigquery workflows to just appened
Yeah im going to go ahead and incorporate your updated script into my project
Also, I'm still trying to debug Doccker image deployments into Google Run
import pandasgbq projectid = 'ai-projects-406720' pandasgbq.togbq( df, 'drugmodel.adverseeventsprod', projectid=projectid, ifexists='append', # Change from 'replace' to 'append' table_schema=schema )
this is the code to append the data instead of replace
so thats how we will use this going forward to update the table with new data
Just finished integrating your new chunk script into the scraper codebase
I'll go back to debugging Docker image deployments on Google Run
hey you might want to double check this when you get in, so I looked at your table the one you said you get to work, there is only 518k rows my orignal dataset for the years of 2017+ has 37 million
the csv file in the notebook has 67 million lines
is it because the adverse_events_database_prod notebook stopped at chunk 14 of 16?
the API seems promising - im gonna go ahead and work on refactoring the work to use this new method instead of web scraping
Another concern of mine was security - if i structure this code to run on workflow notebooks, the gcloud credentials will most likely be embedded within the code
Can we make environment variables ? From our data there is no package to the outside or external resources it’s all within gcp right so would that matter
ok yea this should be fine, was just checkin
so i've looked around bigquery's GUI and couldnt find a place to enter env vars
how do ppl usually do that with Jupyter notebooks? I've always just loaded these values from a .env in python
but dont worry about it if u cant apply it
g'morning - I'm a little confused, should our code be running on BigQuery workflow or GCF (Google Cloud Function)?
also, I'm currently running tests on the updated workflow version
it will write any error responses with the corresponding attempted url so that I can retrieve any missing data after a cronjob run
so we are proceeding with bigquery workflow?
I just realized something - is there a reason the case_number field is not a primary key? I thought it could help when running scripts for missing data and preventing duplicates
Also, is the "file_name" field still relevant since I no longer have that information when I changed the ingestion method to use FDA's api
Field name is not relevant at all! And case_number can have multie entries so there is no primary key in the dataset
ohh gotcha, so what do you think is the best method for ensuring unique entries?
like let's say, i found out the cronjob got a 400 response and I have to rerun a job for March 2004
With the data completed for 2004-2024 why would we need to do that? And this is the raw data approach right now. We are just ingesting the data as it comes. Feature engineering later when I get to the models is when unique entries are created and feature engineering is done.
i'm planning for ingestion of future data and failsafes we'll need
how bout if I made a primary key of a casenumber + substancename.. would that be unique enough?
why is there a need for unique id, i am getting confused on that part, a cronjob requires a unique id?
so i would assume that if you have duplicates of entries, then the AI you're creating would have those entries having more weights or something
the unique id would help when the script has to retry getting any failed GET requests, it would ensure no duplicate entries go into our BigQuery table
ahah your thinking like 10 steps ahead lol
there is no duplicate entries in my dataset
is that due to a "cleaning" process after the ingestion phase?
yes, like i said there is feature engineering done, this raw data is nothing like the model training
oh ok, i wasnt aware of those already existing
ok so currently, the code is still running on Workflow (which is a good sign) so I'm gonna make a release of this version
After it's done, I'll verify the big query rows match the csv entries count
and this updates the prod table correct_ i currently have step 2 ready for workflow job as well
with all the data this has made step 2 have 130million rows
no, this is pointed at a dev-josh table since your prod table should be considered all good and I dont wanna alter that
i dropped file_name from he prod table u should be abple to append them in the fuutre no worries
and i have step 2 notebook ready to be attached to the workflow like the next one
here's an example of the scenario that I would come across
when the script reached 2023-10-02 TO 2023-11-01, FDA will start telling me to try again later with response code 429
is to many request what does that mean is that the connection hit or data?
and also another scenario is 1 month contains way too much data that the script hits the skip limit
those are the 2 main scenarios where I would need to run the script to retry to get data at a later time
can you log into monday and see if you see the board i am getitng the task together
you mind if i start creating cards and editing stuff?
i can add in the task and or we can go over it tomorrow
you should have access to the github now as well
Hi James! happy friday!I
I've finished formalizing the adverse events workflow Some highlights for v0.6.0 : • conditional for start date used when downloading data (by default it's the current date, otherwise, take the latest case_date from the bigquery table) • Some unit tests and integration tests added to the project • Infrastructure naming (env vars) are organized and accessible I could hop on a call to discuss the next steps I can tackle
I've scheduled a cronjob for Monday morning just to do a full simulation of all the parts being automated
awesome!!! soounds good lets have a call on monday enjoy your weekend
Good morning! The scheduled workflow on Big Query ran successfully this morning v0.6.0
it gathered data up to 09/09/2024 and stored it into an integration test table
What are the next steps? the ICD10 scripts?
so that created the adverse events table, the next would be in coprorate the icd10 logic
if i close all the other tabs open, will it affect your console?
yes i want to put this in vscode jupyter notebook
Hi James - when you get a chance, could i plz have the link for step 3?
also I've created a solution for unifying the notebooks with shared env vars
Ok I can give u the first couple of parts but for steps 1-2 but I also wanted to talk about adding in the logs for each step I can show u what I mean when I get back im at an appointment
it comes in the form of having a json downloaded from a bucket
the json will contain: • base names to Big Query tables • Schemas
That’s fine ! Nice workaround
the key is only embedded in the notebook code
The contents of the json is for infrastructure purposes, (like all relevant tables for integration testing vs the prod tables all being used across several notebook scripts)
Each notebook code still needs the api secrets embedded in them bcuz it allows the script to auth Google cloud libraries
My laptop lost internet connection at the office
i can hop on super quick but i wont stay on long
I've completed the logger util
But as I finished it, I foresee some technical debt in updating N number of notebooks using these utility modules. So I'm going to pause refactoring Step3 and implement a solution for each notebook to obtain these util modules
Yes the logging for each step would be different is that what your saying
not quite, I'm saying all of the steps so far (adverse and icd10) use GoogleCloudUtil that I wrote, and eventually can integrate the LoggerUtil i just made. And so will future steps
and right now, I'm have to copy paste these Utils over and over into each notebook
Does google cloud utility provide logging similar to logging utility or in confused ahha
The google cloud utility is what allows the scripts to auth with Google's API, upload/download stuff to buckets and bigquery
it also handles ensuring that if a table doesnt exist, it will be created for upsert operations
the number of notebooks is growing and I have to keep up with the changes
If Step 1 was ingestion of adverse_events and Step 2 was icd10 table creation how would you describe Step 3?
it would be drug label info or extraction
Also the solution I implemented for the modules is: • bash script downloads all git repo releases of utility modules (GoogleCloudUtil, LoggerUtil, etc) ◦ Uploads them to Google Bucket • Each notebook pulls down the modules and installs dependencies This will ensure the versioning and unify the modules being reused by our notebook scripts at each step
you have a moment? - I can present the current state of the Workflow
lets do it tomorrow during our standup call
lets have our meeting closer to 1 today
just a heads up- i think my internet at the office is being weird
yes i have a couple of minutes for a quick call
pip install google-cloud-secret-manager
from google.cloud import secretmanager
def accesssecret(projectid: str, secretid: str, versionid: str = "latest") -> str: """ Accesses the specified secret version in Google Cloud Secret Manager.
Args:
project_id (str): GCP project ID.
secret_id (str): Name of the secret.
version_id (str, optional): Secret version (default: "latest").
Returns:
str: The secret value.
"""
# Create the Secret Manager client
client = secretmanager.SecretManagerServiceClient()
# Build the resource name
name = f"projects/{project_id}/secrets/{secret_id}/versions/{version_id}"
# Access the secret version
response = client.access_secret_version(name=name)
# Return the secret payload as a string
return response.payload.data.decode("UTF-8")
projectid = "your-gcp-project-id" secretid = "your-secret-name" secretvalue = accesssecret(projectid, secretid) print("Secret Value:", secret_value)
can you test to see if u need auth to run this inside that enviromen?
I've tested that method but yea i can double check
yea confirmed, that wouldn't work
this line requires that google cloud credentials from the JSON we spoke about:
client = secretmanager.SecretManagerServiceClient()
also, if it did work, then anybody in the world could get anybody else's key(s) just by knowing their key id and project id
I noticed that the drug info doesnt have a schema - is that right?
yes it should, also i have corrected step 5 so all the way to step 9 is done
once u get done producitonalizing the code, we have a task that we need to do so steps 1-9 or 10 we need to switch and do something cooler and add a piece of the pipeline like i had to do preivously which is or the new one is to extract drug information from studies around the world
(I'm currently running integration tests on step3 - if all is good, i'll move to step 4)
That was the result of it lol we met with some guys who look for these drugs and we have to take that and incorporate what they do in Thai process
Hi james, on step 4 there's this if statement that i just wanna make sure is still relevant
scenarioA: if file exists in bucket, skip it scenarioB: it doesnt matter if file is already in bucket, grab its pdf and overwrite existing file in bucket
G'morning James! So you know how the logs are being uploaded onto Big Query? Did you want all the steps to share the same table or shall I separate them by step?
So I'm thinkin of separating them because if a step failed, we can just look directly for the table confined for that step
the tables are prepended with the infra property (e.g. "integ_test", "dev", "prod") so they'd be grouped up any way as big query shows them in alphabetical order
Yes that’s what I was going for as well separate them is fine with me
Hi James! got a question about Step5 (plz see image)
DrugSummary.warningsprecautions_ seems to start with a string "This is a list..." and then is overwritten by extractcategory()_ without being used. It seems the initial string value didnt matter at all amirite?
that is not a string value, that is a definition of the column of a pydantic model
is there anthing you need to show or look at for this meeting, if not we can cancel and work
Nothing new to demo yet - but I’m close to finishing step 5
ok awesome! we can skip this call how was the weekend
It was pretty fun! Friends and i are hooked on that Marvel Rivals game 👾 😆
How was yours?
!man i play that everyday!! whats your rank?
Noice!! I’m a lowly silver 1 😅 but I’m tryna catch up
I play on ps
Support rocket lol I’m plat 3
Oh dayum!! You’re prolly a great healer haha
hahaha i be doing ok my team mates just suck i would be higher haha
for step 5 optimization, is it ok if I retrieved all entries in the drugsummaries big query table (if any) and then filter out the brandnames with existing entries? This could cut down on the bedrock_api calls
also let me know when you want to play it is crossplay
Heya James! I cant seem to see the code for step 7. I'm looking at productionn.ipynb
I’m currently on retained and removed labels step 6
Gotcha ok we might have to do another step in there from my talks yesterday
gotcha - so a step in step 6 OR after step 6?
I'm also going to add a filtering phase so that step 6 doesnt run on brand_names that already have entries on either the retained/removed tables
By adding these filtering steps, I found that it significantly cuts down on the entire workflow's runtime (from 2 hrs to 25 min)
Basically it’s kinda having to redo the back in of the process a little bit so we need to target drugs that don’t have cases in them then do a ranking model for those drugs and the ones that do we can also just show as a separate function / feature I guess. I gotta talk to Ryan about it
i see... well i just released step 6 - what should I tackle on next?
i'm lookin at step 8 - is this where the aforementioned alterations will go?
G'morning! So I'm currently testing Step 8 - would you be free to do a call on those changes you mentioned?
After this aggregation part, you want me to filter out rows that have number_of_cases greater than 0?
# Aggregation
result = df_adverse_financial[df_adverse_financial['earnings'].notnull()].groupby(
['manufacturer_name', 'brand_name', 'activesubstancename', 'case_year']
).agg(
number_of_cases=('case_number', 'nunique'),
Hmmm I would hold off on number 8 actually this is where we need to have a call the tanning model doesn’t need to be completed but because we need to get the number of open cases from case text and only filter the ones with 0 open cases because I need to use this list to extract clinical trial data form and make sense of it all
hi James! So I'm currently debuggin why the chunk method of upserting to a table for Step 8's result: it's due to new fields that come up every so often
I've tried to download the schema of your table adverse_events_icd_metrics but it seems that's missing the fields too. Do you know what the expected columns are supposed to be?
if your trying to create drugmodel.adverseeventsicdmetricsretainedlabels_prod correct? remeber we removed the financial data column, so with that, u just have to delete the output table of step 8 so it creates a new one this the exact columns
does the production.ipynb reflect that change? because i still see the dfadversefinancial on there
the production.ipynb reflecks it, but im not sure in ur prod code in workflow does for the previous step
if the table exist already, it needs to have the same column names
so us removing it adjust that, so u have to delete the table in big query so it creates a new one with the new column changes
the table existing wont be an issue - i delete it after each test
query_metrics = """
SELECT a.**
FROM drug_model.adverse_events_icd_prod a
INNER JOIN drug_model.adverse_events_retained_labels_prod b
ON a.brand_name = b.brand_name AND a.reactionmeddrapt = b.reactionmeddrapt
"""
this is the only place u are fetching data from in step 8
and this table should have those removed columns
Ok I’ll double check the table being used in my query
It’s just that i remember we deleted some code for step 8 on friday, but cant remember exactly what it was
that was on step 6 i believe it was the financial data iw as saying
I was able to compile a complete schema and got a solution to work. Gonna formalize it and then run it thru another test
ryan is joining this ai spring meeting
take the time and show him some of the stuiff u been wokring on,
On the case_text script, the first frame does a query for df_ranking, but it doesnnt seem to be referenced at all for the 2nd frame where the goals is to get brand_name, case_url_link, case_text
Is the first frame relevant at all for step 9?
can you explain what you mean by first frame
i thought that's what they're called on a notebook - im referring to the first giant text box
yes, from what i understand - it's happens before step 9
because it's supposed to help me weed out brand_names with cases - right?
plz confirm if i understood the overview correctly
Only update I have is using the exact table names but other than that the process looks good to me
Ok gotcha, I’ll update the table names!
(Step 9 is still undergoing integration tests)
once this is done i can run my script and work on the case details, i think if you want we can talk about the next part which is more modeling/llm focused if you wanna work on that, no rush for this part
Status Update: I've implemented a chunkified version for case_text (there's sooo many case urls lol) and I'm exceeding quota for Big Query. My solution is to increase the chunky size
oh really ok lol i dint know there was an quota for that i think we can increase it
His familiar are u with setting of vs code
what specifically are you trying to setup in vs code?
Instead of Jupiter notebooks I want to be able to have it in vs ode and leverage fit more
ctrl + shift + X opens the panel with extensions
there's jupyter notebook support made by microsoft
what's the scope of this setup? are we also taking into account setting up a virtualenv for the python project?
Really ? I just want to use the vs code interface for the projects and future projects instead of Jupyter notebooks and leveraging the co pilot the notebook environments is getting frustrating to me
i just searched, they also have github copilot as a vs code extension
No lol it doesn’t have to be right away if u have free time from the other task then we can switch but the key is also maybe to leverage the run times from gcp
yea we could def set that up - i'd be up to huddle on it tomorrow around 10AM PST?
i def have been running each step locally first and see how long they take (so as to not run up the gcp bill for compute time on just tests)
Yes u just want to make changes and push to got instead of Jupyter stuff be more software engineer like and have a process for it
g'morning! did you still wanna do that huddle for vs code setup?
quick quesiton is the ranking model done?
I had my machine running since 2:30 yesterday and I’m not sure how long these ranknet trainings take
But it’s reached epoch 3 and is still going this morning
What’s the expected value for those losses?
Mathis is because of formatting issues in the data like it looks like it’s containing infs or nans
Well I’ll take a look at the dataset and dataloader
How long does the training usually take? (So i know when something’s wrong)
Hmmm it can take however long but u did right there shouldn’t be any nan on loss
I’m running a test on GCP to try and find where the nan is coming from
Hoping to find the source by the time i arrive at the office
turnns out a bunch of records have <NA> or NaN
is it a valid solution to set them to 0 ? I ask bcuz making median_patient_age 0 seems odd
yes that is odd, i would run in the prod notebook because that gave the results
and see what te datatables are like in there
i am not sure whats wrong in the pipeline perspective
but you can decode from the working prod notebook
i ran it against your prod table and noticed that median_patient_age is not among the fields, and the only NaN values are in avg patient age and avg weight
So i'm going back to the previous step and make sure to cleanup the data being produced
Success!! i took a small sample of 1000 records and managed to produced a ranking table!
gonna test it on the entire icd retained label metrics table and have it run on GCP
Awesome !!! Nice nice nice this is great let me know when it’s done
so yesterday, Big Query inexplicably terminated the test i was running (it was on for 2 hr and 43 min) and that'll be a separate thing I have to look at
My computer's been running since yesterday. It's working on the training ranknet phase for a total of 172,019,504 DrugDataset entries
currently it's on Epoch 3...
Huh it deff didn’t take that long because it’s only ranking and these models are ones that is only records without a claim in case text correct
it got 19900 entries in total from the icd retainedd labels table, but after the filtering, got 17729
also i should note that I've also had to make a set - in order to get rid of brand name duplicates
So how is the model training on 172 million entries but the drugs in there is only 17729. Are you using the right table for training ? This shiikd be the metrics icd10 table. The most we are looking about is what ehh let’s say even 20 years so for 1 brand_name the most rows it can or should contain is 20 I don’t think we have 175 million rows
it should've been 1:1 right? like if the icd10 had 1 mil entries, the model training entries would be also 1 mil right?
but yes, i am using the table produced by the previous step containing icd metrics retained labels
Yes it should be one to one the same amount of records ur training form the previous step should be the same. I honestly think I need to look at it because with the removal of records form casetexf he teisninf should be easier and faster have u ran the training from the prod notebook and compared times
Yea man ! Do a compare and contrast of that
ok, i just confirmed that the diff occurs after DrugDataset is created
it seems to be due to the pairs being created - so i'll investigate that
i have a few questions, a huddle might be ideal if you have time
Ok I’m out at lunch if that’s ok
this nested loop will inevitably result in pairs[] being larger than the df coming
*Thread Reply:* integtestadverseeventsranking
My code finished after 30min, upserted 17,729 entries to integ_test_integ_test_adverse_events_ranking_logs .The number of entries matches the count for brand_names without cases
I’m still currently working on automating table replacements
hows your day going how was ur weekend
Heya! It was great- got some good practice sesh airbrush painting haha
How was yours?
Today’s good - made a fix last night and I’m pickin up where the Workflow job left off. Currently waiting on Step 8 to finish. Then i should have data (born from the full 37 mil adverse events) for Step 9 to rank
I did some light researching - seems it’s possible to connect to the GCP server while using vscode so that your code runs remotely
do you think we can focus on that for tomorrow to get it out the way
Sure! I’ll pivot to looking into that and see how to set it up with our GCP instance
yes, the other stuff should just be running now to complete correct, deff wanna start getting more software engineer like with it
im still trying to fix Step 9, all of a sudden running into a missing column issue
verifying if this is a bug in the code or dirty data
its probably one of the ones we deleted previously
implementing automated tests, abstracted modules and such
also, as im moving along, I'm integrating new modules to older steps (like the logger to step2)
I'm trying to add SSH keys to the machine-learning instance but I dont have permissions
Step 9 has completed - but the ranking's earliest year is 2001
I should note that im also waiting on Step 1 to ingest stuff - (I've explicitly told it to start from 1994)
we need to double check tomorrow the casetext query, some of these are coming back with cases
ehh i think its ok, i could give the list of links to cases but combiung through them could be annoying rather give them to the user to figure out
when you have a moment, im ready to show you how i've ssh'd into a GCP VM
When you get the chance, could u try to add an ssh key i to machine-learning? I’ve already given myself an admin role and still not able to do so
I’m wondering if you’re account is able to add one
I’m currently eating 😅
I’ll msg u when i get back
let me look at this tomorrow i have some things i need to take care of, good work today though!
real quick, whats the results of the rank stored in whats the tbale name?
not sure why i didnt get a slack notification this morning
we have a successful test that finished this morning - Step 1 to Step 9
i'm making some small fixes (like typos in the logs table names, etc)
Let’s go!! lol can u update the tables in Miro with the actual table names
oh yea for sure - i'll create a ticket so i dont forget
bcuz I succeeded in giving myself an admin role for the machine-learning instance
BUT for some reason, i'm just not allowed to make any changes to it (including adding SSH keys)
are you able to add ssh keys? if you're also not able to, I'm guessing this is due to some GCP configuration (iirc on AWS if you create an instance, there's a scenario where you wont be able to add SSH keys afterwards)
ssh-keygen -t rsa -f gcp-shield-legal -C dev-cronjob -b 2048
ssh-keygen -t rsa -f gcp-shield-legal-machinelearning -C shield-legal -b 2048
IdentityFile C:\Users\Jehoshua.ssh\machine-learning
just wanna make sure i dont add a rando this secret project haha
It’s the second one Ahha and yes that would be bad
were u able to like get in the enviroment connect to github and do stuff
yea! i git cloned one of my repos into the vs-code-machine-learning and run it just fine
i also tried using the git extension on vs code (never used it, i've always used sourcetree) and that worked out fine too
awesome let me try and do it too ima download whay i was wokring on ltest and try to updalod it prob need to make a new branch have u been following any type of structe or naming convention
I name the repos based on what your notebooks’ step titles are
I’ll send u a git invite for Step 10’s repo once i go into the office
gotcha ok, yes if u follow the miro boad and have the updaed table names thats a fine naming convention since it relates back to that
alrighty, i've added you to step 10's repo (named clinical_extraction)
like usuingur shield account or personal?
ohh ok i thought that was, like, the "team" we're on
also, i should mention that i did have to install git and a few things on the VM, not sure if you'll run into that
Just wanted to confirm - i will be refactoring the entire step 10 script
also i am able to clean the repos and get in! ima start my coding form here and u can just pull my changes when done
i dont think this has pyhton isntalled lol
oh i mean like when i am trying to code, there is no python kernal installed
hmmm... i was able to run code just fine on that VM also i usually create a virtual env
i guess we'll just have to sync up code via git
what you mean if we are both in the envorment, shouldnt the terminal install work for both of us
i believe the VM partitions us by user - you cant see my files and i cant see yours either
if the goal is to have use both see the same files, i could look into how the VM can create shared storage volume
That’s true, I guess the requirements . Txt needs to be pulled form somewhere idk, how would we solve the problem via git well if we are working on GitHub we don’t need the same files but for python packages we need to be in sync for that cuz u are not away of everything
the repo has requirements.txt - so we can update it as needed
and that requirements.txt is what will keep our dependencies in sync
i also put a get_modules.bash script in all of our projects to grab the common modules among them (like table names, google cloud util, etc)
Ok I guess I can keep the repo requirements txt updated is that the same without ? Like if I’m working on step 10 versus step 1 would that be the same repo file
most of them use the same requirements.txt, step 9 was the exception - i created a separate requirements.txt for that project
if you pip installed the requirements.txt currently in the Step 10 repo, it should have everything we've been using in the previous steps
iirc step 9 was the one step that used torch, which took a while downloading and installing - didnt want the other steps to be bogged down by that when they dont need it
G’morning! Recently saw this open source software (docker deployable) that I could implement when we have time or when we do another pipeline
It basically solves our needs for: • UI showing steps of pipeline • Automated pulling from our GitHub repository (ensuring integrity of releases and streamlining deployment) • Live feed of a gant chart timeline for each step being done • Handling of secrets • Infrastructure as code (steps are saved as YAML)
nice, lets talk about it, so what are u working on now? i have a meeting with cam and he wants us to prsent something soon, so i wanted to get more data on these clinical trials
I'm currently refactoring the script used for step 10
no idea lol, have. ameeting with him tomorrow about it, but more so the refactoring of the llm in regrards to the classificaiton corect
i ran an integration test on Friday, and the integ_test_adverse_events_ranking is updated
So i've been going thru Step 10's code, namely the QAResponseModel2 class
aside from breaking down the code, was there a feature you needed implemented/added to it?
can u link me tro the file ur looiking at
no that part is done and final, but there is another part i am currently coding to add
after this, are u good with bi dashboarding?
I’ve done Plotly before - is there a python library you’re leaning towards?
also, who's the target audience? if it's outside the company, i'm also capable of creating the backend with login and auth so users that are only allowed to see the data can see it
Bush’s to check the rank algorithm but that’s fine we should start plotting this in the chart it’s going to be can and other stakeholders I don’t think that with is necessary for now but let me show u how I envision it and we can start to put together a moor board
I like this kinda dashboard look and it’s very official
that link is referring to Looker studio, (but I also know Node, React and NextJS if that becomes relevant)
I know right lol if we were to get the dashboard looking like that my word !
after the demo and the looker UI stuff, would I have the chance to work on improving our pipeline with Kestra?
one of my main concerns at the moment is that our deployment is not automated
Yes to the first one and isn’t it in the scheduled to run? That’s fine
it is scheduled to run
by deployment, I'm referring to the step of putting the code onto the BigQuery Workflow - me - I am the deployment rn lol
i literally have to copy and paste stuff into the notebooks (which isnt a standard software releasing practice lol)
oh ahha i thought its already in there i am confused but its fine but yes we can work on a more standard approach but this is fine for now
that said i think im almost done refactoring the LLM part of step 10 - when im done what are my next steps?
or rather, what features does your branch have? (that I will be refactorring when you're done)
refacrotering was suppose to be for the recommendaiton llm
it doesnt seme like u updated this code
but that llm would work for recatoring too
if u just change the prompt and ingestion around
I dont see code pertaining to recommendations
Or did you mean I am writing code for recommendations
let me pull and push the example llm that i did before for prod
git clone <https://github.com/josh-SL/clinical_extraction.git>
so am i understanding this correctly - step 10 is basically:
so just like the previous steps, I've had to abstract some of the logic - to make reusable code - in this case the bedrock calls
it also involved abstracting the prompts as params and moved them to another file (separate from the code)
in the llm_recommendation notebook's code, the first line is
# Read the extracted case data
df = pd.read_csv('extracted_case_data.csv')
are these csv files going to be stored on Google Bucket for our production level pipeline?
OR is that data supposed to be coming from the case_texts table?
Well I think that is the process that’s changed right so the start is going to come from clinicaltrialprod which is the information from the clinical data
let me know when you have a couple minutes
able to run the llm and code from end to end in vs code
Oh awesome! What branch should i pull down?
alrighty, i've finished refactoring the llm_recommendation code that you put in my branch yesterday
Let’s take a look at it tomorrow and see the results
i'll try to get test data to show you tomorrow, but I thought the data ingestion part of step10 was still undergoing changes in your branch?
Scroll up! The ingestion comes from a the clinicaltrialprod
oh my bad i forgot, ok i'll run it with that data
good morning! the test results data can be found in integ_test_drug_ranking_llm_recommendation
I might have to do some tweaking since some of the rows and columns are nan
the recommendaiton is nan? yea prob some prompt engineering but i actually got a gameplan on how this new dashboard is going to be awesome, i need to invite u to a new miro board so we can get the dashboard together this is going to be epic and prob get us some raises ahha
alrighty! i fixed it - the data on the aforementioned table looks more correct now
i'll get started on refactoring the ingestion part of the script
g'morning! Ok so i've finished integrating the ingestion part into my code and ran step 10 again
So the tables integ_test_clinical_abstracts integ_test_drug_ranking_llm_recommendation have been created
Heya! So I found the reason why the abstracts were missing columns. It’s related to how I’m processing the bedrock response - working on a fix for it
#this is what you have to fix dfupload = pd.merge(dforiginal, rankeddf[['caseyear', 'brandname', 'manufacturername', 'activesubstancename', 'rank']], how='left', on=['caseyear', 'brandname', 'manufacturer_name', 'activesubstancename'])
I’ve made the fix and now currently running Step 9 to verify the data on Big Query
Also running Step 10 all over again from last night - the api decided to cut connection mid run
So i added an upload phase after the ingestion to salvage the work done by a previous run
oh yikes ok sounds good let me know whens tep 9 is done
oof, my Step 9 running on Big Query was terminated as it ran out of RAM
going to retry on my work laptop
So I'm having trouble getting through Step 9 after i added that merge line
the point of failure is at the predict and rank phase
(this attempt is on our online GCP instance)
or make sure our cpu and stuff is good
its a little late on my end, i am at a loacrosee game
i can walk throught it text wise though
so on this spot, your notebook was doing it in place, but pandas told me not to do that, hence i changed it
got past those lines that replaced the nan and fillna()
but now, the code does the train_ranknet() part and it just dies
There are only 10 epochs, but they would die before finishing - any guesses as to why that is?\
google says i probably ran out of ram (going to check the ram on our instance)
so the code shouldnt have been adjusted from what it was preivously other than that one line of code after
i had to adjust it due to an error crash with pandas:
FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.
what line are you getting the error from
the lines wont translate 1:1 but this is the code:
i've had to put group['score'] = ...... on the left side
gotcha ya you can just ask chatgpt how to solve that
remove "inplace" param and set the value to group['score']
so i wouldnt deviate form that unless you know the data structure in which you are trying to achieve for that step
i'll have to look into increasing the ram of the instance
just a heads up, i rebooted the vs-code-machine-learning instance
please update your ssh config with the new ip: 35.185.19.12
G'morning - so I think I've gotten the non-scaled values except I forgot to delete the existing table. Gonna rerun Step 9 with a fresh new table
been fixing bugs - currently waiting on a new step 9 run
i forgot to mention that even after upgrading the RAM, the data being merged for step 9 was wayyy too big - like a snake trying to swallow an elephant
so i had to chunkify that process
hmmm thats impossible, reemeber, if u look at the other data, this step should only be like 3k rows 4k rows max
it should be doing anything that requires chunking
i might need your help verifying if this is a data issue then
typically, when I make software, I create automated tests - but in this scenario, it's difficult to make those automated assertions of expected values
can you share with me the code or let me see through github?
<https://github.com/josh-SL/ranking_model_yearly.git>
The main function in question is YearlyRanking.ranking_procedures()
I have an integration test in this project specifically to verify that the output data is not scaled, but I was hoping we could do a call so I could verify with you the following: • sample input data is valid • test assertions are valid
i really dont htink we need to make a new whole repo for every single step do we ? seems overkill should just cycle through the branches
and i dont have access to see your git
the reason why each step is its own repo is because it makes each of them testable and it makes the entire process flexible
if down the line we adjust ingestion of Step X, we wont have to touch the other steps
goal is to be modularized and not tightly coupled
can you show me at which code/process your test is using a repo
to run just the integration tests use:
pytest tests\integrations_tests
i am traveling right now and my service is bad, can you hear me
also did you change anything in gcp? i cant even start the notebook now
i had to reboot that instance, so might have to update the ip address
i didnt change anything for machine-learning
but what about the overall account qoutas? associated with any accounts in gcp?
mmaybe its cuz of the additional one we created
it might be, the new one has a lot more CPUs
i could shut off the beefier instance - shall i do that?
i forget what i set it to, we could just increase it again to handle both instances
littl ebit hard to ofllow ur code though at which step are u ingesting the data from before
ok i'll hold off shutting the beefy instance down
the overview of the algorithm are in the driver.py
lets talk about this tomororw, this is kinda overkill, if you could its ok to ask more quesitons about the rpocess, the ingestion table is or should be the ranking metrics, which is 3-4k records, and what you do in step name is just adding a new column through scalling of ranking. all your doing is just joining the new ranking table to the orginal 3-4k metric table to combine the orignal table used for ranking to the ranking one
ok, yes plz i would very much like to huddle about it tomorrow
can you give me the github to the previous stetp before this one?
<https://github.com/josh-SL/retained_label_metrics.git>
yes this repo stuff deff needs to be changed, i should have access to this stuff and or is in our enrioment and not detached or multiple one off instances of reports
we can talk more about it tomorrow
my chunkified method successfully generated integ_test_adverse_events_ranking, and it seems that the reason why it's so large is bcuz of duplicates
this is still incorrect, great that the value are no longer sclaled but look at that table, its simple comparison. this is showing 161 million records.. min shows 3700
*Thread Reply:* yea it definitely is wayyy too big - I aim to find out why the data gets bigger than the input. The tests seem to point to the ranknet function
when you hop on, can you reduce the cpu in the cs code enviroment
*Thread Reply:* alrighty, i have set the vs-code-machine-learning instance to 2vCPUs and 8GB memory
did you mean the vs-code-machine-learning instance?
also i have a better test now, i replaced everything in my project with the code from your notebook
this is a simple test to verify that the length of the input data is the same as the output
the sample data came from integ-test-icd-metrics-retained (this allows me for faster turnarounds when testing)
the input is 1k, but the output is 87k, so it fails the assertion
you have a second for a talk and next steps
I've acquired the diffs between our tables
My table 2 is short by 36 million - going to start debugging from there
I think k they is a great starting point !
while im waiting for Step2's results, I had a breakthrough it Step 9's test
The input and output df lengths match if I eliminated duplicates from the input and output dataframes
Duplicate is established by these columns:
['manufacturer_name', 'brand_name', 'activesubstancename', 'case_year', 'number_of_cases', 'number_of_patients', 'average_patient_age', 'average_patient_weight_kg']
saw you said breakthrough it made me laugh lol like you solved world hunger or something
i got excited as it was really confusing to me why that was happening haha
but knowing that, as i'm debuggin these steps, i'll keep in mind if duplicates might be an issue
so the length of step 9 match from both instances?
oh no that's not it
Im saying i now have a unit test to help me determine if my code is going to mess up the merge
this wouldd just help me in this debugging process
Good morning! So I ran 3 tests to debug step2, and it shows that my code and the original notebook's code are consistent in its results
I'm not sure why your adverse_events_icd_prod has 116 million entries, but the tests show that the issue is not with the data source nor the code.
whats the comparison with what i have?
your icd prod table has 116 million while those tests above resulted in around 80 million
Also, I've confirmed (with an automated test) that Step9 notebook code does in fact result in varying lengths (input data len VS output data len)
But after I eliminated duplicates (using dataframe operations) the lengths started to match
Going by this, I ran a query on my step 9 table, it went from 163 million down to 19 million
you should run step 9 in the notebook, it shouldn ttake lon and log the metrics or the lengths, and see whats happening, because this will show u that how u switched the code was the issue or the datatables being used
I tried several times - in the jupyterlab notebook itself
literally would just crash the page and I'd have to start over
I noticed the row count anomaly start between Step 6 and Step 8. So I created a DISTINCT only table from step 6 (went from 3.6 million down to 1.9 million)
I will test it as input for step 8
But I can tell you with full confidence that given a csv file as input to the original Step 9 notebook code, the input length will vary from the output length
which is why I'm pursuing the possibility of duplicates in previous steps
i updated the instnace for higher ram and the code runs
i am adjusting the code and making sure this step in concrete
2025-04-01 13:45:14,813 - INFO - Starting RankNet processing...
2025-04-01 13:45:16,970 - INFO - Loaded DataFrame from BigQuery with 65524 rows.
2025-04-01 13:45:18,727 - INFO - Preprocessing data...
2025-04-01 13:45:18,821 - INFO - Preprocessed DataFrame has 65524 rows.
2025-04-01 13:45:19,743 - INFO - Created 112630 ranking pairs.
2025-04-01 13:45:19,751 - INFO - Training RankNet model...
2025-04-01 13:45:21,649 - INFO - Epoch 1, Loss: 0.6934238484637304
2025-04-01 13:45:23,479 - INFO - Epoch 2, Loss: 0.6931662190366875
2025-04-01 13:45:25,313 - INFO - Epoch 3, Loss: 0.6931543030522086
2025-04-01 13:45:27,157 - INFO - Epoch 4, Loss: 0.6931536412374539
2025-04-01 13:45:29,006 - INFO - Epoch 5, Loss: 0.6931815118274905
2025-04-01 13:45:29,111 - INFO - Ranked DataFrame has 65524 rows.
2025-04-01 13:45:29,140 - INFO - Merged DataFrame has 65524 rows.
65524 out of 65524 rows loaded.?it/s]2025-04-01 13:45:34,727 - INFO -
100%|██████████| 1/1 [00:00<00:00, 1610.10it/s]
2025-04-01 13:45:34,729 - INFO - Successfully uploaded ranked data to BigQuery.
[8]:
logs from the new updates everything matches
I think i found the source of the problem - my step 8's chunkified process - it's doing the query_metrics N times
going to run tests
Yea, I’m running a test of the JOIN query of my tables, get that length, and then querying your tables, get that length and then compare if the diff is significant
ok so i did the Step8's JOIN queries: • yours came out 51 million entries • mine was 45.9 million The difference is most probably due to your icd10 table being 116 million and mine is 80 mil
So my question for you is, if I performed the JOIN query once (like the test above), could I chunkify the data aggregation and use that same JOIN query df or would that also be wrong?
If not, then I guess I'll just have to get rid of the chunkified method altogether
ok good news, Step 6 is confirmed to be fixed!
• adverse_events_retained_labels_prod at 1.7 million
• integ_test_icd_retained_labels at 1.9 million
Step 8 seems like a big bottleneck with millions of entries from icd and metrics tables
Tried 3 times to run a test (home machine, local machine, vs-code VM) and they would all run out of RAM
Trying to run the Jupyterlab code and see how long it takes to finish
i need help - the machine-learning VM died and I'm not authorized to start it back up
It’s starting now give it like 2 min
The Step8 notebooks on GCP keep terminating
I think I'll have to take a different approach
in this new approach I'll do the following: • Fetch JOIN query • batch by batch, take chunks of the query and perform aggregation process on it • upload batch to big query
yea it looks like the Step8 notebook crashed
thats not the error for notebook crashing
I get the error msg but it was the last output on that run
I created a separate branch for the approach I mentioned earlier (involve batch processing and creating pivot tables)
Gonna try to test it out
alrighty! so this version that I'm currently running on my local machine seems promising
It's already produced some data on manual_test_adverse_events_icd_metrics_retained_labels
quick question, did you ever figure out the clincial trial additions for a bigger table?
looks like Step 8 is fixed - the new method created a table with 79k, but after a query of SELECT DISTINCT it yielded 71k
as for your question, are you talking about Step 10?
I havent been able to find any additional international api's for that yet since I've been focused on debugging as of late
Today, I'll be updating Step 9 based on your new notebook code
are you updating the machine-learning instance? it's down again and I can't start it up
No sometimes it’s down due to inactivity
it's weird, I can start it back up on the Vertex page, but not on the instances page
yea im able to look at the jupyter notebook now
going thru your new step 9, looks like you;ve changed the sigmoid calculation
Yes it is working now it takes a while to run end to end I changed it back to all pairs but the codes works flawlessly
also there are 2 frames that contain very similar code for step 9
the first one is ok for now i guess if it makes u helps u create the process but hte second one is verythign
Ok yea I’m currently running the first one
After that finishes, I’ll update it to reflect the second one
awesome! sounds ogo dwiht me the secodn one takes a while
i got to step 4 then it failed i was pissed
integ_test_adverse_events_ranking_logs has 68k
your adverse_ranking_prod table has 65k so I think this is a SUCCESS
going to update the code to reflect the 2nd one
We have to make sure that data is in step 9
Sorry what do you mean by that? This is the resulting table of Step9
Also, I’m currently running the updated code based on second frame
And step 9 with 68k came from the new step 8 data?
however, that 68k was based off the 1st frame on jupyterlab
Currently on Epoch3 for the updated code based off the 2nd frame on jupyterlab
seems like there's 4 to 5 hours between each epoch so far
That said, given the runtime of our entire pipeline, and lack of automated deployment, i think i need to setup Kestra for our production solution
(When all this is done of course and when you have the data needed for your demo)
lets try to get the data pipeline working correctly end to end before we talk about switching, but i understand what you mean
yea for sure!
Ok so the first frame of Step 9 came out to 68k, while the second frame came out to 2 million (and ran for about a day)
i know that frame 1 is only a limited version to how many pairs it creates - did your 2nd version ever finish running? I remember you mentioned it failed on epoch 4
i noticed my batch_size was set to 32, going to update it to 128 and set epochs to 5. Gonna run it again
lets just run the one with the number of pairs listed cuz we need to move on from this this week, so instead of spending a day running that finalize the working one with the reduced pair count
the result of the first version can be found on integ_test_adverse_events_ranking
awesome lets use that, there are some things we need to get done asap, so lets hop on a call when thats done
you want me to run the first one again?
What i'm saying is that the result is already up for that first version
i have a doctors appointment in like 5 minutes but i can be quick
I need help obtaining a new API key for Voyage
I'm not able to properly auth the VoyageAIEmbeddings
I'm not able to properly auth the VoyageAIEmbeddings
For that step? Yes it is ugh boy one second
so i know we have a meeting about the dashboard tomorrow
the last time i ran step10, it was about 7 hrs, so just in case it doesnt finish by EOD, I'll continue to run it thru tonight
which step 10 is this the file system or clincial trials? is there already a table with a lot of them populated
so i have a table saved from the previous run (from like 2 weeks ago) and I'm running the classification part off of those abstracts
this is good for right now, let me use this
SELECT ** FROM ai-projects-406720.drug_model.clincial_trial_prod LIMIT 1000
this is the final product im expecting
think it just needs to be runrun with all the code
ok yea, I'll run the ingestion part too on the new Step 9 rank table
I also wanted to mention that the vs-code-machine-learning instance isnt stable. It sometimes would suddenly terminate outta nowhere - this is another issue I hope to solve with a formal pipeline instance
maybe we need to use github codespaces
We could, but if this is about the repos being too many, I’m going to combine them all into one repo after we get this data for the dashboard done
oh no its about being able tos hare code and stuff lol
Unfortunately, the Step 10 run crashed ughh I'll look into fixes for this
made some adjustments and running a new test on my local machine
made some adjustments and running a new test on my local machine
ok so while im waiting on Step10 to finish, I'll tinker with Looker today
maybe later this afternoon, i'll need to take a look at what i'm working with first
ok i can assist if needed its not that bad
how do i navigate to the new dashboard that you created?
Under "shared with me" i only have Tortellignece.ai MFP 2.0 - Private & Confidential
i sent a request for access on the blue dashboard you showed earlier today
i already am tinkering with a duplicate of the old one
yeah, i dont actually know what each of your tiles have
also, im having trouble with getting data to show up - I've checked that the data sources are properly connected
Let’s have a short meeting on this if ur free
https://themedialaboratory.slack.com/archives/D088F7N2UG2/p1744134260051499
SELECT
**,
LAG(rank) OVER (PARTITION BY brandname, manufacturername, activesubstancename ORDER BY caseyear) AS lastyear_rank
FROM
ai-projects-406720.drug_model.integ_test_adverse_events_ranking
clinicaltrailprod you have to create a blend with the v3 table
i added the columns that I could see from your screenshot (e.g. pmid, risk_assessment, etc)
i am almost done with the clincial trial data on the backend
*Thread Reply:* Do you mean that you're making changes to the jupyterlab code?
oh cool
mine was in the middle of the litigation recommendation phase, but terminated unexpectedly at 4223 out of 4724
I'm able to salvage yesterday's run bcuz I saved csv's between each phase
hmm? did you do the like or get all the drugs?
yea - I'm processing the step 9 ranking table's updated results
the beginning of step 10 ingested all the clinical abstracts for each drug name obtained from Step 9's ranking table
the beginning of step 10 ingested all the clinical abstracts for each drug name obtained from Step 9's ranking table
so on that dashboard duplicate that I'm working on - What else should I tweak about it?
it's weird bcuz it shows my mic is picking up my vovice
So i think i have the line graphs set up right, but due to the metrics being blank it doesnt show anything
just wanted to confirm i did that correctly
my step 10 llm recommendation phase created integ_test_drug_ranking_llm_recommendation in case we wanted to have something to use for the dashboard demo
like can you show me the code for this?
also, how do you intend to use this? how are you going to reference it back to any of the clincial trials and pmids with just a brand_name recommendation and reason?
This is for Step 10 https://github.com/josh-SL/clinical_extraction
I wasnt aware it was missing those columns - i was still using the older version of the jupyterlab code during yesterday's run. I'll be updating Step 10 today with your latest jupyterlab code
I'll also add a test asserting the pmid column is present for each entry
just to make sure we're on the same page - what are the expected columns for Step 10?
Look at clinical trial prod
ok so just to confirm, I do have those columns from clinical trial prod - but you're saying that the final table from Step 10 would have those columns + llm recommendation and reason columns
did I understand that correctly?
Yes yes! Let think through it having a recommendation doesn’t mean anything if we don’t know what it’s recommending lol
lol ok gotcha, i thought the target audience only cared about what drug had what recommendation
fixing it rn
But they still need to know the reference of the recommendation lol
did you push your changes to your script onto the repo?
i'm looking at the jupterlab code and im not sure what the updates are
No I haven’t pushed anything what changed
well as it stands, the dev branch of step10 now has those expected columns
I'll wait for your changes before doing another full run of step10
Awesome ! Can you try to get everything into one GitHub repo and multiple branches Zac or even multiple folders of the same repo
If u need naming conventions that are best we can have a meeting
i'll go ahead and start consolidating all the steps into 1 repo
I am away from my pc now but when are u leaving in an hour or so?
oh ok it's all good then, I have a good idea on how to structure this repo
did you have a preference for the name of the repo?
I think it should be shield-genai-tortelllifence
i think we as a team might move to github codespaces for everything not to sure, but the github repo with everything is going to be good
the latest tableu completed whats that name
Yea I’m excited for this new repo bcuz i learned that Kestra can be configured to get each step’s code from there, also i can refactor for proper practices like a .env file and automated deployment
The step10 table?
integtestdrugrankingllm_recommendation
and i was looking to see what to do next
like which step i need to work on based off of what u did
Corect - i have to refactor with your new recommendation code
Step 10 has abstracts ingestion, abstracts classification, litigation recommendation all done
But any updates on the code that you made recently, I need it to refactor the code currently on the dev branch
I would say just get everything in there and refactor it later
But my disclaimer is that this is not a simple copy and paste - I’ll will still need to update the file references in the code as I’m moving each step into this repo
hey man, i am off this week, but i deff want to focus on your development with this code base, so we can focus on you this week and get anythings unresolved or confusion out the way with the project task expectations, so let me know whenever you get on!
g'morning! I've completed putting all Steps (1 to 10) on 1 repo. I've refactored all of them to use a .env file
I was planning on setting up Kestra pipeline this week
is your updated Step 10 on Jupyterlab or on our github repo?
g'morning! I've completed putting all Steps (1 to 10) on 1 repo. I've refactored all of them to use a .env file
I was planning on setting up Kestra pipeline this week
is your updated Step 10 on Jupyterlab or on our github repo?
One of the unresolved issues is that the Jupyterlab code for Step 9 still doesnt pass the test for matching input and output data lengths
This screenshot shows that 1000 entries were used for sample_data, but the result had only 3 entries
dont set up kestra we have tod icuss as a team
whats the github repo, can u make me owner/admin of it
ok, I stopped the kestra instance
But i managed to test it to gather info on it and so far the gains we would have from it are: • Scalable deployments (we're going to have pipelines to drugs, food, clothing, etc) • automated deployments • Realtime gant charts (we can see how long each step takes to finish) • Security in keeping our api keys (using a .env instead of having it out in the open in a notebook) • less clunky and more reliable (running our code on Big Query Workflow would crash a lot and causes delays in my development time)
ok, I stopped the kestra instance
But i managed to test it to gather info on it and so far the gains we would have from it are: • Scalable deployments (we're going to have pipelines to drugs, food, clothing, etc) • automated deployments • Realtime gant charts (we can see how long each step takes to finish) • Security in keeping our api keys (using a .env instead of having it out in the open in a notebook) • less clunky and more reliable (running our code on Big Query Workflow would crash a lot and causes delays in my development time)
I also sent the git repo transfer request to you
is kestra in gcp i just need to go over what it is cuz im not sure
i did github codespace i like it a lot
its what we been trying to do
Kestra runs on a docker instance which runs in GCP - I was using the free version
Here's an example of a pipeline UI from their tutorial
We'll be able to monitor pipelines running in realtime with these gant charts
hows it look flow and perfoemance wise
I stopped the instance upon your request yesterday
I could go ahead and finish setting it up today with our new repo if that’s ok
The other thing I’m working on is reconciling the discrepancy between our datasets. I’m using the FASENRA entries as a benchmark. So far i was able to gather more ICD10 entries after running Step 1 specifically for 2015. I’m going to continue to look for the years with missing entries
i think the most pronto thing is the dashboard, how has that been looking is all the data from the tables uploaded in there?
All the tables that you told me to put up are there
But as I’ve mentioned, the line graph doesnt show anything since those fields being referenced are blank
I didnt receive any more specs needed for the dashboard
Should I remove the Drug Litigation section?
oh i havent talked to him about it, i was trying to talk to him today
i'm not sure if that's ok since we're just using it as a placeholder
I've been working on our data and pipeline this whole time since I'm not sure what direction to take for that dashboard
yes im looking at it now, the direction is sitll the same, its just that the data needs to be populated, was the last step run of the prediction with the clinical trial summary or the recommendation
the last step ran was step 10 (without your updates)
but the blend that it's using is your data table
just wanted to show my progress on Kestra - this gant chart is pretty useful in remotely monitoring the pipeline
Gotcha I am on vacation so that’s why I been a little spotty this week, however was the step done with my updates since we can’t use the data table with only the single recommendation
I never got your updated code - would you like me to run step 10 from jupterlab notebook?
otherwise, i have integtestrankingllmrecommendation just to populate the tables
that integ test doesnt mean anything since we cant relate the recommendation run back
this is the recommendation script that needs to be runa gain
im assuming i'll run it based on your tables?
ai-projects-406720.drugmodel.integtestclinicalabstracts`
no problem! i got github code spaces i am testingit out now with the github code u gave me
this is bad practice but in the interest of time, here's the .env file you'll need to be at the project's root dir lol
mini status report - the new step 10 code failed due to ValueError: columns overlap but no suffix specified: Index(['error'], dtype='object')
Working on a fix for it...
yep, got the script running all thru yesterday and it finished this morning
I've had to tweak the new jupyterlab script bcuz the resulting table did not have the score columns
I'm also trying to see if i can fix the blend so the case_year isnt null
I've had to tweak the new jupyterlab script bcuz the resulting table did not have the score columns
I'm also trying to see if i can fix the blend so the case_year isnt null
nice!! we dont need to show case year in there, what else is going on for like steps to do you think
• Step8 data discrepancy: I need to go back on my investigation on data discrepancy from Step8's table. It's caused by a domino effect that goes all the way from Step2's table. I'm going to do a more granular ingestion to ensure it doesnt have missing entries (those missing entries are affecting the metrics calculations) • Production level pipeline: I've succeeded in creating a way to get .env values in our free version of Kestra (the paid version has a nice UI to upload it). I also succeeded in getting some integration tests to run as an example. But, the Kestra runs are each isolated and does pip installs each time and sometimes runs out of space, so I have to look into optimizing that (maybe find a way to make the virtualenv to persist throughout Steps 1-10)
Hmm so before kestra is there or is the pipeline or the data completed and looks good
sorry i dont understand what you meant by that
great news! i got our Kestra pipeline to stabilize (by creating a docker image with our dependencies) and now have a remote & dependable way to run our entire project!
this is helping me with debugging because the Big Query Workflow was prone to crashing/terminating which caused delays. it also expedites the deployment process
great news! i got our Kestra pipeline to stabilize (by creating a docker image with our dependencies) and now have a remote & dependable way to run our entire project!
this is helping me with debugging because the Big Query Workflow was prone to crashing/terminating which delayed the process. it also expedites the deployment process
great news! i got our Kestra pipeline to stabilize (by creating a docker image with our dependencies) and now have a remote & dependable way to run our entire project!
this is helping me with debugging because the Big Query Workflow was prone to crashing/terminating which caused delays. it also expedites the deployment process
really? you have to show it to me this is a great job!
haha thanks, I'm still trying to figure out how your icd table has 116 million. So i'm rereunning Steps 1 and 2 and then comparing it to your table using FASENRA as the reference drug
haha thanks, I'm still trying to figure out how your icd table has 116 million. So i'm rereunning Steps 1 and 2 and then comparing it to your table using FASENRA as the reference drug
http://34.46.241.86:8080/ui // the url of Kestra
username: dev@shield-legal.com pwd: tortAI123$
There are a lot of failures - but that was when I was trying to figure out Kestra's yaml, so lots of trial & error lol
Heya! I was able to reconcile the missing data of 2024 for Step2. So I'll continue doing this method til my Step2 table is closer to the original Step2 table
Heya! I was able to reconcile the missing data of 2024 for Step2. So I'll continue doing this method til my Step2 table is closer to the original Step2 table
whenever you get on, lets have a quick meeting
Good morning - i'm reaady for a quick meeting
let wait till whenever your in office to go over some stuff
gmorning! the integ_test_adverse_events_ranking table has been updated
currently running Step10 for the recommendations
Should i change the data source on the Drug Rankings table of the dashboard?
how are you running this? is it in that app or just vs code from github?
it's running on a GCP instance - the Kestra pipeline
with Kestra, I can check on its progress wherever I am
There's something weird with that SQL query on the dashboard
my tablee has rank 1's but our dashboard's query is doin something funky
im free to do a call rn if you're cool with that
i am bout to walk downstairs to get my lunch, but after i should be free
Here's a quick summary of how to access our Kestra pipeline: ```1. Go to http://34.56.77.27:8080/ui (you can obtain the IP address from our GCP instances page
NOTE: Kestra treats any logs as an "error" even though it isnt. Fixing that is on my TODO list```
Here's an example of a Flow yaml, I've put emojis on the important configurations
This abstracts the version of code we are running from the pipeline infrastructure that we want to run it on
ok lets go over totday how to run this
ok in a little i stepped out for lunch on my end
Yes ! I have been super busy today with my WiFi being out trying to get it fixed with google been so noting
if you'd like, i could also try to make instructions on how to run it with screenshots
also, i wanted to ask for some collab time on 2 issues: • Step 9 is up to date with the Jupyterlab code but still has not passed the test for input/output data lengths • Dashboard ranking table isnt showing any drug with rank 1 (but the source table has rank 1's, so I'm guessing it has something to do with the query being done)
i'm doing some code cleanup and double checking Step 9. The current code on dev and main are identical to the Jupyterlab notebook, yet tests would show the inconsistency in the input/output lengths
let me know when ur doen the code cleanup lets hop on a call
step 1 yielded 977,523 entries using the csv called drug_adverse_event_data_combined_datasets_chunked.csv
that's about 30 million off from adverse_events_prod
So what i'll do is copy adverse_events_prod and run Step 2 on that instead
All the code I'm running now is from the Jupyterlab Notebook and the data is based off your original table adverse_events_prod. The code is on the "jupyterlab" branch
Step 2 crashed after running for 5 hours. It attempted to execute process_adverse_events_data() . I'm sure this code ran well when the dataset was less than 37 million, but we need to collab on a piece-wise solution.
Tomorrow, I'll look into delegating that process to the Big Query through SQL
All the code I'm running now is from the Jupyterlab Notebook and the data is based off your original table adverse_events_prod. The code is on the "jupyterlab" branch
Step 2 crashed after running for 5 hours. It attempted to execute process_adverse_events_data() . I'm sure this code ran well when the dataset was less than 37 million, but we need to collab on a piece-wise solution.
Tomorrow, I'll look into delegating that process to the Big Query through SQL
ok i will give u a call shortly
i think i have a viable sql solution. I'm doing spot checks and it has matching entry counts
I also noticed something odd - how does your Step 1 table not have a case entry but Step 2 has 6 of them?
Is there another data ingestion somewhere else? Perhaps a manual one that was done in the past?
I also noticed something odd - how does your Step 1 table not have a case entry but Step 2 has 6 of them?
Is there another data ingestion somewhere else? Perhaps a manual one that was done in the past?
ok do you have a couple minutes at the top of the hour
It's worth noting the casetext.com no longer has the service working, so Step9's ingestion of case texts has been deprecated
Their page just says: "This service is no longer available, but we appreciate you being a part of it. For legal research, please visit Westlaw, and if you're curious about legal AI, check out CoCounsel. Thanks for stopping by!"
I was able to produce a table from the Jupyterlab Steps 8 & 9 named jupyter_adverse_events_ranking
and the result looks promising
It has unique active substance name and only has a total of 4,884 entries. Perhaps you'd be inclined to look at this table
Yes I can take a look! I been re running some of the code myself thank u I will let u know tomorrow
yea thanks!
the caveat is that Step 8 was using "replace" so the data kept replacing until it was only 2024. But if you validate the ranking table as "good" then I can make the adjustments to the code to include all years. On the plus side, both Step 8 and Step 9 tables have the exact same length
I also went ahead and created kestra_adverse_events_ranking to contain all the years, it has 10,059 entries
He’s we can talk little bit busy but I can switch these tables I can show u what I did
I tested to see DUPIXENT and it has a unique entry per year
Yes I did my test rhis morning I can switch to yours forestry
sounds good, plz let me know how that turns out
so is the data on kestra_adverse_events_ranking good?
I used the one u ent this morning I haven’t check that one yet prob will do it in the morning im building out all the graphs in python
u never created the recommendation code adjustment correct ?
no i did not alter that Step 10 recommendation code
I’ll get em to ya once I’m back from this meeting
Hi James, does our ChatGPT accounts need an admin to allow usage or increase quota? I tried to use the api key yesterday and it gave and error 429 stating that I've reach the quota (even though it's the first time I was using it)
The enterprise ChatGPT? I think is under Ryan name I am not even sure if I have admin but actually let me look cuz he is our on vacation
ok i am on i am the owner so i can adjust things
thanks! and also plz verify that my account has permissions to use API keys
Nick has me working on Short Form being filled out for Acts and Dichello concerning sexual abuse cases
I tested ChatGPT to extract info from docx files yesterday and it yielded the expected results, so I'm trying to automate that process
yea i just needed to know who to assign the api key too i have to create a project and stuff
in chaptgpt itself i created a new project for you called short form so you should use that for your liek quesitosn and code stuff
now let me do it on the billings ide for api keys
sk-proj-epGZHG4dY-YaxrhyDAMeoGyxcGHbCex2F8LCp82WPexS2gMeVC17KGnt0lzUvTMYJijX8GdTT3BlbkFJRZHxWz21EsO1eBO3Hn3lSk8wL2_oRkCLZ94HFUaOeVFWxGlWlZSjKUFk0JhOA4vwwtY9yFdogA
let me know when u have it so i can delete
just tested it and i got the same 429 error
i verified that the client does in fact have the key
openai.RateLimitError: Error code: 429 - {'error': {'message': 'You exceeded your current quota, please check your plan and billing details. For more information on this error, read the docs: <https://platform.openai.com/docs/guides/error-codes/api-errors.>', 'type': 'insufficient_quota', 'param': None, 'code': 'insufficient_quota'}}
I'm just trying to use chatgpt (if that's free)
yes all my models i use is in bedrock llm
i mean I could use bedrock llm if that's ok with you
let me give u this a template and see if it owrks
i could also use gemini if that works with our GCP account
naw i dont have anything related to gemini
you should be able to tweak this to waht you need but htis is basically the template
like this prompt file has the keys and connection the ipynb file has the run commands
G'morning James, I tried to create API keys in OpenAI but it asked me to create a new org - are you able to create an API key for me? (it's be used for a project for Abe)
do u need a new api key i have one existing here for u
sorry 3 other people pinged me same time
it's all good
I think the last api key i received was for AWS Bedrock
do u want both and or use anthropic
sk-proj-PYxX1UDM76o5AaydkP65bORfdNbvLZZOkVJ1O61h64fELyb8ij2m-i57uHkEytiUUgq9Rbo1T3BlbkFJNxk9cTFliMJweU2Xtpwbwd23Vp3XBYq2isnjnoQEl2RK65Z6RXBKgOvEGym85-yqfM7ZezyHYA