James Turner (jturner@shield-legal.com)
2025-03-28 14:47:58

Thought about it a bit on the way back to my desk, you can create a for loop that goes through the 80k line by line and that creates new dataframes using groupby(medication_name),

Add all those dfs into a dictionary called something like Unique_med_data and write a function that counts the number of diff() to find differences per column for each row in the specific medication DF

James Turner (jturner@shield-legal.com)
2025-03-28 14:48:53

That way you can see clearly if certain columns have a higher count of Diff rate and look for how merge is handling those

James Turner (jturner@shield-legal.com)
2025-03-28 14:51:56

I don't know if that's the best way to do it, but its the first thing that comes to mind lol

Josh Josue (jjosue@shield-legal.com)
2025-03-28 14:52:56

haha alright, thanks for your insight! i'll give that a try

Josh Josue (jjosue@shield-legal.com)
2025-03-28 14:54:33

actually if you have ae moment, can i drop by?

👍 James Turner
James Turner (jturner@shield-legal.com)
2025-03-28 14:54:53

Sure thing, cmon by

James Turner (jturner@shield-legal.com)
2025-03-28 14:55:00

"Come into my office" lol

😆 Josh Josue
James Turner (jturner@shield-legal.com)
2025-04-15 16:54:51

Hey dude, I haven't made a doc like this before, can you let me know if this makes any sense??

Can you take a look at this and let me know if it makes any sense?? https://docs.google.com/document/d/1XG4lVyJw139bxY_2E57r3b5E6n5gtwAngqZKqiXp_pM/edit?usp=sharing

James Turner (jturner@shield-legal.com)
2025-04-15 16:55:03

Hey dude, I haven't made a doc like this before, can you let me know if this makes any sense??

Can you take a look at this and let me know if it makes any sense?? https://docs.google.com/document/d/1XG4lVyJw139bxY_2E57r3b5E6n5gtwAngqZKqiXp_pM/edit?usp=sharing

Josh Josue (jjosue@shield-legal.com)
2025-04-15 17:28:56

Yea sure! Lemme take a look

Josh Josue (jjosue@shield-legal.com)
2025-04-15 17:35:08

V nice duude! Quite comprehensive but didnt feel like a wall of text

Great headers and highlights on relevant table names 💯

James Turner (jturner@shield-legal.com)
2025-04-15 17:36:50

I appreciate that! Thank you!

Josh Josue (jjosue@shield-legal.com)
2025-04-15 17:42:55

Np!

Josh Josue (jjosue@shield-legal.com)
2025-04-30 13:25:19

hey James, do u have a moment to chat?

James Turner (jturner@shield-legal.com)
2025-04-30 13:25:53

Hey there~ I'm jumping between a few things but I can chat. What's up?

Josh Josue (jjosue@shield-legal.com)
2025-04-30 13:27:13

i have a tableA that has 122 entries and tableB has 153 entries

but when i do merges and to see a diff, it says tableA has 30 entries that tableB does not have

Josh Josue (jjosue@shield-legal.com)
2025-04-30 13:27:26

what might cause that? it doesnt make sense to me

Josh Josue (jjosue@shield-legal.com)
2025-04-30 13:28:27
James Turner (jturner@shield-legal.com)
2025-04-30 13:28:52

If you are running a script to find the differences and see that table A has 30 more unique entrees than table B, it might be that table B has duplicates.

It sounds like table B has more TOTAL entrees, but some of them may be duplicates.

Josh Josue (jjosue@shield-legal.com)
2025-04-30 13:29:08

so i thought of the duplicates thing

James Turner (jturner@shield-legal.com)
2025-04-30 13:29:10

Looking at your screenshot now

Josh Josue (jjosue@shield-legal.com)
2025-04-30 13:29:27

but when i run a SQL query that only keeps DISTINCT **, the entries count stays the same at 153

James Turner (jturner@shield-legal.com)
2025-04-30 13:29:35

Here, this might help..

James Turner (jturner@shield-legal.com)
2025-04-30 13:29:40

Here, this might help..

James Turner (jturner@shield-legal.com)
2025-04-30 13:31:12

Comments and names aside, I build this simple script to compare two CSVs to find and list specific discrepencies

James Turner (jturner@shield-legal.com)
2025-04-30 13:33:24

What this does is compare my measurements against Ryans, but it can be used to compare any dataframes that SHOULD be the same and it spits out a csv with the unique values per dataframe

👍 Josh Josue
Josh Josue (jjosue@shield-legal.com)
2025-04-30 13:33:45

oh ok thanks, i'll take a look at that

James Turner (jturner@shield-legal.com)
2025-04-30 13:34:43

You should be able to tweak it a bit and import your Dataframes, specify the column you want to compare and have it give you a clean reading of the differences

Josh Josue (jjosue@shield-legal.com)
2025-04-30 13:34:58

this seems to only deal with specific columns amirite?

James Turner (jturner@shield-legal.com)
2025-04-30 13:35:10

Right now yeah.. but lemme see..

Josh Josue (jjosue@shield-legal.com)
2025-04-30 13:35:19

im tweakin it

James Turner (jturner@shield-legal.com)
2025-04-30 13:39:52

```import pandas as pd

Load files

df1 = pd.readcsv(NewFinqueryresults) df2 = pd.readcsv(OldFinqueryresults)

if you need to change the names on any of the columns between table A and B

df1 = df1.rename(columns={'leadid': 'leadid'}) df2 = df2.rename(columns={'LR Lead ID': 'leadid'})

Make sure the columns you had to change the names of, are the same datatype just to be safe

df1['leadid'] = df1['leadid'].astype(str) df2['leadid'] = df2['leadid'].astype(str)

Set your index column, in my case I am using the lead_id one, but whatever your key is should work??

df1.setindex('leadid', inplace=True) df2.setindex('leadid', inplace=True)

Align both dataframes on the same index and columns

df1, df2 = df1.align(df2, join='inner', axis=0)

Compare rows

diff_mask = (df1 != df2) & ~(df1.isna() & df2.isna())

Filter rows that are different

diffrows = df1[diffmask.any(axis=1)]

Throw those differences side-by-side for inspection..

combineddiff = pd.concat([df1[diffmask.any(axis=1)], df2[diff_mask.any(axis=1)]], keys=['NewFin', 'OldFin'])

and convert that bad boy to CSV

combineddiff.tocsv('fullrowdifferences.csv') print("Row differences exported to fullrowdifferences.csv")```

James Turner (jturner@shield-legal.com)
2025-04-30 13:40:01

I /think/ that will work??

Josh Josue (jjosue@shield-legal.com)
2025-04-30 13:40:03

ok so the resulting csv has the same result from my other script

🙃 James Turner
Josh Josue (jjosue@shield-legal.com)
2025-04-30 13:40:21

oh thanks but i was able to tweak it

Josh Josue (jjosue@shield-legal.com)
2025-04-30 13:40:43

so my question stands, and i am quite confuzzled by this

James Turner (jturner@shield-legal.com)
2025-04-30 13:40:57

Weird. How big are these Dataframes? would it possible to inspect it manually with Jupyter?

Josh Josue (jjosue@shield-legal.com)
2025-04-30 13:41:00

i thought duplicates were the culprit too but not the case

Josh Josue (jjosue@shield-legal.com)
2025-04-30 13:41:20

i literally have the 2 tables as csv files locally

James Turner (jturner@shield-legal.com)
2025-04-30 13:42:32

Are you normalizing the data at any point? If we normalize it during the wrong step, it might actually be differences of minor things like spaces or something?

Josh Josue (jjosue@shield-legal.com)
2025-04-30 13:42:33
Josh Josue (jjosue@shield-legal.com)
2025-04-30 13:42:50

semantically, this would mean that tableB has no entries that are not in tableA amirite?

James Turner (jturner@shield-legal.com)
2025-04-30 13:45:11

Try normalizing it before the comparison, like throw it all into lower() and get rid of spaces

Josh Josue (jjosue@shield-legal.com)
2025-04-30 13:45:36

hmm ok i'll try that thanks

James Turner (jturner@shield-legal.com)
2025-04-30 13:55:01

Can you send me the two CSV's I kinda wanna try something out of curiousity

Josh Josue (jjosue@shield-legal.com)
2025-04-30 13:55:23

sure, altough i've pivotted to a different approach

Josh Josue (jjosue@shield-legal.com)
2025-04-30 13:55:36
James Turner (jturner@shield-legal.com)
2025-04-30 14:20:17

Let me know what you find, I'm curious

Josh Josue (jjosue@shield-legal.com)
2025-04-30 14:22:15

i had to set it aside, but this did help me determine which step was causing the data discrepancy

James Turner (jturner@shield-legal.com)
2025-04-30 14:22:33

I mean, that was the ultimate goal right?

👍 Josh Josue
James Turner (jturner@shield-legal.com)
2025-04-30 14:22:49

Glad we got a little further, even if it was not exactly what we were looking for lol

😁 Josh Josue