Thought about it a bit on the way back to my desk, you can create a for loop that goes through the 80k line by line and that creates new dataframes using groupby(medication_name),
Add all those dfs into a dictionary called something like Unique_med_data and write a function that counts the number of diff() to find differences per column for each row in the specific medication DF
That way you can see clearly if certain columns have a higher count of Diff rate and look for how merge is handling those
I don't know if that's the best way to do it, but its the first thing that comes to mind lol
haha alright, thanks for your insight! i'll give that a try
actually if you have ae moment, can i drop by?
Hey dude, I haven't made a doc like this before, can you let me know if this makes any sense??
Can you take a look at this and let me know if it makes any sense?? https://docs.google.com/document/d/1XG4lVyJw139bxY_2E57r3b5E6n5gtwAngqZKqiXp_pM/edit?usp=sharing
Hey dude, I haven't made a doc like this before, can you let me know if this makes any sense??
Can you take a look at this and let me know if it makes any sense?? https://docs.google.com/document/d/1XG4lVyJw139bxY_2E57r3b5E6n5gtwAngqZKqiXp_pM/edit?usp=sharing
V nice duude! Quite comprehensive but didnt feel like a wall of text
Great headers and highlights on relevant table names 💯
Hey there~ I'm jumping between a few things but I can chat. What's up?
i have a tableA that has 122 entries and tableB has 153 entries
but when i do merges and to see a diff, it says tableA has 30 entries that tableB does not have
what might cause that? it doesnt make sense to me
If you are running a script to find the differences and see that table A has 30 more unique entrees than table B, it might be that table B has duplicates.
It sounds like table B has more TOTAL entrees, but some of them may be duplicates.
but when i run a SQL query that only keeps DISTINCT **, the entries count stays the same at 153
Comments and names aside, I build this simple script to compare two CSVs to find and list specific discrepencies
What this does is compare my measurements against Ryans, but it can be used to compare any dataframes that SHOULD be the same and it spits out a csv with the unique values per dataframe
You should be able to tweak it a bit and import your Dataframes, specify the column you want to compare and have it give you a clean reading of the differences
this seems to only deal with specific columns amirite?
```import pandas as pd
df1 = pd.readcsv(NewFinqueryresults) df2 = pd.readcsv(OldFinqueryresults)
df1 = df1.rename(columns={'leadid': 'leadid'}) df2 = df2.rename(columns={'LR Lead ID': 'leadid'})
df1['leadid'] = df1['leadid'].astype(str) df2['leadid'] = df2['leadid'].astype(str)
df1.setindex('leadid', inplace=True) df2.setindex('leadid', inplace=True)
df1, df2 = df1.align(df2, join='inner', axis=0)
diff_mask = (df1 != df2) & ~(df1.isna() & df2.isna())
diffrows = df1[diffmask.any(axis=1)]
combineddiff = pd.concat([df1[diffmask.any(axis=1)], df2[diff_mask.any(axis=1)]], keys=['NewFin', 'OldFin'])
combineddiff.tocsv('fullrowdifferences.csv') print("Row differences exported to fullrowdifferences.csv")```
ok so the resulting csv has the same result from my other script
so my question stands, and i am quite confuzzled by this
Weird. How big are these Dataframes? would it possible to inspect it manually with Jupyter?
i thought duplicates were the culprit too but not the case
i literally have the 2 tables as csv files locally
Are you normalizing the data at any point? If we normalize it during the wrong step, it might actually be differences of minor things like spaces or something?
semantically, this would mean that tableB has no entries that are not in tableA amirite?
Try normalizing it before the comparison, like throw it all into lower() and get rid of spaces
Can you send me the two CSV's I kinda wanna try something out of curiousity
sure, altough i've pivotted to a different approach
i had to set it aside, but this did help me determine which step was causing the data discrepancy
I mean, that was the ultimate goal right?
Glad we got a little further, even if it was not exactly what we were looking for lol