Page 24 sur 29
Deduplication
Dedupe
To dedupe a dataframe on an Email field keeping only the first duplicate record, use drop_duplicates.
df['Email'] = df['Email'].str.lower() dfDeduped = df.drop_duplicates(subset=['Email'], keep='first')
Find duplicates
Use duplicated:
df2 = df1[df1.duplicated(['My field'], keep=False)] print(tabulate(df2.head(10), headers='keys', tablefmt='psql', showindex=False))
Find no-duplicates
df2 = df1[~df1.duplicated(['My field'], keep=False)] print(tabulate(df2.head(10), headers='keys', tablefmt='psql', showindex=False))
Merge duplicates
df_Merge = df.groupby(['Field1', 'Field2'], as_index=False).agg({ 'Field3': lambda x: ' - '.join(x), 'Field4': 'first', 'Field5': 'sum' })