Deduplication

Page 24 sur 29

To dedupe a dataframe on an Email field keeping only the first duplicate record, use drop_duplicates.

df['Email'] = df['Email'].str.lower()
dfDeduped = df.drop_duplicates(subset=['Email'], keep='first')

Use duplicated:

df2 = df1[df1.duplicated(['My field'], keep=False)]
 
print(tabulate(df2.head(10), headers='keys', tablefmt='psql', showindex=False))

df2 = df1[~df1.duplicated(['My field'], keep=False)]
 
print(tabulate(df2.head(10), headers='keys', tablefmt='psql', showindex=False))

df_Merge = df.groupby(['Field1', 'Field2'], as_index=False).agg({
'Field3': lambda x: ' - '.join(x),
'Field4': 'first',
'Field5': 'sum'
})

Liens ou pièces jointes
France-Departements-Deformation.zip	[France-Departements-Deformation]	335 Ko
simple_countries.zip	[simple_countries]	1880 Ko

Data management with Python, Pandas, Geopandas, Sqlachemy, Matplotlib and Openpyxl - Deduplication