Index de l'article

Deduplication

Dedupe

To dedupe a dataframe on an Email field keeping only the first duplicate record, use drop_duplicates.

df['Email'] = df['Email'].str.lower()
dfDeduped = df.drop_duplicates(subset=['Email'], keep='first')

Find duplicates

Use duplicated:

df2 = df1[df1.duplicated(['My field'], keep=False)]
 
print(tabulate(df2.head(10), headers='keys', tablefmt='psql', showindex=False))

Find no-duplicates

df2 = df1[~df1.duplicated(['My field'], keep=False)]
 
print(tabulate(df2.head(10), headers='keys', tablefmt='psql', showindex=False))

Merge duplicates

df_Merge = df.groupby(['Field1', 'Field2'], as_index=False).agg({
'Field3': lambda x: ' - '.join(x),
'Field4': 'first',
'Field5': 'sum'
})

 

Liens ou pièces jointes
Télécharger ce fichier (France-Departements-Deformation.zip)France-Departements-Deformation.zip[France-Departements-Deformation]335 Ko
Télécharger ce fichier (simple_countries.zip)simple_countries.zip[simple_countries]1880 Ko