Index de l'article

Filter/Extract

Field contains string

df2 = df1[df1['My field'].str.contains('AAA').fillna(False)]
 
print(tabulate(df2.head(5), headers='keys', tablefmt='psql'))

Field not contains a string

Just use ~:

df2 = df1[~df1['My field'].str.contains('AAA').fillna(False)]
 
print(tabulate(df2.head(5), headers='keys', tablefmt='psql'))

Field not empty, not NaN, not null

df_notEmpty= df[
    (df['My field'].notnull()) &
    (df['My field'].notna()) &
    (df['My field'] != '') &
    (df['My field'] != 'nan')
]

Filed starts with

df2 = df1[df1['My field'].str.startswith('A')]
 
print(tabulate(df2.head(5), headers='keys', tablefmt='psql'))

Or:

df2 = df1[df1['My field'].str.startswith(('A', 'B'))]
 
print(tabulate(df2.head(5), headers='keys', tablefmt='psql'))

Field is a numeric value

# Function to extract numeric values
def extractNumericValue(my_text):
    my_values = re.findall(r'^-?\d+\.\d+|-?\d+$', str(my_text))
    if my_values:
        return my_values
    else:
        return None
 
# Apply
df['SearchNumericValue'] = df['My column'].apply(extractNumericValue)
 
dfResult = df[
        (df['SearchNumericValue'].notna()) &
        (~df['My column'].astype(str).str.contains('[a-zA-Z]', regex = True))
]
 
print(tabulate(dfResult.head(10), headers='keys', tablefmt='psql', showindex=False))

Field contains element from a list

Example searching space, star and parenthesis:

MySearch = ['\*', ' ', '\(', '\)']
 
df2 = df1[df1['zone'].str.contains('|'.join(MySearch)).fillna(False)]
 
print(tabulate(df2.head(40), headers='keys', tablefmt='psql', showindex=False))

Field equals element from a list

MySearch = ['Bbbb', 'XXX']
 
df2 = df1[df1['zone'].str.fullmatch('|'.join(MySearch)).fillna(False)]
 
print(tabulate(df2.head(40), headers='keys', tablefmt='psql', showindex=False))

Or like a SQL IN:

searchIn= ['AAA', 'BBB', 'CCC']
 
df2 = df1[df1['My field'].isin(searchIn)]
 
print(tabulate(df2.head(5), headers='keys', tablefmt='psql', showindex=False))

Field in digit

df2 = df1.loc[df1['zone'].astype(str).str.isdigit()]
 
print(tabulate(df2.head(5), headers='keys', tablefmt='psql'))

Several condition

Below we get lines with a field with null values and another field with empty values:

df2 = df1[(df1['My field'].isna()) | (df1['My other field'] == '')]
 
print(tabulate(df2.head(5), headers='keys', tablefmt='psql', showindex=False))

Get lines searching if values are IN a list

searchIn= ['AAA', 'BBB', 'CCC']
 
df2 = df1[df1['My field'].isin(searchIn)]
 
print(tabulate(df2.head(5), headers='keys', tablefmt='psql', showindex=False))

Field not finishes with a concatenation from other fields

Suppose you have this dataframe df:

Language PDF
Python docA-Python.pdf 
SQL  docF-SQ.pdf 
 PHP  docS-PHP.pdf
CSS  docG-CSS.pdf 
 Javascript  docR-Javascript.pdf
 HTML  docW-HTML.pdf

 

Now you would like extract lines where the PDF field does not finish with [Language field] + '.pdf':

Compare = df[['Language', 'PDF']].apply(lambda x: x['PDF'].endswith(x['Language'] + '.pdf'), axis=1)
 
df_temp = pd.DataFrame(Compare)
df_temp.columns =['True or false']
 
df_concat = pd.concat([df, df_temp], axis=1)
df_filtered = df_concat[['Language', 'PDF']][df_concat['True or false'] == False]
 
print(tabulate(df_filtered.head(5), headers='keys', tablefmt='psql', showindex=False))
 Language  PDF
 SQL  docF-SQ.pdf

Field not contains an email 😀

...
df_pb_Email = df[df['Email field'].str.contains(r'[^@]+@[^@]+\.[^@]+') == False]
...

Field not equals an email

...
df_pb_Email = df[df['Email field'].str.fullmatch(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b') == False]
...

Field not contains a number

dfProblem = df[(df['My field'].astype(str).str.match(pat=r'^[-+]?[0-9]+$|^[-+]?[0-9]+\.[0-9]+$') == False)]

Field contains another field

df['A contains B'] = [x[0] in x[1] for x in zip(df['FieldB'], df['FieldA'])]
df= df[df['A contains B'] == False]

Field ends like another field

df['A finish with B'] = [x[1].endswith(x[0]) for x in zip(df['FieldB'], df['FieldA'])]
df= df[df['A finish with B'] == False]

Field with lenght equals a number 

dfProbleme = df[
df['Field1'].str.len() != 7
]

Filter counting a string

How many times it appears in a cell:

dfCount = df[(df['Substitutions'].str.count('\\,') > 1)]

Regex

If the field is equal to (^...$) a number with a comma as decimal separator:

dfTest = df[df['My field'].str.match(pat='^\d+,\d+$')==True]

If the field starts (^) with different strings (...|...), using themselves different shapes ([...]):

dfTest = df[
    (df['My field'].str.contains(pat=r'^[tT]he animal')==True) |
    (df['My field'].str.contains(pat=r'^[aA]ny animal')==True) |
    (df['My field'].str.contains(pat=r'^([sS]ad|[hH]appy) animal')==True)
]

Extract a specific string

# Delete what is before SpecificString
df['NewField'] = df['OldField'].replace({'.*(?=SpecificString)': ''}, regex=True)
 
# Delete what is after SpecificString
df['NewField'] = df['NewField'].replace({'(?<=SpecificString).*': ''}, regex=True)

Extract names from email

from email_decomposer.EmailDecomposer import EmailDecomposer
 
email_decomposer = EmailDecomposer()
 
df[['first_name', 'last_name', 'host']] = df['email'].apply(
lambda x: pd.Series(email_decomposer.decompose(data=x, get_host=True)))

 

Liens ou pièces jointes
Télécharger ce fichier (France-Departements-Deformation.zip)France-Departements-Deformation.zip[France-Departements-Deformation]335 Ko
Télécharger ce fichier (simple_countries.zip)simple_countries.zip[simple_countries]1880 Ko