Python Cheat Sheet for Data Science Practitioners

These are some python code snippets that I use very often.

Pandas DataFrame Manipulation

Ensure a data frame column is a float type column

df['column_name'] = df['column_name'].astype(float)

Group by a column and keep the column afterwards

df.groupby(['column_name']).aggregate_function().reset_index()

Convert a list of lists into a pandas data frame

df=pd.DataFrame(list(your_list_of_lists),columns=['column 1','column 2','column3'])

Convert a data frame to a list of dictionary values

Let’s say you want a list of dictionaries from a pandas data frame as follows:

From this:

To this:

[{"name":"rita","age":23}, {"name":"gita","age":45}]

This is the code you would use:

 dict_vals=df.to_dict(orient='records')

Convert a dictionary to a pandas data frame

Let’s say you have a dict as follows:

my_dict={'mrr':0.4,'map':0.3,'precision':0.6}.

To convert this to a pandas Data Frame, you can do the following:

import pandas as pd

my_dict={'mrr':0.4,'map':0.3,'precision':0.6}
pd.DataFrame(list(my_dict.items()),columns=['metric','value'])

You will see the following output:

   metric     value
0   mrr         0.4
1   map         0.3
2   precision   0.6

Select rows matching a specific column criteria

Let’s say you want to find rows where the column value matches a specific constraint. You could use the following:

import pandas as pd
df=df[(df['column1']<=2) & df['column2']<==3) ]

Create a new data frame column with specific values

Let’s say you want to add an additional column to a data frame with values generated via some external processing. You can transform the external values into a list and do the following:

vals=[1,2,3,4]

df['vals']=vals

Sort data frame by value

# Sort in descending order
df.sort_values(by=["column_name1","column_name2"],ascending=False)

Get unique values from a data frame column

# Get unique list of values in the df['column_name'] column
uniq_vals=list(df.column_name.unique())

Create a new derived column with df.apply

The goal here is to create a new column with values populated based on the values of an old column. Let’s say you want a new column that adds 1 to a value from an old column.

# Generate a new value from `an_existing_column`
# generate_a_value(x) is a python function that generates a value 
# based on the column value from `an_existing_column`

df['my_new_column'] = df['an_existing_column'].apply(lambda x: generate_a_value(x))
 

If you want to send more than two columns for processing:

# Generate a new value from two or more existing columns

df['my_new_column'] = df.apply(lambda x: generate_a_value(x.column1,x.column2),axis=1)
 

Create pandas data frame with column names with a list of lists data

# Set column names

column_names=["col a","col b",'col c']

# Create a DataFrame and assign column names
df=pd.DataFrame(list_of_lists, columns=column_names)
 

Select/display specific columns from a data frame

# select specific columns in a data frame
df= df[['my_col1','my_col2']]

Ensure a data frame column is a float-type column

df['column_name'] = df['column_name'].astype(float)

Python List Manipulation

Concatenate two python lists

listone = [1,2,3]
listtwo = [4,5,6]

joinedlist = listone + listtwo

Convert a python string to a list of characters

word = 'abc'
the_list = list(word)

Randomize contents of python list

import random

the_list = ["item1", "item2", "item3"]
random.shuffle(the_list)

print(the_list)

JSON Manipulation

Convert a dictionary to a json string

import json

r = {'expert': 'True', 'rating': 1.5}
r = json.dumps(r)

Convert a json string back to a python dictionary

import json

my_dict = json.loads(json_string)

Load a json file into a pandas data frame

import pandas as pd

#this assumes one json item per line in json file
df=pd.read_json("path_to_json_file", lines=True)

System Commands

Run a system command from within Python code

import os

os.system("your_system_command")

File / Directory Operations

Safely create nested directories in Python

import os

#check if path exists, if not create the directory
if not os.path.exists(directory):
    os.makedirs(directory)

Evaluation

Compute Cosine Similarity

import math
import re
from collections import Counter

WORD = re.compile(r"\w+")


def get_cosine(vec1, vec2):
    
    # get intersecting keys
    intersection = set(vec1.keys()) & set(vec2.keys())

    # multiply and sum weights
    numerator = sum([vec1[x] * vec2[x] for x in intersection])

    # compute denominator
    sum1 = sum([vec1[x] ** 2 for x in list(vec1.keys())])
    sum2 = sum([vec2[x] ** 2 for x in list(vec2.keys())])
    denominator = math.sqrt(sum1) * math.sqrt(sum2)

    # return cosine score
    if not denominator:
        return 0.0
    else:
        return float(numerator) / denominator


def text_to_vector(text):
    words = WORD.findall(text)
    return Counter(words)

def get_cosine_val(text1,text2):
    
    # turn text into vector counts
    vector1 = text_to_vector(text1)
    vector2 = text_to_vector(text2)

    # compute similarity
    cosine = get_cosine(vector1, vector2)
    return cosine

Compute WER

Compute the word error rate.

from jiwer import wer,mer
import jiwer

ground_truth = "this is a test string"
hypothesis = "this is a test"

#error = wer(ground_truth, hypothesis)

transformation = jiwer.Compose([
    jiwer.ToLowerCase(),
    jiwer.RemovePunctuation(),
    jiwer.RemoveWhiteSpace(replace_by_space=True),
    jiwer.RemoveMultipleSpaces(),
    jiwer.SentencesToListOfWords(word_delimiter=" ")
])

def get_wer(target_text:str,system_text:str):
        wer_score=wer(target_text,
                      system_text,
                      truth_transform=transformation,
                      hypothesis_transform=transformation)
        
        return wer_score
    

Compute per-class precision, recall, f1 scores

The goal here is to compute per-class precision, recall and f1 scores and display the results using a data frame.

The first step is to collect your labels as two separate lists. (1) the predicted labels and (2) the corresponding true labels. For example:

predicted_labels=["positive","positive","negative"]
true_labels=["positive","other","negative"]

Once you have the true and predicted labels in a list, you can use sklearn’s `precision_recall_fscore_support` module to compute all the scores for you. Here’s how you do it:

from sklearn.metrics import precision_recall_fscore_support

# the possible labels
labels = ['positive', 'negative', 'other']

# setting average to None, returns precision, recall and f1 scores for individual labels            

per_class_prf=precision_recall_fscore_support(true_labels,predicted_labels,average=None,labels=labels)

# Collect all the p/r/f labels for the 3 classes
precisions = per_class_prf[0]
recalls = per_class_prf[1]
fscores = per_class_prf[2]
supports = per_class_prf[3]

# Zip the values to make rows for each label
table_data = zip(labels, precisions, recalls, fscores, supports)

# Place the zipped values in a dataframe
df = pd.DataFrame(list(table_data),columns=['labels', 'precision', 'recall', 'f-score', 'num_of_examples'])
print(df.sort_values(by=['num_of_examples'], ascending=False))

Example output:

     labels  precision    recall   f-score  num_of_examples
1  negative   0.875000  0.933333  0.903226               15
0  positive   0.636364  0.777778  0.700000                9
2     other   0.500000  0.200000  0.285714                5