These are some python code snippets that I use very often.
- Pandas DataFrame Manipulation
- Ensure a data frame column is a float type column
- Group by a column and keep the column afterwards
- Convert a list of lists into a pandas data frame
- Convert a data frame to a list of dictionary values
- Convert a dictionary to a pandas data frame
- Select rows matching a specific column criteria
- Create a new data frame column with specific values
- Sort data frame by value
- Get unique values from a data frame column
- Create a new derived column with df.apply
- Create pandas data frame with column names with a list of lists data
- Select/display specific columns from a data frame
- Ensure a data frame column is a float-type column
- Python List Manipulation
- JSON Manipulation
- System Commands
- File / Directory Operations
- Evaluation
Pandas DataFrame Manipulation
Ensure a data frame column is a float type column
df['column_name'] = df['column_name'].astype(float)
Group by a column and keep the column afterwards
df.groupby(['column_name']).aggregate_function().reset_index()
Convert a list of lists into a pandas data frame
df=pd.DataFrame(list(your_list_of_lists),columns=['column 1','column 2','column3'])
Convert a data frame to a list of dictionary values
Let’s say you want a list of dictionaries from a pandas data frame as follows:
From this:
To this:
[{"name":"rita","age":23}, {"name":"gita","age":45}]
This is the code you would use:
dict_vals=df.to_dict(orient='records')
Convert a dictionary to a pandas data frame
Let’s say you have a dict as follows:
my_dict={'mrr':0.4,'map':0.3,'precision':0.6}
.
To convert this to a pandas Data Frame, you can do the following:
import pandas as pd
my_dict={'mrr':0.4,'map':0.3,'precision':0.6}
pd.DataFrame(list(my_dict.items()),columns=['metric','value'])
You will see the following output:
metric value
0 mrr 0.4
1 map 0.3
2 precision 0.6
Select rows matching a specific column criteria
Let’s say you want to find rows where the column value matches a specific constraint. You could use the following:
import pandas as pd
df=df[(df['column1']<=2) & df['column2']<==3) ]
Create a new data frame column with specific values
Let’s say you want to add an additional column to a data frame with values generated via some external processing. You can transform the external values into a list and do the following:
vals=[1,2,3,4]
df['vals']=vals
Sort data frame by value
# Sort in descending order
df.sort_values(by=["column_name1","column_name2"],ascending=False)
Get unique values from a data frame column
# Get unique list of values in the df['column_name'] column
uniq_vals=list(df.column_name.unique())
Create a new derived column with df.apply
The goal here is to create a new column with values populated based on the values of an old column. Let’s say you want a new column that adds 1 to a value from an old column.
# Generate a new value from `an_existing_column`
# generate_a_value(x) is a python function that generates a value
# based on the column value from `an_existing_column`
df['my_new_column'] = df['an_existing_column'].apply(lambda x: generate_a_value(x))
If you want to send more than two columns for processing:
# Generate a new value from two or more existing columns
df['my_new_column'] = df.apply(lambda x: generate_a_value(x.column1,x.column2),axis=1)
Create pandas data frame with column names with a list of lists data
# Set column names
column_names=["col a","col b",'col c']
# Create a DataFrame and assign column names
df=pd.DataFrame(list_of_lists, columns=column_names)
Select/display specific columns from a data frame
# select specific columns in a data frame
df= df[['my_col1','my_col2']]
Ensure a data frame column is a float-type column
df['column_name'] = df['column_name'].astype(float)
Python List Manipulation
Concatenate two python lists
listone = [1,2,3]
listtwo = [4,5,6]
joinedlist = listone + listtwo
Convert a python string to a list of characters
word = 'abc'
the_list = list(word)
Randomize contents of python list
import random
the_list = ["item1", "item2", "item3"]
random.shuffle(the_list)
print(the_list)
JSON Manipulation
Convert a dictionary to a json string
import json
r = {'expert': 'True', 'rating': 1.5}
r = json.dumps(r)
Convert a json string back to a python dictionary
import json
my_dict = json.loads(json_string)
Load a json file into a pandas data frame
import pandas as pd
#this assumes one json item per line in json file
df=pd.read_json("path_to_json_file", lines=True)
System Commands
Run a system command from within Python code
import os
os.system("your_system_command")
File / Directory Operations
Safely create nested directories in Python
import os
#check if path exists, if not create the directory
if not os.path.exists(directory):
os.makedirs(directory)
Evaluation
Compute Cosine Similarity
import math
import re
from collections import Counter
WORD = re.compile(r"\w+")
def get_cosine(vec1, vec2):
# get intersecting keys
intersection = set(vec1.keys()) & set(vec2.keys())
# multiply and sum weights
numerator = sum([vec1[x] * vec2[x] for x in intersection])
# compute denominator
sum1 = sum([vec1[x] ** 2 for x in list(vec1.keys())])
sum2 = sum([vec2[x] ** 2 for x in list(vec2.keys())])
denominator = math.sqrt(sum1) * math.sqrt(sum2)
# return cosine score
if not denominator:
return 0.0
else:
return float(numerator) / denominator
def text_to_vector(text):
words = WORD.findall(text)
return Counter(words)
def get_cosine_val(text1,text2):
# turn text into vector counts
vector1 = text_to_vector(text1)
vector2 = text_to_vector(text2)
# compute similarity
cosine = get_cosine(vector1, vector2)
return cosine
Compute WER
Compute the word error rate.
from jiwer import wer,mer
import jiwer
ground_truth = "this is a test string"
hypothesis = "this is a test"
#error = wer(ground_truth, hypothesis)
transformation = jiwer.Compose([
jiwer.ToLowerCase(),
jiwer.RemovePunctuation(),
jiwer.RemoveWhiteSpace(replace_by_space=True),
jiwer.RemoveMultipleSpaces(),
jiwer.SentencesToListOfWords(word_delimiter=" ")
])
def get_wer(target_text:str,system_text:str):
wer_score=wer(target_text,
system_text,
truth_transform=transformation,
hypothesis_transform=transformation)
return wer_score
Compute per-class precision, recall, f1 scores
The goal here is to compute per-class precision, recall and f1 scores and display the results using a data frame.
The first step is to collect your labels as two separate lists. (1) the predicted labels and (2) the corresponding true labels. For example:
predicted_labels=["positive","positive","negative"]
true_labels=["positive","other","negative"]
Once you have the true and predicted labels in a list, you can use sklearn’s `precision_recall_fscore_support` module to compute all the scores for you. Here’s how you do it:
from sklearn.metrics import precision_recall_fscore_support
# the possible labels
labels = ['positive', 'negative', 'other']
# setting average to None, returns precision, recall and f1 scores for individual labels
per_class_prf=precision_recall_fscore_support(true_labels,predicted_labels,average=None,labels=labels)
# Collect all the p/r/f labels for the 3 classes
precisions = per_class_prf[0]
recalls = per_class_prf[1]
fscores = per_class_prf[2]
supports = per_class_prf[3]
# Zip the values to make rows for each label
table_data = zip(labels, precisions, recalls, fscores, supports)
# Place the zipped values in a dataframe
df = pd.DataFrame(list(table_data),columns=['labels', 'precision', 'recall', 'f-score', 'num_of_examples'])
print(df.sort_values(by=['num_of_examples'], ascending=False))
Example output:
labels precision recall f-score num_of_examples
1 negative 0.875000 0.933333 0.903226 15
0 positive 0.636364 0.777778 0.700000 9
2 other 0.500000 0.200000 0.285714 5