This notebook contains both parts of the mandatory portfolio assignment. Part 1 has been previosuly released as Guidance 5. Part 2 is listed below Part 1. You are encouraged to copy your previous work for pary 1 into this notebook to submit both assignments together.
The tasks in this notebook should be submitted as the first part of the mandatory assignment, Portfolio of Practical work, that is due by 17 November 2022. (Note: This portfolio will have additional components. The link to hand in will only be published once all requirements have been posted.)
This notebook requires you to complete the missing parts of the Jupyter notebook. This will include comments in markdown, or completing code for the outline classes that is provided.
Most data science projects start by pre-processing a dataset to ensure the data is ready to use for its intended purpose. One of the tasks that a datascientist would typically complete during such a pre-processing phase is to replace missing data values in the dataset using a process known as imputation. A popular toolkit that assist with this task in python is class sklearn.preprocessing.Imputer. A discussion of this class and its properties and methods can be found at http://lijiancheng0614.github.io/scikit-learn/modules/generated/sklearn.preprocessing.Imputer.html .
You are required to design and implement your own version of class Imputer. Your version should make use of the strategy pattern to ensure it is extensible and easy to maintain.
Your class Imputer should (initially) accept should accept two parameters, namely strategy and axis with the following options:
strategy : string, optional (default=”mean”). The imputation strategy.
axis : integer, optional (default=0) The axis along which to impute.
Your class should support two methods, namely fit and transform with the following behaviours:
Parameters:
Returns:
In other words: Fit receives as input the "matrix" of incomplete data, with the "boundaries" of the area for which we want to impute (calculate) missing values. (i.e. a single column, or the entire matrix) and returns an object containing only the part we want to do the imputation on.
Parameters:
Firstly, the user should create an instance of the imputer (see example cpommand below). In this case the parameters indicate that the imputation strategy should calculate the mean of the values in each column and replace missing values with the calculated mean. You may assume missing values will always be indicated by the word ‘nan’. (Note: The axis = 0 parameter shown in the example here is not a current requirement.It indicates that the imputation should be by columns. Your version should not have this parameter at all).
Secondly, the user needs to “fit” the imputation to the dataset. This means the user needs to tell the class which rows and columns must be included in the imputation. The statement below specifies that all rows and columns 1 to 3 should be included. You may simplify the syntax your class expects, but it must be documented clearly.
Lastly, the user will invoke the transform method. Transform returns a copy of the input data that has now been imputed.
Example of data before imputation
Example of data after imputation
this = ([['France', 'nan', '72000.0', '560'],
['Spain', '27.0', '48000.0', 'nan'],
['Germany', '30.0', 'nan', '172'],
['Spain', '38.0', '61000.0', 'nan'],
['Germany','40.0', 'nan', '340'],
['France', '35.0', '58000.0', 'nan'],
['Spain', 'nan', '52000.0', '660'],
['France', '48.0', 'nan', '1560'],
['Germany', 'nan', '83000.0', '950'],
['France', '37.0', '67000', 'nan'],
['Sweden', '48' , 'nan', '265'],
['Norway','48' , '67000', 'nan'],
['USA', 'nan', '58000.0', '790'],
['USA', '35.0', '58000.0', 'nan']])
from abc import ABCMeta, abstractmethod
# the abstract method calculator is replaced on instanciation
# depending on choosen method of impuation
class CalculateStrategy(metaclass=ABCMeta):
@abstractmethod
def calculator():
pass
class MeanCalc(CalculateStrategy):
def __init__(self, list):
self.list = list
# check if the passed value is a float, their after appending and dividing by amount appended values
def calculator(self,fit):
sum = 0
count = 0
for x in self.list:
if isinstance(x[fit], float) is True:
sum += x[fit]
count += 1
return round(sum/count,1)
class MedianCalc(CalculateStrategy):
def __init__(self, list):
self.__list = list
# check if the passed value is a float, their after appending to temporary list
def calculator(self,fit):
count = 0
median = []
for x in self.__list:
if isinstance(x[fit], float) is True:
median.append(x[fit])
count += 1
# sort a set after acending values
median = sorted(median)
# find the center value of a set sorted list "median"
# checks if the set does not have 1 senter it the takes mean of the 2 center values
if count % 2 == 0:
temp_median = int(count * 0.5)
calculated_median =\
(median[temp_median] \
+ (median[temp_median] + 1) \
* 0.5)
else:
temp_median = int((count-1) * 0.5)
calculated_median = median[temp_median]
return calculated_median
class ModCalc(CalculateStrategy):
def __init__(self, list):
self.__list = list
# check if the passed value is a float, their after appending to temporary list
def calculator(self, fit):
count = 0
temp_list = []
for x in self.__list:
if isinstance(x[fit], float) is True:
temp_list.append(x[fit])
count += 1
# find the most frequently occuring value if multiple mod values occur the first will be utilized
temp = max(temp_list, key= temp_list.count)
return temp
# Inputation settings and pass along to the different functions
class Imputer:
def __init__(self, replaced, strategy=None, axis = None):
self.__replaced = replaced
self.__strategy = strategy
self.__processing = None
self.__axis = axis
self.__dataset = list
self.__cal_pointer = None
# Run methods in correct sequenze
def transform(self):
list1 = self.transpose(self.__dataset,self.__axis)
list1 = self.pre_processing(list1)
self.__dataset = list1
self.strategy_calculator()
list2 = self.replace_imputer()
list3 = self.post_processing(list2,self.__processing)
list3 = self.transpose(list3,self.__axis)
self.__dataset = list3
# declares the fit propertiesa and operational parameters
def fit(self,dataset, from_x = 1, to_x = 9999, from_y = 1, to_y = 9999):
self.__dataset = dataset
self.__from_x = from_x -1
self.__to_x = to_x -1
self.__from_y = from_y -1
self.__to_y = to_y -1
# retrives input from user (default is mean) calling the class with relevant strategy
# thereafter imputing values in the dataset within strategies give parameters
def strategy_calculator(self):
if self.__strategy.lower() == "mean" or self.__strategy.lower() == None:
self.__cal_pointer = MeanCalc(self.__dataset)
elif self.__strategy.lower() == "median":
self.__cal_pointer = MedianCalc(self.__dataset)
elif self.__strategy.lower() == "mod":
self.__cal_pointer = ModCalc(self.__dataset)
else:
print("The Imputer only take the following strateiges (mean,median,mod)")
# transposing the dataset through appending each value to a new list
def transpose(self, list,axis):
transposed_list = []
count = 0
if axis == 1:
while count < len(list[0]):
temp_list = []
for x in list:
temp_list.append(x[count])
count += 1
transposed_list.append(temp_list)
return transposed_list
elif axis == 0 or self.__axis == None:
return list
else:
print("0=columns, 1=rows")
# custom float check, to avoid "NaN" and "Inf" being converted into nummeric values
# disregards weather value string, int and float and checks if value is alphabetical or nummeric
# nan is considered a float, for this program to run it needs remain a string.
def isfloat(self,num):
if num.lower() == "nan" or num.lower() == "inf":
return False
try:
float(num)
return True
except ValueError:
return False
# preprocesses the list conventing strings with numeric values to floats
def pre_processing(self, list):
preprocessed_list = ([])
for x in list:
temp_list = []
for y in x:
if y.isnumeric():
y = float(y)
temp_list.append(y)
elif self.isfloat(y) is True:
y = float(y)
temp_list.append(y)
else:
temp_list.append(y)
preprocessed_list.append(temp_list)
return(preprocessed_list)
# Takes the datasett and converts each item in the to string or float depending on user requirments
def post_processing(self, list,processing):
postprocessed_list = ([])
if processing == "string" or self.__processing == None:
for x in list:
temp_list = []
for y in x:
temp_list.append(str(y))
postprocessed_list.append(temp_list)
return postprocessed_list
elif processing == "float":
return list
else:
print("has to be string or float")
# find designated value passed by user and replaces them with the chosen strategies calculated value
def replace_imputer(self):
new_list = ([])
count_y = 0
#self.strategy_calculator()
for y in self.__dataset:
temp_list = []
count_x = 0
# check parameter for x and y access only replacing relevant values in the dataset
for x in y:
if x == self.__replaced and count_x >= self.__from_x and count_x\
<= self.__to_x and count_y >= self.__from_y and count_y <= self.__to_y:
temp_list.append(self.__cal_pointer.calculator(count_x))
else:
temp_list.append(x)
count_x += 1
new_list.append(temp_list)
count_y += 1
return new_list
# multiple options when printing fineshed data set
def print_this(self, matrix = False, transposed = False, how = None):
# flip y and x axis
if transposed == True:
self.__dataset = self.transpose(self.__dataset,1)
# print in matrix format
if matrix == True:
self.__dataset = self.post_processing(self.__dataset,"string")
print('\n'.join([''.join(['{:12}'.format(item) for item in row]) for row in self.__dataset]))
# allows final result to printed as where numeric values are printed as floats
elif how == "float":
self.__dataset = self.pre_processing(self.__dataset)
for x in self.__dataset:
print(x)
# simple print
else:
for x in self.__dataset:
print(x)
# The Imputer allows define what word want be replaced in the dataset and what strategy to use
# the axis function defines what weather the values that passed in the fit method are on columns or rows
# transform simple runs the sequenze of operations
# print_this is just how you want your results to be displayed and does change how dataset is saved in the object
Imput1 = Imputer("nan","mean",axis= 0)
Imput1.fit(this,from_x=1,to_x=2,from_y= 3, to_y= 8)
Imput1.transform()
Imput1.print_this(matrix = True, transposed= False, how ="string")
France nan 72000.0 560.0 Spain 27.0 48000.0 nan Germany 30.0 nan 172.0 Spain 38.0 61000.0 nan Germany 40.0 nan 340.0 France 35.0 58000.0 nan Spain 38.6 52000.0 660.0 France 48.0 nan 1560.0 Germany nan 83000.0 950.0 France 37.0 67000.0 nan Sweden 48.0 nan 265.0 Norway 48.0 67000.0 nan USA nan 58000.0 790.0 USA 35.0 58000.0 nan
# The Imputer allows define what word want be replaced in the dataset and what strategy to use
# the axis function defines what weather the values that passed in the fit method are on columns or rows
# transform simple runs the sequenze of operations
# print_this is just how you want your results to be displayed and does change how dataset is saved in the object
Imput2 = Imputer(replaced="nan", axis=1, strategy= "Mod")
Imput2.fit(this,from_x=1,to_x=5)
Imput2.transform()
Imput2.print_this(how="float")
['France', 72000.0, 72000.0, 560.0] ['Spain', 27.0, 48000.0, 27.0] ['Germany', 30.0, 30.0, 172.0] ['Spain', 38.0, 61000.0, 38.0] ['Germany', 40.0, 40.0, 340.0] ['France', 35.0, 58000.0, 'nan'] ['Spain', 'nan', 52000.0, 660.0] ['France', 48.0, 'nan', 1560.0] ['Germany', 'nan', 83000.0, 950.0] ['France', 37.0, 67000.0, 'nan'] ['Sweden', 48.0, 'nan', 265.0] ['Norway', 48.0, 67000.0, 'nan'] ['USA', 'nan', 58000.0, 790.0] ['USA', 35.0, 58000.0, 'nan']
# The Imputer allows define what word want be replaced in the dataset and what strategy to use
# the axis function defines what weather the values that passed in the fit method are on columns or rows
# transform simple runs the sequenze of operations
# print_this is just how you want your results to be displayed and does change how dataset is saved in the object
Imput3 = Imputer("nan","median",axis= 0)
Imput3.fit(this,from_x=1,to_x=3)
Imput3.transform()
Imput3.print_this(matrix = False, transposed= True)
['France', 'Spain', 'Germany', 'Spain', 'Germany', 'France', 'Spain', 'France', 'Germany', 'France', 'Sweden', 'Norway', 'USA', 'USA'] ['57.5', '27.0', '30.0', '38.0', '40.0', '35.0', '57.5', '48.0', '57.5', '37.0', '48.0', '48.0', '57.5', '35.0'] ['72000.0', '48000.0', '91500.5', '61000.0', '91500.5', '58000.0', '52000.0', '91500.5', '83000.0', '67000.0', '91500.5', '67000.0', '58000.0', '58000.0'] ['560.0', 'nan', '172.0', 'nan', '340.0', 'nan', '660.0', '1560.0', '950.0', 'nan', '265.0', 'nan', '790.0', 'nan']
Consider the possibility of having to add a new strategy and/or having to change your imoplementation to also support imputing along axis 1. Explain how the strategy pattern makes your design resistent to the impact of such changes.
Write a brief reflection on the strategy pattern and how its utility is demonstarted in the above code here. 150 - 300 words.
Incorporating the strategy pattern yields the flexibility of easily expanding the program with other central tendencies, such as the “midrange”, thus further expanding the capabilities of calculating the replacing values other than a median, mod, or the default mean.
However, some of the flaws with the strategy, in this case, are when trying to add a way input along another axis. As I see, there are two ways to switch the axis. One is to hardcode the functionality into the main code base, thus losing the value of a strategy pattern. You need to change the replacement functionality and not only pass other imputation values based on given values to the main program. The other way would be to transmute (flip the y-axis and x-axis) the dataset before running the core functionality; this cannot be done by adding another strategy to the existing abstract class because it essentially only passes a value back to the main process and not an altered dataset. The order of processes in the program will remain the same, altering the axis on which the program imputes the values and reversing the transmutations to the end user. One solution would be utilizing a new class that encompasses and transmutes data and then passes a new transmuted dataset before initializing the remaining functions of the imputer, thus decorating the strategy pattern.
I have added the ability to switch the axis in my imputer through it is hardcoded in the core program via transmute function. However, the problem remains that if this functionality were not the intention of the original design, it would be difficult to incorporate the change while keeping a pure strategy pattern.
My Imputer uses a metaclass "transform strategy," allowing the user to pass variables in and out of the main program; the metaclass inherits from the different implemented strategies allowing for flexibility when changing strategies in what value you wish to utilize as a replacement on the dataset.
One possible problem with the use of the strategy pattern is the reliance on the client to compose the used object with the correct "strategy class" to ensure the required behaviour. The previous specification for Part 1 of this task already required the use of a parameter (for example "strategy = 'mean'") to determine which strategy the class should be composed with. However, this leaves the code to instantiate with a specific strategy inside your imputer class. Ideally we want to keep the imputer open for extension but closed for modification. From a software usage point of view, it would be more convenient, and less error prone, to simply specify the behaviour that would be desirable as a parameter and have an external "factory" take care of the instantiation.
Your task consist of three parts, the first two are short written discussions (you may use diagrams as part of the discussions).
Explain how you can use a factory to take care of the strategy instantiation. This explanation should take the form of a discussion of the suggested design for the overall collection of classes for the imputer and any clients that will use it.
Discuss the benefits and/or negatives of the above design
Provide all the code for the suggested design. Including new versions for any classes you already wrote in part 1 of the protfolio assignment. Also add code to showcase how the classes work
The factory will instantiate what method of calculating the user desires instead of having this embedded in the core program; this allows for much more flexibility If one wants to add further functionally. The core program now only has a variable that the object is passed through and can be changed depending on the function needed, essentially creating a simple factory pattern.
Reflecting on what increases the utility of the design, one could further incorporate the design principles such as: “Strive for loosely coupled design”.
To adhere to the above design principle, one could encapsulate the functionality of the core program where most of the functions receive a list and add their modifications, eventually returning it to the core program, thus resulting in an abstract factory. The aforementioned “strategy factory” and now we could also have an imputation factory, thus allowing us to switch out core functionality on demand. An example of this could be that the existing print function could be part of its factory called “final result” and depending on what format the user needs to export the dataset in; CSV, JSON, or append data directly to a database. Of course, the same functionality could also be decorated. However, the main point being further expansion is not limited by design.
Besides the difficulties of adding axis functionality as described above. One does reflect on previous working experience when receiving datasets on how valuable a custom imputer could be, thus, incorporating the flexibility to add later similar functionality, such as replacing punctions for commas or vice versa, could be highly efficient when processing data.
The current state of my program in strategy pattern would not easily accommodate such changes and can only process datasets with a comma-separated list; however, expanding on the strategy pattern and establishing a hierarchy removing the functionality in the main program could allow for functionality such as replacing ex. punction with commas and appending imminently as string, thus ensuring further capabilities with comma-separated lists. A factory pattern would allow for further expansion when the necessity arises, also allowing the replacement of core functionally if one finds a more efficient solution.
One could argue that we are over-engineering this Imputer and adding a factory pattern will initially make creating the program more challenging than a strategy pattern; however, accommodating this flexibility further incorporates the design principles.
The “modular” approach to the Factory pattern will allow for much need flexibility and to ensure the program’s utility and life span. As Johan mentioned in the lecture, “Change is the only constant in life.” Heraclitus, the Greek philosopher.
# allowing user to input the strategy same as previously and instanciate the relevant object
# returning it for use for another object
class Strategy_Factory(metaclass=ABCMeta):
def means_of_imputation(self,strategy):
if strategy.lower() == "mean" or strategy.lower() == None:
cal_pointer = MeanCalc
return cal_pointer
elif strategy.lower() == "median":
cal_pointer = MedianCalc
return cal_pointer
elif strategy.lower() == "mod":
cal_pointer = ModCalc
return cal_pointer
else:
print("The Imputer only take the following strateiges (mean,median,mod)")
from abc import ABCMeta, abstractmethod
# the abstract method calculator is replaced on instanciation
# depending on choosen method of impuation
class CalculateStrategy(metaclass=ABCMeta):
@abstractmethod
def calculator():
pass
class MeanCalc(CalculateStrategy):
# check if the passed value is a float, their after appending and dividing by amount appended values
def calculator(fit,dataset):
sum = 0
count = 0
for x in dataset:
if isinstance(x[fit], float) is True:
sum += x[fit]
count += 1
return round(sum/count,1)
class ModCalc(CalculateStrategy):
# check if the passed value is a float, their after appending to temporary list
def calculator(fit,dataset):
count = 0
temp_list = []
for x in dataset:
if isinstance(x[fit], float) is True:
temp_list.append(x[fit])
count += 1
# find the most frequently occuring value if multiple mod values occur the first will be utilized
temp = max(temp_list, key= temp_list.count)
return temp
class MedianCalc(CalculateStrategy):
# check if the passed value is a float, their after appending to temporary list
def calculator(fit,dataset):
count = 0
median = []
for x in dataset:
if isinstance(x[fit], float) is True:
median.append(x[fit])
count += 1
# sort a set after acending values
median = sorted(median)
# find the center value of a set sorted list "median"
# checks if the set does not have 1 senter it the takes mean of the 2 center values
if count % 2 == 0:
temp_median = int(count * 0.5)
calculated_median =\
(median[temp_median] \
+ (median[temp_median] + 1) \
* 0.5)
else:
temp_median = int((count-1) * 0.5)
calculated_median = median[temp_median]
return calculated_median
# Inputation settings and pass along to the different functions
class Imputer():
def __init__(self, factory, replaced, axis = None):
self.__cal_pointer = factory
self.__replaced = replaced
self.__processing = None
self.__axis = axis
self.__dataset = list
# Run methods in correct sequenze
def transform(self):
list1 = self.transpose(self.__dataset,self.__axis)
list1 = self.pre_processing(list1)
list2 = self.replace_imputer(list1)
list3 = self.post_processing(list2,self.__processing)
list3 = self.transpose(list3,self.__axis)
self.__dataset = list3
# declares the fit propertiesa and operational parameters
def fit(self,dataset, from_x = 1, to_x = 9999, from_y = 1, to_y = 9999):
self.__dataset = dataset
self.__from_x = from_x -1
self.__to_x = to_x -1
self.__from_y = from_y -1
self.__to_y = to_y -1
# transposing the dataset
# itterating each item in the dataset and appending each value to a new list
def transpose(self, list,axis):
transposed_list = []
count = 0
if axis == 1:
while count < len(list[0]):
temp_list = []
for x in list:
temp_list.append(x[count])
count += 1
transposed_list.append(temp_list)
return transposed_list
elif axis == 0 or self.__axis == None:
return list
else:
print("0=columns, 1=rows")
# float check, to avoid "NaN" and "Inf" being converted into nummeric values
# disregards weather value is a string, int and float and checks if value is alphabetical or nummeric
# nan is considered a float, for this program to run it needs remain a string.
def isfloat(self,num):
if num.lower() == "nan" or num.lower() == "inf":
return False
try:
float(num)
return True
except ValueError:
return False
# preprocesses the list conventing strings with numeric values to floats
def pre_processing(self, list):
preprocessed_list = ([])
for x in list:
temp_list = []
for y in x:
if y.isnumeric():
y = float(y)
temp_list.append(y)
elif self.isfloat(y) is True:
y = float(y)
temp_list.append(y)
else:
temp_list.append(y)
preprocessed_list.append(temp_list)
return(preprocessed_list)
# Takes the datasett and converts each item in the to string or float depending on user requirments
def post_processing(self, list,processing):
postprocessed_list = ([])
if processing == "string" or self.__processing == None:
for x in list:
temp_list = []
for y in x:
temp_list.append(str(y))
postprocessed_list.append(temp_list)
return postprocessed_list
elif processing == "float":
return list
else:
print("has to be string or float")
# find designated value passed by user and replaces them with the chosen strategies calculated value
def replace_imputer(self,list):
new_list = ([])
count_y = 0
for y in list:
temp_list = []
count_x = 0
# check parameter for x and y access only replacing relevant values in the dataset
for x in y:
if x == self.__replaced and count_x >= self.__from_x and count_x\
<= self.__to_x and count_y >= self.__from_y and count_y <= self.__to_y:
temp_list.append(self.__cal_pointer.calculator(count_x, list))
else:
temp_list.append(x)
count_x += 1
new_list.append(temp_list)
count_y += 1
return new_list
# multiple options when printing fineshed data set
def print_this(self, matrix = False, transposed = False, how = None):
# flip y and x axis
if transposed == True:
self.__dataset = self.transpose(self.__dataset,1)
# print in matrix format
if matrix == True:
self.__dataset = self.post_processing(self.__dataset,"string")
print('\n'.join([''.join(['{:12}'.format(item) for item in row]) for row in self.__dataset]))
# allows final result to printed as where numeric values are printed as floats
elif how == "float":
self.__dataset = self.pre_processing(self.__dataset)
for x in self.__dataset:
print(x)
# simple print
else:
for x in self.__dataset:
print(x)
# instaniting the strategy factory with the desired strategy.
# The Imputer allows define what word want be replaced in the dataset and also passed the object from strategy factory
# the axis function defines what weather the values that passed in the fit method are on columns or rows
# transform simple runs the sequenze of operations
# print_this is just how you want your results to be displayed and does change how dataset is saved in the object
imput_cal = Strategy_Factory().means_of_imputation("mean")
Imput2 = Imputer(imput_cal,"nan")
Imput2.fit(this,from_x=1,to_x=3)
Imput2.transform()
Imput2.print_this(how="float", transposed=False ,matrix=True)
France 38.6 72000.0 560.0 Spain 27.0 48000.0 nan Germany 30.0 62400.0 172.0 Spain 38.0 61000.0 nan Germany 40.0 62400.0 340.0 France 35.0 58000.0 nan Spain 38.6 52000.0 660.0 France 48.0 62400.0 1560.0 Germany 38.6 83000.0 950.0 France 37.0 67000.0 nan Sweden 48.0 62400.0 265.0 Norway 48.0 67000.0 nan USA 38.6 58000.0 790.0 USA 35.0 58000.0 nan
# instaniting the strategy factory with the desired strategy.
# The Imputer allows define what word want be replaced in the dataset and also passed the object from strategy factory
# the axis function defines what weather the values that passed in the fit method are on columns or rows
# transform simple runs the sequenze of operations
# print_this is just how you want your results to be displayed and does change how dataset is saved in the object
imput_cal = Strategy_Factory().means_of_imputation("mean")
Imput2 = Imputer(imput_cal,"nan",axis=1)
Imput2.fit(this,from_x=1,to_x=4)
Imput2.transform()
Imput2.print_this(how="float", transposed=False)
['France', 36280.0, 72000.0, 560.0] ['Spain', 27.0, 48000.0, 24013.5] ['Germany', 30.0, 101.0, 172.0] ['Spain', 38.0, 61000.0, 30519.0] ['Germany', 40.0, 'nan', 340.0] ['France', 35.0, 58000.0, 'nan'] ['Spain', 'nan', 52000.0, 660.0] ['France', 48.0, 'nan', 1560.0] ['Germany', 'nan', 83000.0, 950.0] ['France', 37.0, 67000.0, 'nan'] ['Sweden', 48.0, 'nan', 265.0] ['Norway', 48.0, 67000.0, 'nan'] ['USA', 'nan', 58000.0, 790.0] ['USA', 35.0, 58000.0, 'nan']
Futher encapsulation of functions isolating them. The previous part of notebook remain the same you run cell below the output still remains the same.
# Inputation settings and pass along to the different functions
class Imputer():
def __init__(self, factory, replaced, axis = None):
self.__cal_pointer = factory
self.__replaced = replaced
self.__processing = None
self.__axis = axis
self.__dataset = list
# Run methods in correct sequenze
def transform(self):
list1 = Transpose().transpose(self.__dataset,self.__axis)
list1 = PreProcessing().pre_processing(list1)
list2 = Replace(self.__cal_pointer, self.__from_x, self.__to_x, \
self.__from_y, self.__to_y).replace_imputer(self.__replaced, list1)
list3 = PostProcessing().post_processing(list2,self.__processing)
list3 = Transpose().transpose(list3,self.__axis)
self.__dataset = list3
# declares the fit propertiesa and operational parameters
def fit(self,dataset, from_x = 1, to_x = 9999, from_y = 1, to_y = 9999):
self.__dataset = dataset
self.__from_x = from_x -1
self.__to_x = to_x -1
self.__from_y = from_y -1
self.__to_y = to_y -1
# allows other implementations to decorate with existing dataset
def get_dataset(self):
return self.__dataset
class Replace(Imputer):
def __init__(self, cal_pointer, from_x, to_x, from_y, to_y):
self.__cal_pointer = cal_pointer
self.__from_x = from_x
self.__to_x = to_x
self.__from_y = from_y
self.__to_y = to_y
# find designated value passed by user and replaces them with the chosen strategies calculated value
def replace_imputer(self, replaced, list):
new_list = ([])
count_y = 0
for y in list:
temp_list = []
count_x = 0
# check parameter for x and y access only replacing relevant values in the dataset
for x in y:
if x == replaced and count_x >= self.__from_x and count_x\
<= self.__to_x and count_y >= self.__from_y and count_y <= self.__to_y:
temp_list.append(self.__cal_pointer.calculator(count_x, list))
else:
temp_list.append(x)
count_x += 1
new_list.append(temp_list)
count_y += 1
return new_list
class Transpose():
# transposing the dataset
# itterating each item in the dataset and appending each value to a new list
def transpose(self, list,axis):
transposed_list = []
count = 0
if axis == 1:
while count < len(list[0]):
temp_list = []
for x in list:
temp_list.append(x[count])
count += 1
transposed_list.append(temp_list)
return transposed_list
elif axis == 0 or axis == None:
return list
else:
print("0=columns, 1=rows")
class PreProcessing():
# float check, to avoid "NaN" and "Inf" being converted into nummeric values
# disregards weather value is a string, int and float and checks if value is alphabetical or nummeric
# nan is considered a float, for this program to run it needs remain a string.
def isfloat(self,num):
if num.lower() == "nan" or num.lower() == "inf":
return False
try:
float(num)
return True
except ValueError:
return False
# preprocesses the list conventing strings with numeric values to floats
def pre_processing(self, list):
preprocessed_list = ([])
for x in list:
temp_list = []
for y in x:
if y.isnumeric():
y = float(y)
temp_list.append(y)
elif self.isfloat(y) is True:
y = float(y)
temp_list.append(y)
else:
temp_list.append(y)
preprocessed_list.append(temp_list)
return(preprocessed_list)
class PostProcessing():
# Takes the datasett and converts each item in the to string or float depending on user requirments
def post_processing(self, list,processing):
postprocessed_list = ([])
if processing == "string" or processing == None:
for x in list:
temp_list = []
for y in x:
temp_list.append(str(y))
postprocessed_list.append(temp_list)
return postprocessed_list
elif processing == "float":
return list
else:
print("has to be string or float")
class Finalize():
def __init__(self, dataset):
self.__dataset = dataset
# multiple options when printing fineshed data set
def print_this(self, matrix = False, transposed = False, how = None):
# flip y and x axis
if transposed == True:
self.__dataset = Transpose().transpose(self.__dataset,1)
# print in matrix format
if matrix == True:
self.__dataset = PostProcessing().post_processing(self.__dataset,"string")
print('\n'.join([''.join(['{:12}'.format(item) for item in row]) for row in self.__dataset]))
# allows final result to printed as where numeric values are printed as floats
elif how == "float":
self.__dataset = PreProcessing().pre_processing(self.__dataset)
for x in self.__dataset:
print(x)
# simple print
else:
for x in self.__dataset:
print(x)
imput_cal = Strategy_Factory().means_of_imputation("mean")
Imput2 = Imputer(imput_cal,"nan")
Imput2.fit(this,from_x=1,to_x=3)
Imput2.transform()
Finalize(Imput2.get_dataset()).print_this(how="float", transposed=False ,matrix=True)
France 38.6 72000.0 560.0 Spain 27.0 48000.0 nan Germany 30.0 62400.0 172.0 Spain 38.0 61000.0 nan Germany 40.0 62400.0 340.0 France 35.0 58000.0 nan Spain 38.6 52000.0 660.0 France 48.0 62400.0 1560.0 Germany 38.6 83000.0 950.0 France 37.0 67000.0 nan Sweden 48.0 62400.0 265.0 Norway 48.0 67000.0 nan USA 38.6 58000.0 790.0 USA 35.0 58000.0 nan