6. Data Preprocessing Using Orange library in Python
In this blog on Data Science, Data Preprocessing Using Orange library in Python, I will cover some points about how we can use the Orange library in a python script and perform various data preprocessing tasks like Randomization, Normalization, Discretization and Continuization on data with help of various Orange functions.
Discretization
Discretization replaces continuous features with the corresponding categorical features:
The variable in the new data table indicates the bins to which the original values belong.
The default discretization method (four bins with an approximately an equal number of data instances) can be replaced with other methods.
Continuization
Given a data table, return a new table in which the discretize attributes are replaced with continuous or removed.
- binary variables are transformed into 0.0/1.0 or -1.0/1.0 indicator variables, depending upon the argument zero_based.
- multinomial variables are treated according to the argument multinomial_treatment.
- discrete attribute with only one possible value is removed.
zero_based
Determines the value used as the “low” value of the variable. When binary variables are transformed into continuous or when the multivalued variable is transformed into multiple variables, the transformed variable can either have values 0.0 and 1.0 (default, zero_based=True) or -1.0 and 1.0 (zero_based=False).
multinomial_treatment
The variable is replaced by indicator variables, each corresponding to one value of the original variable. For each value of the original attribute, only the corresponding new attribute will have a value of one, and the others will be zero. This is the default behavior.
Note that these variables are not independent, so they cannot be used (directly) in, for instance, linear or logistic regression.
For example, dataset “titanic” has featured “status” with values “crew”, “first”, “second” and “third”, in that order. Its value for the 7th row is “first”. Continuization replaces the variable with variables “status=crew”, “status=first”, “status=second” and “status=third”.
Normalization
Normalization refers to rescaling real-valued numeric attributes into a 0 to 1 range. Data normalization is used in machine learning to make model training less sensitive to the scale of features. This allows our model to converge to better weights and, in turn, leads to a more accurate model.
Data normalization is generally considered the development of clean data. … Data normalization is the organization of data to appear similar across all records and fields. It increases the cohesion of entry types leading to cleansing, lead generation, segmentation, and higher quality data.
Randomization
With randomization, given a data table, the preprocessor returns a new table in which the data is shuffled. Randomize function is used from the Orange library to perform randomization.
this is a brief introduction about data pre-processing in using the orange library in python script. 🙌