6. Data Preprocessing Using Orange library in Python

In this blog on Data Science, Data Preprocessing Using Orange library in Python, I will cover some points about how we can use the Orange library in a python script and perform various data preprocessing tasks like Randomization, Normalization, Discretization and Continuization on data with help of various Orange functions.

Discretization

The variable in the new data table indicates the bins to which the original values belong.

The default discretization method (four bins with an approximately an equal number of data instances) can be replaced with other methods.

Continuization

  • binary variables are transformed into 0.0/1.0 or -1.0/1.0 indicator variables, depending upon the argument zero_based.
  • multinomial variables are treated according to the argument multinomial_treatment.
  • discrete attribute with only one possible value is removed.

zero_based

Determines the value used as the “low” value of the variable. When binary variables are transformed into continuous or when the multivalued variable is transformed into multiple variables, the transformed variable can either have values 0.0 and 1.0 (default, zero_based=True) or -1.0 and 1.0 (zero_based=False).

multinomial_treatment

The variable is replaced by indicator variables, each corresponding to one value of the original variable. For each value of the original attribute, only the corresponding new attribute will have a value of one, and the others will be zero. This is the default behavior.

Note that these variables are not independent, so they cannot be used (directly) in, for instance, linear or logistic regression.

For example, dataset “titanic” has featured “status” with values “crew”, “first”, “second” and “third”, in that order. Its value for the 7th row is “first”. Continuization replaces the variable with variables “status=crew”, “status=first”, “status=second” and “status=third”.

Normalization

Data normalization is generally considered the development of clean data. … Data normalization is the organization of data to appear similar across all records and fields. It increases the cohesion of entry types leading to cleansing, lead generation, segmentation, and higher quality data.

Randomization

this is a brief introduction about data pre-processing in using the orange library in python script. 🙌