09.01.2020       Выпуск 316 (06.01.2020 - 12.01.2020)       Статьи

Xverse - извлекаем features из dataset

Читать>>




Экспериментальная функция:

Ниже вы видите текст статьи по ссылке. По нему можно быстро понять ссылка достойна прочтения или нет

Просим обратить внимание, что текст по ссылке и здесь может не совпадать.

Xverse short for X Universeis a python package for machine learning to assist Data Scientists with feature transformation and feature selection. Before we talk about the details of the package, let us understand why we need one.

Background

Credits: Geeksta

I came across this image and I felt there is no better way to explain this project. As funny as it sounds, it happens in most of the data science life cycle. 80% of the time spent by Data Scientists lies in dealing with features (X), and therefore the name XuniVersea.k.a“xverse”.

In addition, the entire project is based on the works that I published in Medium in 2018. Based on the feedback I got from readers, I wanted to improve the functionalities of the code. Instead of releasing it as an updated functionality, I thought it would be better to release it as a python package. Well, that is how “xverse” is born.

Earlier works for this package

The articles shown here below are now part of this package.

“Weight of evidence (WOE) and Information value (IV) are simple, yet powerful techniques to perform variable transformation and selection. These concepts have huge connection with the logistic regression modeling technique. It is widely used in credit scoring to measure the separation of good vs bad customers.”

“The idea is to apply a variety of techniques to select variables. When an algorithm picks a variable, we give a vote for the variable. At the end, we calculate the total votes for each variables and then pick the best ones based on votes. This way, we end up picking the best variables with minimum effort in the variable selection process.”

Functionalities

The dataset used for all the examples shown below is present in the “data” folder. In addition, you can refer to the Jupyter notebook code “Xverse.ipynb” present in this link.

1. Monotonic Binning

Monotonic Binning is a data preparation technique widely used in scorecard development. It tries to convert numerical variable to categorical variable by creating bins which have a monotonic relationship with the target. The example given below will demonstrate how it works.

Balance vs Target (Monotonic Relationship)

The feature ‘balance’ shown above ranges from $-3,313.001 to $71,188. We can all agree that this is a very wide range. If we have to bin this variable based on domain knowledge, then we don’t have to worry about a monotonic relationship. However, when no prior knowledge exists, then it is hard to determine the bin categories. This is an ideal situation where monotonic binning could help.

It starts with 20 bins (default, which you can change any time) and then reduces bins until it finds a monotonic relationship (either increasing or decreasing) between X variable and target y. If it did not establish a monotonic relationship, then it forces to create bins on X variable using the force_bins option. All this time, it uses spearman correlation to verify if there is a monotonic relationship exists between X and y variable.

Pros in Monotonic Binning:

  1. Handle outliers
  2. Establish a monotonic relationship between a feature and target variable

Cons in Monotonic Binning:

  1. We lose information during variable binning
  2. No proper way to treat missing values

How to perform Monotonic Binning using “Xverse”?

from xverse.transformer import MonotonicBinning

clf = MonotonicBinning()
clf.fit(X, y)

print(clf.bins)
output_bins = clf.bins #will be used later in this exercise





Here X represents the features dataset which is a Pandas dataframe and y is a numpy array of target column. We can also use the precomputed monotonic bins to apply on new datasets as shown below.

clf = MonotonicBinning(custom_binning=output_bins) #output_bins was created earlier

out_X = clf.transform(X)
out_X.head()


2. Weight of Evidence(WOE) and Information Value(IV)

The WOE package is available in the transformer function. The advantages of WOE transformation are

  1. Handles missing values
  2. Handles outliers
  3. The transformation is based on logarithmic value of distributions. This is aligned with the logistic regression output function
  4. No need for dummy variables
  5. By using proper binning technique, it can establish monotonic relationship (either increase or decrease) between the independent and dependent variable
  6. IV value can be used to select variables quickly.

The formula to calculate WOE and IV is provided below.

or simply,

How to perform WOE using “Xverse”?

from xverse.transformer import WOEclf = WOE()
clf.fit(X, y)
clf.woe_df # weight of evidence transformation dataset. This dataset will be used in making bivariate charts as well. clf.iv_df #information value dataset

To select features, please use the rule below

+-------------------+-----------------------------+
| Information Value | Variable Predictiveness |
+-------------------+-----------------------------+
| Less than 0.02 | Not useful for prediction |
+-------------------+-----------------------------+
| 0.02 to 0.1 | Weak predictive Power |
+-------------------+-----------------------------+
| 0.1 to 0.3 | Medium predictive Power |
+-------------------+-----------------------------+
| 0.3 to 0.5 | Strong predictive Power |
+-------------------+-----------------------------+
| >0.5 | Suspicious Predictive Power |
+-------------------+-----------------------------+











And to transform variable X based on WOE bins calculated before, use the code below.

clf.transform(X) #apply WOE transformation on the dataset

How to apply WOE transformation on future dataset?

output_woe_bins = clf.woe_bins #future transformation
output_mono_bins = clf.mono_custom_binning #future transformation
clf = WOE(woe_bins=output_woe_bins, mono_custom_binning=output_mono_bins) #output_bins was created earlierout_X = clf.transform(X)

Additional features to explore in WOE package:

treat_missing: {‘separate’, ‘mode’, ‘least_frequent’} (default=’separate’)This parameter setting is used to handle missing values in the dataset.
‘separate’ — Missing values are treated as a own group (category)
‘mode’ — Missing values are combined with the highest frequent item in the dataset
‘least_frequent’ — Missing values are combined with the least frequent item in the dataset


3. Bar Charts for Bivariate distributions

Bivariate distributions help us explore relationship between X and y variables. This option is available in “xverse” as well.

How to chart using “Xverse”?

To make bivariate charts, we need the “woe_df” dataset generated from the WOE option shown above.

woe_df = clf.woe_dffrom xverse.graph import BarCharts
clf = BarCharts(bar_type='v')
clf.plot(woe_df)

Additional features to explore in Bar charts:

plot_metric: 'count' or 'mean' (default='mean')
Metric to be used while plotting the bivariate chart.
'count' - Event counts in the particular bin
'mean' - Mean event rate in the particular bin


bar_type: 'horizontal' or 'vertical' (default='vertical')
Type of bar chart.
fig_size: figure size for each of the individual plots (default=(8,6))bar_color: CSS color style picker. Use it with the hashtag in the front. (default='#058caa')
Bar color

num_color: CSS color style picker. Use it with the hashtag in the front (default='#ed8549')
Numbers color. It represents the numbers written on top of the bar.



4. Voting Selector for variable selection

Variable selection is one of the key process in predictive modeling process. It is an art. To put in simple terms, variable selection is like picking a soccer team to win the World cup. You need to have the best player in each position and you don’t want two or many players who plays the same position.

In python, we have different techniques to select variables. Some of them include Recursive feature elimination, Tree-based selection and L1 based feature selection.

The idea here is to apply a variety of techniques to select variables. When an algorithm picks a variable, we give a vote for the variable. In the end, we calculate the total votes for each variable and then pick the best ones based on votes. This way, we end up picking the best variables with minimum effort in the variable selection process.

from xverse.ensemble import VotingSelectorclf = VotingSelector()
clf.fit(X, y)

Now, let’s look at the feature importance using the code below.

clf.feature_importances_
clf.feature_votes_
clf.transform(X).head()

Additional features to explore in Voting Selector:

selection_techniques: 'all', 'quick' or list(default='all')
List of selection techniques to be applied on the data.
Available techniques - Weight of evidence ('WOE'), Random Forest ('RF'), Recursive Feature Elimination ('RFE'), Extra Trees Classifier ('ETC'), Chi Square ('CS'), L1 feature selection ('L_ONE').

'all' - Apply all selection techniques ['WOE', 'RF', 'RFE', 'ETC', 'CS', 'L_ONE']
'quick' - ['WOE','RF','ETC']
list - user provided list of feature selection techniques from available techniques

no_of_featues: 'auto', 'sqrt' or int(default='auto')
Number of features to be selected by each selection technique.
'auto' - len(features)/2
'sqrt' - sqrt(len(features)) rounded to the lowest number
int - user provided number in integer format

handle_category= 'woe' or 'le' (default='woe')
Handle category values transformation using Label encoder or Weight of Evidence option. Takes care of missing values too. It treats missing values as separate level.
'woe' - use weight of evidence transformation
'le' - use label encoder transformation

numerical_missing_values= 'median', 'mean' or 0 (default='median')
Handle numerical variable missing values.
'median' - use median of the column
'mean' - use mean of the column
0 - use 0 to impute the missing values

minimum_votes = int (default=0)
Minimum number of votes needed to select a variable after feature selection. Only used in the transform process. Default value is set to 0 to select all variables.

























5. Pipeline feature

Finally, one important thing about “Xverse” is the pipeline functionality. We can add all the steps as part of the pipeline and make our lives much easier.

Here is a code shown below that shows the pipeline feature.

from sklearn.pipeline import Pipelineclf = Pipeline(steps=[('split_x_y', SplitXY(['target'])),('feature_votes', VotingSelector())])
clf.fit(df, df['target'])

Finally, to get the transformation on a new dataset,

clf.transform(df)

Contributing to Xverse

XuniVerse is under active development, if you’d like to be involved, we’d love to have you. Please leave a response in this article and we can talk about next steps.

The codes are given in a single notebook here for you to experiment on your own dataset.

Have fun!






Разместим вашу рекламу

Пиши: mail@pythondigest.ru

Нашли опечатку?

Выделите фрагмент и отправьте нажатием Ctrl+Enter.

Система Orphus