08.02.2018       Выпуск 216 (05.02.2018 - 11.02.2018)       Статьи

Прогнозирование начальных зарплат питчера


Экспериментальная функция:

Ниже вы видите текст статьи по ссылке. По нему можно быстро понять ссылка достойна прочтения или нет

Просим обратить внимание, что текст по ссылке и здесь может не совпадать.


With a basic linear regression using one feature in hand, I want to explore a model with the three features mentioned earlier: Age, IP, and WAR. To do this, I use K-folds cross-validation to tune my model. This approach involves splitting the data into ‘folds’ of K size, where Kis typically 5 or 10, and iterating through each fold as the validation set, while the remainder form the training set. Here is a nice graphic to aid in understanding:

KFolds Cross-Validation with Five Folds

In order to do apply this technique to my data, I adapt some code from David Ziganto. The first step is defining a few helper functions to calculate and return the training and validation error.

def calc_train_error(X_train, y_train, model):
'''returns in-sample error for already fit model.'''
predictions = model.predict(X_train)
mse = mean_squared_error(y_train, predictions)
rmse = np.sqrt(mse)
return mse

def calc_validation_error(X_test, y_test, model):
'''returns out-of-sample error for already fit model.'''
predictions = model.predict(X_test)
mse = mean_squared_error(y_test, predictions)
rmse = np.sqrt(mse)
return mse

def calc_metrics(X_train, y_train, X_test, y_test, model):
'''fits model and returns the RMSE for in-sample error and out-of-sample error'''
model.fit(X_train, y_train)
train_error = calc_train_error(X_train, y_train, model)
validation_error = calc_validation_error(X_test, y_test, model)
return train_error, validation_error

This is more advantageous than using scikit-learn’s cross_val_score function because it returns both in-sample (training) and out-of-sample (validation) error, allowing me to determine where my model lies on the bias-variance plot. If you are unfamiliar with the concept, I recommend investing the time to understand it. At a high level, it states that if I lower my training error past a certain point, my model is overfitting the training data and will actually result in poorer performance on the validation data. The preferred region within the chart is at or near the minimum of the validation error curve, because it results in the lowest overall error.

Bias-Variance Tradeoff

This concept is critical to cross-validation, and is apparent in the final output of the K-Folds approach I am using. One other important requirement for cross-validation is that I must scale my data so that the magnitudes of certain features do not cause those features to outweigh the others. In this case, I standardize the three features using the StandardScaler class from scikit-learn and apply K-Folds.

s = StandardScaler()
K = 10
kf = KFold(n_splits=K, shuffle=True, random_state=42)
alphas = [1e-5, 1e-4, 1e-3, 1e-2, 1e-1, 1]

for alpha in alphas:
train_errors = []
validation_errors = []
for train_index, val_index in kf.split(X, y):

# split data
X_train, X_val = X[train_index], X[val_index]
y_train, y_val = y.loc[train_index], y.loc[val_index]

        X_train = s.fit_transform(X_train)
X_train = X_train.reshape(164,3)
X_test = s.fit_transform(X_test)
X_test = X_test.reshape(18,3)

# instantiate model
lasso = Lasso(alpha=alpha, fit_intercept=True, random_state=77)

#calculate errors
train_error, val_error = calc_metrics(X_train, y_train, X_val, y_val, lasso)

# append to appropriate list

# generate report
print('alpha: {:6} | mean(train_error): {:7} | mean(val_error): {}'.


alpha: 1e-05 | mean(train_error): 2.5996 | mean(val_error): 2.9349
alpha: 0.0001 | mean(train_error): 2.5996 | mean(val_error): 2.9348
alpha: 0.001 | mean(train_error): 2.5996 | mean(val_error): 2.934
alpha: 0.01 | mean(train_error): 2.6002 | mean(val_error): 2.9268
alpha: 0.1 | mean(train_error): 2.6555 | mean(val_error): 2.9145
alpha: 1 | mean(train_error): 6.9193 | mean(val_error): 7.728

This particular cross-validation recommends an alpha for a lasso punishment term to the regression, but the most important results are the trends of the error terms. Training error decreases rapidly while moving from an alpha of 1 to an alpha of .1, but begins to level off. Meanwhile, the validation error forms a convex pattern as it moves across alphas in the same direction. This is the bias-variance tradeoff in action! With this information, I can choose my optimal alpha of .01, because it has the smallest delta between error terms. At this point, I use the K-Folds cross-validation multiple times to determine the best model for the data. It turns out that WAR exhibits too much collinearity with IP and does not actually provide much value. Because I have two features, I can represent the model as a hyperplane on a 3D plot. For interpretability, I undo the log transformation on Salary and get an interesting shape.

The Resulting Model

Again, it is clear that this model has shortfalls, especially when it comes to predicting salaries for the oldest players. This model predicts incredibly high salaries for those players that pitch into their forties, which is simply not the case.

As it turns out, a polynomial transformation of Age and IP results in an even better model than what I had with those two features alone. Should you wish to explore polynomial feature transformation, you can visit my github.

Лучшая Python рассылка

Разместим вашу рекламу

Пиши: mail@pythondigest.ru

Нашли опечатку?

Выделите фрагмент и отправьте нажатием Ctrl+Enter.

Система Orphus