class documentation

class FeatureSelector(object):

Constructor: FeatureSelector(data, target, C, C_space, ...)

View In Hierarchy

Feature selection using regularized logistic regression with automatic C parameter tuning. Attributes: ----- data : array-like or pd.DataFrame Feature matrix (n_samples, n_features) target : array-like Binary target vector (n_samples,) C : float, optional Regularization strength (inverse of regularization). If None, will be automatically determined from C_space. C_space : array-like, optional Search space for C values (default: 20 values from 0.0001 to 1) C_finder_iter : int, optional Number of bootstrap iterations for C optimization (default: 100) C_tol : float, optional Tolerance for derivative to consider the optimization plateau (default: 0.005) cut_off_w_feature : float, optional Fraction cutoff for model coefficients (default: 0) cut_off_w_estimation : bool, optional Whether to estimate optimal cutoff fraction (default: False) cut_off_estim_params : dict, optional Parameters for cutoff w estimation. If None, uses: {'inner_loop': 10, 'max_iter': 10, 'cut_off_feature_value': 0.1, 'max_feature': None, optimal_method': 'first'} inner_loop: The number of simulations of the inner loop when evaluating cut_off_level. The number of repetitions of train_test_splits to evaluate the quality of a model with a given set of features. max_iter: Number of simulations when adding each feature during cut-off search. The total number of simulations is max_iter*inner_loop. cut_off_feature_value: The minimum proportion of models that a feature must be included in for it to be considered in further simulations max_feature : int, optional Can be restrict the feature space to n-top features. If "None", then the search is carried out over the entire feature space. optimal_method: str, ['first', 'median'] How optimal cut-off level will be get. Default = 'first'. feature_resample: int, optional (default=0) Resampling of features. If 0, the full feature space is considered at each train_test_split, otherwise the feature space is also sampled in batches of size "feature_resample" pipeline_steps : list, optional Custom pipeline steps for the classifier. If None, uses: [StandardScaler(), LogisticRegression(penalty='l1', solver='liblinear')] scoring : str, optional Scoring metric for model evaluation. Must be from sklearn.metrics.SCORERS.keys() (default: 'roc_auc') Notes ----- - When C=None, performs bootstrap-based optimization to find optimal regularization strength - Final model selects features with non-zero coefficients Examples -------- >>> FS_model = FeatureSelector(X, y, C = 0.04, C_space=np.linspace(0.0001, 1, 20), C_finder_iter=10, cut_off_w_estimation=False, cut_off_w_feature=0.95, cut_off_estim_params={'max_feature': 50}) >>> FS_model.fit(max_iter=32000, log=True, feature_resample=10) >>> FS_model.best_features # return dict of selected best features {feature: w} >>> FS_model.all_features # return dict of all features with w > cut_off_feature_value (default > 0.1)

Method __init__ Undocumented
Method fit Train the feature selection model and identify significant features.
Method get_optimal_C The function searches for the optimal regularization parameter
Instance Variable all_features dict(feature: weight), All features that w > cut_off_feature_value
Instance Variable best_features dict(feature: weight), The best features that were selected. Available after fit
Instance Variable C Regularization strength (inverse of regularization). If None, will be automatically determined from C_space.
Instance Variable C_finder_iter Number of bootstrap iterations for C optimization (default: 100)
Instance Variable C_space Search space for C values (default: 20 values from 0.0001 to 1)
Instance Variable cut_off_estim_params dict, optional Parameters for cutoff w estimation.
Instance Variable cut_off_w_feature cutoff for w feature (default: 0). All features with w > cut_off_w_feature would be included to analysis
Instance Variable data array-like or pd.DataFrame. Feature matrix (n_samples, n_features)
Instance Variable pipeline sklearn.pipeline.Pipeline. A sequence of data transformers with an optional final predictor.
Instance Variable scoring Scoring metric for model evaluation. Must be from sklearn.metrics.SCORERS.keys(). default: 'roc_auc'
Instance Variable target array-like. Binary target vector (n_samples,)
Method _get_optimal_cut_off_level Undocumented
Instance Variable __C_space_iter Undocumented
Instance Variable _best_feature_progress Undocumented
Instance Variable _best_features_counter Undocumented
Instance Variable _C_arg_plateau Undocumented
Instance Variable _C_space_iter Undocumented
Instance Variable _C_space_mean Undocumented
Instance Variable _n_model Undocumented
Instance Variable _old_top Undocumented
Instance Variable _plot_params Undocumented
Instance Variable _score_list Undocumented
Instance Variable _top_fearures_progress Undocumented
Instance Variable _top_update Undocumented
Instance Variable _top_weights Undocumented
def __init__(self, data, target, C=None, C_space=np.linspace(0.0001, 1, 20), C_finder_iter=100, C_tol=0.005, cut_off_w_feature=0, cut_off_w_estimation=True, cut_off_estim_params=None, pipeline_steps=None, scoring='roc_auc'):

Undocumented

def fit(self, max_iter: int = 3000, cut_off_score: float = 0.6, log: bool = True, feature_resample: int = 0):

Train the feature selection model and identify significant features. Parameters ---------- max_iter : int, optional (default=3000) Maximum number of iterations for the optimization solver. Must be positive. Consider increasing for complex datasets. cut_off_score : float, optional (default=0.6) Minimum importance score (scorer_metrics [ROC-AUC] * proportion-of-models-that-include-the-feature) threshold for feature selection (range: 0-1). Features with scores below this threshold will be discarded. log : bool, optional (default=True) If True, enables verbose output during training. Recommended for monitoring convergence. feature_resample : int, optional (default=0) Resampling of features. If 0, the full feature space is considered at each train_test_split, otherwise the feature space is also sampled in batches of size "feature_resample" Returns ------- self : FeatureSelector The fitted estimator instance enabling method chaining. Examples -------- >>> selector = FeatureSelector(X, y) >>> selector.fit(max_iter=5000)

def get_optimal_C(self, tol=0.005):

The function searches for the optimal regularization parameter

all_features =

dict(feature: weight), All features that w > cut_off_feature_value

best_features =

dict(feature: weight), The best features that were selected. Available after fit

C =

Regularization strength (inverse of regularization). If None, will be automatically determined from C_space.

C_finder_iter =

Number of bootstrap iterations for C optimization (default: 100)

C_space =

Search space for C values (default: 20 values from 0.0001 to 1)

cut_off_estim_params =

dict, optional Parameters for cutoff w estimation. If None, uses: {'inner_loop': 10, 'max_iter': 10, 'cut_off_feature_value': 0.1, 'max_feature': None, optimal_method': 'first'} --------------- inner_loop: The number of simulations of the inner loop when evaluating cut_off_level. The number of repetitions of train_test_splits to evaluate the quality of a model with a given set of features. max_iter: Number of simulations when adding each feature during cut-off search. The total number of simulations is max_iter*inner_loop. cut_off_feature_value: The minimum proportion of models that a feature must be included in for it to be considered in further simulations max_feature : int, optional Can be restrict the feature space to n-top features. If "None", then the search is carried out over the entire feature space. optimal_method: str, ['first', 'median'] How optimal cut-off level will be get. Default = 'first'. feature_resample: int, optional (default=0) Resampling of features. If 0, the full feature space is considered at each train_test_split, otherwise the feature space is also sampled in batches of size "feature_resample"

cut_off_w_feature =

cutoff for w feature (default: 0). All features with w > cut_off_w_feature would be included to analysis

data =

array-like or pd.DataFrame. Feature matrix (n_samples, n_features)

pipeline =

sklearn.pipeline.Pipeline. A sequence of data transformers with an optional final predictor.

scoring =

Scoring metric for model evaluation. Must be from sklearn.metrics.SCORERS.keys(). default: 'roc_auc'

target =

array-like. Binary target vector (n_samples,)

def _get_optimal_cut_off_level(self, X=None, y=None, best_features=None, self_data=True, inner_loop=20, max_iter=10, cut_off_feature_value=0.1, max_feature=None, optimal_method='first', feature_resample=0):

Undocumented

__C_space_iter =

Undocumented

_best_feature_progress =

Undocumented

_best_features_counter =

Undocumented

_C_arg_plateau =

Undocumented

_C_space_iter =

Undocumented

_C_space_mean =

Undocumented

_n_model: int =

Undocumented

_old_top =

Undocumented

_plot_params =

Undocumented

_score_list =

Undocumented

_top_fearures_progress =

Undocumented

_top_update =

Undocumented

_top_weights =

Undocumented