genelens.fselector.FeatureSelector

class documentation

class FeatureSelector(object):

Constructor: FeatureSelector(data, target, C, C_space, ...)

Feature selection using regularized logistic regression with automatic C parameter tuning. Attributes: ----- data : array-like or pd.DataFrame Feature matrix (n_samples, n_features) target : array-like Binary target vector (n_samples,) C : float, optional Regularization strength (inverse of regularization). If None, will be automatically determined from C_space. C_space : array-like, optional Search space for C values (default: 20 values from 0.0001 to 1) C_finder_iter : int, optional Number of bootstrap iterations for C optimization (default: 100) C_tol : float, optional Tolerance for derivative to consider the optimization plateau (default: 0.005) cut_off_w_feature : float, optional Fraction cutoff for model coefficients (default: 0) cut_off_w_estimation : bool, optional Whether to estimate optimal cutoff fraction (default: False) cut_off_estim_params : dict, optional Parameters for cutoff w estimation. If None, uses: {'inner_loop': 10, 'max_iter': 10, 'cut_off_feature_value': 0.1, 'max_feature': None, optimal_method': 'first'} inner_loop: The number of simulations of the inner loop when evaluating cut_off_level. The number of repetitions of train_test_splits to evaluate the quality of a model with a given set of features. max_iter: Number of simulations when adding each feature during cut-off search. The total number of simulations is max_iter*inner_loop. cut_off_feature_value: The minimum proportion of models that a feature must be included in for it to be considered in further simulations max_feature : int, optional Can be restrict the feature space to n-top features. If "None", then the search is carried out over the entire feature space. optimal_method: str, ['first', 'median'] How optimal cut-off level will be get. Default = 'first'. feature_resample: int, optional (default=0) Resampling of features. If 0, the full feature space is considered at each train_test_split, otherwise the feature space is also sampled in batches of size "feature_resample" pipeline_steps : list, optional Custom pipeline steps for the classifier. If None, uses: [StandardScaler(), LogisticRegression(penalty='l1', solver='liblinear')] scoring : str, optional Scoring metric for model evaluation. Must be from sklearn.metrics.SCORERS.keys() (default: 'roc_auc') Notes ----- - When C=None, performs bootstrap-based optimization to find optimal regularization strength - Final model selects features with non-zero coefficients Examples -------- >>> FS_model = FeatureSelector(X, y, C = 0.04, C_space=np.linspace(0.0001, 1, 20), C_finder_iter=10, cut_off_w_estimation=False, cut_off_w_feature=0.95, cut_off_estim_params={'max_feature': 50}) >>> FS_model.fit(max_iter=32000, log=True, feature_resample=10) >>> FS_model.best_features # return dict of selected best features {feature: w} >>> FS_model.all_features # return dict of all features with w > cut_off_feature_value (default > 0.1)

Method	`__init__`	Undocumented
Method	`fit`	Train the feature selection model and identify significant features.
Method	`get_optimal_C`	The function searches for the optimal regularization parameter
Instance Variable	`all_features`	dict(feature: weight), All features that w > cut_off_feature_value
Instance Variable	`best_features`	dict(feature: weight), The best features that were selected. Available after fit
Instance Variable	`C`	Regularization strength (inverse of regularization). If None, will be automatically determined from C_space.
Instance Variable	`C_finder_iter`	Number of bootstrap iterations for C optimization (default: 100)
Instance Variable	`C_space`	Search space for C values (default: 20 values from 0.0001 to 1)
Instance Variable	`cut_off_estim_params`	dict, optional Parameters for cutoff w estimation.
Instance Variable	`cut_off_w_feature`	cutoff for w feature (default: 0). All features with w > cut_off_w_feature would be included to analysis
Instance Variable	`data`	array-like or pd.DataFrame. Feature matrix (n_samples, n_features)
Instance Variable	`pipeline`	sklearn.pipeline.Pipeline. A sequence of data transformers with an optional final predictor.
Instance Variable	`scoring`	Scoring metric for model evaluation. Must be from sklearn.metrics.SCORERS.keys(). default: 'roc_auc'
Instance Variable	`target`	array-like. Binary target vector (n_samples,)
Method	`_get_optimal_cut_off_level`	Undocumented
Instance Variable	`__C_space_iter`	Undocumented
Instance Variable	`_best_feature_progress`	Undocumented
Instance Variable	`_best_features_counter`	Undocumented
Instance Variable	`_C_arg_plateau`	Undocumented
Instance Variable	`_C_space_iter`	Undocumented
Instance Variable	`_C_space_mean`	Undocumented
Instance Variable	`_n_model`	Undocumented
Instance Variable	`_old_top`	Undocumented
Instance Variable	`_plot_params`	Undocumented
Instance Variable	`_score_list`	Undocumented
Instance Variable	`_top_fearures_progress`	Undocumented
Instance Variable	`_top_update`	Undocumented
Instance Variable	`_top_weights`	Undocumented

def __init__(self, data, target, C=None, C_space=np.linspace(0.0001, 1, 20), C_finder_iter=100, C_tol=0.005, cut_off_w_feature=0, cut_off_w_estimation=True, cut_off_estim_params=None, pipeline_steps=None, scoring='roc_auc'): ¶

Undocumented

def fit(self, max_iter: int = 3000, cut_off_score: float = 0.6, log: bool = True, feature_resample: int = 0): ¶

Train the feature selection model and identify significant features. Parameters ---------- max_iter : int, optional (default=3000) Maximum number of iterations for the optimization solver. Must be positive. Consider increasing for complex datasets. cut_off_score : float, optional (default=0.6) Minimum importance score (scorer_metrics [ROC-AUC] * proportion-of-models-that-include-the-feature) threshold for feature selection (range: 0-1). Features with scores below this threshold will be discarded. log : bool, optional (default=True) If True, enables verbose output during training. Recommended for monitoring convergence. feature_resample : int, optional (default=0) Resampling of features. If 0, the full feature space is considered at each train_test_split, otherwise the feature space is also sampled in batches of size "feature_resample" Returns ------- self : FeatureSelector The fitted estimator instance enabling method chaining. Examples -------- >>> selector = FeatureSelector(X, y) >>> selector.fit(max_iter=5000)

def get_optimal_C(self, tol=0.005): ¶

The function searches for the optimal regularization parameter

all_features = ¶

dict(feature: weight), All features that w > cut_off_feature_value

best_features = ¶

dict(feature: weight), The best features that were selected. Available after fit

C = ¶

Regularization strength (inverse of regularization). If None, will be automatically determined from C_space.

C_finder_iter = ¶

Number of bootstrap iterations for C optimization (default: 100)

C_space = ¶

Search space for C values (default: 20 values from 0.0001 to 1)

cut_off_estim_params = ¶

dict, optional Parameters for cutoff w estimation. If None, uses: {'inner_loop': 10, 'max_iter': 10, 'cut_off_feature_value': 0.1, 'max_feature': None, optimal_method': 'first'} --------------- inner_loop: The number of simulations of the inner loop when evaluating cut_off_level. The number of repetitions of train_test_splits to evaluate the quality of a model with a given set of features. max_iter: Number of simulations when adding each feature during cut-off search. The total number of simulations is max_iter*inner_loop. cut_off_feature_value: The minimum proportion of models that a feature must be included in for it to be considered in further simulations max_feature : int, optional Can be restrict the feature space to n-top features. If "None", then the search is carried out over the entire feature space. optimal_method: str, ['first', 'median'] How optimal cut-off level will be get. Default = 'first'. feature_resample: int, optional (default=0) Resampling of features. If 0, the full feature space is considered at each train_test_split, otherwise the feature space is also sampled in batches of size "feature_resample"

cut_off_w_feature = ¶

cutoff for w feature (default: 0). All features with w > cut_off_w_feature would be included to analysis

data = ¶

array-like or pd.DataFrame. Feature matrix (n_samples, n_features)

pipeline = ¶

sklearn.pipeline.Pipeline. A sequence of data transformers with an optional final predictor.

scoring = ¶

Scoring metric for model evaluation. Must be from sklearn.metrics.SCORERS.keys(). default: 'roc_auc'

target = ¶

array-like. Binary target vector (n_samples,)

def _get_optimal_cut_off_level(self, X=None, y=None, best_features=None, self_data=True, inner_loop=20, max_iter=10, cut_off_feature_value=0.1, max_feature=None, optimal_method='first', feature_resample=0): ¶

Undocumented

__C_space_iter = ¶

Undocumented

_best_feature_progress = ¶

Undocumented

_best_features_counter = ¶

Undocumented

_C_arg_plateau = ¶

Undocumented

_C_space_iter = ¶

Undocumented

_C_space_mean = ¶

Undocumented

_n_model: int = ¶

Undocumented

_old_top = ¶

Undocumented

_plot_params = ¶

Undocumented

_score_list = ¶

Undocumented

_top_fearures_progress = ¶

Undocumented

_top_update = ¶

Undocumented

_top_weights = ¶

Undocumented