class FeatureSelector(object):
Constructor: FeatureSelector(data, target, C, C_space, ...)
Feature selection using regularized logistic regression with automatic C parameter tuning. Attributes: ----- data : array-like or pd.DataFrame Feature matrix (n_samples, n_features) target : array-like Binary target vector (n_samples,) C : float, optional Regularization strength (inverse of regularization). If None, will be automatically determined from C_space. C_space : array-like, optional Search space for C values (default: 20 values from 0.0001 to 1) C_finder_iter : int, optional Number of bootstrap iterations for C optimization (default: 100) C_tol : float, optional Tolerance for derivative to consider the optimization plateau (default: 0.005) cut_off_w_feature : float, optional Fraction cutoff for model coefficients (default: 0) cut_off_w_estimation : bool, optional Whether to estimate optimal cutoff fraction (default: False) cut_off_estim_params : dict, optional Parameters for cutoff w estimation. If None, uses: {'inner_loop': 10, 'max_iter': 10, 'cut_off_feature_value': 0.1, 'max_feature': None, optimal_method': 'first'} inner_loop: The number of simulations of the inner loop when evaluating cut_off_level. The number of repetitions of train_test_splits to evaluate the quality of a model with a given set of features. max_iter: Number of simulations when adding each feature during cut-off search. The total number of simulations is max_iter*inner_loop. cut_off_feature_value: The minimum proportion of models that a feature must be included in for it to be considered in further simulations max_feature : int, optional Can be restrict the feature space to n-top features. If "None", then the search is carried out over the entire feature space. optimal_method: str, ['first', 'median'] How optimal cut-off level will be get. Default = 'first'. feature_resample: int, optional (default=0) Resampling of features. If 0, the full feature space is considered at each train_test_split, otherwise the feature space is also sampled in batches of size "feature_resample" pipeline_steps : list, optional Custom pipeline steps for the classifier. If None, uses: [StandardScaler(), LogisticRegression(penalty='l1', solver='liblinear')] scoring : str, optional Scoring metric for model evaluation. Must be from sklearn.metrics.SCORERS.keys() (default: 'roc_auc') Notes ----- - When C=None, performs bootstrap-based optimization to find optimal regularization strength - Final model selects features with non-zero coefficients Examples -------- >>> FS_model = FeatureSelector(X, y, C = 0.04, C_space=np.linspace(0.0001, 1, 20), C_finder_iter=10, cut_off_w_estimation=False, cut_off_w_feature=0.95, cut_off_estim_params={'max_feature': 50}) >>> FS_model.fit(max_iter=32000, log=True, feature_resample=10) >>> FS_model.best_features # return dict of selected best features {feature: w} >>> FS_model.all_features # return dict of all features with w > cut_off_feature_value (default > 0.1)
| Method | __init__ |
Undocumented |
| Method | fit |
Train the feature selection model and identify significant features. |
| Method | get_optimal_ |
The function searches for the optimal regularization parameter |
| Instance Variable | all |
dict(feature: weight), All features that w > cut_off_feature_value |
| Instance Variable | best |
dict(feature: weight), The best features that were selected. Available after fit |
| Instance Variable | C |
Regularization strength (inverse of regularization). If None, will be automatically determined from C_space. |
| Instance Variable | |
Number of bootstrap iterations for C optimization (default: 100) |
| Instance Variable | |
Search space for C values (default: 20 values from 0.0001 to 1) |
| Instance Variable | cut |
dict, optional Parameters for cutoff w estimation. |
| Instance Variable | cut |
cutoff for w feature (default: 0). All features with w > cut_off_w_feature would be included to analysis |
| Instance Variable | data |
array-like or pd.DataFrame. Feature matrix (n_samples, n_features) |
| Instance Variable | pipeline |
sklearn.pipeline.Pipeline. A sequence of data transformers with an optional final predictor. |
| Instance Variable | scoring |
Scoring metric for model evaluation. Must be from sklearn.metrics.SCORERS.keys(). default: 'roc_auc' |
| Instance Variable | target |
array-like. Binary target vector (n_samples,) |
| Method | _get |
Undocumented |
| Instance Variable | __ |
Undocumented |
| Instance Variable | _best |
Undocumented |
| Instance Variable | _best |
Undocumented |
| Instance Variable | _ |
Undocumented |
| Instance Variable | _ |
Undocumented |
| Instance Variable | _ |
Undocumented |
| Instance Variable | _n |
Undocumented |
| Instance Variable | _old |
Undocumented |
| Instance Variable | _plot |
Undocumented |
| Instance Variable | _score |
Undocumented |
| Instance Variable | _top |
Undocumented |
| Instance Variable | _top |
Undocumented |
| Instance Variable | _top |
Undocumented |
Undocumented
int = 3000, cut_off_score: float = 0.6, log: bool = True, feature_resample: int = 0):
¶
Train the feature selection model and identify significant features. Parameters ---------- max_iter : int, optional (default=3000) Maximum number of iterations for the optimization solver. Must be positive. Consider increasing for complex datasets. cut_off_score : float, optional (default=0.6) Minimum importance score (scorer_metrics [ROC-AUC] * proportion-of-models-that-include-the-feature) threshold for feature selection (range: 0-1). Features with scores below this threshold will be discarded. log : bool, optional (default=True) If True, enables verbose output during training. Recommended for monitoring convergence. feature_resample : int, optional (default=0) Resampling of features. If 0, the full feature space is considered at each train_test_split, otherwise the feature space is also sampled in batches of size "feature_resample" Returns ------- self : FeatureSelector The fitted estimator instance enabling method chaining. Examples -------- >>> selector = FeatureSelector(X, y) >>> selector.fit(max_iter=5000)
Regularization strength (inverse of regularization). If None, will be automatically determined from C_space.
dict, optional Parameters for cutoff w estimation. If None, uses: {'inner_loop': 10, 'max_iter': 10, 'cut_off_feature_value': 0.1, 'max_feature': None, optimal_method': 'first'} --------------- inner_loop: The number of simulations of the inner loop when evaluating cut_off_level. The number of repetitions of train_test_splits to evaluate the quality of a model with a given set of features. max_iter: Number of simulations when adding each feature during cut-off search. The total number of simulations is max_iter*inner_loop. cut_off_feature_value: The minimum proportion of models that a feature must be included in for it to be considered in further simulations max_feature : int, optional Can be restrict the feature space to n-top features. If "None", then the search is carried out over the entire feature space. optimal_method: str, ['first', 'median'] How optimal cut-off level will be get. Default = 'first'. feature_resample: int, optional (default=0) Resampling of features. If 0, the full feature space is considered at each train_test_split, otherwise the feature space is also sampled in batches of size "feature_resample"
cutoff for w feature (default: 0). All features with w > cut_off_w_feature would be included to analysis
sklearn.pipeline.Pipeline. A sequence of data transformers with an optional final predictor.
Scoring metric for model evaluation. Must be from sklearn.metrics.SCORERS.keys(). default: 'roc_auc'
Undocumented