API Guide
aggmap has two major classes AggMap
and AggMapNet
.
AggMap
- class aggmap.map.AggMap(dfx, metric='correlation', by_scipy=False, n_cpus=16, info_distance=None)[source]
The feature restructuring class AggMap
- Parameters:
dfx (pandas DataFrame) – Input data frame.
metric (string, default: 'correlation') – measurement of feature distance, support {‘cosine’, ‘correlation’, ‘euclidean’, ‘jaccard’, ‘hamming’, ‘dice’}
info_distance (numpy array, defalt: None) – a vector-form distance vector of the feature points, shape should be: (n*(n-1)/2), where n is the number of the features. It can be useful when you have you own vector-form distance to pass
by_scipy (bool, defalt: False.) – calculate the distance by using the scipy pdist fuction. It can bu useful when dfx.shape[1] > 20000, i.e., the number of features is very large Using pdist will increase the speed to calculate the distance.
n_cpus (int, default: 16) – number of cpu cores to use to calculate the distance.
- batch_transform(array_2d, scale=True, scale_method='minmax', n_jobs=4, fillnan=0)[source]
- Parameters:
array_2d (2D numpy array feature points, M(samples) x N(feature ponits)) –
scale (bool, if True, we will apply MinMax scaling by the precomputed values) –
scale_method ({'minmax', 'standard'}) –
n_jobs (number of parallel) –
fillnan (fill nan value, default: 0) –
- fit(feature_group_list=[], cluster_channels=5, var_thr=-1, split_channels=True, fmap_shape=None, emb_method='umap', min_dist=0.1, n_neighbors=15, a=1.576943460405378, b=0.8950608781227859, verbose=2, random_state=32, group_color_dict={}, lnk_method='complete', **kwargs)[source]
- Parameters:
feature_group_list (list of the group name for each feature point) –
cluster_channels (int, number of the channels(clusters) if feature_group_list is empty) –
var_thr (float, defalt is -1, meaning that feature will be included only if the conresponding variance larger than this value. Since some of the feature has pretty low variances, we can remove them by increasing this threshold) –
split_channels (bool, if True, outputs will split into various channels using the types of feature) –
fmap_shape (None or tuple, size of molmap, if None, the size of feature map will be calculated automatically) –
emb_method ({'umap', 'tsne', 'mds', 'isomap', 'random', 'lle', 'se'}, algorithm to embedd high-D to 2D) –
min_dist (float, UMAP parameters for the effective minimum distance between embedded points.) –
n_neighbors (init, UMAP parameters of controlling the embedding. Larger values result in more global views of the manifold, while smaller values result in more local data being preserved.) –
a (float, UMAP parameters of controlling the embedding. If None, it will automatically be determined by
min_dist
andspread
.) –b (float, UMAP parameters of controlling the embedding. If None, it will automatically be determined by
min_dist
andspread
.) –group_color_dict (dict of the group colors, keys are the group names, values are the colors) –
lnk_method ({'complete', 'average', 'single', 'weighted', 'centroid'}, linkage method) –
kwargs (the extra parameters for the conresponding embedding method) –
- plot_scatter(htmlpath='./', htmlname=None, radius=2, enabled_data_labels=False)[source]
Scatter plot, radius: the size of the scatter, must be int
- plot_tree(figsize=(16, 8), add_leaf_labels=True, leaf_font_size=18, leaf_rotation=90)[source]
Diagram tree plot
- refit_c(cluster_channels=10, lnk_method='complete', group_color_dict={})[source]
re-fit the aggmap object to update the number of channels
- Parameters:
cluster_channels (int, number of the channels(clusters)) –
group_color_dict (dict of the group colors, keys are the group names, values are the colors) –
lnk_method ({'complete', 'average', 'single', 'weighted', 'centroid'}, linkage method) –
- to_nwk_tree(treefile='mytree', leaf_names=None)[source]
convert mp object to newick tree and the data file to submit to itol sever
AggMapNet
Created on Sun Aug 16 17:10:53 2020
@author: wanxiang.shen@u.nus.edu
- class aggmap.AggMapNet.MultiClassEstimator(epochs=200, conv1_kernel_size=13, dense_layers=[128], dense_avf='relu', batch_size=128, lr=0.0001, loss='categorical_crossentropy', batch_norm=False, n_inception=2, dropout=0.0, monitor='val_loss', metric='ACC', patience=10000, verbose=0, last_avf='softmax', random_state=32, gpuid=0)[source]
An AggMap CNN MultiClass estimator (each sample belongs to only one class)
- Parameters:
epochs (int, default = 200) – A parameter used for training epochs.
conv1_kernel_size (int, default = 13) – A parameter used for the kernel size of first covolutional layers
dense_layers (list, default = [128]) – A parameter used for the dense layers.
batch_size (int, default: 128) – A parameter used for the batch size.
lr (float, default: 1e-4) – A parameter used for the learning rate.
loss (string or function, default: 'categorical_crossentropy') – A parameter used for the loss function
batch_norm (bool, default: False) – batch normalization after convolution layers.
n_inception (int, default:2) – Number of the inception layers.
dense_avf (str, default is 'relu') – activation fuction in the dense layers.
dropout (float, default: 0) – A parameter used for the dropout of the dense layers.
monitor (str, default: 'val_loss') – {‘val_loss’, ‘val_metric’}, a monitor for model selection.
metric (str, default: 'ACC') – {‘ROC’, ‘ACC’, ‘PRC’}, a matric parameter.
patience (int, default: 10000) – A parameter used for early stopping.
gpuid (int, default: 0) – A parameter used for specific gpu card.
verbose (int, default: 0) – if positive, then the log infomation of AggMapNet will be print, if negative, then the log infomation of orignal model will be print.
random_state (int, default: 32) – Random seed.
Examples
>>> from aggmap import AggModel >>> clf = AggModel.MultiClassEstimator()
- explain_model(mp, X, y, binary_task=False, explain_format='global', apply_logrithm=False, apply_smoothing=False, kernel_size=3, sigma=1.2)[source]
Feature importance calculation
- Parameters:
mp (aggmap object) –
X (trianing or test set X arrays) –
y (trianing or test set y arrays) –
binary_task ({True, False}) – whether the task is binary, if True, the feature importance will be calculated for one class only
explain_format ({'local', 'global'}, default: 'global') – local or global feature importance, if local, then X must be one sample
apply_logrithm ({True, False}, default: False) – whether apply a logarithm transformation on the importance values
apply_smoothing ({True, False}, default: False) – whether apply a smoothing transformation on the importance values
kernel_size (odd number, the kernel size to perform the smoothing) –
sigma (float, sigma for gaussian smoothing) –
- Return type:
DataFrame of feature importance
- predict_proba(X)[source]
Probability estimates. The returned estimates for all classes are ordered by the label of classes. For a multi_class problem, if multi_class is set to be “multinomial” the softmax function is used to find the predicted probability of each class.
- Parameters:
X (array-like of shape (n_samples, n_features)) – Vector to be scored, where n_samples is the number of samples and n_features is the number of features.
- Returns:
T – Returns the probability of the sample for each class in the model, where classes are ordered as they are in
self.classes_
.- Return type:
array-like of shape (n_samples, n_classes)
- score(X, y, scoring='accuracy', sample_weight=None)[source]
Returns the score using the scoring option on the given test data and labels.
- Parameters:
X (array-like of shape (n_samples, n_features)) – Test samples.
y (array-like of shape (n_samples,)) – True labels for X.
scoring (str, please refer to: https://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter) –
sample_weight (array-like of shape (n_samples,), default=None) – Sample weights.
- Returns:
score – Score of self.predict(X) wrt. y.
- Return type:
- set_params(**parameters)[source]
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline
). The latter have parameters of the form<component>__<parameter>
so that it’s possible to update each component of a nested object.- Parameters:
**params (dict) – Estimator parameters.
- Returns:
self – Estimator instance.
- Return type:
estimator instance
- class aggmap.AggMapNet.MultiLabelEstimator(epochs=200, conv1_kernel_size=13, dense_layers=[128], dense_avf='relu', batch_size=128, lr=0.0001, loss=<function sigmoid_cross_entropy_with_logits_v2>, batch_norm=False, n_inception=2, dropout=0.0, monitor='val_loss', metric='ROC', patience=10000, verbose=0, random_state=32, gpuid=0)[source]
An AggMap CNN MultiLabel estimator (each sample belongs to only one class)
- Parameters:
epochs (int, default = 200) – A parameter used for training epochs.
conv1_kernel_size (int, default = 13) – A parameter used for the kernel size of first covolutional layers。
dense_layers (list, default = [128]) – A parameter used for the dense layers.
batch_size (int, default: 128) – A parameter used for the batch size.
lr (float, default: 1e-4) – A parameter used for the learning rate.
loss (string or function, default: tf.nn.sigmoid_cross_entropy_with_logits。) – A parameter used for the loss function
batch_norm (bool, default: False) – batch normalization after convolution layers.
n_inception (int, default:2) – Number of the inception layers.
dense_avf (str, default is 'relu') – activation fuction in the dense layers.
dropout (float, default: 0) – A parameter used for the dropout of the dense layers, such as 0.1, 0.3, 0.5.
monitor (str, default: 'val_loss') – {‘val_loss’, ‘val_metric’}, a monitor for model selection。
metric (str, default: 'ROC') – {‘ROC’, ‘ACC’, ‘PRC’}, a matric parameter。
patience (int, default: 10000) – A parameter used for early stopping。
gpuid (int, default: 0) – A parameter used for specific gpu card。
verbose (int, default: 0) – if positive, then the log infomation of AggMapNet will be print, if negative, then the log infomation of orignal model will be print。
random_state (int, default: 32) – Random seed
name (str) – Model name
Examples
>>> from aggmap import AggModel >>> clf = AggModel.MultiLabelEstimator()
- explain_model(mp, X, y, explain_format='global', apply_logrithm=False, apply_smoothing=False, kernel_size=3, sigma=1.2)[source]
Feature importance calculation.
- Parameters:
mp (aggmap object) –
X (trianing or test set X arrays) –
y (trianing or test set y arrays) – whether the task is binary, if True, the feature importance will be calculated for one class only
explain_format ({'local', 'global'}, default: 'global') – local or global feature importance, if local, then X must be one sample.
apply_logrithm ({True, False}, default: False.) – whether apply a logarithm transformation on the importance values.
apply_smoothing ({True, False}, default: False.) – whether apply a smoothing transformation on the importance values.
kernel_size (odd number, the kernel size to perform the smoothing.) –
sigma (float, sigma for gaussian smoothing.) –
- Return type:
DataFrame of feature importance
- predict_proba(X)[source]
Probability estimates. The returned estimates for all classes are ordered by the label of classes. For a multi_class problem, if multi_class is set to be “multinomial” the softmax function is used to find the predicted probability of each class.
- Parameters:
X (array-like of shape (n_samples, n_features)) – Vector to be scored, where n_samples is the number of samples and n_features is the number of features.
- Returns:
T – Returns the probability of the sample for each class in the model, where classes are ordered as they are in
self.classes_
.- Return type:
array-like of shape (n_samples, n_classes)
- score(X, y, scoring='accuracy', sample_weight=None)[source]
Returns the score using the scoring option on the given test data and labels.
- Parameters:
X (array-like of shape (n_samples, n_features)) – Test samples.
y (array-like of shape (n_samples,)) – True labels for X.
scoring (str, please refer to: https://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter) –
sample_weight (array-like of shape (n_samples,), default=None) – Sample weights.
- Returns:
score – Score of self.predict(X) wrt. y.
- Return type:
- set_params(**parameters)[source]
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline
). The latter have parameters of the form<component>__<parameter>
so that it’s possible to update each component of a nested object.- Parameters:
**params (dict) – Estimator parameters.
- Returns:
self – Estimator instance.
- Return type:
estimator instance
- class aggmap.AggMapNet.RegressionEstimator(epochs=200, conv1_kernel_size=13, dense_layers=[128], dense_avf='relu', batch_size=128, lr=0.0001, loss='mse', batch_norm=False, n_inception=2, dropout=0.0, monitor='val_loss', metric='r2', patience=10000, verbose=0, random_state=32, gpuid=0)[source]
An AggMap CNN Regression estimator (each sample belongs to only one class)
- Parameters:
epochs (int, default = 200) – A parameter used for training epochs.
conv1_kernel_size (int, default = 13) – A parameter used for the kernel size of first covolutional layers
dense_layers (list, default = [128]) – A parameter used for the dense layers.
batch_size (int, default: 128) – A parameter used for the batch size.
lr (float, default: 1e-4) – A parameter used for the learning rate.
loss (string or function, default: 'mse') – A parameter used for the loss function
batch_norm (bool, default: False) – batch normalization after convolution layers.
n_inception (int, default:2) – Number of the inception layers.
dense_avf (str, default is 'relu') – activation fuction in the dense layers.
dropout (float, default: 0) – A parameter used for the dropout of the dense layers.
monitor (str, default: 'val_loss') – {‘val_loss’, ‘val_r2’}, a monitor for model selection
metric (str, default: 'r2') – {‘r2’, ‘rmse’}, a matric parameter
patience (int, default: 10000) – A parameter used for early stopping
gpuid (int, default: 0) – A parameter used for specific gpu card
verbose (int, default: 0) – if positive, then the log infomation of AggMapNet will be print if negative, then the log infomation of orignal model will be print
random_state (int, default: 32) – random seed.
Examples
>>> from aggmap import AggModel >>> clf = AggModel.RegressionEstimator()
- explain_model(mp, X, y, explain_format='global', apply_logrithm=False, apply_smoothing=False, kernel_size=3, sigma=1.2)[source]
Feature importance calculation
- Parameters:
mp (aggmap object) –
X (trianing or test set X arrays) –
y (trianing or test set y arrays) –
explain_format ({'local', 'global'}, default: 'global') – local or global feature importance, if local, then X must be one sample
apply_logrithm ({True, False}, default: False) – whether apply a logarithm transformation on the importance values
apply_smoothing ({True, False}, default: False) – whether apply a smoothing transformation on the importance values
kernel_size (odd number, the kernel size to perform the smoothing) –
sigma (float, sigma for gaussian smoothing) –
- Return type:
DataFrame of feature importance
- predict(X)[source]
- Parameters:
X (array-like of shape (n_samples, n_features_w, n_features_h, n_features_c)) – Vector to be scored, where n_samples is the number of samples and
- Returns:
T – Returns the predicted values
- Return type:
array-like of shape (n_samples, n_classes)
- score(X, y, scoring='r2', sample_weight=None)[source]
Returns the score using the scoring option on the given test data and labels.
- Parameters:
X (array-like of shape (n_samples, n_features)) – Test samples.
y (array-like of shape (n_samples,)) – True labels for X.
scoring (str, please refer to: https://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter) –
sample_weight (array-like of shape (n_samples,), default=None) – Sample weights.
- Returns:
score – Score of self.predict(X) wrt. y.
- Return type:
- set_params(**parameters)[source]
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline
). The latter have parameters of the form<component>__<parameter>
so that it’s possible to update each component of a nested object.- Parameters:
**params (dict) – Estimator parameters.
- Returns:
self – Estimator instance.
- Return type:
estimator instance
- aggmap.AggMapNet.load_model(model_path, gpuid=None)[source]
gpuid: load model to specific gpu: {None, 0, 1, 2, 3,..}
A number of internal functions can also be accessed separately for more fine tuned work.
Useful Functions
Created on Fri Sep. 17 17:10:53 2021
@author: wanxiang.shen@u.nus.edu
- class aggmap.aggmodel.explainer.shapley_explainer(estimator, mp, backgroud='min', k_means_sampling=True, link='identity', **args)[source]
Kernel Shap based model explaination, the limiations can be found in this paper:https://christophm.github.io/interpretable-ml-book/shapley.html#disadvantages-16 <Problems with Shapley-value-based explanations as feature importance measures>. The SHAP values do not identify causality Global mean absolute Deep SHAP feature importance is the average impact on model output magnitude.
- Parameters:
estimator – model with a predict or predict_probe method
mp – aggmap object
backgroud (string or int) – {‘min’, ‘global_min’,’all’, int}. if min, then use the min value as the backgroud data (equals to 1 sample) if global_min, then use the min value of all data as the backgroud data. if int, then sample the K samples as the backgroud data if ‘all’ use all of the train data as the backgroud data for shap,
k_means_sampling (bool,) – whether use the k-mean to sample the backgroud values or not
link – {“identity”, “logit”}. A generalized linear model link to connect the feature importance values to the model output. Since the feature importance values, phi, sum up to the model output, it often makes sense to connect them to the output with a link function where link(output) = sum(phi). If the model output is a probability then the LogitLink link function makes the feature importance values have log-odds units.
args – Other parameters for shap.KernelExplainer.
Examples
>>> import seaborn as sns >>> from aggmap.aggmodel.explainer import shapley_explainer >>> ## shapley explainer >>> shap_explainer = shapley_explainer(estimator, mp) >>> global_imp_shap = shap_explainer.global_explain(clf.X_) >>> local_imp_shap = shap_explainer.local_explain(clf.X_[[0]]) >>> ## S-map of shapley explainer >>> sns.heatmap(local_imp_shap.shapley_importance_class_1.values.reshape(mp.fmap_shape), >>> cmap = 'rainbow') >>> ## shapley plot >>> shap.summary_plot(shap_explainer.shap_values, >>> feature_names = shap_explainer.feature_names) # #global plot_type='bar >>> shap.initjs() >>> shap.force_plot(shap_explainer.explainer.expected_value[1], >>> shap_explainer.shap_values[1], feature_names = shap_explainer.feature_names)
- global_explain(X=None, nsamples='auto', **args)[source]
Explaination of many samples.
- Parameters:
X (None or 4D array, where the shape is (n, w, h, c)) – the 4D array of AggMap multi-channel fmaps. Noted that if X is None, then use the estimator.X_ instead, namely explain the training set of the estimator
nsamples ({'auto', int}) – Number of times to re-evaluate the model when explaining each prediction. More samples lead to lower variance estimates of the SHAP values. The “auto” setting uses nsamples = 2 * X.shape[1] + 2048
args (other parameters in the shape_values method of shap.KernelExplainer) –
- local_explain(X=None, idx=0, nsamples='auto', **args)[source]
Explaination of one sample only:
- Parameters:
X (None or 4D array, where the shape is (n, w, h, c)) – the 4D array of AggMap multi-channel fmaps. Noted if X is None, then use the estimator.X_[[idx]] instead, namely explain the first sample if idx=0
nsamples ({'auto', int}) – Number of times to re-evaluate the model when explaining each prediction. More samples lead to lower variance estimates of the SHAP values. The “auto” setting uses nsamples = 2 * X.shape[1] + 2048
args (other parameters in the shape_values method of shap.KernelExplainer) –
- class aggmap.aggmodel.explainer.simply_explainer(estimator, mp, backgroud='min', apply_logrithm=False, apply_smoothing=False, kernel_size=5, sigma=1.0)[source]
Simply-explainer for model explaination.
- Parameters:
estimator (object) – model with a predict or predict_probe method
mp (object) – aggmap object
backgroud ({'min', 'global_min','zeros'}, default: 'min'.) – if “min”, then use the min value of a vector of the training set, if ‘global_min’, then use the min value of all training set. if ‘zero’, then use all zeros as the backgroud data.
apply_logrithm (bool, default: False) – apply the logirthm to the feature importance score
apply_smoothing (bool, default: False) – apply the gaussian smoothing on the feature importance score (Saliency map)
kernel_size (int, default: 5.) – the kernel size for the smoothing
sigma (float, default: 1.0.) – the sigma for the smoothing.
Examples
>>> import seaborn as sns >>> from aggmap.aggmodel.explainer import simply_explainer >>> simp_explainer = simply_explainer(estimator, mp) >>> global_imp_simp = simp_explainer.global_explain(clf.X_, clf.y_) >>> local_imp_simp = simp_explainer.local_explain(clf.X_[[0]], clf.y_[[0]]) >>> ## S-map of simply explainer >>> sns.heatmap(local_imp_simp.simply_importance.values.reshape(mp.fmap_shape), >>> cmap = 'rainbow')
- global_explain(X=None, y=None)[source]
Explaination of many samples.
- Parameters:
X (None or 4D array, where the shape is (n, w, h, c)) – the 4D array of AggMap multi-channel fmaps
y (None or 4D array, where the shape is (n, class_num)) – the True label
None (Noted that if X and y are) –
instead (then use the estimator.X and estimator.y) –
estimator (namely explain the training set of the) –
- local_explain(X=None, y=None, idx=0)[source]
Explaination of one sample only.
- Parameters:
X (None or 4D array, where the shape is (1, w, h, c)) –
y (the True label, None or 4D array, where the shape is (1, class_num).) –
idx (int,) – index of the sample to interpret Noted that if X and y are None, then use the estimator.X_[[idx]] and estimator.y_[[idx]] instead, namely explain the first sample if idx=0.
- Return type:
Feature importance of the current class
- aggmap.utils.vismap.plot_grid(mp, htmlpath='./', htmlname=None, enabled_data_labels=False)[source]
mp: the object of mp htmlpath: the figure path
- aggmap.utils.vismap.plot_scatter(mp, htmlpath='./', htmlname=None, radius=2, enabled_data_labels=False)[source]
mp: the object of mp htmlpath: the figure path, not include the prefix of ‘html’ htmlname: the name radius: int, defaut:3, the radius of scatter dot
Created on Sat Aug 17 16:54:12 2019
@author: wanxiang.shen@u.nus.edu
@usecase: statistic features’ distribution
Created on Fri Aug 27 14:06:17 2021
@author: Shen Wanxiang