API Guide

aggmap has two major classes AggMap and AggMapNet.

AggMap

class aggmap.map.AggMap(dfx, metric='correlation', by_scipy=False, n_cpus=16, info_distance=None)[source]

The feature restructuring class AggMap

Parameters:
  • dfx (pandas DataFrame) – Input data frame.

  • metric (string, default: 'correlation') – measurement of feature distance, support {‘cosine’, ‘correlation’, ‘euclidean’, ‘jaccard’, ‘hamming’, ‘dice’}

  • info_distance (numpy array, defalt: None) – a vector-form distance vector of the feature points, shape should be: (n*(n-1)/2), where n is the number of the features. It can be useful when you have you own vector-form distance to pass

  • by_scipy (bool, defalt: False.) – calculate the distance by using the scipy pdist fuction. It can bu useful when dfx.shape[1] > 20000, i.e., the number of features is very large Using pdist will increase the speed to calculate the distance.

  • n_cpus (int, default: 16) – number of cpu cores to use to calculate the distance.

batch_transform(array_2d, scale=True, scale_method='minmax', n_jobs=4, fillnan=0)[source]
Parameters:
  • array_2d (2D numpy array feature points, M(samples) x N(feature ponits)) –

  • scale (bool, if True, we will apply MinMax scaling by the precomputed values) –

  • scale_method ({'minmax', 'standard'}) –

  • n_jobs (number of parallel) –

  • fillnan (fill nan value, default: 0) –

copy()[source]

copy self

fit(feature_group_list=[], cluster_channels=5, var_thr=-1, split_channels=True, fmap_shape=None, emb_method='umap', min_dist=0.1, n_neighbors=15, a=1.576943460405378, b=0.8950608781227859, verbose=2, random_state=32, group_color_dict={}, lnk_method='complete', **kwargs)[source]
Parameters:
  • feature_group_list (list of the group name for each feature point) –

  • cluster_channels (int, number of the channels(clusters) if feature_group_list is empty) –

  • var_thr (float, defalt is -1, meaning that feature will be included only if the conresponding variance larger than this value. Since some of the feature has pretty low variances, we can remove them by increasing this threshold) –

  • split_channels (bool, if True, outputs will split into various channels using the types of feature) –

  • fmap_shape (None or tuple, size of molmap, if None, the size of feature map will be calculated automatically) –

  • emb_method ({'umap', 'tsne', 'mds', 'isomap', 'random', 'lle', 'se'}, algorithm to embedd high-D to 2D) –

  • min_dist (float, UMAP parameters for the effective minimum distance between embedded points.) –

  • n_neighbors (init, UMAP parameters of controlling the embedding. Larger values result in more global views of the manifold, while smaller values result in more local data being preserved.) –

  • a (float, UMAP parameters of controlling the embedding. If None, it will automatically be determined by min_dist and spread.) –

  • b (float, UMAP parameters of controlling the embedding. If None, it will automatically be determined by min_dist and spread.) –

  • group_color_dict (dict of the group colors, keys are the group names, values are the colors) –

  • lnk_method ({'complete', 'average', 'single', 'weighted', 'centroid'}, linkage method) –

  • kwargs (the extra parameters for the conresponding embedding method) –

load(filename)[source]

load self

plot_grid(htmlpath='./', htmlname=None, enabled_data_labels=False)[source]

Grid plot

plot_scatter(htmlpath='./', htmlname=None, radius=2, enabled_data_labels=False)[source]

Scatter plot, radius: the size of the scatter, must be int

plot_tree(figsize=(16, 8), add_leaf_labels=True, leaf_font_size=18, leaf_rotation=90)[source]

Diagram tree plot

refit_c(cluster_channels=10, lnk_method='complete', group_color_dict={})[source]

re-fit the aggmap object to update the number of channels

Parameters:
  • cluster_channels (int, number of the channels(clusters)) –

  • group_color_dict (dict of the group colors, keys are the group names, values are the colors) –

  • lnk_method ({'complete', 'average', 'single', 'weighted', 'centroid'}, linkage method) –

save(filename)[source]

save self

to_nwk_tree(treefile='mytree', leaf_names=None)[source]

convert mp object to newick tree and the data file to submit to itol sever

transform(arr_1d, scale=True, scale_method='minmax', fillnan=0)[source]
Parameters:
  • arr_1d (1d numpy array feature points) –

  • scale (bool, if True, we will apply MinMax scaling by the precomputed values) –

  • scale_method ({'minmax', 'standard'}) –

  • fillnan (fill nan value, default: 0) –

transform_mpX_to_df(X)[source]

input 4D X, output 2D dataframe

AggMapNet

Created on Sun Aug 16 17:10:53 2020

@author: wanxiang.shen@u.nus.edu

class aggmap.AggMapNet.MultiClassEstimator(epochs=200, conv1_kernel_size=13, dense_layers=[128], dense_avf='relu', batch_size=128, lr=0.0001, loss='categorical_crossentropy', batch_norm=False, n_inception=2, dropout=0.0, monitor='val_loss', metric='ACC', patience=10000, verbose=0, last_avf='softmax', random_state=32, gpuid=0)[source]

An AggMap CNN MultiClass estimator (each sample belongs to only one class)

Parameters:
  • epochs (int, default = 200) – A parameter used for training epochs.

  • conv1_kernel_size (int, default = 13) – A parameter used for the kernel size of first covolutional layers

  • dense_layers (list, default = [128]) – A parameter used for the dense layers.

  • batch_size (int, default: 128) – A parameter used for the batch size.

  • lr (float, default: 1e-4) – A parameter used for the learning rate.

  • loss (string or function, default: 'categorical_crossentropy') – A parameter used for the loss function

  • batch_norm (bool, default: False) – batch normalization after convolution layers.

  • n_inception (int, default:2) – Number of the inception layers.

  • dense_avf (str, default is 'relu') – activation fuction in the dense layers.

  • dropout (float, default: 0) – A parameter used for the dropout of the dense layers.

  • monitor (str, default: 'val_loss') – {‘val_loss’, ‘val_metric’}, a monitor for model selection.

  • metric (str, default: 'ACC') – {‘ROC’, ‘ACC’, ‘PRC’}, a matric parameter.

  • patience (int, default: 10000) – A parameter used for early stopping.

  • gpuid (int, default: 0) – A parameter used for specific gpu card.

  • verbose (int, default: 0) – if positive, then the log infomation of AggMapNet will be print, if negative, then the log infomation of orignal model will be print.

  • random_state (int, default: 32) – Random seed.

Examples

>>> from aggmap import AggModel
>>> clf = AggModel.MultiClassEstimator()
explain_model(mp, X, y, binary_task=False, explain_format='global', apply_logrithm=False, apply_smoothing=False, kernel_size=3, sigma=1.2)[source]

Feature importance calculation

Parameters:
  • mp (aggmap object) –

  • X (trianing or test set X arrays) –

  • y (trianing or test set y arrays) –

  • binary_task ({True, False}) – whether the task is binary, if True, the feature importance will be calculated for one class only

  • explain_format ({'local', 'global'}, default: 'global') – local or global feature importance, if local, then X must be one sample

  • apply_logrithm ({True, False}, default: False) – whether apply a logarithm transformation on the importance values

  • apply_smoothing ({True, False}, default: False) – whether apply a smoothing transformation on the importance values

  • kernel_size (odd number, the kernel size to perform the smoothing) –

  • sigma (float, sigma for gaussian smoothing) –

Return type:

DataFrame of feature importance

get_params(deep=True)[source]

Get parameters for this estimator.

Parameters:

deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns:

params – Parameter names mapped to their values.

Return type:

dict

predict_proba(X)[source]

Probability estimates. The returned estimates for all classes are ordered by the label of classes. For a multi_class problem, if multi_class is set to be “multinomial” the softmax function is used to find the predicted probability of each class.

Parameters:

X (array-like of shape (n_samples, n_features)) – Vector to be scored, where n_samples is the number of samples and n_features is the number of features.

Returns:

T – Returns the probability of the sample for each class in the model, where classes are ordered as they are in self.classes_.

Return type:

array-like of shape (n_samples, n_classes)

score(X, y, scoring='accuracy', sample_weight=None)[source]

Returns the score using the scoring option on the given test data and labels.

Parameters:
Returns:

score – Score of self.predict(X) wrt. y.

Return type:

float

set_params(**parameters)[source]

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters:

**params (dict) – Estimator parameters.

Returns:

self – Estimator instance.

Return type:

estimator instance

class aggmap.AggMapNet.MultiLabelEstimator(epochs=200, conv1_kernel_size=13, dense_layers=[128], dense_avf='relu', batch_size=128, lr=0.0001, loss=<function sigmoid_cross_entropy_with_logits_v2>, batch_norm=False, n_inception=2, dropout=0.0, monitor='val_loss', metric='ROC', patience=10000, verbose=0, random_state=32, gpuid=0)[source]

An AggMap CNN MultiLabel estimator (each sample belongs to only one class)

Parameters:
  • epochs (int, default = 200) – A parameter used for training epochs.

  • conv1_kernel_size (int, default = 13) – A parameter used for the kernel size of first covolutional layers。

  • dense_layers (list, default = [128]) – A parameter used for the dense layers.

  • batch_size (int, default: 128) – A parameter used for the batch size.

  • lr (float, default: 1e-4) – A parameter used for the learning rate.

  • loss (string or function, default: tf.nn.sigmoid_cross_entropy_with_logits。) – A parameter used for the loss function

  • batch_norm (bool, default: False) – batch normalization after convolution layers.

  • n_inception (int, default:2) – Number of the inception layers.

  • dense_avf (str, default is 'relu') – activation fuction in the dense layers.

  • dropout (float, default: 0) – A parameter used for the dropout of the dense layers, such as 0.1, 0.3, 0.5.

  • monitor (str, default: 'val_loss') – {‘val_loss’, ‘val_metric’}, a monitor for model selection。

  • metric (str, default: 'ROC') – {‘ROC’, ‘ACC’, ‘PRC’}, a matric parameter。

  • patience (int, default: 10000) – A parameter used for early stopping。

  • gpuid (int, default: 0) – A parameter used for specific gpu card。

  • verbose (int, default: 0) – if positive, then the log infomation of AggMapNet will be print, if negative, then the log infomation of orignal model will be print。

  • random_state (int, default: 32) – Random seed

  • name (str) – Model name

Examples

>>> from aggmap import AggModel
>>> clf = AggModel.MultiLabelEstimator()
explain_model(mp, X, y, explain_format='global', apply_logrithm=False, apply_smoothing=False, kernel_size=3, sigma=1.2)[source]

Feature importance calculation.

Parameters:
  • mp (aggmap object) –

  • X (trianing or test set X arrays) –

  • y (trianing or test set y arrays) – whether the task is binary, if True, the feature importance will be calculated for one class only

  • explain_format ({'local', 'global'}, default: 'global') – local or global feature importance, if local, then X must be one sample.

  • apply_logrithm ({True, False}, default: False.) – whether apply a logarithm transformation on the importance values.

  • apply_smoothing ({True, False}, default: False.) – whether apply a smoothing transformation on the importance values.

  • kernel_size (odd number, the kernel size to perform the smoothing.) –

  • sigma (float, sigma for gaussian smoothing.) –

Return type:

DataFrame of feature importance

get_params(deep=True)[source]

Get parameters for this estimator.

Parameters:

deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns:

params – Parameter names mapped to their values.

Return type:

dict

predict_proba(X)[source]

Probability estimates. The returned estimates for all classes are ordered by the label of classes. For a multi_class problem, if multi_class is set to be “multinomial” the softmax function is used to find the predicted probability of each class.

Parameters:

X (array-like of shape (n_samples, n_features)) – Vector to be scored, where n_samples is the number of samples and n_features is the number of features.

Returns:

T – Returns the probability of the sample for each class in the model, where classes are ordered as they are in self.classes_.

Return type:

array-like of shape (n_samples, n_classes)

score(X, y, scoring='accuracy', sample_weight=None)[source]

Returns the score using the scoring option on the given test data and labels.

Parameters:
Returns:

score – Score of self.predict(X) wrt. y.

Return type:

float

set_params(**parameters)[source]

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters:

**params (dict) – Estimator parameters.

Returns:

self – Estimator instance.

Return type:

estimator instance

class aggmap.AggMapNet.RegressionEstimator(epochs=200, conv1_kernel_size=13, dense_layers=[128], dense_avf='relu', batch_size=128, lr=0.0001, loss='mse', batch_norm=False, n_inception=2, dropout=0.0, monitor='val_loss', metric='r2', patience=10000, verbose=0, random_state=32, gpuid=0)[source]

An AggMap CNN Regression estimator (each sample belongs to only one class)

Parameters:
  • epochs (int, default = 200) – A parameter used for training epochs.

  • conv1_kernel_size (int, default = 13) – A parameter used for the kernel size of first covolutional layers

  • dense_layers (list, default = [128]) – A parameter used for the dense layers.

  • batch_size (int, default: 128) – A parameter used for the batch size.

  • lr (float, default: 1e-4) – A parameter used for the learning rate.

  • loss (string or function, default: 'mse') – A parameter used for the loss function

  • batch_norm (bool, default: False) – batch normalization after convolution layers.

  • n_inception (int, default:2) – Number of the inception layers.

  • dense_avf (str, default is 'relu') – activation fuction in the dense layers.

  • dropout (float, default: 0) – A parameter used for the dropout of the dense layers.

  • monitor (str, default: 'val_loss') – {‘val_loss’, ‘val_r2’}, a monitor for model selection

  • metric (str, default: 'r2') – {‘r2’, ‘rmse’}, a matric parameter

  • patience (int, default: 10000) – A parameter used for early stopping

  • gpuid (int, default: 0) – A parameter used for specific gpu card

  • verbose (int, default: 0) – if positive, then the log infomation of AggMapNet will be print if negative, then the log infomation of orignal model will be print

  • random_state (int, default: 32) – random seed.

Examples

>>> from aggmap import AggModel
>>> clf = AggModel.RegressionEstimator()
explain_model(mp, X, y, explain_format='global', apply_logrithm=False, apply_smoothing=False, kernel_size=3, sigma=1.2)[source]

Feature importance calculation

Parameters:
  • mp (aggmap object) –

  • X (trianing or test set X arrays) –

  • y (trianing or test set y arrays) –

  • explain_format ({'local', 'global'}, default: 'global') – local or global feature importance, if local, then X must be one sample

  • apply_logrithm ({True, False}, default: False) – whether apply a logarithm transformation on the importance values

  • apply_smoothing ({True, False}, default: False) – whether apply a smoothing transformation on the importance values

  • kernel_size (odd number, the kernel size to perform the smoothing) –

  • sigma (float, sigma for gaussian smoothing) –

Return type:

DataFrame of feature importance

get_params(deep=True)[source]

Get parameters for this estimator.

Parameters:

deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns:

params – Parameter names mapped to their values.

Return type:

dict

predict(X)[source]
Parameters:

X (array-like of shape (n_samples, n_features_w, n_features_h, n_features_c)) – Vector to be scored, where n_samples is the number of samples and

Returns:

T – Returns the predicted values

Return type:

array-like of shape (n_samples, n_classes)

score(X, y, scoring='r2', sample_weight=None)[source]

Returns the score using the scoring option on the given test data and labels.

Parameters:
Returns:

score – Score of self.predict(X) wrt. y.

Return type:

float

set_params(**parameters)[source]

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters:

**params (dict) – Estimator parameters.

Returns:

self – Estimator instance.

Return type:

estimator instance

aggmap.AggMapNet.load_model(model_path, gpuid=None)[source]

gpuid: load model to specific gpu: {None, 0, 1, 2, 3,..}

A number of internal functions can also be accessed separately for more fine tuned work.

Useful Functions

Created on Fri Sep. 17 17:10:53 2021

@author: wanxiang.shen@u.nus.edu

class aggmap.aggmodel.explainer.shapley_explainer(estimator, mp, backgroud='min', k_means_sampling=True, link='identity', **args)[source]

Kernel Shap based model explaination, the limiations can be found in this paper:https://christophm.github.io/interpretable-ml-book/shapley.html#disadvantages-16 <Problems with Shapley-value-based explanations as feature importance measures>. The SHAP values do not identify causality Global mean absolute Deep SHAP feature importance is the average impact on model output magnitude.

Parameters:
  • estimator – model with a predict or predict_probe method

  • mp – aggmap object

  • backgroud (string or int) – {‘min’, ‘global_min’,’all’, int}. if min, then use the min value as the backgroud data (equals to 1 sample) if global_min, then use the min value of all data as the backgroud data. if int, then sample the K samples as the backgroud data if ‘all’ use all of the train data as the backgroud data for shap,

  • k_means_sampling (bool,) – whether use the k-mean to sample the backgroud values or not

  • link – {“identity”, “logit”}. A generalized linear model link to connect the feature importance values to the model output. Since the feature importance values, phi, sum up to the model output, it often makes sense to connect them to the output with a link function where link(output) = sum(phi). If the model output is a probability then the LogitLink link function makes the feature importance values have log-odds units.

  • args – Other parameters for shap.KernelExplainer.

Examples

>>> import seaborn as sns
>>> from aggmap.aggmodel.explainer import shapley_explainer
>>> ## shapley explainer
>>> shap_explainer = shapley_explainer(estimator, mp)
>>> global_imp_shap = shap_explainer.global_explain(clf.X_)
>>> local_imp_shap = shap_explainer.local_explain(clf.X_[[0]])
>>> ## S-map of shapley explainer
>>> sns.heatmap(local_imp_shap.shapley_importance_class_1.values.reshape(mp.fmap_shape),
>>> cmap = 'rainbow')
>>> ## shapley plot
>>> shap.summary_plot(shap_explainer.shap_values,
>>> feature_names = shap_explainer.feature_names) # #global  plot_type='bar
>>> shap.initjs()
>>> shap.force_plot(shap_explainer.explainer.expected_value[1],
>>> shap_explainer.shap_values[1], feature_names = shap_explainer.feature_names)
global_explain(X=None, nsamples='auto', **args)[source]

Explaination of many samples.

Parameters:
  • X (None or 4D array, where the shape is (n, w, h, c)) – the 4D array of AggMap multi-channel fmaps. Noted that if X is None, then use the estimator.X_ instead, namely explain the training set of the estimator

  • nsamples ({'auto', int}) – Number of times to re-evaluate the model when explaining each prediction. More samples lead to lower variance estimates of the SHAP values. The “auto” setting uses nsamples = 2 * X.shape[1] + 2048

  • args (other parameters in the shape_values method of shap.KernelExplainer) –

local_explain(X=None, idx=0, nsamples='auto', **args)[source]

Explaination of one sample only:

Parameters:
  • X (None or 4D array, where the shape is (n, w, h, c)) – the 4D array of AggMap multi-channel fmaps. Noted if X is None, then use the estimator.X_[[idx]] instead, namely explain the first sample if idx=0

  • nsamples ({'auto', int}) – Number of times to re-evaluate the model when explaining each prediction. More samples lead to lower variance estimates of the SHAP values. The “auto” setting uses nsamples = 2 * X.shape[1] + 2048

  • args (other parameters in the shape_values method of shap.KernelExplainer) –

class aggmap.aggmodel.explainer.simply_explainer(estimator, mp, backgroud='min', apply_logrithm=False, apply_smoothing=False, kernel_size=5, sigma=1.0)[source]

Simply-explainer for model explaination.

Parameters:
  • estimator (object) – model with a predict or predict_probe method

  • mp (object) – aggmap object

  • backgroud ({'min', 'global_min','zeros'}, default: 'min'.) – if “min”, then use the min value of a vector of the training set, if ‘global_min’, then use the min value of all training set. if ‘zero’, then use all zeros as the backgroud data.

  • apply_logrithm (bool, default: False) – apply the logirthm to the feature importance score

  • apply_smoothing (bool, default: False) – apply the gaussian smoothing on the feature importance score (Saliency map)

  • kernel_size (int, default: 5.) – the kernel size for the smoothing

  • sigma (float, default: 1.0.) – the sigma for the smoothing.

Examples

>>> import seaborn as sns
>>> from aggmap.aggmodel.explainer import simply_explainer
>>> simp_explainer = simply_explainer(estimator, mp)
>>> global_imp_simp = simp_explainer.global_explain(clf.X_, clf.y_)
>>> local_imp_simp = simp_explainer.local_explain(clf.X_[[0]], clf.y_[[0]])
>>> ## S-map of simply explainer
>>> sns.heatmap(local_imp_simp.simply_importance.values.reshape(mp.fmap_shape),
>>> cmap = 'rainbow')
global_explain(X=None, y=None)[source]

Explaination of many samples.

Parameters:
  • X (None or 4D array, where the shape is (n, w, h, c)) – the 4D array of AggMap multi-channel fmaps

  • y (None or 4D array, where the shape is (n, class_num)) – the True label

  • None (Noted that if X and y are) –

  • instead (then use the estimator.X and estimator.y) –

  • estimator (namely explain the training set of the) –

local_explain(X=None, y=None, idx=0)[source]

Explaination of one sample only.

Parameters:
  • X (None or 4D array, where the shape is (1, w, h, c)) –

  • y (the True label, None or 4D array, where the shape is (1, class_num).) –

  • idx (int,) – index of the sample to interpret Noted that if X and y are None, then use the estimator.X_[[idx]] and estimator.y_[[idx]] instead, namely explain the first sample if idx=0.

Return type:

Feature importance of the current class

aggmap.utils.vismap.plot_grid(mp, htmlpath='./', htmlname=None, enabled_data_labels=False)[source]

mp: the object of mp htmlpath: the figure path

aggmap.utils.vismap.plot_scatter(mp, htmlpath='./', htmlname=None, radius=2, enabled_data_labels=False)[source]

mp: the object of mp htmlpath: the figure path, not include the prefix of ‘html’ htmlname: the name radius: int, defaut:3, the radius of scatter dot

Created on Sat Aug 17 16:54:12 2019

@author: wanxiang.shen@u.nus.edu

@usecase: statistic features’ distribution

Created on Fri Aug 27 14:06:17 2021

@author: Shen Wanxiang