aggmap package

Subpackages

Submodules

aggmap.AggMapNet module

Created on Sun Aug 16 17:10:53 2020

@author: wanxiang.shen@u.nus.edu

class aggmap.AggMapNet.MultiClassEstimator(epochs=200, conv1_kernel_size=13, dense_layers=[128], dense_avf='relu', batch_size=128, lr=0.0001, loss='categorical_crossentropy', batch_norm=False, n_inception=2, dropout=0.0, monitor='val_loss', metric='ACC', patience=10000, verbose=0, last_avf='softmax', random_state=32, gpuid=0)[source]

Bases: BaseEstimator, ClassifierMixin

An AggMap CNN MultiClass estimator (each sample belongs to only one class)

Parameters:
  • epochs (int, default = 200) – A parameter used for training epochs.

  • conv1_kernel_size (int, default = 13) – A parameter used for the kernel size of first covolutional layers

  • dense_layers (list, default = [128]) – A parameter used for the dense layers.

  • batch_size (int, default: 128) – A parameter used for the batch size.

  • lr (float, default: 1e-4) – A parameter used for the learning rate.

  • loss (string or function, default: 'categorical_crossentropy') – A parameter used for the loss function

  • batch_norm (bool, default: False) – batch normalization after convolution layers.

  • n_inception (int, default:2) – Number of the inception layers.

  • dense_avf (str, default is 'relu') – activation fuction in the dense layers.

  • dropout (float, default: 0) – A parameter used for the dropout of the dense layers.

  • monitor (str, default: 'val_loss') – {‘val_loss’, ‘val_metric’}, a monitor for model selection.

  • metric (str, default: 'ACC') – {‘ROC’, ‘ACC’, ‘PRC’}, a matric parameter.

  • patience (int, default: 10000) – A parameter used for early stopping.

  • gpuid (int, default: 0) – A parameter used for specific gpu card.

  • verbose (int, default: 0) – if positive, then the log infomation of AggMapNet will be print, if negative, then the log infomation of orignal model will be print.

  • random_state (int, default: 32) – Random seed.

Examples

>>> from aggmap import AggModel
>>> clf = AggModel.MultiClassEstimator()
property clean
explain_model(mp, X, y, binary_task=False, explain_format='global', apply_logrithm=False, apply_smoothing=False, kernel_size=3, sigma=1.2)[source]

Feature importance calculation

Parameters:
  • mp (aggmap object) –

  • X (trianing or test set X arrays) –

  • y (trianing or test set y arrays) –

  • binary_task ({True, False}) – whether the task is binary, if True, the feature importance will be calculated for one class only

  • explain_format ({'local', 'global'}, default: 'global') – local or global feature importance, if local, then X must be one sample

  • apply_logrithm ({True, False}, default: False) – whether apply a logarithm transformation on the importance values

  • apply_smoothing ({True, False}, default: False) – whether apply a smoothing transformation on the importance values

  • kernel_size (odd number, the kernel size to perform the smoothing) –

  • sigma (float, sigma for gaussian smoothing) –

Return type:

DataFrame of feature importance

fit(X, y, X_valid=None, y_valid=None, class_weight=None)[source]
get_params(deep=True)[source]

Get parameters for this estimator.

Parameters:

deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns:

params – Parameter names mapped to their values.

Return type:

dict

load_model(model_path, gpuid=None)[source]
plot_model(to_file='model.png', show_shapes=True, show_layer_names=True, rankdir='TB', expand_nested=False, dpi=96)[source]
predict(X)[source]
predict_proba(X)[source]

Probability estimates. The returned estimates for all classes are ordered by the label of classes. For a multi_class problem, if multi_class is set to be “multinomial” the softmax function is used to find the predicted probability of each class.

Parameters:

X (array-like of shape (n_samples, n_features)) – Vector to be scored, where n_samples is the number of samples and n_features is the number of features.

Returns:

T – Returns the probability of the sample for each class in the model, where classes are ordered as they are in self.classes_.

Return type:

array-like of shape (n_samples, n_classes)

save_model(model_path)[source]
score(X, y, scoring='accuracy', sample_weight=None)[source]

Returns the score using the scoring option on the given test data and labels.

Parameters:
Returns:

score – Score of self.predict(X) wrt. y.

Return type:

float

set_params(**parameters)[source]

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters:

**params (dict) – Estimator parameters.

Returns:

self – Estimator instance.

Return type:

estimator instance

class aggmap.AggMapNet.MultiLabelEstimator(epochs=200, conv1_kernel_size=13, dense_layers=[128], dense_avf='relu', batch_size=128, lr=0.0001, loss=<function sigmoid_cross_entropy_with_logits_v2>, batch_norm=False, n_inception=2, dropout=0.0, monitor='val_loss', metric='ROC', patience=10000, verbose=0, random_state=32, gpuid=0)[source]

Bases: BaseEstimator, ClassifierMixin

An AggMap CNN MultiLabel estimator (each sample belongs to only one class)

Parameters:
  • epochs (int, default = 200) – A parameter used for training epochs.

  • conv1_kernel_size (int, default = 13) – A parameter used for the kernel size of first covolutional layers。

  • dense_layers (list, default = [128]) – A parameter used for the dense layers.

  • batch_size (int, default: 128) – A parameter used for the batch size.

  • lr (float, default: 1e-4) – A parameter used for the learning rate.

  • loss (string or function, default: tf.nn.sigmoid_cross_entropy_with_logits。) – A parameter used for the loss function

  • batch_norm (bool, default: False) – batch normalization after convolution layers.

  • n_inception (int, default:2) – Number of the inception layers.

  • dense_avf (str, default is 'relu') – activation fuction in the dense layers.

  • dropout (float, default: 0) – A parameter used for the dropout of the dense layers, such as 0.1, 0.3, 0.5.

  • monitor (str, default: 'val_loss') – {‘val_loss’, ‘val_metric’}, a monitor for model selection。

  • metric (str, default: 'ROC') – {‘ROC’, ‘ACC’, ‘PRC’}, a matric parameter。

  • patience (int, default: 10000) – A parameter used for early stopping。

  • gpuid (int, default: 0) – A parameter used for specific gpu card。

  • verbose (int, default: 0) – if positive, then the log infomation of AggMapNet will be print, if negative, then the log infomation of orignal model will be print。

  • random_state (int, default: 32) – Random seed

  • name (str) – Model name

Examples

>>> from aggmap import AggModel
>>> clf = AggModel.MultiLabelEstimator()
property clean
explain_model(mp, X, y, explain_format='global', apply_logrithm=False, apply_smoothing=False, kernel_size=3, sigma=1.2)[source]

Feature importance calculation.

Parameters:
  • mp (aggmap object) –

  • X (trianing or test set X arrays) –

  • y (trianing or test set y arrays) – whether the task is binary, if True, the feature importance will be calculated for one class only

  • explain_format ({'local', 'global'}, default: 'global') – local or global feature importance, if local, then X must be one sample.

  • apply_logrithm ({True, False}, default: False.) – whether apply a logarithm transformation on the importance values.

  • apply_smoothing ({True, False}, default: False.) – whether apply a smoothing transformation on the importance values.

  • kernel_size (odd number, the kernel size to perform the smoothing.) –

  • sigma (float, sigma for gaussian smoothing.) –

Return type:

DataFrame of feature importance

fit(X, y, X_valid=None, y_valid=None)[source]
get_params(deep=True)[source]

Get parameters for this estimator.

Parameters:

deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns:

params – Parameter names mapped to their values.

Return type:

dict

load_model(model_path, gpuid=None)[source]
plot_model(to_file='model.png', show_shapes=True, show_layer_names=True, rankdir='TB', expand_nested=False, dpi=96)[source]
predict(X)[source]
predict_proba(X)[source]

Probability estimates. The returned estimates for all classes are ordered by the label of classes. For a multi_class problem, if multi_class is set to be “multinomial” the softmax function is used to find the predicted probability of each class.

Parameters:

X (array-like of shape (n_samples, n_features)) – Vector to be scored, where n_samples is the number of samples and n_features is the number of features.

Returns:

T – Returns the probability of the sample for each class in the model, where classes are ordered as they are in self.classes_.

Return type:

array-like of shape (n_samples, n_classes)

save_model(model_path)[source]
score(X, y, scoring='accuracy', sample_weight=None)[source]

Returns the score using the scoring option on the given test data and labels.

Parameters:
Returns:

score – Score of self.predict(X) wrt. y.

Return type:

float

set_params(**parameters)[source]

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters:

**params (dict) – Estimator parameters.

Returns:

self – Estimator instance.

Return type:

estimator instance

class aggmap.AggMapNet.RegressionEstimator(epochs=200, conv1_kernel_size=13, dense_layers=[128], dense_avf='relu', batch_size=128, lr=0.0001, loss='mse', batch_norm=False, n_inception=2, dropout=0.0, monitor='val_loss', metric='r2', patience=10000, verbose=0, random_state=32, gpuid=0)[source]

Bases: BaseEstimator, RegressorMixin

An AggMap CNN Regression estimator (each sample belongs to only one class)

Parameters:
  • epochs (int, default = 200) – A parameter used for training epochs.

  • conv1_kernel_size (int, default = 13) – A parameter used for the kernel size of first covolutional layers

  • dense_layers (list, default = [128]) – A parameter used for the dense layers.

  • batch_size (int, default: 128) – A parameter used for the batch size.

  • lr (float, default: 1e-4) – A parameter used for the learning rate.

  • loss (string or function, default: 'mse') – A parameter used for the loss function

  • batch_norm (bool, default: False) – batch normalization after convolution layers.

  • n_inception (int, default:2) – Number of the inception layers.

  • dense_avf (str, default is 'relu') – activation fuction in the dense layers.

  • dropout (float, default: 0) – A parameter used for the dropout of the dense layers.

  • monitor (str, default: 'val_loss') – {‘val_loss’, ‘val_r2’}, a monitor for model selection

  • metric (str, default: 'r2') – {‘r2’, ‘rmse’}, a matric parameter

  • patience (int, default: 10000) – A parameter used for early stopping

  • gpuid (int, default: 0) – A parameter used for specific gpu card

  • verbose (int, default: 0) – if positive, then the log infomation of AggMapNet will be print if negative, then the log infomation of orignal model will be print

  • random_state (int, default: 32) – random seed.

Examples

>>> from aggmap import AggModel
>>> clf = AggModel.RegressionEstimator()
property clean
explain_model(mp, X, y, explain_format='global', apply_logrithm=False, apply_smoothing=False, kernel_size=3, sigma=1.2)[source]

Feature importance calculation

Parameters:
  • mp (aggmap object) –

  • X (trianing or test set X arrays) –

  • y (trianing or test set y arrays) –

  • explain_format ({'local', 'global'}, default: 'global') – local or global feature importance, if local, then X must be one sample

  • apply_logrithm ({True, False}, default: False) – whether apply a logarithm transformation on the importance values

  • apply_smoothing ({True, False}, default: False) – whether apply a smoothing transformation on the importance values

  • kernel_size (odd number, the kernel size to perform the smoothing) –

  • sigma (float, sigma for gaussian smoothing) –

Return type:

DataFrame of feature importance

fit(X, y, X_valid=None, y_valid=None)[source]
get_params(deep=True)[source]

Get parameters for this estimator.

Parameters:

deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns:

params – Parameter names mapped to their values.

Return type:

dict

load_model(model_path, gpuid=None)[source]
plot_model(to_file='model.png', show_shapes=True, show_layer_names=True, rankdir='TB', expand_nested=False, dpi=96)[source]
predict(X)[source]
Parameters:

X (array-like of shape (n_samples, n_features_w, n_features_h, n_features_c)) – Vector to be scored, where n_samples is the number of samples and

Returns:

T – Returns the predicted values

Return type:

array-like of shape (n_samples, n_classes)

save_model(model_path)[source]
score(X, y, scoring='r2', sample_weight=None)[source]

Returns the score using the scoring option on the given test data and labels.

Parameters:
Returns:

score – Score of self.predict(X) wrt. y.

Return type:

float

set_params(**parameters)[source]

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters:

**params (dict) – Estimator parameters.

Returns:

self – Estimator instance.

Return type:

estimator instance

aggmap.AggMapNet.clean(clf)[source]
aggmap.AggMapNet.load_model(model_path, gpuid=None)[source]

gpuid: load model to specific gpu: {None, 0, 1, 2, 3,..}

aggmap.AggMapNet.save_model(model, model_path)[source]

aggmap.map module

Created on Sun Aug 25 20:29:36 2019

@author: wanxiang.shen@u.nus.edu

main aggmap code

class aggmap.map.AggMap(dfx, metric='correlation', by_scipy=False, n_cpus=16, info_distance=None)[source]

Bases: Base

The feature restructuring class AggMap

Parameters:
  • dfx (pandas DataFrame) – Input data frame.

  • metric (string, default: 'correlation') – measurement of feature distance, support {‘cosine’, ‘correlation’, ‘euclidean’, ‘jaccard’, ‘hamming’, ‘dice’}

  • info_distance (numpy array, defalt: None) – a vector-form distance vector of the feature points, shape should be: (n*(n-1)/2), where n is the number of the features. It can be useful when you have you own vector-form distance to pass

  • by_scipy (bool, defalt: False.) – calculate the distance by using the scipy pdist fuction. It can bu useful when dfx.shape[1] > 20000, i.e., the number of features is very large Using pdist will increase the speed to calculate the distance.

  • n_cpus (int, default: 16) – number of cpu cores to use to calculate the distance.

batch_transform(array_2d, scale=True, scale_method='minmax', n_jobs=4, fillnan=0)[source]
Parameters:
  • array_2d (2D numpy array feature points, M(samples) x N(feature ponits)) –

  • scale (bool, if True, we will apply MinMax scaling by the precomputed values) –

  • scale_method ({'minmax', 'standard'}) –

  • n_jobs (number of parallel) –

  • fillnan (fill nan value, default: 0) –

copy()[source]

copy self

fit(feature_group_list=[], cluster_channels=5, var_thr=-1, split_channels=True, fmap_shape=None, emb_method='umap', min_dist=0.1, n_neighbors=15, a=1.576943460405378, b=0.8950608781227859, verbose=2, random_state=32, group_color_dict={}, lnk_method='complete', **kwargs)[source]
Parameters:
  • feature_group_list (list of the group name for each feature point) –

  • cluster_channels (int, number of the channels(clusters) if feature_group_list is empty) –

  • var_thr (float, defalt is -1, meaning that feature will be included only if the conresponding variance larger than this value. Since some of the feature has pretty low variances, we can remove them by increasing this threshold) –

  • split_channels (bool, if True, outputs will split into various channels using the types of feature) –

  • fmap_shape (None or tuple, size of molmap, if None, the size of feature map will be calculated automatically) –

  • emb_method ({'umap', 'tsne', 'mds', 'isomap', 'random', 'lle', 'se'}, algorithm to embedd high-D to 2D) –

  • min_dist (float, UMAP parameters for the effective minimum distance between embedded points.) –

  • n_neighbors (init, UMAP parameters of controlling the embedding. Larger values result in more global views of the manifold, while smaller values result in more local data being preserved.) –

  • a (float, UMAP parameters of controlling the embedding. If None, it will automatically be determined by min_dist and spread.) –

  • b (float, UMAP parameters of controlling the embedding. If None, it will automatically be determined by min_dist and spread.) –

  • group_color_dict (dict of the group colors, keys are the group names, values are the colors) –

  • lnk_method ({'complete', 'average', 'single', 'weighted', 'centroid'}, linkage method) –

  • kwargs (the extra parameters for the conresponding embedding method) –

load(filename)[source]

load self

plot_grid(htmlpath='./', htmlname=None, enabled_data_labels=False)[source]

Grid plot

plot_scatter(htmlpath='./', htmlname=None, radius=2, enabled_data_labels=False)[source]

Scatter plot, radius: the size of the scatter, must be int

plot_tree(figsize=(16, 8), add_leaf_labels=True, leaf_font_size=18, leaf_rotation=90)[source]

Diagram tree plot

refit_c(cluster_channels=10, lnk_method='complete', group_color_dict={})[source]

re-fit the aggmap object to update the number of channels

Parameters:
  • cluster_channels (int, number of the channels(clusters)) –

  • group_color_dict (dict of the group colors, keys are the group names, values are the colors) –

  • lnk_method ({'complete', 'average', 'single', 'weighted', 'centroid'}, linkage method) –

save(filename)[source]

save self

to_nwk_tree(treefile='mytree', leaf_names=None)[source]

convert mp object to newick tree and the data file to submit to itol sever

transform(arr_1d, scale=True, scale_method='minmax', fillnan=0)[source]
Parameters:
  • arr_1d (1d numpy array feature points) –

  • scale (bool, if True, we will apply MinMax scaling by the precomputed values) –

  • scale_method ({'minmax', 'standard'}) –

  • fillnan (fill nan value, default: 0) –

transform_mpX_to_df(X)[source]

input 4D X, output 2D dataframe

class aggmap.map.Base[source]

Bases: object

MinMaxScaleClip(x, xmin, xmax)[source]
StandardScaler(x, xmean, xstd)[source]
class aggmap.map.Random_2DEmbedding(random_state=123, n_components=2)[source]

Bases: object

fit(X)[source]

aggmap.show module

Created on Tue Aug 18 13:01:00 2020

@author: SHEN WANXIANG

aggmap.show.imshow(x_arr, ax, mode='dark', color_list=['#1300ff', '#ff0c00', '#25ff00', '#d000ff', '#e2ff00', '#00fff6', '#ff8800', '#fccde5', '#178b66', '#8a0075'], x_max=255, vmin=-1, vmax=1)[source]
aggmap.show.imshow_wrap(x, mode='dark', color_list=['#1300ff', '#ff0c00', '#25ff00', '#d000ff', '#e2ff00', '#00fff6', '#ff8800', '#fccde5', '#178b66', '#8a0075'], x_max=255, vmin=-1, vmax=1)[source]

Module contents