aggmap package
Subpackages
Submodules
aggmap.AggMapNet module
Created on Sun Aug 16 17:10:53 2020
@author: wanxiang.shen@u.nus.edu
- class aggmap.AggMapNet.MultiClassEstimator(epochs=200, conv1_kernel_size=13, dense_layers=[128], dense_avf='relu', batch_size=128, lr=0.0001, loss='categorical_crossentropy', batch_norm=False, n_inception=2, dropout=0.0, monitor='val_loss', metric='ACC', patience=10000, verbose=0, last_avf='softmax', random_state=32, gpuid=0)[source]
Bases:
BaseEstimator
,ClassifierMixin
An AggMap CNN MultiClass estimator (each sample belongs to only one class)
- Parameters:
epochs (int, default = 200) – A parameter used for training epochs.
conv1_kernel_size (int, default = 13) – A parameter used for the kernel size of first covolutional layers
dense_layers (list, default = [128]) – A parameter used for the dense layers.
batch_size (int, default: 128) – A parameter used for the batch size.
lr (float, default: 1e-4) – A parameter used for the learning rate.
loss (string or function, default: 'categorical_crossentropy') – A parameter used for the loss function
batch_norm (bool, default: False) – batch normalization after convolution layers.
n_inception (int, default:2) – Number of the inception layers.
dense_avf (str, default is 'relu') – activation fuction in the dense layers.
dropout (float, default: 0) – A parameter used for the dropout of the dense layers.
monitor (str, default: 'val_loss') – {‘val_loss’, ‘val_metric’}, a monitor for model selection.
metric (str, default: 'ACC') – {‘ROC’, ‘ACC’, ‘PRC’}, a matric parameter.
patience (int, default: 10000) – A parameter used for early stopping.
gpuid (int, default: 0) – A parameter used for specific gpu card.
verbose (int, default: 0) – if positive, then the log infomation of AggMapNet will be print, if negative, then the log infomation of orignal model will be print.
random_state (int, default: 32) – Random seed.
Examples
>>> from aggmap import AggModel >>> clf = AggModel.MultiClassEstimator()
- property clean
- explain_model(mp, X, y, binary_task=False, explain_format='global', apply_logrithm=False, apply_smoothing=False, kernel_size=3, sigma=1.2)[source]
Feature importance calculation
- Parameters:
mp (aggmap object) –
X (trianing or test set X arrays) –
y (trianing or test set y arrays) –
binary_task ({True, False}) – whether the task is binary, if True, the feature importance will be calculated for one class only
explain_format ({'local', 'global'}, default: 'global') – local or global feature importance, if local, then X must be one sample
apply_logrithm ({True, False}, default: False) – whether apply a logarithm transformation on the importance values
apply_smoothing ({True, False}, default: False) – whether apply a smoothing transformation on the importance values
kernel_size (odd number, the kernel size to perform the smoothing) –
sigma (float, sigma for gaussian smoothing) –
- Return type:
DataFrame of feature importance
- plot_model(to_file='model.png', show_shapes=True, show_layer_names=True, rankdir='TB', expand_nested=False, dpi=96)[source]
- predict_proba(X)[source]
Probability estimates. The returned estimates for all classes are ordered by the label of classes. For a multi_class problem, if multi_class is set to be “multinomial” the softmax function is used to find the predicted probability of each class.
- Parameters:
X (array-like of shape (n_samples, n_features)) – Vector to be scored, where n_samples is the number of samples and n_features is the number of features.
- Returns:
T – Returns the probability of the sample for each class in the model, where classes are ordered as they are in
self.classes_
.- Return type:
array-like of shape (n_samples, n_classes)
- score(X, y, scoring='accuracy', sample_weight=None)[source]
Returns the score using the scoring option on the given test data and labels.
- Parameters:
X (array-like of shape (n_samples, n_features)) – Test samples.
y (array-like of shape (n_samples,)) – True labels for X.
scoring (str, please refer to: https://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter) –
sample_weight (array-like of shape (n_samples,), default=None) – Sample weights.
- Returns:
score – Score of self.predict(X) wrt. y.
- Return type:
- set_params(**parameters)[source]
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline
). The latter have parameters of the form<component>__<parameter>
so that it’s possible to update each component of a nested object.- Parameters:
**params (dict) – Estimator parameters.
- Returns:
self – Estimator instance.
- Return type:
estimator instance
- class aggmap.AggMapNet.MultiLabelEstimator(epochs=200, conv1_kernel_size=13, dense_layers=[128], dense_avf='relu', batch_size=128, lr=0.0001, loss=<function sigmoid_cross_entropy_with_logits_v2>, batch_norm=False, n_inception=2, dropout=0.0, monitor='val_loss', metric='ROC', patience=10000, verbose=0, random_state=32, gpuid=0)[source]
Bases:
BaseEstimator
,ClassifierMixin
An AggMap CNN MultiLabel estimator (each sample belongs to only one class)
- Parameters:
epochs (int, default = 200) – A parameter used for training epochs.
conv1_kernel_size (int, default = 13) – A parameter used for the kernel size of first covolutional layers。
dense_layers (list, default = [128]) – A parameter used for the dense layers.
batch_size (int, default: 128) – A parameter used for the batch size.
lr (float, default: 1e-4) – A parameter used for the learning rate.
loss (string or function, default: tf.nn.sigmoid_cross_entropy_with_logits。) – A parameter used for the loss function
batch_norm (bool, default: False) – batch normalization after convolution layers.
n_inception (int, default:2) – Number of the inception layers.
dense_avf (str, default is 'relu') – activation fuction in the dense layers.
dropout (float, default: 0) – A parameter used for the dropout of the dense layers, such as 0.1, 0.3, 0.5.
monitor (str, default: 'val_loss') – {‘val_loss’, ‘val_metric’}, a monitor for model selection。
metric (str, default: 'ROC') – {‘ROC’, ‘ACC’, ‘PRC’}, a matric parameter。
patience (int, default: 10000) – A parameter used for early stopping。
gpuid (int, default: 0) – A parameter used for specific gpu card。
verbose (int, default: 0) – if positive, then the log infomation of AggMapNet will be print, if negative, then the log infomation of orignal model will be print。
random_state (int, default: 32) – Random seed
name (str) – Model name
Examples
>>> from aggmap import AggModel >>> clf = AggModel.MultiLabelEstimator()
- property clean
- explain_model(mp, X, y, explain_format='global', apply_logrithm=False, apply_smoothing=False, kernel_size=3, sigma=1.2)[source]
Feature importance calculation.
- Parameters:
mp (aggmap object) –
X (trianing or test set X arrays) –
y (trianing or test set y arrays) – whether the task is binary, if True, the feature importance will be calculated for one class only
explain_format ({'local', 'global'}, default: 'global') – local or global feature importance, if local, then X must be one sample.
apply_logrithm ({True, False}, default: False.) – whether apply a logarithm transformation on the importance values.
apply_smoothing ({True, False}, default: False.) – whether apply a smoothing transformation on the importance values.
kernel_size (odd number, the kernel size to perform the smoothing.) –
sigma (float, sigma for gaussian smoothing.) –
- Return type:
DataFrame of feature importance
- plot_model(to_file='model.png', show_shapes=True, show_layer_names=True, rankdir='TB', expand_nested=False, dpi=96)[source]
- predict_proba(X)[source]
Probability estimates. The returned estimates for all classes are ordered by the label of classes. For a multi_class problem, if multi_class is set to be “multinomial” the softmax function is used to find the predicted probability of each class.
- Parameters:
X (array-like of shape (n_samples, n_features)) – Vector to be scored, where n_samples is the number of samples and n_features is the number of features.
- Returns:
T – Returns the probability of the sample for each class in the model, where classes are ordered as they are in
self.classes_
.- Return type:
array-like of shape (n_samples, n_classes)
- score(X, y, scoring='accuracy', sample_weight=None)[source]
Returns the score using the scoring option on the given test data and labels.
- Parameters:
X (array-like of shape (n_samples, n_features)) – Test samples.
y (array-like of shape (n_samples,)) – True labels for X.
scoring (str, please refer to: https://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter) –
sample_weight (array-like of shape (n_samples,), default=None) – Sample weights.
- Returns:
score – Score of self.predict(X) wrt. y.
- Return type:
- set_params(**parameters)[source]
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline
). The latter have parameters of the form<component>__<parameter>
so that it’s possible to update each component of a nested object.- Parameters:
**params (dict) – Estimator parameters.
- Returns:
self – Estimator instance.
- Return type:
estimator instance
- class aggmap.AggMapNet.RegressionEstimator(epochs=200, conv1_kernel_size=13, dense_layers=[128], dense_avf='relu', batch_size=128, lr=0.0001, loss='mse', batch_norm=False, n_inception=2, dropout=0.0, monitor='val_loss', metric='r2', patience=10000, verbose=0, random_state=32, gpuid=0)[source]
Bases:
BaseEstimator
,RegressorMixin
An AggMap CNN Regression estimator (each sample belongs to only one class)
- Parameters:
epochs (int, default = 200) – A parameter used for training epochs.
conv1_kernel_size (int, default = 13) – A parameter used for the kernel size of first covolutional layers
dense_layers (list, default = [128]) – A parameter used for the dense layers.
batch_size (int, default: 128) – A parameter used for the batch size.
lr (float, default: 1e-4) – A parameter used for the learning rate.
loss (string or function, default: 'mse') – A parameter used for the loss function
batch_norm (bool, default: False) – batch normalization after convolution layers.
n_inception (int, default:2) – Number of the inception layers.
dense_avf (str, default is 'relu') – activation fuction in the dense layers.
dropout (float, default: 0) – A parameter used for the dropout of the dense layers.
monitor (str, default: 'val_loss') – {‘val_loss’, ‘val_r2’}, a monitor for model selection
metric (str, default: 'r2') – {‘r2’, ‘rmse’}, a matric parameter
patience (int, default: 10000) – A parameter used for early stopping
gpuid (int, default: 0) – A parameter used for specific gpu card
verbose (int, default: 0) – if positive, then the log infomation of AggMapNet will be print if negative, then the log infomation of orignal model will be print
random_state (int, default: 32) – random seed.
Examples
>>> from aggmap import AggModel >>> clf = AggModel.RegressionEstimator()
- property clean
- explain_model(mp, X, y, explain_format='global', apply_logrithm=False, apply_smoothing=False, kernel_size=3, sigma=1.2)[source]
Feature importance calculation
- Parameters:
mp (aggmap object) –
X (trianing or test set X arrays) –
y (trianing or test set y arrays) –
explain_format ({'local', 'global'}, default: 'global') – local or global feature importance, if local, then X must be one sample
apply_logrithm ({True, False}, default: False) – whether apply a logarithm transformation on the importance values
apply_smoothing ({True, False}, default: False) – whether apply a smoothing transformation on the importance values
kernel_size (odd number, the kernel size to perform the smoothing) –
sigma (float, sigma for gaussian smoothing) –
- Return type:
DataFrame of feature importance
- plot_model(to_file='model.png', show_shapes=True, show_layer_names=True, rankdir='TB', expand_nested=False, dpi=96)[source]
- predict(X)[source]
- Parameters:
X (array-like of shape (n_samples, n_features_w, n_features_h, n_features_c)) – Vector to be scored, where n_samples is the number of samples and
- Returns:
T – Returns the predicted values
- Return type:
array-like of shape (n_samples, n_classes)
- score(X, y, scoring='r2', sample_weight=None)[source]
Returns the score using the scoring option on the given test data and labels.
- Parameters:
X (array-like of shape (n_samples, n_features)) – Test samples.
y (array-like of shape (n_samples,)) – True labels for X.
scoring (str, please refer to: https://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter) –
sample_weight (array-like of shape (n_samples,), default=None) – Sample weights.
- Returns:
score – Score of self.predict(X) wrt. y.
- Return type:
- set_params(**parameters)[source]
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline
). The latter have parameters of the form<component>__<parameter>
so that it’s possible to update each component of a nested object.- Parameters:
**params (dict) – Estimator parameters.
- Returns:
self – Estimator instance.
- Return type:
estimator instance
aggmap.map module
Created on Sun Aug 25 20:29:36 2019
@author: wanxiang.shen@u.nus.edu
main aggmap code
- class aggmap.map.AggMap(dfx, metric='correlation', by_scipy=False, n_cpus=16, info_distance=None)[source]
Bases:
Base
The feature restructuring class AggMap
- Parameters:
dfx (pandas DataFrame) – Input data frame.
metric (string, default: 'correlation') – measurement of feature distance, support {‘cosine’, ‘correlation’, ‘euclidean’, ‘jaccard’, ‘hamming’, ‘dice’}
info_distance (numpy array, defalt: None) – a vector-form distance vector of the feature points, shape should be: (n*(n-1)/2), where n is the number of the features. It can be useful when you have you own vector-form distance to pass
by_scipy (bool, defalt: False.) – calculate the distance by using the scipy pdist fuction. It can bu useful when dfx.shape[1] > 20000, i.e., the number of features is very large Using pdist will increase the speed to calculate the distance.
n_cpus (int, default: 16) – number of cpu cores to use to calculate the distance.
- batch_transform(array_2d, scale=True, scale_method='minmax', n_jobs=4, fillnan=0)[source]
- Parameters:
array_2d (2D numpy array feature points, M(samples) x N(feature ponits)) –
scale (bool, if True, we will apply MinMax scaling by the precomputed values) –
scale_method ({'minmax', 'standard'}) –
n_jobs (number of parallel) –
fillnan (fill nan value, default: 0) –
- fit(feature_group_list=[], cluster_channels=5, var_thr=-1, split_channels=True, fmap_shape=None, emb_method='umap', min_dist=0.1, n_neighbors=15, a=1.576943460405378, b=0.8950608781227859, verbose=2, random_state=32, group_color_dict={}, lnk_method='complete', **kwargs)[source]
- Parameters:
feature_group_list (list of the group name for each feature point) –
cluster_channels (int, number of the channels(clusters) if feature_group_list is empty) –
var_thr (float, defalt is -1, meaning that feature will be included only if the conresponding variance larger than this value. Since some of the feature has pretty low variances, we can remove them by increasing this threshold) –
split_channels (bool, if True, outputs will split into various channels using the types of feature) –
fmap_shape (None or tuple, size of molmap, if None, the size of feature map will be calculated automatically) –
emb_method ({'umap', 'tsne', 'mds', 'isomap', 'random', 'lle', 'se'}, algorithm to embedd high-D to 2D) –
min_dist (float, UMAP parameters for the effective minimum distance between embedded points.) –
n_neighbors (init, UMAP parameters of controlling the embedding. Larger values result in more global views of the manifold, while smaller values result in more local data being preserved.) –
a (float, UMAP parameters of controlling the embedding. If None, it will automatically be determined by
min_dist
andspread
.) –b (float, UMAP parameters of controlling the embedding. If None, it will automatically be determined by
min_dist
andspread
.) –group_color_dict (dict of the group colors, keys are the group names, values are the colors) –
lnk_method ({'complete', 'average', 'single', 'weighted', 'centroid'}, linkage method) –
kwargs (the extra parameters for the conresponding embedding method) –
- plot_scatter(htmlpath='./', htmlname=None, radius=2, enabled_data_labels=False)[source]
Scatter plot, radius: the size of the scatter, must be int
- plot_tree(figsize=(16, 8), add_leaf_labels=True, leaf_font_size=18, leaf_rotation=90)[source]
Diagram tree plot
- refit_c(cluster_channels=10, lnk_method='complete', group_color_dict={})[source]
re-fit the aggmap object to update the number of channels
- Parameters:
cluster_channels (int, number of the channels(clusters)) –
group_color_dict (dict of the group colors, keys are the group names, values are the colors) –
lnk_method ({'complete', 'average', 'single', 'weighted', 'centroid'}, linkage method) –
- to_nwk_tree(treefile='mytree', leaf_names=None)[source]
convert mp object to newick tree and the data file to submit to itol sever
aggmap.show module
Created on Tue Aug 18 13:01:00 2020
@author: SHEN WANXIANG