6. Exploring the embedding & grouping in MEGMA

For the Embedding, we can use various manifold learning algorithms or random method to embed microbes in 2D; For the Grouping, we can use various clustering methods to group microbes or use the pre-defined group information to group the microbes. Now, let’s try different methods to perform the embedding and grouping operations in MEGMA.

6.1 Microbial embedding * 6.1.1 Manifold embedding * 6.1.2 Ramdom embedding

6.2 Microbial grouping * 6.2.1 Hierarchical clustering tree based grouping * 6.2.2 Taxonomic tree based grouping

6.1 Microbial embedding

In this section, we can change the embedding method on the loaded megma object. The megma object supports a refit operation to update itself, so you don’t need to reinitialize a new megma.

[22]:
import matplotlib.pyplot as plt
import matplotlib.patches as patches
import seaborn as sns
import pandas as pd
import numpy as np
import aggmap
from aggmap import loadmap, AggMap

megma = loadmap('./megma/megma.all')

6.1.1 Manifold embedding

Let’s try several manifold methods to embed the microbes in megma.

[24]:
manifold_methods = ['mds', 'isomap', 'umap', 'tsne', 'lle', 'se']
[25]:
# using mds
megma_new = megma.copy()
megma_new.fit(emb_method = 'mds',  verbose=0)
megma_new.plot_scatter(htmlpath='./images', radius = 5)
2022-08-24 15:36:23,232 - INFO - [bidd-aggmap] - applying hierarchical clustering to obtain group information ...
2022-08-24 15:36:28,732 - INFO - [bidd-aggmap] - Applying grid assignment of feature points, this may take several minutes(1~30 min)
2022-08-24 15:36:29,802 - INFO - [bidd-aggmap] - Finished
2022-08-24 15:36:29,808 - INFO - [bidd-aggmap] - generate file: ./images/feature points_849_correlation_mds_scatter
2022-08-24 15:36:29,814 - INFO - [bidd-aggmap] - save html file to ./images/feature points_849_correlation_mds_scatter
[25]:
[26]:
# using isomap
megma_new = megma.copy()
megma_new.fit(emb_method = 'isomap',  verbose=0)
megma_new.plot_scatter(htmlpath='./images', radius = 5)
2022-08-24 15:36:29,828 - INFO - [bidd-aggmap] - applying hierarchical clustering to obtain group information ...
2022-08-24 15:36:30,074 - INFO - [bidd-aggmap] - Applying grid assignment of feature points, this may take several minutes(1~30 min)
2022-08-24 15:36:30,980 - INFO - [bidd-aggmap] - Finished
2022-08-24 15:36:30,986 - INFO - [bidd-aggmap] - generate file: ./images/feature points_849_correlation_isomap_scatter
2022-08-24 15:36:30,992 - INFO - [bidd-aggmap] - save html file to ./images/feature points_849_correlation_isomap_scatter
[26]:
[27]:
# using tsne
megma_new = megma.copy()
megma_new.fit(emb_method = 'tsne',  verbose=0)
megma_new.plot_scatter(htmlpath='./images', radius = 5)
2022-08-24 15:36:31,006 - INFO - [bidd-aggmap] - applying hierarchical clustering to obtain group information ...
2022-08-24 15:36:31,665 - INFO - [bidd-aggmap] - Applying grid assignment of feature points, this may take several minutes(1~30 min)
2022-08-24 15:36:32,025 - INFO - [bidd-aggmap] - Finished
2022-08-24 15:36:32,033 - INFO - [bidd-aggmap] - generate file: ./images/feature points_849_correlation_tsne_scatter
2022-08-24 15:36:32,041 - INFO - [bidd-aggmap] - save html file to ./images/feature points_849_correlation_tsne_scatter
[27]:

6.1.2 Ramdom embedding

Random embedding method just randomly assigns the 2D-coordinates to the microbes:

[29]:
# using tsne
megma_new = megma.copy()
megma_new.fit(emb_method = 'random', random_state=123, verbose=0)
megma_new.plot_scatter(htmlpath='./images', radius = 5)
2022-08-24 15:38:42,385 - INFO - [bidd-aggmap] - applying hierarchical clustering to obtain group information ...
2022-08-24 15:38:42,406 - INFO - [bidd-aggmap] - Applying grid assignment of feature points, this may take several minutes(1~30 min)
2022-08-24 15:38:43,505 - INFO - [bidd-aggmap] - Finished
2022-08-24 15:38:43,511 - INFO - [bidd-aggmap] - generate file: ./images/feature points_849_correlation_random_scatter
2022-08-24 15:38:43,518 - INFO - [bidd-aggmap] - save html file to ./images/feature points_849_correlation_random_scatter
[29]:

6.2 Microbial grouping

The microbes can be grouped into several groups based on their phenotype or genotype distances.

  • For the phenotype-based grouping (or metagenomic grouping), we can calculate the microbial metagenomic abundance correlation distances and then use the agglomerative hierarchical clustering to group the microbes. The default MEGMA uses this method to group the microbes (The cluster number c = 5),the number of clusters can be specified by users.

  • For the genotype-based grouping, we can build a phylogenetic tree and then we can group the microbes by truncating taxonomic levels in the phylogenetic tree, such as by truncating the Kingdom or Phylum level we can generate the different cluster numbers.

The cluster number c is the number of the channels in the feature map, e. g., c=10 means that the number of channels of the 2D-microbiomeprint is 10.

6.2.1 Hierarchical clustering tree based grouping

[41]:
# c = 10
megma_new = megma.copy()
megma_new.fit(cluster_channels = 10, verbose = 0)
fig = megma_new.plot_tree(leaf_font_size=0)
2022-08-24 16:07:38,374 - INFO - [bidd-aggmap] - applying hierarchical clustering to obtain group information ...
2022-08-24 16:07:38,997 - INFO - [bidd-aggmap] - Applying grid assignment of feature points, this may take several minutes(1~30 min)
2022-08-24 16:07:40,003 - INFO - [bidd-aggmap] - Finished
../_images/_example_MEGMA_example_05_embedding_grouping_15_1.png
[42]:
megma_new.plot_scatter(htmlpath='./images', radius = 5)
2022-08-24 16:07:50,211 - INFO - [bidd-aggmap] - generate file: ./images/feature points_849_correlation_umap_scatter
2022-08-24 16:07:50,220 - INFO - [bidd-aggmap] - save html file to ./images/feature points_849_correlation_umap_scatter
[42]:
[43]:
# c = 20
megma_new = megma.copy()
megma_new.fit(cluster_channels = 20, verbose = 0)
fig = megma_new.plot_tree(leaf_font_size=0)
2022-08-24 16:07:50,239 - INFO - [bidd-aggmap] - applying hierarchical clustering to obtain group information ...
2022-08-24 16:07:50,862 - INFO - [bidd-aggmap] - Applying grid assignment of feature points, this may take several minutes(1~30 min)
2022-08-24 16:07:51,868 - INFO - [bidd-aggmap] - Finished
../_images/_example_MEGMA_example_05_embedding_grouping_17_1.png
[44]:
megma_new.plot_scatter(htmlpath='./images', radius = 5)
2022-08-24 16:08:01,983 - INFO - [bidd-aggmap] - generate file: ./images/feature points_849_correlation_umap_scatter
2022-08-24 16:08:01,994 - INFO - [bidd-aggmap] - save html file to ./images/feature points_849_correlation_umap_scatter
[44]:

6.2.2 Taxonomic tree based grouping

Now, let’s try to group the microbes by taxonomic level. The microbes we used has no taxonomic profiles, therefore, we first need to map the taxonomic profile to each microbes based on the mOTU ID.

The parameter feature_group_list supports a customized grouping information to group the microbes, we can generate the feature group list based on the taxonomic level of kingdom, phylum, class, order, and so on.

[78]:
## get the taxonomic profiles of all mOTUs
url = 'https://raw.githubusercontent.com/shenwanxiang/bidd-aggmap/master/docs/source/_example_MEGMA/dataset/'
dfm = pd.read_csv(url + 'mOTUs_new_taxonomic_profile.txt',sep='\t')
dfm = dfm.set_index('#mOTU')[['consensus_taxonomy']]
dfm.head(5)
[78]:
consensus_taxonomy
#mOTU
ref_mOTU_v2_0001 k__Bacteria|p__Firmicutes|c__Erysipelotrichia|...
ref_mOTU_v2_0002 k__Bacteria|p__Proteobacteria|c__Betaproteobac...
ref_mOTU_v2_0003 k__Bacteria|p__Firmicutes|c__Negativicutes|o__...
ref_mOTU_v2_0004 k__Bacteria|p__Firmicutes|c__Bacilli|o__Lactob...
ref_mOTU_v2_0005 k__Bacteria|p__Firmicutes|c__Bacilli|o__Lactob...
[101]:
## get the mOTU id for our microbes in megma
dfs = pd.DataFrame(megma.alist, columns = ['IDs'])
dfs['mOTU'] = dfs.IDs.apply(lambda x:x.split('[')[1]).apply(lambda x:x.split(']')[0])
dfs = dfs.set_index('mOTU')
dfs.head(5)
[101]:
IDs
mOTU
ref_mOTU_v2_0473 Clostridium saccharogumia [ref_mOTU_v2_0473]
ref_mOTU_v2_0643 Clostridium innocuum [ref_mOTU_v2_0643]
ref_mOTU_v2_0860 Clostridium sp. KLE 1755 [ref_mOTU_v2_0860]
ref_mOTU_v2_0883 Clostridium scindens [ref_mOTU_v2_0883]
ref_mOTU_v2_0886 Clostridium boltae/clostridioforme [ref_mOTU_v...
[102]:
#join the taxonomic profile
dfs = dfs.join(dfm)
dfs.head(5)
[102]:
IDs consensus_taxonomy
mOTU
ref_mOTU_v2_0473 Clostridium saccharogumia [ref_mOTU_v2_0473] k__Bacteria|p__Firmicutes|c__Erysipelotrichia|...
ref_mOTU_v2_0643 Clostridium innocuum [ref_mOTU_v2_0643] k__Bacteria|p__Firmicutes|c__Erysipelotrichia|...
ref_mOTU_v2_0860 Clostridium sp. KLE 1755 [ref_mOTU_v2_0860] k__Bacteria|p__Firmicutes|c__Clostridia|o__Clo...
ref_mOTU_v2_0883 Clostridium scindens [ref_mOTU_v2_0883] k__Bacteria|p__Firmicutes|c__Clostridia|o__Clo...
ref_mOTU_v2_0886 Clostridium boltae/clostridioforme [ref_mOTU_v... k__Bacteria|p__Firmicutes|c__Clostridia|o__Clo...
[103]:
#the taxonomic level for each microbe
dft = dfs['consensus_taxonomy'].apply(lambda x: dict([i.split('__') for i in x.split('|')])).apply(pd.Series)
level_dict = {'k':'kingdom', 'p':'phylum', 'c':'class' ,'o':'order' ,'f':'family' ,'g': 'genus','s': 'species'}
dft = dft.rename(columns=level_dict)
dft.head(5)
[103]:
kingdom phylum class order family genus species
mOTU
ref_mOTU_v2_0473 Bacteria Firmicutes Erysipelotrichia Erysipelotrichales Erysipelotrichaceae Erysipelatoclostridium [Clostridium] saccharogumia
ref_mOTU_v2_0643 Bacteria Firmicutes Erysipelotrichia Erysipelotrichales Erysipelotrichaceae Erysipelatoclostridium [Clostridium] innocuum [C Clostridioides diffi...
ref_mOTU_v2_0860 Bacteria Firmicutes Clostridia Clostridiales Clostridiaceae Clostridium Clostridium sp. KLE 1755
ref_mOTU_v2_0883 Bacteria Firmicutes Clostridia Clostridiales Lachnospiraceae Lachnoclostridium [Clostridium] scindens
ref_mOTU_v2_0886 Bacteria Firmicutes Clostridia Clostridiales Lachnospiraceae Lachnoclostridium [Clostridium] sp. [C boltae/clostridioforme]
[120]:
# grouping by kingdom level
feature_group_list = dft['kingdom'].tolist()
megma_new = megma.copy()
megma_new.fit(feature_group_list = feature_group_list, verbose = 0)
megma_new.plot_scatter(htmlpath='./images', radius = 5)
2022-08-24 18:11:18,456 - INFO - [bidd-aggmap] - Applying grid assignment of feature points, this may take several minutes(1~30 min)
2022-08-24 18:11:19,461 - INFO - [bidd-aggmap] - Finished
2022-08-24 18:11:19,467 - INFO - [bidd-aggmap] - generate file: ./images/feature points_849_correlation_umap_scatter
2022-08-24 18:11:19,472 - INFO - [bidd-aggmap] - save html file to ./images/feature points_849_correlation_umap_scatter
[120]:
[107]:
# grouping by phylum level
feature_group_list = dft['phylum'].tolist()
megma_new = megma.copy()
megma_new.fit(feature_group_list = feature_group_list, verbose = 0)
megma_new.plot_scatter(htmlpath='./images', radius = 5)
2022-08-24 17:39:35,833 - INFO - [bidd-aggmap] - Applying grid assignment of feature points, this may take several minutes(1~30 min)
2022-08-24 17:39:36,838 - INFO - [bidd-aggmap] - Finished
2022-08-24 17:39:36,844 - INFO - [bidd-aggmap] - generate file: ./images/feature points_849_correlation_umap_scatter
2022-08-24 17:39:36,853 - INFO - [bidd-aggmap] - save html file to ./images/feature points_849_correlation_umap_scatter
[107]:
[106]:
# grouping by class level
feature_group_list = dft['class'].tolist()
megma_new = megma.copy()
megma_new.fit(feature_group_list = feature_group_list, verbose = 0)
megma_new.plot_scatter(htmlpath='./images', radius = 5)
2022-08-24 17:39:05,503 - INFO - [bidd-aggmap] - Applying grid assignment of feature points, this may take several minutes(1~30 min)
2022-08-24 17:39:06,512 - INFO - [bidd-aggmap] - Finished
2022-08-24 17:39:06,518 - INFO - [bidd-aggmap] - generate file: ./images/feature points_849_correlation_umap_scatter
2022-08-24 17:39:06,530 - INFO - [bidd-aggmap] - save html file to ./images/feature points_849_correlation_umap_scatter
[106]: