6. Exploring the embedding & grouping in MEGMA

For the Embedding, we can use various manifold learning algorithms or random method to embed microbes in 2D; For the Grouping, we can use various clustering methods to group microbes or use the pre-defined group information to group the microbes. Now, let’s try different methods to perform the embedding and grouping operations in MEGMA.

6.1 Microbial embedding * 6.1.1 Manifold embedding * 6.1.2 Ramdom embedding

6.2 Microbial grouping * 6.2.1 Hierarchical clustering tree based grouping * 6.2.2 Taxonomic tree based grouping

6.1 Microbial embedding

In this section, we can change the embedding method on the loaded megma object. The megma object supports a refit operation to update itself, so you don’t need to reinitialize a new megma.

[22]:

import matplotlib.pyplot as plt
import matplotlib.patches as patches
import seaborn as sns
import pandas as pd
import numpy as np
import aggmap
from aggmap import loadmap, AggMap

megma = loadmap('./megma/megma.all')

6.1.1 Manifold embedding

Let’s try several manifold methods to embed the microbes in megma.

[24]:

manifold_methods = ['mds', 'isomap', 'umap', 'tsne', 'lle', 'se']

[25]:

# using mds
megma_new = megma.copy()
megma_new.fit(emb_method = 'mds',  verbose=0)
megma_new.plot_scatter(htmlpath='./images', radius = 5)

2022-08-24 15:36:23,232 - INFO - [bidd-aggmap] - applying hierarchical clustering to obtain group information ...
2022-08-24 15:36:28,732 - INFO - [bidd-aggmap] - Applying grid assignment of feature points, this may take several minutes(1~30 min)
2022-08-24 15:36:29,802 - INFO - [bidd-aggmap] - Finished
2022-08-24 15:36:29,808 - INFO - [bidd-aggmap] - generate file: ./images/feature points_849_correlation_mds_scatter
2022-08-24 15:36:29,814 - INFO - [bidd-aggmap] - save html file to ./images/feature points_849_correlation_mds_scatter

[25]:

[26]:

# using isomap
megma_new = megma.copy()
megma_new.fit(emb_method = 'isomap',  verbose=0)
megma_new.plot_scatter(htmlpath='./images', radius = 5)

2022-08-24 15:36:29,828 - INFO - [bidd-aggmap] - applying hierarchical clustering to obtain group information ...
2022-08-24 15:36:30,074 - INFO - [bidd-aggmap] - Applying grid assignment of feature points, this may take several minutes(1~30 min)
2022-08-24 15:36:30,980 - INFO - [bidd-aggmap] - Finished
2022-08-24 15:36:30,986 - INFO - [bidd-aggmap] - generate file: ./images/feature points_849_correlation_isomap_scatter
2022-08-24 15:36:30,992 - INFO - [bidd-aggmap] - save html file to ./images/feature points_849_correlation_isomap_scatter

[26]:

[27]:

# using tsne
megma_new = megma.copy()
megma_new.fit(emb_method = 'tsne',  verbose=0)
megma_new.plot_scatter(htmlpath='./images', radius = 5)

2022-08-24 15:36:31,006 - INFO - [bidd-aggmap] - applying hierarchical clustering to obtain group information ...
2022-08-24 15:36:31,665 - INFO - [bidd-aggmap] - Applying grid assignment of feature points, this may take several minutes(1~30 min)
2022-08-24 15:36:32,025 - INFO - [bidd-aggmap] - Finished
2022-08-24 15:36:32,033 - INFO - [bidd-aggmap] - generate file: ./images/feature points_849_correlation_tsne_scatter
2022-08-24 15:36:32,041 - INFO - [bidd-aggmap] - save html file to ./images/feature points_849_correlation_tsne_scatter

[27]:

6.1.2 Ramdom embedding

Random embedding method just randomly assigns the 2D-coordinates to the microbes:

[29]:

# using tsne
megma_new = megma.copy()
megma_new.fit(emb_method = 'random', random_state=123, verbose=0)
megma_new.plot_scatter(htmlpath='./images', radius = 5)

2022-08-24 15:38:42,385 - INFO - [bidd-aggmap] - applying hierarchical clustering to obtain group information ...
2022-08-24 15:38:42,406 - INFO - [bidd-aggmap] - Applying grid assignment of feature points, this may take several minutes(1~30 min)
2022-08-24 15:38:43,505 - INFO - [bidd-aggmap] - Finished
2022-08-24 15:38:43,511 - INFO - [bidd-aggmap] - generate file: ./images/feature points_849_correlation_random_scatter
2022-08-24 15:38:43,518 - INFO - [bidd-aggmap] - save html file to ./images/feature points_849_correlation_random_scatter

[29]:

6.2 Microbial grouping

The microbes can be grouped into several groups based on their phenotype or genotype distances.

For the phenotype-based grouping (or metagenomic grouping), we can calculate the microbial metagenomic abundance correlation distances and then use the agglomerative hierarchical clustering to group the microbes. The default MEGMA uses this method to group the microbes (The cluster number c = 5)，the number of clusters can be specified by users.
For the genotype-based grouping, we can build a phylogenetic tree and then we can group the microbes by truncating taxonomic levels in the phylogenetic tree, such as by truncating the Kingdom or Phylum level we can generate the different cluster numbers.

The cluster number c is the number of the channels in the feature map, e. g., c=10 means that the number of channels of the 2D-microbiomeprint is 10.

6.2.1 Hierarchical clustering tree based grouping

[41]:

# c = 10
megma_new = megma.copy()
megma_new.fit(cluster_channels = 10, verbose = 0)
fig = megma_new.plot_tree(leaf_font_size=0)

2022-08-24 16:07:38,374 - INFO - [bidd-aggmap] - applying hierarchical clustering to obtain group information ...
2022-08-24 16:07:38,997 - INFO - [bidd-aggmap] - Applying grid assignment of feature points, this may take several minutes(1~30 min)
2022-08-24 16:07:40,003 - INFO - [bidd-aggmap] - Finished

../_images/_example_MEGMA_example_05_embedding_grouping_15_1.png

[42]:

megma_new.plot_scatter(htmlpath='./images', radius = 5)

2022-08-24 16:07:50,211 - INFO - [bidd-aggmap] - generate file: ./images/feature points_849_correlation_umap_scatter
2022-08-24 16:07:50,220 - INFO - [bidd-aggmap] - save html file to ./images/feature points_849_correlation_umap_scatter

[42]:

[43]:

# c = 20
megma_new = megma.copy()
megma_new.fit(cluster_channels = 20, verbose = 0)
fig = megma_new.plot_tree(leaf_font_size=0)

2022-08-24 16:07:50,239 - INFO - [bidd-aggmap] - applying hierarchical clustering to obtain group information ...
2022-08-24 16:07:50,862 - INFO - [bidd-aggmap] - Applying grid assignment of feature points, this may take several minutes(1~30 min)
2022-08-24 16:07:51,868 - INFO - [bidd-aggmap] - Finished

../_images/_example_MEGMA_example_05_embedding_grouping_17_1.png

[44]:

megma_new.plot_scatter(htmlpath='./images', radius = 5)

2022-08-24 16:08:01,983 - INFO - [bidd-aggmap] - generate file: ./images/feature points_849_correlation_umap_scatter
2022-08-24 16:08:01,994 - INFO - [bidd-aggmap] - save html file to ./images/feature points_849_correlation_umap_scatter

[44]:

6.2.2 Taxonomic tree based grouping

Now, let’s try to group the microbes by taxonomic level. The microbes we used has no taxonomic profiles, therefore, we first need to map the taxonomic profile to each microbes based on the mOTU ID.

The parameter feature_group_list supports a customized grouping information to group the microbes, we can generate the feature group list based on the taxonomic level of kingdom, phylum, class, order, and so on.

[78]:

## get the taxonomic profiles of all mOTUs
url = 'https://raw.githubusercontent.com/shenwanxiang/bidd-aggmap/master/docs/source/_example_MEGMA/dataset/'
dfm = pd.read_csv(url + 'mOTUs_new_taxonomic_profile.txt',sep='\t')
dfm = dfm.set_index('#mOTU')[['consensus_taxonomy']]
dfm.head(5)

[78]:

	consensus_taxonomy
#mOTU
ref_mOTU_v2_0001	k__Bacteria\|p__Firmicutes\|c__Erysipelotrichia\|...
ref_mOTU_v2_0002	k__Bacteria\|p__Proteobacteria\|c__Betaproteobac...
ref_mOTU_v2_0003	k__Bacteria\|p__Firmicutes\|c__Negativicutes\|o__...
ref_mOTU_v2_0004	k__Bacteria\|p__Firmicutes\|c__Bacilli\|o__Lactob...
ref_mOTU_v2_0005	k__Bacteria\|p__Firmicutes\|c__Bacilli\|o__Lactob...

[101]:

## get the mOTU id for our microbes in megma
dfs = pd.DataFrame(megma.alist, columns = ['IDs'])
dfs['mOTU'] = dfs.IDs.apply(lambda x:x.split('[')[1]).apply(lambda x:x.split(']')[0])
dfs = dfs.set_index('mOTU')
dfs.head(5)

[101]:

	IDs
mOTU
ref_mOTU_v2_0473	Clostridium saccharogumia [ref_mOTU_v2_0473]
ref_mOTU_v2_0643	Clostridium innocuum [ref_mOTU_v2_0643]
ref_mOTU_v2_0860	Clostridium sp. KLE 1755 [ref_mOTU_v2_0860]
ref_mOTU_v2_0883	Clostridium scindens [ref_mOTU_v2_0883]
ref_mOTU_v2_0886	Clostridium boltae/clostridioforme [ref_mOTU_v...

[102]:

#join the taxonomic profile
dfs = dfs.join(dfm)
dfs.head(5)

[102]:

	IDs	consensus_taxonomy
mOTU
ref_mOTU_v2_0473	Clostridium saccharogumia [ref_mOTU_v2_0473]	k__Bacteria\|p__Firmicutes\|c__Erysipelotrichia\|...
ref_mOTU_v2_0643	Clostridium innocuum [ref_mOTU_v2_0643]	k__Bacteria\|p__Firmicutes\|c__Erysipelotrichia\|...
ref_mOTU_v2_0860	Clostridium sp. KLE 1755 [ref_mOTU_v2_0860]	k__Bacteria\|p__Firmicutes\|c__Clostridia\|o__Clo...
ref_mOTU_v2_0883	Clostridium scindens [ref_mOTU_v2_0883]	k__Bacteria\|p__Firmicutes\|c__Clostridia\|o__Clo...
ref_mOTU_v2_0886	Clostridium boltae/clostridioforme [ref_mOTU_v...	k__Bacteria\|p__Firmicutes\|c__Clostridia\|o__Clo...

[103]:

#the taxonomic level for each microbe
dft = dfs['consensus_taxonomy'].apply(lambda x: dict([i.split('__') for i in x.split('|')])).apply(pd.Series)
level_dict = {'k':'kingdom', 'p':'phylum', 'c':'class' ,'o':'order' ,'f':'family' ,'g': 'genus','s': 'species'}
dft = dft.rename(columns=level_dict)
dft.head(5)

[103]:

	kingdom	phylum	class	order	family	genus	species
mOTU
ref_mOTU_v2_0473	Bacteria	Firmicutes	Erysipelotrichia	Erysipelotrichales	Erysipelotrichaceae	Erysipelatoclostridium	[Clostridium] saccharogumia
ref_mOTU_v2_0643	Bacteria	Firmicutes	Erysipelotrichia	Erysipelotrichales	Erysipelotrichaceae	Erysipelatoclostridium	[Clostridium] innocuum [C Clostridioides diffi...
ref_mOTU_v2_0860	Bacteria	Firmicutes	Clostridia	Clostridiales	Clostridiaceae	Clostridium	Clostridium sp. KLE 1755
ref_mOTU_v2_0883	Bacteria	Firmicutes	Clostridia	Clostridiales	Lachnospiraceae	Lachnoclostridium	[Clostridium] scindens
ref_mOTU_v2_0886	Bacteria	Firmicutes	Clostridia	Clostridiales	Lachnospiraceae	Lachnoclostridium	[Clostridium] sp. [C boltae/clostridioforme]

[120]:

# grouping by kingdom level
feature_group_list = dft['kingdom'].tolist()
megma_new = megma.copy()
megma_new.fit(feature_group_list = feature_group_list, verbose = 0)
megma_new.plot_scatter(htmlpath='./images', radius = 5)

2022-08-24 18:11:18,456 - INFO - [bidd-aggmap] - Applying grid assignment of feature points, this may take several minutes(1~30 min)
2022-08-24 18:11:19,461 - INFO - [bidd-aggmap] - Finished
2022-08-24 18:11:19,467 - INFO - [bidd-aggmap] - generate file: ./images/feature points_849_correlation_umap_scatter
2022-08-24 18:11:19,472 - INFO - [bidd-aggmap] - save html file to ./images/feature points_849_correlation_umap_scatter

[120]:

[107]:

# grouping by phylum level
feature_group_list = dft['phylum'].tolist()
megma_new = megma.copy()
megma_new.fit(feature_group_list = feature_group_list, verbose = 0)
megma_new.plot_scatter(htmlpath='./images', radius = 5)

2022-08-24 17:39:35,833 - INFO - [bidd-aggmap] - Applying grid assignment of feature points, this may take several minutes(1~30 min)
2022-08-24 17:39:36,838 - INFO - [bidd-aggmap] - Finished
2022-08-24 17:39:36,844 - INFO - [bidd-aggmap] - generate file: ./images/feature points_849_correlation_umap_scatter
2022-08-24 17:39:36,853 - INFO - [bidd-aggmap] - save html file to ./images/feature points_849_correlation_umap_scatter

[107]:

[106]:

# grouping by class level
feature_group_list = dft['class'].tolist()
megma_new = megma.copy()
megma_new.fit(feature_group_list = feature_group_list, verbose = 0)
megma_new.plot_scatter(htmlpath='./images', radius = 5)

2022-08-24 17:39:05,503 - INFO - [bidd-aggmap] - Applying grid assignment of feature points, this may take several minutes(1~30 min)
2022-08-24 17:39:06,512 - INFO - [bidd-aggmap] - Finished
2022-08-24 17:39:06,518 - INFO - [bidd-aggmap] - generate file: ./images/feature points_849_correlation_umap_scatter
2022-08-24 17:39:06,530 - INFO - [bidd-aggmap] - save html file to ./images/feature points_849_correlation_umap_scatter

[106]: