Tutorial

We start by importing Bacdiving:

from bacdiving import bacdive_caller as bc
from bacdiving import treeplots_maker as tm
from bacdiving import visualizations_maker as vm

Now assume we have the following taxonomy table taxtab.tsv (which we prior extracted from a phyloseq-object):

taxtab.tsv
	Kingdom	Phylum	Class	Order	Family	Genus	Species
ASV1	Bacteria	Bacteroidota	Bacteroidia	Bacteroidales	Bacteroidaceae	Bacteroides	vulgatus
ASV2	Bacteria	Firmicutes	Clostridia	Lachnospirales	Lachnospiraceae	Blautia	NA
ASV3	Bacteria	Bacteroidota	Bacteroidia	Bacteroidales	Bacteroidaceae	Bacteroides	NA
ASV4	Bacteria	Bacteroidota	Bacteroidia	Bacteroidales	Bacteroidaceae	Bacteroides	uniformis
ASV5	Bacteria	Firmicutes	Clostridia	Oscillospirales	Ruminococcaceae	Faecalibacterium	NA

Note

This taxonomy table stems from Nagel et al. (2016) but you can use any taxonomy table in this .tsv format to follow along with this tutorial. For demonstration purposes this table only shows the first 5 rows of the taxonomy table.

Warning

Note how species-level information is simply not always known for all ASVs in a taxonomy table. On average, you can expect 80% - 95% of all species from a given taxonomy table to be documented in BacDive. The exact percentage is mentioned in the access statistics output file which is generated after running bacdiving.bacdive_call().

To get the resulting dataframe with all strain-level BacDive information for all the species in this taxonomy table you can run:

# Run for single taxonomy table input (e.g. as extracted from phyloseq-object)
resulting_list_with_all_res_dfs = bc.bacdive_call(bacdive_id="<your ID>", bacdive_password="<your password>", inputs_list=["./taxtab.tsv taxtable_input"], sample_names=["taxtab"], print_res_df_ToFile = True, print_access_stats = True, print_flattened_file=True, columns_of_interest=["Physiology and metabolism.oxygen tolerance.oxygen tolerance", "Culture and growth conditions.culture temp.temperature", "Isolation, sampling and environmental information.isolation.origin.country","Morphology.cell morphology.motility"], output_dir="./")
resulting_df = resulting_list_with_all_res_dfs[0]

Warning

If you set "print_flattened_file" to True, then make sure your "columns_of_interest" contain meaningful BacDive features. For example, “General.BacDive-ID” or “General.description” would not be a smart choice as the majority value over all strains per species will be written to the flattened output file. Also, columns like “Culture and growth conditions.culture pH” may not be a good choice as this is typically a column with mixed types (neither fully categorical nor numerical). If you are interested in obtaining information from such columns with mixed types, use rather other functions like bacdiving.get_resulting_df_values(). Thus, you should use columns like the ones above instead, which will print the majority value (which is not nan) for a given BacDive feature across all strains per species.

Assuming you do not have a taxonomy table, but have a simple file as input instead which looks something like this:

SILVA_ids.txt
AB681649
AB121974
EU847536
D30768
L35516
AB681086
AB052706
AF144407
AF363064
AJ430586

Note

For demonstration purposes this SILVA_ids.txt file only contains 10 SILVA-ids. Note that other input types for querying BacDive are possible as well, e.g. taxonomy (as in a list of all species names of interest), Bacdive-ids, culture collection ids or genome accession ids. However, it is important that a single file can not contain multiple input types.

Given your input file, you can run the following, depending on your input type:

# Run for a single input from text file for SILVA id queries
resulting_list_with_all_res_dfs = bc.bacdive_call(bacdive_id="<your ID>", bacdive_password="<your password>", inputs_list=["./SILVA_ids.txt input_via_file search_by_16S_seq_accession"], sample_names=["SILVA"], output_dir="./")
resulting_df = resulting_list_with_all_res_dfs[0]

# Run for a single input from text file for taxonomy queries
resulting_list_with_all_res_dfs = bc.bacdive_call(inputs_list=["./taxonomy_ids.txt input_via_file search_by_taxonomy"], sample_names=["taxonomy"], output_dir="./results/") # if credentials are not given via parameters, you will get prompted
resulting_df = resulting_list_with_all_res_dfs[0]

# Run for a single input from text file for BacDive id queries
resulting_list_with_all_res_dfs = bc.bacdive_call(bacdive_id="<your ID>", bacdive_password="<your password>", inputs_list=["./bacdive_ids.txt input_via_file search_by_id"], sample_names=["bacdive"], output_dir="./")
resulting_df = resulting_list_with_all_res_dfs[0]

# Run for a single input from text file for culture collection queries
resulting_list_with_all_res_dfs = bc.bacdive_call(bacdive_id="<your ID>", bacdive_password="<your password>", inputs_list=["./culture_col_ids.txt input_via_file search_by_culture_collection"], sample_names=["culturecol"], output_dir="./")
resulting_df = resulting_list_with_all_res_dfs[0]

# Run for a single input from text file for genome accession queries
resulting_list_with_all_res_dfs = bc.bacdive_call(bacdive_id="<your ID>", bacdive_password="<your password>", inputs_list=["./genome_ids.txt input_via_file search_by_genome_accession"], sample_names=["genomecol"], output_dir="./")
resulting_df = resulting_list_with_all_res_dfs[0]

If you have multiple inputs of possibly different input types, you can run the following command:

# Run for multiple inputs (of possibly different input types)
resulting_list_with_all_res_dfs = bc.bacdive_call(bacdive_id="<your ID>", bacdive_password="<your password>", inputs_list=["./SILVA_ids.txt input_via_file search_by_16S_seq_accession", "./taxonomy_ids.txt input_via_file search_by_taxonomy", "./taxtab1.tsv taxtable_input", "./taxtab2.tsv taxtable_input"], sample_names=["sample1", "sample2", "sample3", "sample4"], print_flattened_file=True, columns_of_interest=["Physiology and metabolism.oxygen tolerance.oxygen tolerance", "Culture and growth conditions.culture temp.temperature"])
resulting_df = resulting_list_with_all_res_dfs[1]  # pick your dataframe of interest from this list

Now that you have the resulting dataframe at hand, you are almost ready to start visualizing the data.

Warning

If you try to plot information for a column which is not present in the resulting dataframe or if your parameters are set incorrectly or do not match the resulting dataframe, you may get an error. So, make sure to get to know your resulting dataframe (and especially its columns) beforehand.

Let’s first take a look at the resulting dataframe:

#Dataframe exploration
print(resulting_df.head()) # prints head of resulting dataframe
print(len(resulting_df.index)) #print number of resulting_df rows
print(resulting_df.keys())  #print resulting_df column names
print(resulting_df.iloc[0:5, 2:4]) #print all specific column information via column index
print(resulting_df["Physiology and metabolism.oxygen tolerance.oxygen tolerance"].unique()) #print unique values in a given column
print(resulting_df.loc[resulting_df["Name and taxonomic classification.species" ] == "Bacteroides uniformis"])  # print all strains and all columns for Bacteroides uniformis from resulting_df
print(resulting_df.loc[resulting_df["Name and taxonomic classification.species"] == "Helicobacter pylori"]["Culture and growth conditions.culture temp"])  # print all strains for column "Culture and growth conditions.culture temp" for Helicobacter pylori from resulting_df
print(resulting_df.loc[(resulting_df["Name and taxonomic classification.species"] == "Helicobacter pylori") & (resulting_df["Isolation, sampling and environmental information.isolation.country"] == "Germany")]) #Subset resulting_df to certain parameters
print(len(resulting_df.loc[resulting_df["Name and taxonomic classification.species"] == 'Zhihengliuella alba'].index)) #find out how many strains are present for a given species
print(get_resulting_df_values(resulting_df, "Culture and growth conditions.culture pH", "pH", species_list=["Helicobacter pylori", "Bacteroides clarus", "Actinomyces odontolyticus", "Bacteroides salyersiae", "Zhihengliuella alba"])) #Given a list of species of interest, access elements in resulting_df which are nested

Great, now we know the basic structure of our resulting dataframe and what kind of BacDive information we have, so it is finally time to start plotting!

Note

There are many possibilites for which columns from the resulting dataframe can be plotted for each plotting function. This tutorial shall only demonstrate a few examples.

Warning

Since we are dealing with strain-level information per species, please be aware when investigating your plotting results that BacDive information does not neccessarily exist for all strains.

In order to first gain some overview of our data, let us start with Bacdiving’s overview treeplot. Assume we want to know for which species BacDive information on temperature and oxygen tolerance is known or not. We can do this by running the following command:

#Overview treeplot
tm.overview_treeplot(resulting_df, label_name1="Temperature", label_name2="Oxygen tolerance", saveToFile=True, output_dir="./")