Usage

About Bacdiving

Bacdiving is a Python package which can access and retrieve information from the world’s largest database for standardized bacterial phenotypic information: BacDive. Additionally, Bacdiving provides access statistics and options to visualize this information.

The following figure gives an overview of Bacdiving and how this package could be used:

alternate text

As depicted in this workflow, Bacdiving can deal with two types of inputs: either a taxonomy table (e.g. resulting from a phyloseq-object) or a file input.

Let us begin with the file input type (in p x 1 format). The file can contain p rows of either all BacDive-IDs, culture collection numbers, taxonomy information (either as full name or as list with genus name, species with optional epithet, and optional subspecies) or sequence accession numbers (either 16S sequences, SILVA-IDs or genome sequences). For each one of the p rows BacDive is then queried and all strain-level information is stored in a single dataframe. This resulting dataframe is of the format p x r with r being the number of BacDive columns for which we have information. This dataframe, along with the corresponding access statistics, can then be written to file. All other core functions in BacDiving rely on the resulting dataframe.

The second possible input type is a taxonomy table (in p x 7 format) which has the following 7 taxonomic ranks: kingdom, phylum, class, order, family, genus, species. Bacdiving first filters out all rows for which the species is unknown. This results in a ”new” taxonomy table (in m x 7 format). Each one of these m species will then be checked if it can be found on BacDive or not. If information is available for a given species, then BacDive data for all of its known strains will be appended into a single dataframe. In the end, this resulting dataframe (m x r) will contain all strain-level information for all species of the taxonomy table. Thus, the resulting dataframe and the corresponding BacDive access statistics results can be written to file. Additionally, given a taxonomy table as input, one may wish to see Bacdive information across all strains of a given species in order to perform various other downstream statistical tasks (e.g. matrix completion, regression, etc.). Therefore, given specific BacDive features of interest, a flattened file (in p x r’ format) can be outputted which contains the number of strains per species found on BacDive as well as the majority values across all strains for given BacDive features of interest.

Depending on the research question, either BacDiving’s visualization options can be used or custom visualizations can be made using the resulting dataframe. There is really no limit on how you can extend and make use of the resulting dataframe.

For instance, this resulting dataframe along with metadata and the corresponding ASV table could be used in tools like NetCoMi to construct various types of networks. The nodes of these networks could be colored with specific phylogenetic information from BacDive as stored in the resulting dataframe file which in turn may explain why a given network looks the way it does. In other words, coloring the nodes in a network based on phylogentic information may explain the correlation between various features.

Accessing BacDive

As soon as you have registered on BacDive, you can use your credentials to run Bacdiving’s most central function bacdiving.bacdive_call():

bacdiving.bacdive_call(bacdive_id=' ', bacdive_password=' ', inputs_list=[' '], sample_names=[' '], print_res_df_ToFile=True, print_access_stats=True, print_flattened_file=False, columns_of_interest=[' '], output_dir='./')

Reads the inputs, queries the BacDive database and stores resulting dataframe(s) and access statistics.

Parameters
  • bacdive_id (str) – Log in credential: BacDive id.

  • bacdive_password (str) – Log in credential: BacDive password.

  • inputs_list (list[str]) – List which specifies (multiple) strings. Each string has the structure: “<file-path> <file-type> (<content-type>)” and is thus seperated by space(s). Content-type is, however, only required if you have input_via_file; it can have one of the following values: “search_by_id”, “search_by_culture_collection”, “search_by_taxonomy”, “search_by_16S_seq_accession” or “search_by_genome_accession”.

  • sample_names (list[str]) – List of sample names.

  • print_res_df_ToFile (bool) – Print the resulting dataframe with all Bacdive information to file or not.

  • print_access_stats (bool) – Print the Bacdive access statistics to file or not.

  • print_flattened_file (bool) – Print the flattened Bacdive information for certain columns of interest to file or not.

  • columns_of_interest (list[str]) – Specify in this list which columns from BacdiveInformation.tsv you want to include in the flattened file.

  • output_dir (str) – Path to where resulting dataframe should be saved.

Returns

List containing the resulting dataframe(s) with all strain-level BacDive information for all inputs.

Return type

list[pandas.DataFrame]

The first thing bacdiving.bacdive_call() does is, it will prompt you to input your login credentials prior to querying BacDive, if you did not input your credentials via the function parameters "bacdive_id" and "bacdive_password".

After that, it generates the resulting dataframe(s) (BacdiveInformation.tsv) with all strain-level information and it can output the BacDive access statistics (if the parameter is set) as a .txt-file which gives information on the percentage of input species found on BacDive and also lists all species which could not be found on BacDive. Additional files (like Species_names_from_taxtable_file.csv or Flattened_Bacdive_data.tsv) may as well be outputted if your input was a taxonomy table. Note that the file Species_names_from_taxtable_file.csv lists all species from the taxonomy table, even prior to querying BacDive.

For accessing specific data entries in your resulting dataframe you can either run bacdiving.get_resulting_df_values() or bacdiving.access_list_df_objects().

bacdiving.get_resulting_df_values(resulting_df, plot_column=' ', plot_category=' ', species_list=[' '])

Access all categories of interest only for a column of interest from the resulting dataframe.

Parameters
  • resulting_df (pandas.DataFrame) – Resulting dataframe as outputted by bacdive_call().

  • plot_column (str) – Column of interest from resulting_df.

  • plot_category (str) – Category of interest from column of interest from resulting_df.

  • species_list (list[str]) – List of species.

Returns

Dictionary: <species> : <values>

bacdiving.access_list_df_objects(resulting_df, plot_column=' ', plot_category=' ', temp=0, pH=0, halophily=0, species_list=[' '])

Access all categories of interest only for the pH, temperature and halophily columns from the resulting dataframe.

Parameters
  • resulting_df (pandas.DataFrame) – Resulting dataframe as outputted by bacdive_call().

  • plot_column (str) – Column of interest from resulting_df.

  • plot_category (str) – Category of interest from column of interest from resulting_df.

  • temp (int) – Either one of temp, pH or halophily can be accessed. If temp = 1, temp will be accessed.

  • pH (int) – Either one of temp, pH or halophily can be accessed. If pH = 1, pH will be accessed.

  • halophily (int) – Either one of temp, pH or halophily can be accessed. If halophily = 1, halophily will be accessed.

  • species_list (list[str]) – List of species.

Returns

Dictionary: <species> : <values>

However, bacdiving.access_list_df_objects() is only designed to be used if you are interested in retrieving information for either pH, temperature or halophily (e.g. prior to making a box plot), whereas bacdiving.get_resulting_df_values() is more generic.

Visualizations

Bacdiving supports 8 different visualization types:

  1. Circular hierarchical taxonomic tree plot (also referred to as overview tree plot since it gives information on which species have what kind of BacDive information):

bacdiving.overview_treeplot(resulting_df, pallete='brg', colormap1='bwr', column_name1='Culture and growth conditions.culture temp.temperature', column_name2='Physiology and metabolism.oxygen tolerance.oxygen tolerance', label_name1='Category1', label_name2='Category2', colormap2='Wistia', fontsize=14, figsize=[20, 10], saveToFile=True, output_dir='./')

Makes overview tree plot showing hierarchical tree structure for all species of input as well as maximum 2 BacDive columns of interest.

Parameters
  • resulting_df (pandas.DataFrame) – Resulting dataframe as outputted by bacdive_call().

  • pallete (str) – Color palette used.

  • colormap1 (str) – Color map used for first column of interest.

  • column_name1 (str) – First column of interest from resulting_df to plot.

  • column_name2 (str) – Second column of interest from resulting_df to plot.

  • label_name1 (str) – Legend label for first column of interest.

  • label_name2 (str) – Legend label for second column of interest.

  • colormap2 (str) – Color map for second column of interest.

  • fontsize (int) – Size of font.

  • figsize ([x, y] array-like of floats) – Size of plot.

  • saveToFile (bool) – Save plot or not.

  • output_dir (str) – Path where plot should be saved.

Returns

Overview plot

A similar circular hierarchical tree plot but without showing BacDive information can be created as well:

bacdiving.circular_treeplot(resulting_df, width=1400, height=1400, saveToFile=True, output_format='pdf', output_dir='./')

Makes tree plot showing hierarchical tree structure for all species of input.

Parameters
  • resulting_df (pandas.DataFrame) – Resulting dataframe as outputted by bacdive_call().

  • width (int) – Width of tree plot.

  • height (int) – Height of tree plot.

  • saveToFile (bool) – Save plot or not.

  • output_format (str) – Output file type. Possible file formats include: pdf, svg and html.

  • output_dir (str) – Path where plot should be saved.

Returns

Circular treeplot

  1. Stacked bar plot to show relative abundance (of e.g. different genera) per sample:

bacdiving.stacked_barplot_relative_abundance(resulting_df, top_x=15, sample_names=[' '], plot_column=' ', title=' ', title_label=' ', saveToFile=True, output_dir='./', figsize=[15, 10])

Makes stacked bar plot for any taxonomy level from resulting dataframe.

Parameters
  • resulting_df (pandas.DataFrame) – Resulting dataframe as outputted by bacdive_call().

  • top_x (int) – Limit for how many different color categories should be seen in the plot.

  • sample_names (list[str]) – List of names for each sample.

  • plot_column (str) – Taxonomy level of interest (e.g. Name and taxonomic classification.genus).

  • title (str) – Title for this plot.

  • title_label (str) – Title for legend (e.g. Genus).

  • saveToFile (bool) – Save plot or not.

  • output_dir (str) – Path where plot should be saved.

  • figsize ([x, y] array-like of floats) – Size of the resulting plot.

Returns

Stacked bar plot

  1. Pie chart to plot information like oxygen tolerance:

bacdiving.pieplot_maker(resulting_df, plot_column, title=' ', ylabel_name=' ', saveToFile=False, output_dir='./', figsize=[6.4, 4.8])

Makes pie plot for columns of interest from resulting dataframe.

Parameters
  • resulting_df (pandas.DataFrame) – Resulting dataframe as outputted by bacdive_call().

  • plot_column (str) – Column of interest from resulting_df.

  • title (str) – Title for this plot.

  • ylabel_name (str) – y-axis label name.

  • saveToFile (bool) – Save plot or not.

  • output_dir (str) – Path where plot should be saved.

  • figsize ([x, y] array-like of floats) – Size of the resulting plot.

Returns

Pie plot

  1. World map to show all countries (not water bodies!) of origin for a given set of species:

bacdiving.worldmap_maker(resulting_df)

Makes world map displaying all countries where species from the input originate from.

Parameters

resulting_df (pandas.DataFrame) – Resulting dataframe as outputted by bacdive_call().

Returns

World map

  1. Fatty acid profile plot for a fatty acid of interest:

bacdiving.fatty_acid_profile(resulting_df, species='', title='Fatty acid profile plot', figsize=[10, 10], barwidth=0.05, fontsize=6, saveToFile=True, output_dir='./')

Makes fatty acid profile plot for any one fatty acid of interest of interest from resulting dataframe.

Parameters
  • resulting_df (pandas.DataFrame) – Resulting dataframe as outputted by bacdive_call().

  • species (str) – Species of interest (e.g. Bacteroides vulgatus).

  • title (str) – Title for this plot.

  • figsize ([x, y] array-like of floats) – Size of the resulting plot.

  • barwidth (float) – Width of the bars.

  • fontsize (int) – Size of the font.

  • saveToFile (bool) – Save plot or not.

  • output_dir (str) – Path where plot should be saved.

Returns

Fatty acid profile plot

  1. Frequency plot (of e.g. most frequent sampling type):

bacdiving.freqplot_maker(resulting_df, plot_column=' ', title=' ', ylabel_name=' ', saveToFile=False, output_dir='./', figsize=[15, 10])

Makes frequency plot for column of interest from resulting dataframe.

Parameters
  • resulting_df (pandas.DataFrame) – Resulting dataframe as outputted by bacdive_call().

  • plot_column (str) – Column of interest from resulting_df.

  • title (str) – Title for this plot.

  • ylabel_name (str) – y-axis label name.

  • saveToFile (bool) – Save plot or not.

  • output_dir (str) – Path where plot should be saved.

  • figsize ([x, y] array-like of floats) – Size of the resulting plot.

Returns

Frequency plot

  1. Box plot to compare e.g. optimal temperature ranges for various species

bacdiving.boxplot_maker(resulting_dict, title=' ', xlabel_name=' ', ylabel_name=' ', saveToFile=False, output_dir='./', figsize=[15, 10])

Makes box plot given a dictionary with values of interest.

Parameters
  • resulting_dict (dict) – Dictionary input with values (e.g. temperature or pH).

  • title (str) – Title for this plot.

  • xlabel_name (str) – x-axis label name.

  • ylabel_name (str) – y-axis label name.

  • saveToFile (bool) – Save plot or not.

  • output_dir (str) – Path where plot should be saved.

  • figsize ([x, y] array-like of floats) – Size of the resulting plot.

Returns

Box plot

  1. Bar plot to compare e.g. cell length of different species

bacdiving.barplot_maker(resulting_df, plot_column=' ', title=' ', ylabel_name=' ', xlabel_name=' ', color='green', species_list=[], saveToFile=False, output_dir='./', figsize=[15, 10])

Makes bar plot for any continuous column of interest from resulting dataframe.

Parameters
  • resulting_df (pandas.DataFrame) – Resulting dataframe as outputted by bacdive_call().

  • title (str) – Title for this plot.

  • ylabel_name (str) – y-axis label name.

  • xlabel_name (str) – x-axis label name.

  • color (str) – Color of bars.

  • species_list (list[str]) – List of species of interest.

  • saveToFile (bool) – Save plot or not.

  • output_dir (str) – Path where plot should be saved.

  • figsize ([x, y] array-like of floats) – Size of the resulting plot.

Returns

Bar plot