Code and data for "Changes in carbon oxidation state of metagenomes along geochemical redox gradients"

This deposition contains the code and data files that are used to make the figures in the paper titled "Changes in carbon oxidation state of metagenomes along geochemical redox gradients".

The software requirements are R (R Core Team, 2018) (version 3.6.0) and the CHNOSZ package (Dick, 2008) (version 1.1.3 or later). The files are described below:

  • plot.R - This has the high-level functions for making the plots. See the usage instructions at the top of this file.
  • MG.R - This has the underlying functions for making the plots, as well as functions for data preparation. These functions were used to sample sequences from the source FASTA files (which are not included in this deposition) to generate tables of DNA and protein composition (which are included in this deposition). See the usage instructions at the top of this file.
  • ARAST.R - ("Abbreviated RAST") These functions implement a workflow for sequence processing similar to the first few steps of the MG-RAST pipeline (Meyer et al., 2008). This depends on additional software that is described at the top of the file.
  • arast_sortme_rna.pl, sort_helper.sh - These have some non-R code that is used by ARAST. (The first one is taken from MG-RAST with modifications described at the top of the file.)
  • kraken.R - This implements the workflow for using Kraken for taxonomic classification of sequences. See usage instructions at the top of the file.
  • AA_codon.csv - This has codon usage data for two organisms, required for Fig1() in plot.R.
  • data/CUB/ - Source data for AA_codon.csv, downloaded from the Codon Usage Database (Nakamura et al., 2000).

The following files were generated by processing the source FASTA files using the functions in MG.R. These files are required for Figures 2 to 5 in the paper:

  • data/MGD/ - DNA nucleobase compositions for groups of sequences sampled from source FASTA files that have been trimmed, normalized, and dereplicated (part of the ARAST workflow) (and for metatranscriptomes, sequences that are not classified as rRNA by FragGeneScan (Rho et al., 2010)). Files whose names contain "MGD" or "MTD" are derived from metagenomic or metatranscriptomic datasets, respectively.
  • data/MGR/ - RNA nucleobase compositions for groups of sequences sampled from coding DNA FASTA files generated by FragGeneScan. Files whose names contain "MGR" or "MTR" are derived from metagenomic or metatranscriptomic datasets, respectively.
  • data/MGP/ - Protein amino acid compositions for groups of sequences sampled from protein FASTA files generated by FragGeneScan. Files whose names contain "MGP" or "MTP" are derived from metagenomic or metatranscriptomic datasets, respectively.
  • one_percent/* - Files generated using the find_species() function in kraken.R. That function uses the Kraken (Wood and Salzberg, 2014) report files to calculate taxon abundances as a percentage of the number of classified reads, then extracts species that make up at least 1% of the classified reads.
  • embed/* - PDF files of all the figures with embedded fonts.