为了正常的体验网站,请在浏览器设置里面开启Javascript功能!

劳务资质办理流程

2017-09-18 4页 doc 16KB 14阅读

用户头像

is_348501

暂无简介

举报
劳务资质办理流程个人整理的Qiime命令用法大全 个人整理的QIIME脚本命令用法大全 By peterrjp add_alpha_to_mapping_file.py – Add alpha diversity data to a metadata mapping file Description: Add alpha diversity data to a mapping file for use with other QIIME scripts, i. e. make_3d_plots.py. The resulting m...
劳务资质办理流程
个人整理的Qiime命令用法大全 个人整理的QIIME脚本命令用法大全 By peterrjp add_alpha_to_mapping_file.py – Add alpha diversity data to a metadata mapping file Description: Add alpha diversity data to a mapping file for use with other QIIME scripts, i. e. make_3d_plots.py. The resulting mapping file will contain three new columns per metric in the alpha diversity data; the first column being the raw value, the second being a normalized raw value and the third one a label classifying the bin where this value fits based on the normalized value. Usage: add_alpha_to_mapping_file.py [options] Input Arguments: [REQUIRED] -i, --alpha_fps Alpha diversity data with one or multiple metrics i. e. the output of alpha_diversity.py. This can also be a comma-separated list of collated alpha diversity file paths i. e. the output of collate_alpha.py, when using collated alpha diversity data the –depth option is required -m, --mapping_fp Mapping file to modify by adding the alpha diversity data [OPTIONAL] -o, --output_mapping_fp Filepath for the modified mapping file [default: mapping_file_with_alpha.txt] -b, --number_of_bins Number of bins [default: 4]. -x, --missing_value_name Bin prefix name for the sample identifiers that exist in the mapping file (mapping_fp) but not in the alpha diversity file (alpha_fp) [default: N/A]. --binning_method Select the method name to create the bins, the options are „equal? and „quantile?. Both methods work over the normalized alpha diversity values. On the one hand „equal? will assign the bins on equally spaced limits, depending on the value of –number_of_bins i. e. if you select 4 the limits will be [0.25, 0.50, 0.75]. On the other hand „quantile? will select the limits based on the –number_of_bins i. e. the limits will be the quartiles if 4 is selected [default: equal]. --depth Select the rarefaction depth to use when the alpha_fps refers to collated alpha diversity file(s) i. e. the output of collate_alpha.py. All the iterations contained at this depth will be averaged to form a single mean value [default: highest depth available]. --collated_input Use to specify that the -i option is composed of collated alpha diversity data. Output: The result of running this script is a metadata mapping file that will include 3 new columns per alpha diversity metric included in the alpha diversity file. For example, with an alpha diversity file with only PD_whole_tree, the new columns will PD_whole_tree_alpha, PD_whole_tree_normalized and PD_whole_tree_bin. Adding alpha diversity data: Add the alpha diversity values to a mapping file and classify the normalized values into 4 bins, where the limits will be 0 < x <= 0.25 for the first bin 0.25 < x <= 0.5 for the second bin, 0.5 < x <= 0.75 for the third bin and 0.75 < x <= 1 for the fourth bin. Adding alpha diversity data with the quantile method: Add the alpha diversity values to a mapping file and classify the normalized values using the quartiles of the distribution of this values. Adding collated alpha diversity data: Add the mean of the alpha diversity values at a specified rarefaction depth, this case is for use with the output of collated_alpha.py. It is recommended that the filenames are the name of the metric used in each file. add_qiime_labels.py – Takes a directory, a metadata mapping file, and a column name that contains the fasta file names that SampleIDs are associated with, combines all files that have valid fasta extensions into a single fasta file, with valid QIIME fasta labels. Description: A metadata mapping file with SampleIDs and fasta file names (just the file name itself, not the full or relative filepath) is used to generate a combined fasta file with valid QIIME labels based upon the SampleIDs specified in the mapping file. See: #metadata-mapping-files for details about the metadata file format. Example mapping file: #SampleID BarcodeSequence LinkerPrimerSequence InputFileName Description Sample.1 AAAACCCCGGGG CTACATAATCGGRATT seqs1.fna sample.1 Sample.2 TTTTGGGGAAAA CTACATAATCGGRATT seqs2.fna sample.2 This script is to handle situations where fasta data comes already demultiplexed into a one fasta file per sample basis. Only alters the fasta label to add a QIIME compatible label at the beginning. Example: With the metadata mapping file above, and an specified directory containing the files seqs1.fna and seqs2.fna, the first line from the seqs1.fna file might look like this: >FLP3FBN01ELBSX length=250 xy=1766_0111 region=1 run=R_2008_12_09_13_51_01_ AACAGATTAGACCAGATTAAGCCGAGATTTACCCGA and in the output combined fasta file would be written like this >Sample.1_0 FLP3FBN01ELBSX length=250 xy=1766_0111 region=1 run=R_2008_12_09_13_51_01_ AACAGATTAGACCAGATTAAGCCGAGATTTACCCGA No changes are made to the sequences. add_qiime_labels.py [options] Usage: Input Arguments: [REQUIRED] -m, --mapping_fp SampleID to fasta file name mapping file filepath -i, --fasta_dir Directory of fasta files to combine and label. -c, --filename_column Specify column used in metadata mapping file for fasta file names. [OPTIONAL] -o, --output_dir Required output directory for log file and corrected mapping file, log file, and html file. [default: .] -n, --count_start Specify the number to start enumerating sequence labels with. [default: 0] Output: A combined_seqs.fasta file will be created in the output directory, with the sequences assigned to the SampleID given in the metadata mapping file. Example: Specify fasta_dir as the input directory of fasta files, use the metadata mapping file example_mapping.txt, with the metadata fasta file name column specified as InputFileName, start enumerating with 1000000, and output the data to the directory combined_fasta adjust_seq_orientation.py – Get the reverse complement of all sequences Description: Write the reverse complement of all seqs in seqs.fasta (-i) to seqs_rc.fasta (default, change output_fp with -o). Each sequence description line will have „ RC? appended to the end of it (default, leave sequence description lines untouched by passing -r): Usage: adjust_seq_orientation.py [options] Input Arguments: [REQUIRED] -i, --input_fasta_fp Path to the input fasta file [OPTIONAL] -o, --output_fp The output filepath -r, --retain_seq_id Leave seq description lines untouched [default: append ” RC” to seq description lines] Output: Example: Reverse complement all sequences in seqs.fna and write result to seqs_rc.fna align_seqs.py – Align sequences using a variety of alignment methods Description: This script aligns the sequences in a FASTA file to each other or to a template sequence alignment, depending on the method chosen. Currently, there are three methods which can be used by the user: 1. PyNAST (Caporaso et al., 2009) - The default alignment method is PyNAST, a python implementation of the NAST alignment algorithm. The NAST algorithm aligns each provided sequence (the “candidate” sequence) to the best-matching sequence in a pre-aligned database of sequences (the “template” sequence). Candidate sequences are not permitted to introduce new gap characters into the template database, so the algorithm introduces local mis-alignments to preserve the existing template sequence. 2. MUSCLE (Edgar, 2004) - MUSCLE is an alignment method which stands for MUltiple Sequence Comparison by Log-Expectation. 3. INFERNAL (Nawrocki, Kolbe, & Eddy, 2009) - Infernal (“INFERence of RNA ALignment”) is for an alignment method for using RNA structure and sequence similarities. align_seqs.py [options] Usage: Input Arguments: [REQUIRED] -i, --input_fasta_fp Path to the input fasta file [OPTIONAL] -m, --alignment_method Method for aligning sequences. Valid choices are: pynast, infernal, clustalw, muscle, infernal, mafft [default: pynast] -a, --pairwise_alignment_method Method for performing pairwise alignment in PyNAST. Valid choices are muscle, pair_hmm, clustal, blast, uclust, mafft [default: uclust] -t, --template_fp Filepath for template against [default: /Users/caporaso/data/greengenes_core_sets/core_set_aligned_imputed.fasta_11_8_07.no_dots] -e, --min_length Minimum sequence length to include in alignment [default: 75% of the median input sequence length] -p, --min_percent_id Minimum percent sequence identity to closest blast hit to include sequence in alignment [default: 0.75] -d, --blast_db Database to blast against when -m pynast [default: created on-the-fly from template_alignment] --muscle_max_memory Maximum memory allocation for the muscle alignment method (MB) [default: 80% of available memory, as detected by MUSCLE] -o, --output_dir Path to store result file [default: _aligned] Output: All aligners will output a fasta file containing the alignment and log file in the directory specified by --output_dir (default _aligned). PyNAST additionally outputs a failures file, containing the sequences which failed to align. So the result of align_seqs.py will be up to three files, where the prefix of each file depends on the user supplied FASTA file: 1. ”..._aligned.fasta” - This is a FASTA file containing all aligned sequences. 2. ”..._failures.fasta” - This is a FASTA file containing all sequences which did not meet all the criteria specified. (PyNAST only) 3. ”..._log.txt” - This is a log file containing information pertaining to the results obtained from a particular method (e.g. BLAST percent identity, etc.). Alignment with PyNAST: The default alignment method is PyNAST, a python implementation of the NAST alignment algorithm. The NAST algorithm aligns each provided sequence (the “candidate” sequence) to the best-matching sequence in a pre-aligned database of sequences (the “template” sequence). Candidate sequences are not permitted to introduce new gap characters into the template database, so the algorithm introduces local mis-alignments to preserve the existing template sequence. The quality thresholds are the minimum requirements for matching between a candidate sequence and a template sequence. The set of matching template sequences will be searched for a match that meets these requirements, with preference given to the sequence length. By default, the minimum sequence length is 150 and the minimum percent id is 75%. The minimum sequence length is much too long for typical pyrosequencing reads, but was chosen for compatibility with the original NAST tool. The following command can be used for aligning sequences using the PyNAST method, where we supply the program with a FASTA file of unaligned sequences (i.e. resulting FASTA file frompick_rep_set.py, a FASTA file of pre-aligned sequences (this is the template file, which is typically the Greengenes core set - available from ), and the results will be written to the directory “pynast_aligned/”: Alternatively, one could change the minimum sequence length (“-e”) requirement and minimum sequence identity (“-p”), using the following command: Alignment with MUSCLE: One could also use the MUSCLE algorithm. The following command can be used to align sequences (i.e. the resulting FASTA file from pick_rep_set.py), where the output is written to the directory “muscle_alignment/”: Alignment with Infernal: An alternative alignment method is to use Infernal. Infernal is similar to the PyNAST method, in that you supply a template alignment, although Infernal has several distinct differences. Infernal takes a multiple sequence alignment with a corresponding secondary structure annotation. This input file must be in Stockholm alignment format. There is a fairly good description of the Stockholm format rules at: Infernal will use the sequence and secondary structural information to align the candidate sequences to the full reference alignment. Similar to PyNAST, Infernal will not allow for gaps to be inserted into the reference alignment. Using Infernal is slower than other methods, and therefore is best used with sequences that do not align well using PyNAST. The following command can be used for aligning sequences using the Infernal method, where we supply the program with a FASTA file of unaligned sequences, a STOCKHOLM file of pre-aligned sequences and secondary structure (this is the template file - an example file can be obtained from: ), and the results will be written to the directory “infernal_aligned/”: alpha_diversity.py – Calculate alpha diversity on each sample in an otu table, using a variety of alpha diversity metrics Description: This script calculates alpha diversity, or within-sample diversity, using an otu table. The QIIME pipeline allows users to conveniently calculate more than two dozen different diversity metrics. The full list of available metrics is available by passing the option -s to the script alpha_diversity.py, and documentation of those metrics can be found at different strengths and limitations - technical discussion of each metric is readily available online and in ecology textbooks, but is beyond the scope of this document. Usage: alpha_diversity.py [options] Input Arguments: [OPTIONAL] -i, --input_path Input OTU table filepath or input directory containing OTU tables for batch processing. [default: None] -o, --output_path Output distance matrix filepath or output directory to store distance matrices when batch processing. [default: None] -m, --metrics Alpha-diversity metric(s) to use. A comma-separated list should be provided when multiple metrics are specified. [default: PD_whole_tree,chao1,observed_species] -s, --show_metrics Show the available alpha-diversity metrics and exit. -t, --tree_path Input newick tree filepath. [default: None; REQUIRED for phylogenetic metrics] Output: The resulting file(s) is a tab-delimited text file, where the columns correspond to alpha diversity metrics and the rows correspond to samples and their calculated diversity measurements. When a folder is given as input (-i), the script processes every otu table file in the given folder, and creates a corresponding file in the output directory. Example Output: simpson PD_whole_tree observed_species PC.354 0.925 2.83739 16.0 PC.355 0.915 3.06609 14.0 PC.356 0.945 3.10489 19.0 PC.481 0.945 3.65695 19.0 PC.593 0.91 3.3776 15.0 PC.607 0.92 4.13397 16.0 PC.634 0.9 3.71369 14.0 PC.635 0.94 4.20239 18.0 PC.636 0.925 3.78882 16.0 Single File Alpha Diversity Example (non-phylogenetic): To perform alpha diversity (e.g. chao1) on a single OTU table, where the results are output to “alpha_div.txt”, you can use the following command: Single File Alpha Diversity Example (phylogenetic): In the case that you would like to perform alpha diversity using a phylogenetic metric (e.g. PD_whole_tree), you can use the following command: Single File Alpha Diversity Example with multiple metrics: You can use the following idiom to run multiple metrics at once (comma-separated): Multiple File (batch) Alpha Diversity: To perform alpha diversity on multiple OTU tables (e.g.: rarefied otu tables resulting from multiple_rarefactions.py), specify an input directory instead of a single otu table, and an output directory (e.g. “alpha_div_chao1_PD/”) as shown by the following command: alpha_diversity_metrics – List of available metrics Non-phylogeny based metrics: , berger_parker_d , brillouin_d , chao1 , chao1_confidence , dominance , doubles (# otus with exactly two individuals in sample) , equitability , fisher_alpha , gini index , goods coverage , heip_e (note, using heip_e at low (<5) individuals may cause errors , kempton_taylor_q , margalef , mcintosh_d , mcintosh_e , menhinick , michaelis_menten_fit , observed_species , osd (observed # otus, singleton OTUs, doubleton OTUs) , robbins , shannon (base 2 is used in the logarithms) , simpson (1 - Dominance) , simpson_reciprocal (1 / Dominance) , simpson_e , singles (# OTUs with exactly one individual present in sample) , strong Phylogeny based metrics: , PD_whole_tree alpha_rarefaction.py – A workflow script for performing alpha rarefaction Description: The steps performed by this script are: Generate rarefied OTU tables; compute alpha diversity metrics for each rarefied OTU table; collate alpha diversity results; and generate alpha rarefaction plots. Usage: alpha_rarefaction.py [options] Input Arguments: [REQUIRED] -i, --otu_table_fp The input otu table [REQUIRED] -m, --mapping_fp Path to the mapping file [REQUIRED] -o, --output_dir The output directory [REQUIRED] [OPTIONAL] -p, --parameter_fp Path to the parameter file, which specifies changes to the default behavior. See #qiime-parameters . [if omitted, default values will be used] -n, --num_steps Number of steps (or rarefied OTU table sizes) to make between min and max counts [default: 10] -f, --force Force overwrite of existing output directory (note: existing files in output_dir will not be removed) [default: None] -w, --print_only Print the commands but don?t call them – useful for debugging [default: False] -a, --parallel Run in parallel where available [default: False] -t, --tree_fp Path to the tree file [default: None; REQUIRED for phylogenetic measures] --min_rare_depth The lower limit of rarefaction depths [default: 10] -e, --max_rare_depth The upper limit of rarefaction depths [default: median sequence/sample count] -O, --jobs_to_start Number of jobs to start. NOTE: you must also pass -a to run in parallel, this defines the number of jobs to be started if and only if -a is passed [default: 2] Output: The primary interface for the results will be OUTPUT_DIR/alpha_rarefaction_plots/rarefaction_plots.html where OUTPUT_DIR is the value you specify with -o. You can open this in a web browser for interactive alpha rarefaction plots. Example: Given an OTU table, a phylogenetic tree, a mapping file, and a max sample depth, compute alpha rarefaction plots for the PD, observed species and chao1 metrics. To specify alternative metrics pass a parameter file via -p. We generally recommend that the max depth specified here (-e) is the same as the even sampling depth provided to beta_diversity_through_plots (also -e). ampliconnoise.py – Run AmpliconNoise Description: The steps performed by this script are: 1. Split input sff.txt file into one file per sample 2. Run scripts required for PyroNoise 3. Run scripts required for SeqNoise 4. Run scripts requred for Perseus (chimera removal) 5. Merge output files into one file similar to the output of split_libraries.py This script produces a denoised fasta sequence file such as: >PC.355_41 CATGCTGCCTC... ... >PC.636_23 CATGCTGCCTC... ... Additionally, the intermediate results of the ampliconnoise pipeline are written to an output directory. Ampliconnoise must be installed and correctly configured, and parallelized steps will be called with mpirun, not qiime?s start_parallel_jobs_torque.py script. Usage: ampliconnoise.py [options] Input Arguments: [REQUIRED] -m, --mapping_fp The mapping filepath -i, --sff_filepath Sff.txt filepath -o, --output_filepath The output file [OPTIONAL] -n, --np Number of processes to use for mpi steps. Default: 2 --chimera_alpha Alpha value to Class.pl used for chimera removal Default: -3.8228 --chimera_beta Beta value to Class.pl used for chimera removal Default: 0.62 --seqnoise_resolution -s parameter passed to seqnoise. Default is 25.0 for titanium, 30.0 for flx -d, --output_dir Directory for ampliconnoise intermediate results. Default is output_filepath_dir -p, --parameter_fp Path to the parameter file, which specifies changes to the default behavior. See #qiime-parameters. [if omitted, default values will be used] -f, --force Force overwrite of existing output directory (note: existing files in output_dir will not be removed) [default: False] -w, --print_only Print the commands but don?t call them – useful for debugging [default: False] --suppress_perseus Omit perseus from ampliconnoise workflow --platform Sequencing technology, options are „titanium?,?flx?. [default: flx] --truncate_len Specify a truncation length for ampliconnoise. Note that is this is not specified, the truncate length is chosen by the –platform option (220 for FLX, 400 for Titanium) [default: None] Output: a fasta file of sequences, with labels as:?>sample1_0? , „>sample1_1? ... Run ampliconnoise, write output to anoise_out.fna, compatible with output of split_libraries.py assign_taxonomy.py – Assign taxonomy to each sequence Description: Contains code for assigning taxonomy, using several techniques. Given a set of sequences, assign_taxonomy.py attempts to assign the taxonomy of each sequence. Currently there are three methods implemented: assignment with BLAST, assignment with the RDP classifier, and assignment with the RTAX classifier. The output of this step is a mapping of input sequence identifiers (1st column of output file) to taxonomy (2nd column) and quality score (3rd column). The sequence identifier of the best BLAST hit is also included if the blast method is used (4th column). Example reference data sets and id_to_taxonomy maps can be found in the Greengenes OTUs. To get the latest build of those click the “Most recent Greengenes OTUs” link on the top right of. After downloading and unzipping you can use the following following files as -r and -t. As of this writing the latest build was gg_otus_4feb2011, but that portion of path to these files will change with future builds. Modify these paths accordining when calling assign_taxonomy.py. -r gg_otus_4feb2011/rep_set/gg_97_otus_4feb2011.fasta -t gg_otus_4feb2011/taxonomies/greengenes_tax_rdp_train.txt (best for retraining the RDP classifier) -t gg_otus_4feb2011/taxonomies/greengenes_tax.txt (best for BLAST taxonomy assignment) assign_taxonomy.py [options] Usage: Input Arguments: [REQUIRED] -i, --input_fasta_fp Path to the input fasta file [OPTIONAL] -t, --id_to_taxonomy_fp Path to tab-delimited file mapping sequences to assigned taxonomy. Each assigned taxonomy is provided as a semicolon-separated list. For assignment with rdp, each assigned taxonomy must be exactly 6 levels deep. [default: /Users/caporaso/data/gg_12_10_otus/taxonomy/97_otu_taxonomy.txt; REQUIRED when method is blast] -r, --reference_seqs_fp Path to reference sequences. For assignment with blast, these are used to generate a blast database. For assignment with rdp, they are used as training sequences for the classifier. [default: /Users/caporaso/data/gg_12_10_otus/rep_set/97_otus.fasta; REQUIRED if -b is not provided when method is blast] -p, --training_data_properties_fp Path to ”.properties” file in pre-compiled training data for the RDP Classifier. This option is overridden by the -t and -r options. [default: None] --read_1_seqs_fp Path to fasta file containing the first read from paired-end sequencing, prior to OTU clustering (used for RTAX only). [default: None] --read_2_seqs_fp Path to fasta file containing a second read from paired-end sequencing, prior to OTU clustering (used for RTAX only). [default: None] --single_ok When classifying paired ends, allow fallback to single-ended classification when the mate pair is lacking (used for RTAX only). [default: False] --no_single_ok_generic When classifying paired ends, do not allow fallback to single-ended classification when the mate pair is overly generic (used for RTAX only). [default: False] --read_id_regex Used to parse the result of OTU clustering, to get the read_1_id for each clusterID. (used for RTAX only). [default: S+s+(S+)] --amplicon_id_regex Used to parse the result of split_libraries, to get the ampliconID for each read_1_id. Two groups capture read_1_id and ampliconID, respectively. (used for RTAX only). [default: (S+)s+(S+?)/] --header_id_regex Used to choose the part of the header in the OTU clustering file that Rtax reports back as the ID. The default uses the amplicon ID, not including /1 or /3, as the primary key for the query sequences. (used for RTAX only). [default: S+s+(S+?)/] -m, --assignment_method Taxon assignment method, must be one of rdp, blast, rtax, mothur, tax2tree [default: rdp] -b, --blast_db Database to blast against. Must provide either –blast_db or –reference_seqs_db for assignment with blast [default: None] -c, --confidence Minimum confidence to record an assignment, only used for rdp and mothur methods [default: 0.8] --rdp_max_memory Maximum memory allocation, in MB, for Java virtual machine when using the rdp method. Increase for large training sets [default: 1500] -e, --e_value Maximum e-value to record an assignment, only used for blast method [default: 0.001] --tree_fp The filepath to a prebuilt tree containing both the representative and reference sequences. Required for Tax2Tree assignment. -o, --output_dir Path to store result file [default: _assigned_taxonomy] Output: The consensus taxonomy assignment implemented here is the most detailed lineage description shared by 90% or more of the sequences within the OTU (this level of agreement can be adjusted by the user). The full lineage information for each sequence is one of the output files of the analysis. In addition, a conflict file records cases in which a phylum-level taxonomy assignment disagreement exists within an OTU (such instances are rare and can reflect sequence misclassification within the greengenes database). Sample Assignment with BLAST: Taxonomy assignments are made by searching input sequences against a blast database of pre-assigned reference sequences. If a satisfactory match is found, the reference assignment is given to the input sequence. This method does not take the hierarchical structure of the taxonomy into account, but it is very fast and flexible. If a file of reference sequences is provided, a temporary blast database is built on-the-fly. The quality scores assigned by the BLAST taxonomy assigner are e-values. To assign the sequences to the representative sequence set, using a reference set of sequences and a taxonomy to id assignment text file, where the results are output to default directory “blast_assigned_taxonomy”, you can run the following command: Optionally, the user could changed the E-value (“-e”), using the following command: Assignment with the RDP Classifier: The RDP Classifier program (Wang, Garrity, Tiedje, & Cole, 2007) assigns taxonomies by matching sequence segments of length 8 to a database of previously assigned sequences. It uses a naive bayesian algorithm, which means that for each potential assignment, it attempts to calculate the probability of the observed matches, assuming that the assignment is correct and that the sequence segments are completely independent. The RDP Classifier is distributed with a pre-built database of assigned sequence, which is used by default. The quality scores provided by the RDP classifier are confidence values. Note: If a reference set of sequences and taxonomy to id assignment file are provided, the script will use them to generate a new training dataset for the RDP Classifier on-the-fly. Because of the RDP Classifier?s implementation, all lineages in the training dataset must contain the same number of ranks. To assign the representative sequence set, where the output directory is “rdp_assigned_taxonomy”, you can run the following command: Alternatively, the user could change the minimum confidence score (“-c”), using the following command: Sample Assignment with RTAX: Taxonomy assignments are made by searching input sequences against a fasta database of pre-assigned reference sequences. All matches are collected which match the query within 0.5% identity of the best match. A taxonomy assignment is made to the lowest rank at which more than half of these hits agree. Note that both unclustered read fasta files are required as inputs in addition to the representative sequence file. To make taxonomic classifications of the representative sequences, using a reference set of sequences and a taxonomy to id assignment text file, where the results are output to default directory “rtax_assigned_taxonomy”, you can run the following command: Sample Assignment with Mothur: The Mothur software provides a naive bayes classifier similar to the RDP Classifier. A set of training sequences and id-to-taxonomy assignments must be provided. Unlike the RDP Classifier, sequences in the training set may be assigned at any level of the taxonomy. To make taxonomic classifications of the representative sequences, where the results are output to default directory “mothur_assigned_taxonomy”, you can run the following command: beta_diversity.py – Calculate beta diversity (pairwise sample dissimilarity) on one or many otu tables Description: The input for this script is the OTU table containing the number of sequences observed in each OTU (rows) for each sample (columns). For more information pertaining to the OTU table refer to the documentation for make_otu_table. If the user would like phylogenetic beta diversity metrics using UniFrac, a phylogenetic tree must also be passed as input (see make_phylogeny.py). The output of this script is a distance matrix containing a dissimilarity value for each pairwise comparison. A number of metrics are currently supported, including unweighted and weighted UniFrac (pass the -s option to see available metrics). In general, because unifrac uses phylogenetic information, one of the unifrac metrics is recommended, as results can be vastly more useful (Hamady & Knight, 2009). Quantitative measures (e.g. weighted unifrac) are ideally suited to revealing community differences that are due to changes in relative taxon abundance (e.g., when a particular set of taxa flourish because a limiting nutrient source becomes abundant). Qualitative measures (e.g. unweighted unifrac) are most informative when communities differ primarily by what can live in them (e.g., at high temperatures), in part because abundance information can obscure significant patterns of variation in which taxa are present (Lozupone et al., 2007). Most qualitative measures are referred to here e.g. “binary_jaccard”. Typically both weighted and unweighted unifrac are used. Usage: beta_diversity.py [options] Input Arguments: [OPTIONAL] -i, --input_path Input OTU table in biom format or input directory containing OTU tables in biom format for batch processing. -r, --rows Compute for only these rows of the distance matrix. User should pass a list of sample names (e.g. “s1,s3”) [default: None; full n x n matrix is generated] -o, --output_dir Output directory. One will be created if it doesn?t exist. -m, --metrics Beta-diversity metric(s) to use. A comma-separated list should be provided when multiple metrics are specified. [default: unweighted_unifrac,weighted_unifrac] -s, --show_metrics Show the available beta-diversity metrics and exit. Metrics starting with “binary...” specifies that a metric is qualitative, and considers only the presence or absence of each taxon [default: False] -t, --tree_path Input newick tree filepath, which is required when phylogenetic metrics are specified. [default: None] -f, --full_tree By default, tips not corresponding to OTUs in the OTU table are removed from the tree for diversity calculations. Pass to skip this step if you?re already passing a minimal tree. Beware with “full_tree” metrics, as extra tips in the tree change the result Output: Each file in the input directory should be an otu table, and the output of beta_diversity.py is a folder containing text files, each a distance matrix between samples corresponding to an input otu table. Single File Beta Diversity (non-phylogenetic): To perform beta diversity (using e.g. euclidean distance) on a single OTU table, where the results are output to beta_div/, use the following command: Single File Beta Diversity (phylogenetic): In the case that you would like to perform beta diversity using a phylogenetic metric (e.g. weighted_unifrac), you can use the following command: Multiple File (batch) Beta Diversity (phylogenetic): To perform beta diversity on multiple OTU tables (e.g., resulting files from multiple_rarefactions.py), specify an input directory (e.g. otu_tables/) as shown by the following command: beta_diversity_metrics – List of available metrics Non-phylogenetic beta diversity metrics. These are count based metrics which is based on the OTU table: , abund_jaccard: abundance weighted Jaccard distance , binary_dist_chisq: Binary chi-square distance , binary_dist_chord: Binary chord distance , binary_dist_euclidean: Binary euclidean distance , binary_dist_hamming: Binary Hamming distance (binary Manhattan distance) , binary_dist_jaccard: Binary Jaccard distance (binary Soergel distance) , binary_dist_lennon: Binary Lennon distance , binary_dist_ochiai: Binary Ochiai distance , binary_otu_gain: Binary distance similar to Unifrac G , binary_dist_pearson: Binary Pearson distance , binary_dist_sorensen_dice: Binary Sörensen-Dice distance (binary Bray-Curtis distance or binary Whittaker distance) , dist_bray_curtis: Bray-Curtis distance (normalized Manhattan distance) , dist_canberra: Canberra distance , dist_chisq: Chi-square distance , dist_chord: Chord distance , dist_euclidean: Euclidean distance , dist_gower: Gower distance , dist_hellinger: Hellinger distance , dist_kulczynski: Kulczynski distance , dist_manhattan: Manhattan distance , dist_morisita_horn: Morisita-Horn distance , dist_pearson: Pearson distance , dist_soergel: Soergel distance , dist_spearman_approx: Spearman rank distance , dist_specprof: Species profile distance Phylogenetic beta diversity metrics. These metrics are based on UniFrac, which takes into account the evolutionary relationship between sequences: , dist_unifrac_G: The G metric calculates the fraction branch length in the sample i + sample j tree that is exclusive to sample i and it is asymmetric. , dist_unifrac_G_full_tree: The full_tree version calculates the fraction of branch length in the full tree that is exclusive to sample i and it is asymmetric. , dist_unweighted_unifrac: This is the standard unweighted UniFrac, which is used to assess „who?s there? without taking in account the relative abundance of identical sequences. , dist_unweighted_unifrac_full_tree: Typically, when computing the dissimilarity between two samples, unifrac considers only the parts of the phylogenetic tree contained by otus in either sample. The full_tree modification considers the entire supplied tree. , dist_weighted_normalized_unifrac: Weighted UniFrac with normalized values and is used to include abundance information. The normalization adjusts for varying root-to-tip distances. , dist_weighted_unifrac: Weighted UniFrac. beta_diversity_through_plots.py – A workflow script for computing beta diversity distance matrices and generating PCoA plots Description: This script will perform beta diversity, principal coordinate anlalysis, and generate a preferences file along with 3D PCoA Plots. Usage: beta_diversity_through_plots.py [options] Input Arguments: [REQUIRED] -i, --otu_table_fp The input biom table [REQUIRED] -m, --mapping_fp Path to the mapping file [REQUIRED] -o, --output_dir The output directory [REQUIRED] [OPTIONAL] -t, --tree_fp Path to the tree file [default: None; REQUIRED for phylogenetic measures] -p, --parameter_fp Path to the parameter file, which specifies changes to the default behavior. See #qiime-parameters . [if omitted, default values will be used] --color_by_all_fields Plots will have coloring for all mapping fields [default: False; only include fields with greater than one value and fewer values than the number of samples] -c, --histogram_categories Mapping fields to use when plotting distance histograms [default: None] -f, --force Force overwrite of existing output directory (note: existing files in output_dir will not be removed) [default: None] -w, --print_only Print the commands but don?t call them – useful for debugging [default: False] -a, --parallel Run in parallel where available [default: False] -e, --seqs_per_sample Depth of coverage for even sampling [default: None] --suppress_2d_plots Do not generate 2D plots [default: False] --suppress_3d_plots Do not generate 3D plots [default: False] -O, --jobs_to_start Number of jobs to start. NOTE: you must also pass -a to run in parallel, this defines the number of jobs to be started if and only if -a is passed [default: 2] Output: This script results in a distance matrix (from beta_diversity.py), a principal coordinates file (from principal_coordinates.py), a preferences file (from make_prefs_file.py) and folders containing the resulting PCoA plots (accessible through html files). Example: Given an OTU table, a phylogenetic tree, an even sampling depth, and a mapping file, perform the following steps: 1. Randomly subsample otu_table.biom to even number of sequences per sample (100 in this case); 2. Compute a weighted and unweighted unifrac distance matrcies (can add additional metrics by passing a parameters file via -p); 3. Peform a principal coordinates analysis on the result of Step 2; 4. Generate a 2D and 3D plots for all mapping fields. blast_wrapper.py – Blast Interface Description: This script is a functionally-limited interface to the qiime.util.qiime_blast_seqs function, primarily useful for testing purposes. Once that function has been integrated into qiime as the primary blast interface it will move to PyCogent. An expanded version of this command line interface may replace the script functionality of cogent.app.blast at that point. Usage: blast_wrapper.py [options] Input Arguments: [REQUIRED] -i, --input_fasta_fp Path to the input fasta file -r, --refseqs_fp Path to blast database as a fasta file [OPTIONAL] -n, --num_seqs_per_blast_run Number of sequences passed to each blast call - useful for very large sequence collections [default: 1000] Output: This is a utility program, which returns BLAST results. Example: Blast all sequences in inseqs.fasta (-i) against a BLAST db constructed from refseqs.fasta (-r). categorized_dist_scatterplot.py – makes a figure representing average distances between samples, broken down by categories. I call it a ‘categorized distance scatterplot’ Description: makes a figure representing average distances between samples, broken down by categories. I call it a „categorized distance scatterplot?. See script usage for more details. The mapping file specifies the relavent data - if you have e.g. „N/A? values or samples you don?t want included, first use filter_samples_from_otu_table.py to remove unwanted samples from the mapping file, and thus the analysis. Note that the resulting plot will include only samples in both the mapping file AND the distance matrix. Usage: categorized_dist_scatterplot.py [options] Input Arguments: [REQUIRED] -m, --map Mapping file -d, --distance_matrix Distance matrix -p, --primary_state Samples matching this state will be plotted. E.g.: AgeCategory:Child . See qiime?s filter_samples_from_otu_table.py for more syntax options -a, --axis_category This will form the horizontal axis of the figure, e.g.: AgeYears . Must be numbers -o, --output_path Output figure, filename extention determines format. E.g.: “fig1.png” or similar. A “fig1.txt” or similar will also be created with the data underlying the figure [OPTIONAL] -c, --colorby Samples will first be separated by this column of the mapping file. They will be colored by this column of the mapping file, and all comparisons will be done only among samples with the same value in this column. e.g.: Country. You may omit -c, and the samples will not be separated -s, --secondary_state All samples matching the primary state will be compared to samples matcthing this secondary state. E.g.: AgeCategory:Adult Output: a figure and the text dat for that figure Canonical Example: Split samples by country. Within each country compare each child to all adults. Plot the average distance from that child to all adults, vs. the age of that child Example 2: Same as above, but compares Child with all other categories (e.g.: NA, Infant, etc.) check_id_map.py – Checks user’s metadata mapping file for required data, valid format Description: Specifically, we check that: , The BarcodeSequence, LinkerPrimerSequences, and ReversePrimer fields have valid IUPAC DNA characters, and BarcodeSequence characters are non-degenerate (error) , The SampleID, BarcodeSequence, LinkerPrimerSequence, and Description headers are present. (error) , There are not duplicate header fields (error) , There are not duplicate barcodes (error) , Barcodes are of the same length. Suppressed when variable_len_barcode flag is passed (warning) , The headers do not contain invalid characters (alphanumeric and underscore only) (warning) , The data fields do not contain invalid characters (alphanumeric, underscore, space, and +-%./:,; characters) (warning) , SampleID fields are MIENS compliant (only alphanumeric and . characters). (warning) , There are no duplicates when the primer and variable length barcodes are appended (error) , There are no duplicates when barcodes and added demultiplex fields (-j option) are combined (error) , Data fields are not found beyond the Description column (warning) Details about the metadata mapping file format can be found here: #metadata-mapping-files Errors and warnings are saved to a log file. Errors can be caused by problems with the headers, invalid characters in barcodes or primers, or by duplications in SampleIDs or barcodes. Warnings can arise from invalid characters and variable length barcodes that are not specified with the –variable_len_barcode. Warnings will contain a reference to the cell (row,column) that the warning arose from. In addition to the log file, a “corrected_mapping” file will be created. Any invalid characters will be replaced with „.? characters in the SampleID fields (to enforce MIENS compliance) and text in other data fields will be replaced with the character specified by the -c parameter, which is an underscore “_” by default. A html file will be created as well, which will show locations of warnings and errors, highlighted in yellow and red respectively. If no errors or warnings were present the file will display a message saying such. Header errors can mask other errors, so these should be corrected first. If pooled primers are used, separate with a comma. For instance, a pooled set of three 27f primers (used to increase taxonomic coverage) could be specified in the LinkerPrimerSequence fields as such: AGGGTTCGATTCTGGCTCAG,AGAGTTTGATCCTGGCTTAG,AGAATTTGA TCTTGGTTCAG Usage: check_id_map.py [options] Input Arguments: [REQUIRED] -m, --mapping_fp Metadata mapping filepath [OPTIONAL] -o, --output_dir Required output directory for log file, corrected mapping file, and html file. [default: ./] -v, --verbose Enable printing information to standard out [default: True] -c, --char_replace Changes the default character used to replace invalid characters found in the mapping file. Must be a valid character (alphanumeric, period, or underscore).[default: _] -b, --not_barcoded Use -b if barcodes are not present. BarcodeSequence header still required. [default: False] -B, --variable_len_barcodes Use -B if variable length barcodes are present to suppress warnings about barcodes of unequal length. [default: False] -p, --disable_primer_check Use -p to disable checks for primers. LinkerPrimerSequence header still required. [default: False] -j, --added_demultiplex_field Use -j to add a field to use in the mapping file as additional demultiplexing (can be used with or without barcodes). All combinations of barcodes/primers and the these fields must be unique. The fields must contain values that can be parsed from the fasta labels such as “plate=R_2008_12_09”. In this case, “plate” would be the column header and “R_2008_12_09” would be the field data (minus quotes) in the mapping file. To use the run prefix from the fasta label, such as “>FLP3FBN01ELBSX”, where “FLP3FBN01” is generated from the run ID, use “-j run_prefix” and set the run prefix to be used as the data under the column header “run_prefix”. [default: None] -s, --suppress_html Use -s to disable html file generation, can be useful for extremely large mapping files. [default: False] Output: A log file, html file, and corrected_mapping.txt file will be written to the current output directory. Example: Check the Fasting_Map.txt mapping file for problems, supplying the required mapping file, and output the results in the check_id_map_output directory clean_raxml_parsimony_tree.py – Remove duplicate tips from Raxml Tree Description: This script allows the user to remove specific duplicate tips from a Raxml tree. Usage: clean_raxml_parsimony_tree.py [options] Input Arguments: [REQUIRED] -i, --input_tree The input raxml parsimony tree -t, --tips_to_keep The input tips to score and retain (comma-separated list) -o, --output_fp The output filepath [OPTIONAL] -s, --scoring_method The scoring method either depth or numtips [default: depth] Output: Example (depth): For this case the user can pass in input Raxml tree, duplicate tips, and define an output filepath. When using the depth option, only the deepest replicate is kept. Example (numtips): For this case the user can pass in input Raxml tree, duplicate tips, and define an output filepath. When using the numtips option, the replicate with the fewest siblings is kept. cluster_quality.py – compute the quality of a cluster Description: The input is a distance matrix (i.e. resulting file from beta_diversity.py). Usage: cluster_quality.py [options] Input Arguments: [REQUIRED] -i, --input_path Input distance matrix file -m, --map Mapping file -c, --category Column of mapping file delimiting clusters [OPTIONAL] -o, --output_path Output path, prints to stdout if omitted -s, --short Print only the ratio of mean dissimilarities between/within clusters instead of more detailed output --metric Choice of quality metric to apply. Currently only one option exists, the ratio of mean(distances between samples from different clusters) to mean(distances between samples from the same cluster) Default: ratio Output: The output is either a single number (with -s), or a more detailed output of the similarity between and within clusters. cluster quality based on the treatment category: to compute the quality of clusters, and print to stdout, use the following idiom: cluster quality based on the DOB category: to compute the quality of clusters, and print to stdout, use the following idiom: collate_alpha.py – Collate alpha diversity results Description: When performing batch analyses on the OTU table (e.g. rarefaction followed by alpha diversity), the result of alpha_diversity.py comprises many files, which need to be concatenated into a single file for generating rarefaction curves. This script joins those files. Input files are: each file represents one (rarefied) otu table each row in a file represents one sample each column in a file represents one diversity metric Output files are: each file represents one diversity metric each row in a file represents one (rarefied) otu table each column in a file represents one sample The input directory should contain only otu tables. The output directory should be empty or nonexistant and the example file is optional. If you have a set of rarefied OTU tables, make sure the example file contains every sample present in the otu talbes. You should typically choose the file with the fewest sequences per sample, to avoid files with sparse samples omitted. Usage: collate_alpha.py [options] Input Arguments: [REQUIRED] -i, --input_path Input path (a directory) -o, --output_path Output path (a directory). will be created if needed [OPTIONAL] -e, --example_path Example alpha_diversity analysis file, containing all samples and all metrics to be included in the collated result[Default: chosen automatically (see usage string)] Output: This script takes the resulting files from batch alpha diversity and collates them into (one file for each metric used). This script transforms a series of files, named (e.g. alpha_rarefaction_20_0.txt, alpha_rarefaction_20_1.txt, etc.) into a (usually much smaller) set of files named (e.g. chao1.txt, PD_whole_tree.txt, etc.), where the columns correspond to samples and rows to the rarefaction files inputted, as shown by the following: sequences per iteration PC.354 PC.355 sample alpha_rarefaction_20_0.txt 20 0 0.925 0.915 alpha_rarefaction_20_1.txt 20 1 0.9 0.89 alpha_rarefaction_20_2.txt 20 2 0.88 0.915 alpha_rarefaction_20_3.txt 20 3 0.91 0.93 ... ... ... ... ... Example: The user inputs the results from batch alpha diversity (e.g. alpha_div/) and the location where the results should be written (e.g. collated_alpha/), as shown by the following command: compare_3d_plots.py – Plot several PCoA files on the same 3D plot Description: This script generates a 3D plot comparing two or more sets of principal coordinates using as input two or more principal coordinates files. Edges are drawn in the plot connecting samples with the same ID across different principal coordinates files. The user can also include a file listing the edges to be drawn in the plot, in which case the user may submit any number of principal coordinates files (including one). If the user includes the edges file, the sample IDs need not match between principal coordinates files. The principal_coordinates coordinates files are obtained by applying “principal_coordinates.py” to a file containing beta diversity measures. The beta diversity files are optained by applying “beta_diversity.py” to an OTU table. One may apply “transform_coordinate_matrices.py” to the principal_coordinates coordinates files before using this script to compare them. Usage: compare_3d_plots.py [options] Input Arguments: [REQUIRED] -i, --coord_fnames This is comma-separated list of the paths to the principal coordinates files (i.e., resulting file from principal_coordinates.py), e.g „pcoa1.txt,pcoa2.txt? -m, --map_fname This is the user-generated mapping file [default=None] [OPTIONAL] -b, --colorby This is a list of the categories to color by in the plots from the user-generated mapping file. The categories must match the name of a column header in the mapping file exactly and multiple categories can be list by comma separating them without spaces. The user can also combine columns in the mapping file by separating the categories by “&&” without spaces [default=None] -a, --custom_axes This is a category or list of categories from the user-generated mapping file to use as a custom axis in the plot. For instance, if there is a pH category and one would like to see the samples plotted on that axis instead of PC1, PC2, etc., one can use this option. It is also useful for plotting time-series data [default: None] -p, --prefs_path This is the user-generated preferences file. NOTE: This is a file with a dictionary containing preferences for the analysis. See make_prefs_file.py. [default: None] -k, --background_color This is the background color to use in the plots (Options are „black? or „white?. [default: None] -e, --edges_file A file where each line contains two sample IDs separated by a whitespace character; for each pair of sample IDs, an edge will be drawn from the first sample to the second sample. [default: None] --serial Connect the 1st set of points to the 2nd, the 2nd to the 3rd, etc. Default behavior is to connect each set of points back to the 1st set. This flag is ignored if the user specifies an edges file. -o, --output_dir Path to the output directory Output: This script results in a folder containing an html file which displays the 3D Plots generated. Example 1: Compare two pca/pcoa files in the same 3d plot where each sample ID is assigned its own color: Example 2: Compare two pca/pcoa files in the same 3d plot with two coloring schemes (Treatment and DOB): Example 3: Compare two pca/pcoa files in the same 3d plot for a combination of label headers from a mapping file: Example 4: Pass in a list of desired edges and only one pca/pcoa file: Example 5: Pass in a list of desired edges and only one pca/pcoa file: compare_alpha_diversity.py – This script compares alpha diversities based on a two-sample t-test using either parametric or non-parametric (Monte Carlo) methods. Description: This script compares the alpha diversity of samples found in a collated alpha diversity file. The comparison is done not between samples, but between groups of samples. The groupings are created via the input category passed via -c/–category. Any samples which have the same value under the catgory will be grouped. For example, if your mapping file had a category called „Treatment? that separated your samples into three groups (Treatment=?Control?, Treatment=?Drug?, Treatment=„2xDose?), passing „Treatment? to this script would cause it to compare (Control,Drug), (Control,2xDose), (2xDose, Drug) alpha diversity values. By default the two-sample t-test will be nonparametric (i.e. using Monte Carlo permutations to calculate the p-value), though the user has the option to make the test a parametric t-test. The script creates an output file in tab-separated format where each row is a different group comparison. The columns in each row denote which two groups of samples are being compared, as well as the mean and standard deviation of each group?s alpha diversity. Finally, the t-statistic and p-value are reported for the comparison. This file can be most easily viewed in a spreadsheet program such as Excel. Note: Any iterations of a rarefaction at a given depth will be averaged. For instance, if your collated_alpha file had 10 iterations of the rarefaction at depth 480, the scores for the alpha diversity metrics of those 10 iterations would be averaged (within sample). The iterations are not controlled by this script; when multiple_rarefactions.py is called, the -n option specifies the number of iterations that have occurred. The multiple comparison correction takes into account the number of between group comparisons. If you do not know the rarefaction depth available or you want to use the deepest rarefaction level available then do not pass -d/–depth and it will default to using the deepest available. If t-statistics and/or p-values are None for any of your comparisons, there are three possible reasons. The first is that there were undefined values in your collated alpha diversity input file. This occurs if there were too few sequences in one or more of the samples in the groups involved in those comparisons to compute alpha diversity at that depth. You can either rerun compare_alpha_diversity.py passing a lower value for –depth, or you can re-run alpha diversity after filtering samples with too few sequences. The second is that you had some comparison where each treatment was represented by only a single sample. It is not possible to perform a two-sample t-test on two samples each of length 1, so None will be reported instead. The third possibility occurs when using the nonparamteric t-test with small datasets where the Monte Carlo permutations don?t return a p-value because the distribution of the data has no variance. The multiple comparisons correction will not penalize you for comparisons that return as None regardless of origin. If the means/standard deviations are None for any treatment group, the likely cause is that there is an „na? value in the collated_alpha file that was passed. Usage: compare_alpha_diversity.py [options] Input Arguments: [REQUIRED] -i, --alpha_diversity_filepath Path to collated alpha diversity file (as generated by collate_alpha.py) [REQUIRED] -m, --mapping_filepath Path to the mapping file [REQUIRED] -c, --category Category for comparison [REQUIRED] -o, --output_fp Location of output file to be created [REQUIRED] [OPTIONAL] -t, --test_type The type of test to perform when calculating the p-values. Valid choices: parametric, nonparametric. If test_type is nonparametric, Monte Carlo permutations will be used to determine the p-value. If test_type is parametric, the num_permutations option will be ignored and the t-distribution will be used instead [default: nonparametric] -n, --num_permutations The number of permutations to perform when calculating the p-value. Must be greater than 10. Only applies if test_type is nonparametric [default: 999] -p, --correction_method Method to use for correcting multiple comparisons. Available methods are bonferroni, fdr, or none. [default: bonferroni] -d, --depth Depth of rarefaction file to use [default: greatest depth] Output: The script generates an output file that is a TSV table. Each row corresponds to a comparison between two groups of treatment values, and includes the means and standard deviations of the two groups? alpha diversities, along with the results of the two-sample t-test. Comparing alpha diversities: The following command takes the following input: a mapping file (which associaties each sample with a number of characteristics), alpha diversity metric (the results of collate_alpha for an alpha diverity metric, like PD_whole_tree), depth (the rarefaction depth to use for comparison), category (the category in the mapping file to determine which samples to compare to each other), and output filepath (a path to the output file to be created). A nonparametric two sample t-test is run to compare the alpha diversities using the default number of Monte Carlo permutations (999). Parametric t-test: The following command runs a parametric two sample t-test using the t-distribution instead of Monte Carlo permutations at rarefaction depth 100. Parametric t-test: The following command runs a parametric two sample t-test using the t-distribution instead of Monte Carlo permutations at the greatest depth available. compare_categories.py – Analyzes statistical significance of sample groupings using distance matrices Description: This script allows for the analysis of the strength and statistical significance of sample groupings using a distance matrix as the primary input. Several statistical methods are available: adonis, ANOSIM, BEST, Moran?s I, MRPP, PERMANOVA, PERMDISP, and db-RDA. Note: R?s vegan and ape packages are used to compute many of these methods, and for the ones that are not, their implementations are based on the implementations found in those packages. It is recommended to read through the detailed descriptions provided by the authors (they are not reproduced here) and to refer to the primary literature for complete details, including the methods? assumptions. To view the documentation of a method in R, prepend a question mark before the method name. For example: ?vegan::adonis The following are brief descriptions of the available methods: adonis - Partitions a distance matrix among sources of variation in order to describe the strength and significance that a categorical or continuous variable has in determining variation of distances. This is a nonparametric method and is nearly equivalent to db-RDA (see below) except when distance matrices constructed with semi-metric or non-metric dissimilarities are provided, which may result in negative eigenvalues. adonis is very similar to PERMANOVA, though it is more robust in that it can accept either categorical or continuous variables in the metadata mapping file, while PERMANOVA can only accept categorical variables. See vegan::adonis for more details. ANOSIM - Tests whether two or more groups of samples are significantly different based on a categorical variable found in the metadata mapping file. You can specify a category in the metadata mapping file to separate samples into groups and then test whether there are significant differences between those groups. For example, you might test whether „Control? samples are significantly different from „Fast? samples. Since ANOSIM is nonparametric, significance is determined through permutations. See vegan::anosim for more details. BEST - This method looks at the numerical environmental variables relating samples in a distance matrix. For instance, given a UniFrac distance matrix and pH and latitude (or any other number of variables) in soil samples, BEST will rank them in order of which best explain patterns in the communities. This method will only accept categories that are numerical (continuous or discrete). This is currently the only method in this script that accepts more than one category (via -c). See vegan::bioenv for more details. Moran?s I - This method uses the numerical (e.g. geographical) data supplied to identify what type of spatial configuration occurs in the samples. For example, are they dispersed, clustered, or of no distinctly noticeable configuration when compared to each other? This method will only accept a category that is numerical. See ape::Moran.I for more details. MRPP - This method tests whether two or more groups of samples are significantly different based on a categorical variable found in the metadata mapping file. You can specify a category in the metadata mapping file to separate samples into groups and then test whether there are significant differences between those groups. For example, you might test whether „Control? samples are significantly different from „Fast? samples. Since MRPP is nonparametric, significance is determined through permutations. See vegan::mrpp for more details. PERMANOVA - This method is very similar to adonis except that it only accepts a categorical variable in the metadata mapping file. It uses an ANOVA experimental design and returns a pseudo-F value and a p-value. Since PERMANOVA is nonparametric, significance is determined through permutations. PERMDISP - This method analyzes the multivariate homogeneity of group dispersions (variances). In essence, it determines whether the variances of groups of samples are significantly different. The results of both parametric and nonparametric significance tests are provided in the output. This method is generally used as a companion to PERMANOVA. See vegan::betadisper for more details. db-RDA - This method is very similar to adonis and will only differ if certain non-Euclidean semi- or non-metrics are used to generate the input distance matrix, and negative eigenvalues are encountered. The only difference then will be in the p-values, not the R^2 values. As part of the output, an ordination plot is also generated that shows grouping/clustering of samples based on a category in the metadata mapping file. This category is used to explain the variability between samples. Thus, the ordination output of db-RDA is similar to PCoA except that it is constrained, while PCoA is unconstrained (i.e. with db-RDA, you must specify which category should be used to explain the variability in your data). See vegan::capscale for more details. For more information and examples pertaining to this script, please refer to the accompanying tutorial, which can be found at Usage: compare_categories.py [options] Input Arguments: [REQUIRED] --method The statistical method to use. Valid options: adonis, anosim, best, morans_i, mrpp, permanova, permdisp, dbrda -i, --input_dm The input distance matrix. WARNING: Only symmetric, hollow distance matrices may be used as input. Asymmetric distance matrices, such as those obtained by the UniFrac Gain metric (i.e. beta_diversity.py -m unifrac_g), should not be used as input -m, --mapping_file The metadata mapping file -c, --categories A comma-delimited list of categories from the mapping file. Note: all methods except for BEST take just a single category. If multiple categories are provided, only the first will be used -o, --output_dir Path to the output directory [OPTIONAL] -n, --num_permutations The number of permutations to use when calculating statistical significance. Only applies to adonis, ANOSIM, MRPP, PERMANOVA, PERMDISP, and db-RDA. Must be greater than or equal to zero [default: 999] Output: At least one file will be created in the output directory specified by -o. For most methods, a single output file containing the results of the test (e.g. the effect size statistic and p-value) will be created. The format of the output files will vary between methods as some are generated by native QIIME code, while others are generated by R?s vegan or ape packages. Please refer to the script description for details on how to access additional information for these methods, including what information is included in the output files. db-RDA is the only exception in that two output files are created: a results text file and a PDF of the ordination plot. adonis example: Runs the adonis statistical method on a distance matrix and mapping file using the Treatment category and 999 permutations, writing the output to the „adonis_out? directory. ANOSIM example: Runs the ANOSIM statistical method on a distance matrix and mapping file using the Treatment category and 99 permutations, writing the output to the „anosim_out? directory. compare_distance_matrices.py – Computes Mantel correlation tests between sets of distance matrices Description: This script compares two or more distance/dissimilarity matrices for correlation by providing the Mantel, partial Mantel, and Mantel correlogram matrix correlation tests. The Mantel test will test the correlation between two matrices. The data often represents the “distance” between objects or samples. The partial Mantel test is a first-order correlation analysis that utilizes three distance (dissimilarity) matrices. This test builds on the traditional Mantel test which is a procedure that tests the hypothesis that distances between the objects within a given matrix are linearly independent of the distances withing those same objects in a separate matrix. It builds on the traditional Mantel test by adding a third “control” matrix. Mantel correlogram produces a plot of distance classes versus Mantel statistics. Briefly, an ecological distance matrix (e.g. UniFrac distance matrix) and a second distance matrix (e.g. spatial distances, pH distances, etc.) are provided. The second distance matrix has its distances split into a number of distance classes (the number of classes is determined by Sturge?s rule). A Mantel test is run over these distance classes versus the ecological distance matrix. The Mantel statistics obtained from each of these tests are then plotted in a correlogram. A filled-in point on the plot indicates that the Mantel statistic was statistically significant (you may provide what alpha to use). For more information and examples pertaining to this script, please refer to the accompanying tutorial, which can be found at Usage: compare_distance_matrices.py [options] Input Arguments: [REQUIRED] --method Matrix correlation method to use. Valid options: [mantel, partial_mantel, mantel_corr] -i, --input_dms The input distance matrices, comma-separated. WARNING: Only symmetric, hollow distance matrices may be used as input. Asymmetric distance matrices, such as those obtained by the UniFrac Gain metric (i.e.beta_diversity.py -m unifrac_g), should not be used as input -o, --output_dir Path to the output directory [OPTIONAL] -n, --num_permutations The number of permutations to perform when calculating the p-value [default: 100] -s, --sample_id_map_fp Map of original sample ids to new sample ids [default: None] -t, --tail_type The type of tail test to perform when calculating the p-value. Valid options: [two sided, less, greater] Two sided is a two-tailed test, while less tests for r statistics less than the observed r statistic, and greater tests for r statistics greater than the observed r statistic. Only applies when method is mantel [default: two sided] -a, --alpha The value of alpha to use when denoting significance in the correlogram plot. Only applies when method is mantel_corr -g, --image_type The type of image to produce. Valid options: [png, svg, pdf]. Only applies when method is mantel_corr [default: pdf] --variable_size_distance_classes If this option is supplied, each distance class will have an equal number of distances (i.e. pairwise comparisons), which may result in variable sizes of distance classes (i.e. each distance class may span a different range of distances). If this option is not supplied, each distance class will have the same width, but may contain varying numbers of pairwise distances in each class. This option can help maintain statistical power if there are large differences in the number of distances in each class. See Darcy et al. 2011 (PLoS ONE) for an example of this type of correlogram. Only applies when method is mantel_corr [default: False] -c, --control_dm The control matrix. Only applies (and is required) when method is partial_mantel. [default: None] Output: Mantel: One file is created containing the Mantel „r? statistic and p-value. Partial Mantel: One file is created in the output directory, which contains the partial Mantel statistic and p-value. Mantel Correlogram: Two files are created in the output directory: a text file containing information about the distance classes, their associated Mantel statistics and p-values, etc. and an image of the correlogram plot. Partial Mantel: Performs a partial Mantel test on two distance matrices, using a third matrix as a control. Runs 99 permutations to calculate the p-value. Mantel: Performs a Mantel test on all pairs of four distance matrices, including 999 permutations for each test. Mantel Correlogram: This example computes a Mantel correlogram on two distance matrices using 999 permutations in each Mantel test. Output is written to the mantel_correlogram_out directory. compare_taxa_summaries.py – Compares taxa summary files Description: This script compares two taxa summary files by computing the correlation coefficient between pairs of samples. This is useful, for example, if you want to compare the taxonomic composition of mock communities that were assigned using different taxonomy assigners in order to see if they are correlated or not. Another example use-case is to compare the taxonomic composition of several mock community replicate samples to a single expected, or known, sample community. This script is also useful for sorting and filling taxa summary files so that each sample has the same taxa listed in the same order (with missing taxa reporting an abundance of zero). The sorted and filled taxa summary files can then be passed to a script, such as plot_taxa_summary.py, to visually compare the differences using the same taxa coloring scheme. For more information and examples pertaining to this script, please refer to the accompanying tutorial, which can be found at Usage: compare_taxa_summaries.py [options] Input Arguments: [REQUIRED] -i, --taxa_summary_fps The two input taxa summary filepaths, comma-separated. These will usually be the files that are output by summarize_taxa.py. These taxa summary files do not need to have the same taxa in the same order, as the script will make them compatible before comparing them -o, --output_dir Path to the output directory -m, --comparison_mode The type of comparison to perform. Valid choices: paired or expected. “paired” will compare each sample in the taxa summary files that match based on sample ID, or that match given a sample ID map (see the –sample_id_map_fp option for more information). “expected” will compare each sample in the first taxa summary file to an expected sample (contained in the second taxa summary file). If “expected”, the second taxa summary file must contain only a single sample that all other samples will be compared to (unless the –expected_sample_id option is provided) [OPTIONAL] -c, --correlation_type The type of correlation coefficient to compute. Valid choices: pearson or spearman [default: pearson] -t, --tail_type The type of tail test to compute when calculating the p-values. “high” specifies a one-tailed test for values greater than the observed correlation coefficient (positive association), while “low” specifies a one-tailed test for values less than the observed correlation coefficient (negative association). “two-sided” specifies a two-tailed test for values greater in magnitude than the observed correlation coefficient. Valid choices: low or high or two-sided [default: two-sided] -n, --num_permutations The number of permutations to perform when calculating the nonparametric p-value. Must be an integer greater than or equal to zero. If zero, the nonparametric test of significance will not be performed and the nonparametric p-value will be reported as “N/A” [default: 999] -l, --confidence_level The confidence level of the correlation coefficient confidence interval. Must be a value between 0 and 1 (exclusive). For example, a 95% confidence interval would be 0.95 [default: 0.95] -s, --sample_id_map_fp Map of original sample IDs to new sample IDs. Use this to match up sample IDs that should be compared between the two taxa summary files. Each line should contain an original sample ID, a tab, and the new sample ID. All original sample IDs from the two input taxa summary files must be mapped. This option only applies if the comparison mode is “paired”. If not provided, only sample IDs that exist in both taxa summary files will be compared [default: None] -e, --expected_sample_id The sample ID in the second “expected” taxa summary file to compare all samples to. This option only applies if the comparison mode is “expected”. If not provided, the second taxa summary file must have only one sample [default: None] --perform_detailed_comparisons Perform a comparison for each sample pair in addition to the single overall comparison. The results will include the Bonferroni-corrected p-values in addition to the original p-values [default: False] Output: The script will always output at least three files to the specified output directory. Two files will be the sorted and filled versions of the input taxa summary files, which can then be used inplot_taxa_summary.py to visualize the differences in taxonomic composition. These files will be named based on the basename of the input files. If the input files? basenames are the same, the output files will have „0? and „1? appended to their names to keep the filenames unique. The first input taxa summary file will have „0? in its filename and the second input taxa summary file will have „1? in its filename. The third output file will contain the results of the overall comparison of the input taxa summary files using the specified sample pairings. The correlation coefficient, parametric p-value, nonparametric p-value, and a confidence interval for the correlation coefficient will be included. If --perform_detailed_comparisons is specified, the fourth output file is a tab-separated file containing the correlation coefficients that were computed between each of the paired samples. Each line will contain the sample IDs of the samples that were compared, followed by the correlation coefficient that was computed, followed by the parametric and nonparametric p-values (uncorrrected and Bonferroni-corrected) and a confidence interval for the correlation coefficient. The output files will contain comments at the top explaining the types of tests that were performed. Paired sample comparison: Compare all samples that have matching sample IDs between the two input taxa summary files using the pearson correlation coefficient. The first input taxa summary file is from the overview tutorial, using the RDP classifier with a confidence level of 0.60 and the gg_otus_4feb2011 97% representative set. The second input taxa summary file was generated the same way, except for using a confidence level of 0.80. Paired sample comparison with sample ID map: Compare samples based on the mappings in the sample ID map using the spearman correlation coefficient. The second input taxa summary file is simply the original ts_rdp_0.60.txt file with all sample IDs containing „PC.? renamed to „S.?. Detailed paired sample comparison: Compare all samples that have matching sample IDs between the two input taxa summary files using the pearson correlation coefficient. Additionally, compute the correlation coefficient between each pair of samples individually. One-tailed test: Compare all samples that have matching sample IDs between the two input taxa summary files using the pearson correlation coefficient. Perform a one-tailed (negative association) test of significance for both parametric and nonparametric tests. Additionally, compute a 90% confidence interval for the correlation coefficient. Note that the confidence interval will still be two-sided. compute_core_microbiome.py – Identify the core microbiome. Description: Usage: compute_core_microbiome.py [options] Input Arguments: [REQUIRED] -i, --input_fp The input otu table in BIOM format -o, --output_dir Directory to store output data [OPTIONAL] --max_fraction_for_core The max fractions of samples that an OTU must be observed in to be considered part of the core as a number in the range [0,1] [default: 1.0] --min_fraction_for_core The min fractions of samples that an OTU must be observed in to be considered part of the core as a number in the range [0,1] [default: 0.5] --num_fraction_for_core_steps The number of evenly sizes steps to take between min_fraction_for_core and max_fraction_for_core [default: 11] --otu_md The otu metadata category to write to the output file [defualt: taxonomy] --mapping_fp Mapping file path (for use with –valid_states) [default: None] --valid_states Description of sample ids to retain (for use with –mapping_fp) [default: None] Output: Identify the core OTUs in otu_table.biom, defined as the OTUs that are present in at least 50% of the samples. Write the list of core OTUs to a text file, and a new BIOM file containing only the core OTUs. Identify the core OTUs in otu_table.biom, defined as the OTUs that are present in all of the samples in the „Fast? treatment (as specified in the mapping file). Write the list of core OTUs to a text file. conditional_uncovered_probability.py – Calculate the conditional uncovered probability on each sample in an otu table. Description: This script calculates the conditional uncovered probability for each sample in an OTU table. It uses the methods introduced in Lladser, Gouet, and Reeder, “Extrapolation of Urn Models via Poissonization: Accurate Measurements of the Microbial Unknown” PLoS 2011. Specifically, it computes a point estimate and a confidence interval using two different methods. Thus it can happen that the PE is actually outside of the CI. The CI method requires precomputed constants that depend on the lookahead, the upper-to-lower bound ratio and the desired confidence. We only provide these constants for some frequently used combinations. These are (alpha:0.95, r=1..25)) for the the L and U interval types, and (alpha:0.9, 0.95, 0.99; f=10; r=3..25,30,40,50). Also, there are a few hand picked special cases: f=2 and r=50 and alpha=0.95 f=2 and r=33 and alpha=0.95 f=1.5 and r=100 and alpha=0.95 f=1.5 and r=94 and alpha=0.95 f=2.5 and r=19 and alpha=0.95 Usage: conditional_uncovered_probability.py [options] Input Arguments: [OPTIONAL] -i, --input_path Input OTU table filepath. [default: None] -o, --output_path Output filepath to store the predictions. [default: None] -r, --look_ahead Number of unobserved, new colors necessary for prediction. [default: 25] -c, --ci_type Type of confidence interval. Choice of ULCL, ULCU, U, L [default: ULCL] -a, --alpha Desired confidence level for CI prediction. [default: 0.95] -f, --f_ratio Upper to lower bound ratio for CI prediction. [default: 10.0] -m, --metrics CUP metric(s) to use. A comma-separated list should be provided when multiple metrics are specified. [default: lladser_pe,lladser_ci] -s, --show_metrics Show the available CUP metrics and exit. Output: The resulting file(s) is a tab-delimited text file, where the columns correspond to estimates of the cond. uncovered probability and the rows correspond to samples. The output file is compatible with the alpha_diversity output files and thus could be tied into thes rarefaction workflow. Example Output: PE Lower Bound Upper Bound PC.354 0.111 0.0245 0.245 PC.124 0.001 0.000564 0.00564 Default case: To calculate the cond. uncovered probability with the default values, you can use the following command: Change lookahead: To change the accuracy of the prediction change the lookahead value. Larger values of r lead to more precise predictions, but might be unfeasable for small samples. For deeply sequenced samples, try increasing r to 50: Change the interval type: To change the confidence interval type to a lower bound prediction, while the upper bound is set to 1 use: consensus_tree.py – This script outputs a majority consensus tree given a collection of input trees. Description: Usage: consensus_tree.py [options] Input Arguments: [REQUIRED] -i, --input_dir Input folder containing trees -o, --output_fname The output consensus tree filepath [OPTIONAL] -s, --strict Use only nodes occurring >50% of the time [default: False] Output: The output is a newick formatted tree compatible with most standard tree viewing programs basic usage: given a directory of trees „jackknifed_trees?, compute the majority consensus and save as a newick formatted text file: convert_fastaqual_fastq.py – From a FASTA file and a matching QUAL file, generates a FASTQ file. From FASTQ file generates FASTA file and matching QUAL file. Description: From a FASTA file and a matching QUAL file, generates a FASTQ file. A minimal FASTQ file omits the redundant sequence label on the quality scores; the quality scores for a sequence are assumed to follow immediately after the sequence with which they are associated. The output FASTQ file will be generated in the specified output directory with the same name as the input FASTA file, suffixed with „.fastq?. A FASTQ file will be split into FASTA and QUAL files, and generated in the designated output directory. Usage: convert_fastaqual_fastq.py [options] Input Arguments: [REQUIRED] -f, --fasta_file_path Input FASTA or FASTQ file. [OPTIONAL] -q, --qual_file_path Required input QUAL file if converting to FASTQ. -o, --output_dir Output directory. Will be created if does not exist. [default: .] -c, --conversion_type Type of conversion: fastaqual_to_fastq or fastq_to_fastaqual [default: fastaqual_to_fastq] -a, --ascii_increment The number to add (subtract if coverting from FASTQ) to the quality score to get the ASCII character (or numeric quality score). [default: 33] -F, --full_fasta_headers Include full FASTA headers in output file(s) (as opposed to merely the sequence label). [default: False] -b, --full_fastq Include identifiers on quality lines in the FASTQ file (those beginning with a “+”). Irrelevant when converting from FASTQ. [default=False] -m, --multiple_output_files Create multiple FASTQ files, one for each sample, or create multiple matching FASTA/QUAL for each sample. [default=False] Output: Outputs a complete or minimal FASTQ file, which omits the redundant sequence label on the quality scores, or splits FASTQ file into matching FASTA/QUAL files. Example: Using the input files seqs.fna and seqs.qual, generate seqs.fastq in the fastq_files directory: Example: Using input seqs.fastq generate fasta and qual files in fastaqual directory: convert_fastaqual_to_fastq.py – From a FASTA file and a matching QUAL file, generates a minimal FASTQ file. Description: From a FASTA file and a mathcing QUAL file, generates a minimal FASTQ file. A minimal FASTQ file omits the redundtant sequence label on the quality scores; the quality scores for a sequence are assumed to follow immediately after the sequence with which they are associated. The output FASTQ file will be generated in the specified output directory with the same name as the input FASTA file, suffixed with „.fastq? Usage: convert_fastaqual_to_fastq.py [options] Input Arguments: [REQUIRED] -f, --fasta_fp Input FASTA file. -q, --qual_fp Input QUAL file. [OPTIONAL] -o, --output_dir Output directory. Will be created if does not exist. [default: .] -a, --ascii_increment The number to add to the quality score to get the ASCII character. [default: 33] -F, --full_fasta_headers Include full FASTA headers in FASTQ file (as opposed to merely the sequence label). [default: False] -b, --full_fastq Include identifiers on quality lines in the FASTQ file (those beginning with a “+” [default=False] -m, --multiple_output_files Create multiple FASTQ files, one for each sample. [default=False] Output: Outputs a minimal FASTQ file, which omits the redundant sequence label on the quality scores. Example: Using the input files seqs.fna and seqs.qual, generate seqs.fastq in the fastq_files directory: convert_otu_table_to_unifrac_sample_ mapping.py – Convert a QIIME OTU table to a UniFrac sample mapping file Description: This script allows users who have picked OTUs in QIIME to convert it to a sample mapping (environment) file for use with the Unifrac web interface. Usage: convert_otu_table_to_unifrac_sample_mapping.py [options] Input Arguments: [REQUIRED] -i, --otu_table_fp Path to the otu table -o, --output_fp Path to output file Output: The result of this script is a sample mapping file for the UniFrac web interface. Example: Convert a biom-formatted OTU table to a unifrac sample mapping (environment) file: convert_unifrac_sample_mapping_to_otu_table.py – Convert a UniFrac sample mapping file to an OTU table Description: This script allows users that have already created sample mapping (environment) files for use with the Unifrac web interface to use QIIME. QIIME records this data in an OTU table. Usage: convert_unifrac_sample_mapping_to_otu_table.py [options] Input Arguments: [REQUIRED] -i, --sample_mapping_fp Path to the sample mapping file -o, --output_fp Path to output file Output: The result of this script is an OTU table. Example: Convert a UniFrac sample mapping (environment) file into a biom-formatted OTU table: core_diversity_analyses.py – A workflow for running a core set of QIIME diversity analyses. Description: This script plugs several QIIME diversity analyses together to form a basic workflow beginning with a BIOM table, mapping file, and optional phylogenetic tree. The included scripts are those run by the workflow scripts alpha_rarefaction.py, beta_diversity_through_plots.py, summarize_tax a_through_plots.py, plus the (non-workflow) scriptsmake_distance_boxplots.py, compare_alpha_diversity.py, and otu_category_significance.py. To update parameters to the workflow scripts, you should pass the same parameters file that you would pass if calling the workflow script directly. core_diversity_analyses.py [options] Usage: Input Arguments: [REQUIRED] -i, --input_biom_fp The input biom file [REQUIRED] -o, --output_dir The output directory [REQUIRED] -m, --mapping_fp The mapping filepath [REQUIRED] -e, --sampling_depth Sequencing depth to use for even sub-sampling and maximum rarefaction depth. You should review the output of print_biom_table_summary.py to decide on this value. [OPTIONAL] -p, --parameter_fp Path to the parameter file, which specifies changes to the default behavior. For more information, see www.qiime.org/documentation/qiime_parameters_files.html [if omitted, default values will be used] -a, --parallel Run in parallel where available. Specify number of jobs to start with -O or in the parameters file. [default: False] --nonphylogenetic_diversity Apply non-phylogenetic alpha (chao1 and observed_species) and beta (bray_curtis) diversity calculations. This is useful if, for example, you are working with non-amplicon BIOM tables, or if a reliable tree is not available (e.g., if you?re working with ITS amplicons) [default: False] --suppress_taxa_summary Suppress generation of taxa summary plots. [default: False] --suppress_beta_diversity Suppress beta diversity analyses. [default: False] --suppress_alpha_diversity Suppress alpha diversity analyses. [default: False] --suppress_otu_category_significance Suppress OTU/category significance analysis. [default: False] -t, --tree_fp Path to the tree file if one should be used. [default: no tree will be used] -c, --categories The metadata category or categories to compare (i.e., column headers in the mapping file) for categorical analyses. These should be passed as a comma-separated list. [default: None; do not perform categorical analyses] -w, --print_only Print the commands but don?t call them – useful for debugging or recovering from failed runs. [default: False] -O, --jobs_to_start Number of jobs to start. NOTE: you must also pass -a to run in parallel, this defines the number of jobs to be started if and only if -a is passed [default: 2] Output: Run diversity analyses at 20 sequences/sample, with categorical analyses focusing on the SampleType and day categories. ALWAYS SPECIFY ABSOLUTE FILE PATHS (absolute path represented here as $PWD, but will generally look something like /home/ubuntu/my_analysis/). core_qiime_analyses.py – A workflow for running a core set of QIIME analyses. Description: This script plugs several QIIME steps together to form a basic full data analysis workflow. The steps include quality filtering and demultiplexing sequences (optional), running thepick_de_novo_otus.py workflow (pick otus and representative sequences, assign taxonomy, align representative sequences, build a tree, and build the OTU table), generating 2d and 3d beta diversity PCoA plots, generating alpha rarefaction plots, identifying OTUs that are differentially represented in different categories, and several additional analysis. Beta diversity calculations will be run both with and without an even sampling step, where the depth of sampling can either be passed to the script or QIIME will try to make a reasonable guess. Usage: core_qiime_analyses.py [options] Input Arguments: [REQUIRED] -i, --input_fnas The input fasta file(s) – comma-separated if more than one [REQUIRED] -o, --output_dir The output directory [REQUIRED] -m, --mapping_fp The mapping filepath [REQUIRED] [OPTIONAL] -p, --parameter_fp Path to the parameter file, which specifies changes to the default behavior. See #qiime-parameters. [if omitted, default values will be used] -q, --input_quals The 454 qual files. Comma-separated if more than one, and must correspond to the order of the fasta files. Not relevant if passing –suppress_split_libraries. [default: None] -f, --force Force overwrite of existing output directory (note: existing files in output_dir will not be removed) [default: None] -a, --parallel Run in parallel where available. Specify number of jobs to start with -O or in the parameters file. [default: False] -e, --seqs_per_sample Depth of coverage for diversity analyses that incorporate subsampling the OTU table to an equal number of sequences per sample. [default: determined automatically - bad choices can be made in some circumstances] --even_sampling_keeps_all_samples If -e/–seqs_per_sample is not provided, chose the even sampling depth to force retaining all samples (rather then default which will choose a sampling depth which may favor keeping more sequences by excluding some samples) [default: False] -t, --reference_tree_fp Path to the tree file if one should be used. Relevant for closed-reference-based OTU picking methods only (i.e., uclust_ref -C and BLAST) [default: de novo tree will be used] -c, --categories The metadata category or categories to compare (i.e., column headers in the mapping file) for the otu_category_significance, supervised_learning.py, and cluster_quality.py steps. Pass a comma-separated list if more than one category [default: None; skip these steps] --suppress_split_libraries Skip demultiplexing/quality filtering (i.e. split_libraries). This assumes that sequence identifiers are in post-split_libraries format (i.e., sampleID_seqID) [default: False] -O, --jobs_to_start Number of jobs to start. NOTE: you must also pass -a to run in parallel, this defines the number of jobs to be started if and only if -a is passed [default: 4] Output: Run serial analysis using a custom parameters file (-p), and guess the even sampling depth (no -e provided). ALWAYS SPECIFY ABSOLUTE FILE PATHS (absolute path represented here as $PWD, but will generally look something like /home/ubuntu/my_analysis/). Run serial analysis using a custom parameters file (-p), and guess the even sampling depth (no -e provided). Skip split libraries by starting with already demultiplexed sequences. ALWAYS SPECIFY ABSOLUTE FILE PATHS (absolute path represented here as $PWD, but will generally look something like /home/ubuntu/my_analysis/). count_seqs.py – Description: Usage: count_seqs.py [options] Input Arguments: [REQUIRED] -i, --input_fps The input filepaths (comma-separated) [OPTIONAL] -o, --output_fp The output filepath [default: write to stdout] --suppress_errors Suppress warnings about missing files [default: False] Output: Count the sequences in a fasta file and write results to stdout. Count the sequences in a fasta file and a fastq file and write results to file. Note that fastq files can only be processed if they end with .fastq – all other files are assumed to be fasta. Count the sequences all .fasta files in current directory and write results to stdout. Note that -i option must be quoted. demultiplex_fasta.py – Demultiplex fasta data according to barcode sequences or data supplied in fasta labels. Description: Using barcodes and/or data from fasta labels provided in a mapping file, will demultiplex sequences from an input fasta file. Barcodes will be removed from the sequences in the output fasta file by default. If a quality scores file is supplied, the quality score file will be truncated to match the output fasta file. The default barcode type are 12 base pair Golay codes. Alternative barcodes allowed are 8 base pair Hamming codes, variable_length, or generic barcodes of a specified length. Generic barcodes utilize mismatch counts for correction. One can also use an added demultiplex field (-j option) to specify data in the fasta labels that can be used alone or in conjunction with barcode sequences for demultiplexing. All barcode correction is disabled when variable length barcodes are used. Usage: demultiplex_fasta.py [options] Input Arguments: [REQUIRED] -m, --map Name of mapping file. NOTE: Must contain a header line indicating SampleID in the first column and BarcodeSequence in the second, LinkerPrimerSequence in the third. -f, --fasta Names of fasta files, comma-delimited [OPTIONAL] -q, --qual File paths of qual files, comma-delimited [default: None] -B, --keep_barcode Do not remove barcode from sequences -b, --barcode_type Barcode type, hamming_8, golay_12, variable_length (will disable any barcode correction if variable_length set), or a number representing the length of the barcode, such as -b 4. The max barcode errors (-e) should be lowered for short barcodes. [default: golay_12] -o, --dir_prefix Directory prefix for output files [default: .] -e, --max_barcode_errors Maximum number of errors in barcode. If using generic barcodes every 0.5 specified counts as a primer mismatch. [default: 1.5] -n, --start-numbering-at Seq id to use for the first sequence [default: 1] --retain_unassigned_reads Retain sequences which can not be demultiplexed in a seperate output sequence file [default: False] -c, --disable_bc_correction Disable attempts to find nearest corrected barcode. Can improve performance. [default: False] -F, --save_barcode_frequencies Save frequences of barcodes as they appear in the given sequences. Sorts in order of largest to smallest. Will do nothing if barcode type is 0 or variable_length. [default: False] -j, --added_demultiplex_field Use -j to add a field to use in the mapping file as an additional demultiplexing option to the barcode. All combinations of barcodes and the values in these fields must be unique. The fields must contain values that can be parsed from the fasta labels such as “plate=R_2008_12_09”. In this case, “plate” would be the column header and “R_2008_12_09” would be the field data (minus quotes) in the mapping file. To use the run prefix from the fasta label, such as “>FLP3FBN01ELBSX”, where “FLP3FBN01” is generated from the run ID, use “-j run_prefix” and set the run prefix to be used as the data under the column headerr “run_prefix”. [default: None] Output: Four files can be generated by demultiplex_fasta.py 1. seqs.fna - This contains the fasta sequences, demultiplexed according to barcodes and/or added demultiplexed field. 2. demultiplexed_sequences.log - Contains details about demultiplexing stats 3. seqs.qual - If quality score file(s) are supplied, these will be truncated to match the seqs.fna file after barcode removal if such is enabled. 4. seqs_not_assigned.fna - If --retain_unassigned_reads is enabled, will write all sequences that can not be demultiplexed to this file. Also will create a seqs_not_assigned.qual file if quality file supplied. Standard Example: Using a single 454 run, which contains a single FASTA, QUAL, and mapping file while using default parameters and outputting the data into the Directory “demultiplexed_output”: For the case where there are multiple FASTA and QUAL files, the user can run the following command as long as there are not duplicate barcodes listed in the mapping file: Duplicate Barcode Example: An example of this situation would be a study with 1200 samples. You wish to have 400 samples per run, so you split the analysis into three runs with and reuse barcodes (you only have 600). After initial analysis you determine a small subset is underrepresented (<500 sequences per samples) and you boost the number of sequences per sample for this subset by running a fourth run. Since the same sample IDs are in more than one run, it is likely that some sequences will be assigned the same unique identifier by demultiplex_fasta.py when it is run separately on the four different runs, each with their own barcode file. This will cause a problem in file concatenation of the four different runs into a single large file. To avoid this, you can use the „-n? parameter which defines a start index for demultiplex_fasta.py fasta label enumeration. From experience, most 454 runs (when combining both files for a single plate) will have 350,000 to 650,000 sequences. Thus, if Run 1 for demultiplex_fasta.py uses „-n 1000000?, Run 2 uses „-n 2000000?, etc., then you are guaranteed to have unique identifiers after concatenating the results of multiple 454 runs. With newer technologies you will just need to make sure that your start index spacing is greater than the potential number of sequences. To run demultiplex_fasta.py, you will need two or more (depending on the number of times the barcodes were reused) separate mapping files (one for each Run, for example one Run1 and another one for Run2), then you can run demultiplex_fasta.py using the FASTA and mapping file for Run1 and FASTA and mapping file for Run2. Once you have independently run demultiplex_fasta on each file, followed by quality filtering, you can concatenate (cat) the sequence files generated. You can also concatenate the mapping files, since the barcodes are not necessary for downstream analyses, unless the same sample ids are found in multiple mapping files. Run demultiplex_fasta.py on Run 1: Run demultiplex_fasta on Run 2: Barcode Decoding Example: The standard barcode types supported by demultiplex_fasta.py are golay (Length: 12 NTs) and hamming (Length: 8 NTs). For situations where the barcodes are of a different length than golay and hamming, the user can define a generic barcode type “-b” as an integer, where the integer is the length of the barcode used in the study. For the case where the generic 8 base pair barcodes were used, you can use the following command: To use the run prefix at the beginning of the fasta label for demultiplexing, there has to be a field in the mapping file labeled “run_prefix”, and can be used by the following command: denoise_wrapper.py – Denoise a flowgram file Description: This script will denoise a flowgram file in .sff.txt format, which is the output of sffinfo. Usage: denoise_wrapper.py [options] Input Arguments: [REQUIRED] -i, --input_file Path to flowgram files (.sff.txt), comma separated -f, --fasta_file Path to fasta file from split_libraries.py [OPTIONAL] -o, --output_dir Path to output directory [default: denoised_seqs/] -n, --num_cpus Number of CPUs [default: 1] --force_overwrite Overwrite files in output directory [default: False] -m, --map_fname Name of mapping file, Has to contain field LinkerPrimerSequence. [REQUIRED unless –primer specified] -p, --primer Primer sequence [REQUIRED unless –map_fname specified] --titanium Select Titanium defaults for denoiser, otherwise use FLX defaults [default: False] Output: This script results in a OTU like mapping file along with a sequence file of denoised (FASTA-format). Note that the sequences coming from denoising are no real OTUs, and have to be sent topick_otus.py if the users wishes to have a defined similarity threshold. Example: Denoise flowgrams in file 454Reads.sff.txt, discard flowgrams not in seqs.fna, and extract primer from map.txt: Multi-core Example: Denoise flowgrams in file 454Reads.sff.txt using 2 cores on your machine in parallel: denoiser.py – Remove noise from 454 sequencing data Description: The denoiser removes sequencing noise characteristic to pyrosequencing by flowgram clustering. For a detailed explanation of the underlying algorithm see (Reeder and Knight, Nature Methods 7(9), 2010). denoiser.py [options] Usage: Input Arguments: [REQUIRED] -i, --input_file Path to flowgram file. Separate several files by commas [REQUIRED] [OPTIONAL] -f, --fasta_fp Path to fasta input file. Reads not in the fasta file are filtered out before denoising. File format is as produced by split_libraries.py [default: None] -o, --output_dir Path to output directory [default: random dir in ./] -c, --cluster Use cluster/multiple CPUs for flowgram alignments [default: False] -p, --preprocess_fp Do not do preprocessing (phase I),instead use already preprocessed data in PREPROCESS_FP --checkpoint_fp Resume denoising from checkpoint. Be careful when changing parameters for a resumed run. Requires -p option. [default: None] -s, --squeeze Use run-length encoding for prefix filtering in phase I [default: False] -S, --split Split input into per library sets and denoise separately [default: False] --force Force overwrite of existing directory [default: False] --primer Primer sequence [default: CATGCTGCCTCCCGTAGGAGT] -n, --num_cpus Number of cpus, requires -c [default: 1] -m, --max_num_iterations Maximal number of iterations in phase II. None means unlimited iterations [default: None] -b, --bail_out Stop clustering in phase II with clusters smaller or equal than BAILde [default: 1] --percent_id Sequence similarity clustering threshold [default: 0.97] --low_cut-off Low clustering threshold for phase II [default: 3.75] --high_cut-off High clustering threshold for phase III [default: 4.5] --low_memory Use slower, low memory method [default: False] -e, --error_profile Path to error profile [default= /Users/caporaso/.virtualenvs/qiime/lib/python2.7/site-packages/qiime/support_files/denoiser/Data/FLX_error_profile.dat] --titanium Shortcut for -e /Users/caporaso/.virtualenvs/qiime/lib/python2.7/site-packages/qiime/support_files/denoiser/Data//Titanium_error_profile.dat –low_cut-off=4 –high_cut_off=5 . Warning: overwrites all previous cut-off values [DEFAULT: False] Output: centroids.fasta: The cluster representatives of each cluster singletons.fasta: contains all unclustered reads denoiser_mapping.txt: This file contains the actual clusters. The cluster centroid is given first, the cluster members follow after the „:?. checkpoints/ : directory with checkpoints Note that the centroids and singleton files are disjoint. For most downstream analyses one wants to cat the two files. Run denoiser on flowgrams in 454Reads.sff.txt with read-to-barcode mapping in seqs.fna, put results into Outdir, log progress in Outdir/denoiser.log Multiple sff.txt files: Run denoiser on two flowgram files in 454Reads_1.sff.txt and 454Reads_2.sff.txt with read-to-barcode mapping in seqs.fna, put results into Outdir, log progress in Outdir/denoiser.log Denoise multiple library separately: Run denoiser on flowgrams in 454Reads.sff.txt with read-to-barcode mapping in seqs.fna, split input files into libraries and process each library separately, put results into Outdir, log progress in Outdir/denoiser.log Resuming a failed run: Resume a previous denoiser run from breakpoint stored in Outdir_from_failed_run/checkpoints/checkpoint100.pickle. The checkpoint option requires the -p or –preprocess option, which usually can be set to the output dir of the failed run. All other arguments must be identical to the failed run. denoiser_preprocess.py – Run phase of denoiser algorithm: prefix clustering Description: The script denoiser_preprocess.py runs the first clustering phase which groups reads based on common prefixes. Usage: denoiser_preprocess.py [options] Input Arguments: [REQUIRED] -i, --input_file Path to flowgram file [REQUIRED] [OPTIONAL] -f, --fasta_file Path to fasta input file [default: None] -s, --squeeze Use run-length encoding for prefix filtering [default: False] -l, --log_file Path to log file [default: preprocess.log] -p, --primer Primer sequence used for the amplification [default: CATGCTGCCTCCCGTAGGAGT] -o, --output_dir Path to output directory [default: /tmp/] Output: prefix_dereplicated.sff.txt: human readable sff file containing the flowgram of the cluster representative of each cluster. prefix_dereplicated.fasta: Fasta file containing the cluster representative of each cluster. prefix_mapping.txt: This file contains the actual clusters. The cluster centroid is given first, the cluster members follw after the „:?. Run program on flowgrams in 454Reads.sff. Remove reads which are not in split_lib_filtered_seqs.fasta. Remove primer CATGCTGCCTCCCGTAGGAGT from reads before running phase I denoiser_worker.py – Start a denoiser worker process Description: The workers are automatically started by the denoiser.py script. You usually never need to use this script yourself. A worker waits for data and does flowgram alignments once it gets it. Usage: denoiser_worker.py [options] Input Arguments: [REQUIRED] -f, --file_path Path used as prefix for worker data files[REQUIRED] -p, --port Server port [REQUIRED] -s, --server_address Server address[REQUIRED] [OPTIONAL] -e, --error_profile Path to error profile [DEFAULT: /Users/caporaso/.virtualenvs/qiime/lib/python2.7/site-packages/qiime/support_files/denoiser/Data/FLX_error_profile.dat] -c, --counter Round counter to start this worker with [default: 0] Output: Denoise worker writes a log file if verbose flag is set. Start worker and connect to server listening on port 12345 on the same machine (localhost) detrend.py – Detrend Principal Coordinates Description: Ordination plots (e.g. principal coordinates analysis) of samples that lay along a naturally occurring gradient (e.g. depth, time, pH) often exhibit a curved shape known as the “arch” or “horseshoe” effect. This can cause samples near the endpoints of the gradient to appear closer to one another than would be expected. This script will attempt to remove any (compounded) quadratic curvature in a set of 2D coordinates. If requested, it will also report an evaluation of the association of the transformed coordinates with a known gradient. detrend.py [options] Usage: Input Arguments: [REQUIRED] -i, --input_fp Path to read PCoA/PCA/ordination table [OPTIONAL] -o, --output_dir Path to output directory [default: .] -m, --map_fp Path to metadata file [default: None] -c, --gradient_variable Column header for gradient variable in metadata table [default: None] -r, --suppress_prerotate Suppress pre-rotation of the coordinates for optimal detrending; not pre-rotating assumes that the curvature is symmetrical across the vertical axis [default: False] Output: Output is detrended PCoA matrices. Examples: The simplest usage takes as input only a table of principal coordinates: One may also include a metadata file with a known real-valued gradient as one of the columns. In this case, the output folder will include a text file providing a summary of how well the analysis fit with the hypothesis that the primary variation is due to the gradient (in this case, “DEPTH”): Note that if you provide a real-valued known gradient the script will prerotate the first two axes of the PCoA coords in order to achieve optimal alignment with that gradient. This can be disabled with “-r”: dissimilarity_mtx_stats.py – Calculate mean, median and standard deviation from a set of distance matrices Description: This script reads in all (dis)similarity matrices from an input directory (input_dir), then calculates and writes the mean, median, standdard deviation (stdev) to an output folder. The input_dir must contain only (dis)similarity matrices, and only those you wish to perform statistical analyses on. Usage: dissimilarity_mtx_stats.py [options] Input Arguments: [REQUIRED] -i, --input_dir Path to input directory -o, --output_dir Path to store result files Output: The outputs are in distance matrix format, where each value is the mean, median, or stdev of that element in all the input distance matrices. Example: This examples takes the “dists/” directory as input and returns the results in the “dist_stats/” directory. distance_matrix_from_mapping.py – Calculate the pairwise dissimilarity on one column of a mappping file Description: The input for this script is a mapping file and the name of a column, it has to be numeric, from which a distance matrix will be created. The output of this script is a distance matrix containing a dissimilarity value for each pairwise comparison. As this is a univariate procedure only one metric is supported: d = c-b. Usage: distance_matrix_from_mapping.py [options] Input Arguments: [REQUIRED] -i, --input_path Mapping filepath. -c, --column String containing the name of the column in the mapping file, e.g. „DOB?. If you pass two colums separated by a comma (e.g. „Latitude,Longitud?) the script will calculate the Vincenty formula (WGS-84) for distance between two Latitude/Longitude points. [OPTIONAL] -o, --output_fp Output directory. One will be created if it doesn?t exist. [default=map_distance_matrix.txt] Output: The output of distance_matrix_from_mapping.py is a file containing a distance matrix between rows corresponding to a pair of columns in a mapping file. Pairwise dissimilarity: To calculate the distance matrix (using euclidean distance) on a column of the mapping file, where the results are output to DOB.txt, use the following command: Pairwise dissimilarity using the Vincenty formula for distance between two Latitude/Longitude points: To calculate the distance matrix (using Vincenty formula) on a column of the mapping file, where the results are output to lat_long.txt, use the following command: exclude_seqs_by_blast.py – Exclude contaminated sequences using BLAST Description: This code is designed to allow users of the QIIME workflow to conveniently exclude unwanted sequences from their data. This is mostly useful for excluding human sequences from runs to comply with Internal Review Board (IRB) requirements, but may also have other uses (e.g. perhaps excluding a major bacterial contaminant). Sequences from a run are searched against a user-specified subject database, where BLAST hits are screened by e-value and the percentage of the query that aligns to the sequence. For human screening THINK CAREFULLY about the data set that you screen against. Are you excluding human non-coding sequences? What about mitochondrial sequences? This point is CRITICAL because submitting human sequences that are not IRB-approved is BAD. (e.g. you would NOT want to just screen against just the coding sequences of the human genome as found in the KEGG .nuc files, for example) One valid approach is to screen all putative 16S rRNA sequences against greengenes to ensure they are bacterial rather than human. WARNING: You cannot use this script if there are spaces in the path to the database of fasta files because formatdb cannot handle these paths (this is a limitation of NCBI?s tools and we have no control over it). Usage: exclude_seqs_by_blast.py [options] Input Arguments: [REQUIRED] -i, --querydb The path to a FASTA file containing query sequences -d, --subjectdb The path to a FASTA file to BLAST against -o, --outputdir The output directory [OPTIONAL] -e, --e_value The e-value cutoff for blast queries [default: 1e-10]. -p, --percent_aligned The % alignment cutoff for blast queries [default: 0.97]. --no_clean If set, don?t delete files generated by formatdb after running [default: False]. --blastmatroot Path to a folder containing blast matrices [default: None]. --working_dir Working dir for BLAST [default: /Users/caporaso/temp]. -m, --max_hits Max hits parameter for BLAST. CAUTION: Because filtering on alignment percentage occurs after BLAST, a max hits value of 1 in combination with an alignment percent filter could miss valid contaminants. [default: 100] -w, --word_size Word size to use for BLAST search [default: 28] -n, --no_format_db If this flag is specified, format_db will not be called on the subject database (formatdb will be set to False). This is useful if you have already formatted the database and a) it took a very long time or b) you want to run the script in parallel on the pre-formatted database [default: False] Output: Four output files are generated based on the supplied outputpath + unique suffixes: 1. “filename_prefix”.matching: A FASTA file of sequences that did pass the screen (i.e. matched the database and passed all filters). 2. “filename_prefix”.non-matching: A FASTA file of sequences that did not pass the screen. 3. “filename_prefix”.raw_blast_results: Contains the raw BLAST results from the screening. 4. “filename_prefix”.sequence_exclusion_log: A log file summarizing the options used and results obtained. In addition, if the --no_clean option is passed, the files generated by formatdb will be kept in the same directory as subjectdb. Examples: The following is a simple example, where the user can take a given FASTA file (i.e. resulting FASTA file from pick_rep_set.py) and blast those sequences against a reference FASTA file containing the set of sequences which are considered contaminated: Alternatively, if the user would like to change the percent of aligned sequence coverage (“-p”) or the maximum E-value (“-e”), they can use the following command: extract_seqs_by_sample_id.py – Extract sequences based on the SampleID Description: This script creates a fasta file which will contain only sequences that ARE associated with a set of sample IDs, OR all sequences that are NOT associated with a set of sample IDs (-n) Usage: extract_seqs_by_sample_id.py [options] Input Arguments: [REQUIRED] -i, --input_fasta_fp Path to the input fasta file -o, --output_fasta_fp The output fasta file [OPTIONAL] -n, --negate Negate the sample ID list (i.e., output sample ids not passed via -s) [default: False] -s, --sample_ids Comma-separated sample_ids to include in output fasta file (or exclude if –negate), or string describing mapping file states defining sample ids (mapping_fp must be provided for the latter) -m, --mapping_fp The mapping filepath Output: The script produces a fasta file containing containing only the specified SampleIDs. Examples: Create the file outseqs.fasta (-o), which will be a subset of inseqs.fasta (-i) containing only the sequences THAT ARE associated with sample ids S2, S3, S4 (-s). As always, sample IDs are case-sensitive: Create the file outseqs.fasta (-o), which will be a subset of inseqs.fasta (-i) containing only the sequences THAT ARE NOT (-n) associated with sample ids S2, S3, S4 (-s). As always, sample IDs are case-sensitive: Create the file outseqs.fasta (-o), which will be a subset of inseqs.fasta (-i) containing only the sequences THAT ARE associated with sample ids whose “Treatment” value is “Fast” in the mapping file: filter_alignment.py – Filter sequence alignment by removing highly variable regions Description: This script should be applied to generate a useful tree when aligning against a template alignment (e.g., with PyNAST). This script will remove positions which are gaps in every sequence (common for PyNAST, as typical sequences cover only 200-400 bases, and they are being aligned against the full 16S gene). Additionally, the user can supply a lanemask file, that defines which positions should included when building the tree, and which should be ignored. Typically, this will differentiate between non-conserved positions, which are uninformative for tree building, and conserved positions which are informative for tree building. FILTERING ALIGNMENTS WHICH WERE BUILD WITH PYNAST AGAINST THE GREENGENES CORE SET ALIGNMENT SHOULD BE CONSIDERED AN ESSENTIAL STEP. Usage: filter_alignment.py [options] Input Arguments: [REQUIRED] -i, --input_fasta_file The input directory [OPTIONAL] -o, --output_dir The output directory [default: .] -m, --lane_mask_fp Path to lanemask file [default: /Users/caporaso/data/greengenes_core_sets/lanemask_in_1s_and_0s.txt] -s, --suppress_lane_mask_filter Suppress lane mask filtering (necessary to turn off lane-mask-based filtering when a qiime_config default is provided for –lane_mask_fp) [default: False] -g, --allowed_gap_frac Gap filter threshold, filters positions which are gaps in > allowed_gap_frac of the sequences [default: 0.999999] -r, --remove_outliers Remove seqs very dissimilar to the alignment consensus (see –threshold). [default: False] -t, --threshold With -r, remove seqs whose dissimilarity to the consensus sequence is approximately > x standard devaitions above the mean of the sequences [default: 3.0] -e, --entropy_threshold Sets percent threshold for removing base positions with the highest entropy. For example, if 0.10 were specified, the top 10% most entropic base positions would be filtered. If this value is used, any lane mask supplied will be ignored. Entropy filtered occurs after gap filtering. [default: None] Output: The output of filter_alignment.py consists of a single FASTA file, which ends with “pfiltered.fasta”, where the “p” stands for positional filtering of the columns. Examples: As a simple example of this script, the user can use the following command, which consists of an input FASTA file (i.e. resulting file from align_seqs.py), lanemask template file and the output directory “filtered_alignment/”: Alternatively, if the user would like to use a different gap fraction threshold (“-g”), they can use the following command: filter_distance_matrix.py – Filter a distance matrix to contain only a specified set of samples. Description: Remove samples from a distance matrix based on a mapping file or an otu table or a list of sample ids. Usage: filter_distance_matrix.py [options] Input Arguments: [REQUIRED] -i, --input_distance_matrix The input distance matrix -o, --output_distance_matrix Path to store the output distance matrix [OPTIONAL] --sample_id_fp A list of sample identifiers (or tab-delimited lines with a sample identifier in the first field) which should be retained -t, --otu_table_fp The otu table filepath -m, --mapping_fp Path to the mapping file -s, --valid_states String containing valid states, e.g. „STUDY_NAME:DOB? --negate Discard specified samples (instead of keeping them) [default: False] Output: Filter samples ids listed in sample_id_list.txt from dm.txt Filter samples ids in otu_table.biom from dm.txt Filter samples ids where DOB is 20061218 in Fasting_Map.txt. (Run “filter_samples_from_otu_table.py -h” for additional information on how metadata filtering can be specified.) filter_fasta.py – This script can be applied to remove sequences from a fasta or fastq file based on input criteria. Description: Usage: filter_fasta.py [options] Input Arguments: [REQUIRED] -f, --input_fasta_fp Path to the input fasta file -o, --output_fasta_fp The output fasta filepath [OPTIONAL] -m, --otu_map An OTU map where sequences ids are those which should be retained -s, --seq_id_fp A list of sequence identifiers (or tab-delimited lines with a seq identifier in the first field) which should be retained -b, --biom_fp A biom file where otu identifiers should be retained -a, --subject_fasta_fp A fasta file where the seq ids should be retained. -p, --seq_id_prefix Keep seqs where seq_id starts with this prefix --sample_id_fp Keep seqs where seq_id starts with a sample id listed in this file -n, --negate Discard passed seq ids rather than keep passed seq ids [default: False] --mapping_fp Mapping file path (for use with –valid_states) [default: None] --valid_states Description of sample ids to retain (for use with –mapping_fp) [default: None] Output: OTU map-based filtering: Keep all sequences that show up in an OTU map. Chimeric sequence filtering: Discard all sequences that show up in chimera checking output. NOTE: It is very important to pass -n here as this tells the script to negate the request, or discard all sequences that are listed via -s. This is necessary to remove the identified chimeras from inseqs.fasta. Sequence list filtering: Keep all sequences from as fasta file that are listed in a text file. biom-based filtering: Keep all sequences that are listed as observations in a biom file. fastq filtering: Keep all sequences from a fastq file that are listed in a text file (note: file name must end with .fastq to support fastq filtering). sample id list filtering: Keep all sequences from a fasta file where the sample id portion of the sequence identifier is listed in a text file (sequence identifiers in fasta file must be in post-split libraries format: sampleID_seqID). filter_otus_by_sample.py – Filter OTU mapping file and sequences by SampleIDs Description: This filter allows for the removal of sequences and OTUs containing user-specified Sample IDs, for instance, the removal of negative control samples. This script identifies OTUs containing the specified Sample IDs and removes its corresponding sequence from the sequence collection. Usage: filter_otus_by_sample.py [options] Input Arguments: [REQUIRED] -i, --otu_map_fp Path to the input OTU map (i.e., the output from pick_otus.py) -f, --input_fasta_fp Path to the input fasta file -s, --samples_to_extract This is a list of sample ids, which should be removed from the OTU file [OPTIONAL] -o, --output_dir Path to the output directory Output: As a result a new OTU and sequence file is generated and written to a randomly generated folder where the name of the folder starts with “filter_by_otus” Also included in the folder, is another FASTA file containing the removed sequences, leaving the user with 3 files. Example: The following command can be used, where all options are passed (using the resulting OTU file from pick_otus.py, FASTA file from split_libraries.py and removal of sample „PC.636?) with the resulting data being written to the output directory “filtered_otus/”: filter_otus_by_sample.py – Filter OTU mapping file and sequences by SampleIDs Description: This filter allows for the removal of sequences and OTUs containing user-specified Sample IDs, for instance, the removal of negative control samples. This script identifies OTUs containing the specified Sample IDs and removes its corresponding sequence from the sequence collection. filter_otus_by_sample.py [options] Usage: Input Arguments: [REQUIRED] -i, --otu_map_fp Path to the input OTU map (i.e., the output from pick_otus.py) -f, --input_fasta_fp Path to the input fasta file -s, --samples_to_extract This is a list of sample ids, which should be removed from the OTU file [OPTIONAL] -o, --output_dir Path to the output directory Output: As a result a new OTU and sequence file is generated and written to a randomly generated folder where the name of the folder starts with “filter_by_otus” Also included in the folder, is another FASTA file containing the removed sequences, leaving the user with 3 files. Example: The following command can be used, where all options are passed (using the resulting OTU file from pick_otus.py, FASTA file from split_libraries.py and removal of sample „PC.636?) with the resulting data being written to the output directory “filtered_otus/”: filter_otus_from_otu_table.py – Filter OTUs from an OTU table based on their observation counts or identifier. Description: Usage: filter_otus_from_otu_table.py [options] Input Arguments: [REQUIRED] -i, --input_fp The input otu table filepath in biom format -o, --output_fp The output filepath in biom format [OPTIONAL] --negate_ids_to_exclude Keep OTUs in otu_ids_to_exclude_fp rather than discard them [default:False] -n, --min_count The minimum total observation count of an otu for that otu to be retained [default: 0] --min_count_fraction Fraction of the total observation (sequence) count to apply as the minimum total observation count of an otu for that otu to be retained. this is a fraction, not percent, so if you want to filter to 1%, you specify 0.01. [default: 0] -x, --max_count The maximum total observation count of an otu for that otu to be retained [default: infinity] -s, --min_samples The minimum number of samples an OTU must be observed in for that otu to be retained [default: 0] -y, --max_samples The maximum number of samples an OTU must be observed in for that otu to be retained [default: infinity] -e, --otu_ids_to_exclude_fp File containing list of OTU ids to exclude: can be a text file with one id per line, a text file where id is the first value in a tab-separated line, or can be a fasta file (extension must be .fna or .fasta) [default: None] Output: Singleton filtering: Discard all OTUs that are observed fewer than 2 times (i.e., singletons) Abundance filtering: Discard all OTUs that are observed greater than 100 times (e.g., if you want to look at low abundance OTUs only) Chimera filtering: Discard all OTUs listed in chimeric_otus.txt (e.g., to remove chimeric OTUs from an OTU table) filter_samples_from_otu_table.py – Filters samples from an OTU table on the basis of the number of observations in that sample, or on the basis of sample metadata. Mapping file can also be filtered to the resulting set of sample ids. Description: Usage: filter_samples_from_otu_table.py [options] Input Arguments: [REQUIRED] -i, --input_fp The input otu table filepath in biom format -o, --output_fp The output filepath in biom format [OPTIONAL] -m, --mapping_fp Path to the map file [default: None] --output_mapping_fp Path to write filtered mapping file [default: filtered mapping file is not written] --sample_id_fp Path to file listing sample ids to keep [default: None] -s, --valid_states String describing valid states (e.g. „Treatment:Fasting?) [default: None] -n, --min_count The minimum total observation count in a sample for that sample to be retained [default: 0] -x, --max_count The maximum total observation count in a sample for that sample to be retained [default: infinity] Output: Abundance filtering (low coverage): Filter samples with fewer than 150 observations from the otu table. Abundance filtering (high coverage): Filter samples with greater than 149 observations from the otu table. Metadata-based filtering (positive): Filter samples from the table, keeping samples where the value for „Treatment? in the mapping file is „Control? Metadata-based filtering (negative): Filter samples from the table, keeping samples where the value for „Treatment? in the mapping file is not „Control? List-based filtering: Filter samples where the id is listed in samples_to_keep.txt filter_taxa_from_otu_table.py – Filter taxa from an OTU table Description: This scripts filters an OTU table based on taxonomic metadata. It can be applied for positive filtering (i.e., keeping only certain taxa), negative filtering (i.e., discarding only certain taxa), or both at the same time. Usage: filter_taxa_from_otu_table.py [options] Input Arguments: [REQUIRED] -i, --input_otu_table_fp The input otu table filepath -o, --output_otu_table_fp The output otu table filepath [OPTIONAL] -p, --positive_taxa Comma-separated list of taxa to retain [default: None; retain all taxa] -n, --negative_taxa Comma-separated list of taxa to discard [default: None; retain all taxa] --metadata_field Observation metadata identifier to filter based on [default: taxonomy] Output: Filter otu_table.biom to include only OTUs identified as __Bacteroidetes or p__Firmicutes. Filter otu_table.biom to exclude OTUs identified as p__Bacteroidetes or p__Firmicutes. Filter otu_table.biom to include OTUs identified as p__Firmicutes but not c__Clostridia. filter_tree.py – This script prunes a tree based on a set of tip names Description: This script takes a tree and a list of OTU IDs (in one of several supported formats) and outputs a subtree retaining only the tips on the tree which are found in the inputted list of OTUs (or not found, if the –negate option is provided). Usage: filter_tree.py [options] Input Arguments: [REQUIRED] -i, --input_tree_filepath Input tree filepath -o, --output_tree_filepath Output tree filepath [OPTIONAL] -n, --negate If negate is True will remove input tips/seqs, if negate is False, will retain input tips/seqs [default: False] -t, --tips_fp A list of tips (one tip per line) or sequence identifiers (tab-delimited lines with a seq identifier in the first field) which should be retained [default: None] -f, --fasta_fp A fasta file where the seq ids should be retained [default: None] Output: Output is a pruned tree in newick format. Prune a tree to include only the tips in tips_to_keep.txt: Prune a tree to remove the tips in tips_to_remove.txt. Note that the -n/–negate option must be passed for this functionality: Prune a tree to include only the tips found in the fasta file provided: fix_arb_fasta.py – Reformat ARB FASTA files Description: This script fixes ARB FASTA formatting by repairing incorrect line break chararcters, stripping spaces and replacing ”.” with “-” characters. Usage: fix_arb_fasta.py [options] Input Arguments: [REQUIRED] -f, --input_fasta_fp Path to the input fasta file [OPTIONAL] -o, --output_fp Path where output will be written [default: print to screen] Output: The reformatted sequences are written to stdout or to the file path provided with -o. Example: Fix the input ARB FASTA format file arb.fasta and print the result to stdout: Example saving to an output file: Fix the input ARB FASTA format file arb.fasta and print the result to fixed.fasta: identify_chimeric_seqs.py – Identify chimeric sequences in input FASTA file Description: A FASTA file of sequences, can be screened to remove chimeras (sequences generated due to the PCR amplification of multiple templates or parent sequences). QIIME currently includes a taxonomy-assignment-based approach, blast_fragments, for identifying sequences as chimeric and the ChimeraSlayer algorithm. 1. Blast_fragments approach: The reference sequences (-r) and id-to-taxonomy map (-t) provided are the same format as those provided to assign_taxonomy.py. The reference sequences are in fasta format, and the id-to-taxonomy map contains tab-separated lines where the first field is a sequence identifier, and the second field is the taxonomy separated by semi-colons (e.g., Archaea;Euryarchaeota;Methanobacteriales;Methanobacterium). The reference collection should be derived from a chimera-checked database (such as the full greengenes database), and filtered to contain only sequences at, for example, a maximum of 97% sequence identity. 2. ChimeraSlayer: ChimeraSlayer uses BLAST to identify potential chimera parents and computes the optimal branching alignment of the query against two parents. We suggest to use the pynast aligned representative sequences as input. 3. usearch61: usearch61 performs both de novo (abundance based) chimera and reference based detection. Unlike the other two chimera checking software, unclustered sequences should be used as input rather than a representative sequence set, as these sequences need to be clustered to get abundance data. The results can be taken as the union or intersection of all input sequences not flagged as chimeras. For details, see: identify_chimeric_seqs.py [options] Usage: Input Arguments: [REQUIRED] -i, --input_fasta_fp Path to the input fasta file [OPTIONAL] -t, --id_to_taxonomy_fp Path to tab-delimited file mapping sequences to assigned taxonomy. Each assigned taxonomy is provided as a comma-separated list. [default: None; REQUIRED when method is blast_fragments] -r, --reference_seqs_fp Path to reference sequences (used to build a blast db when method blast_fragments or reference database for usearch61). [default: None; REQUIRED when method blast_fragments if no blast_db is provided, suppress requirement for usearch61 with –suppress_usearch61_ref;] -a, --aligned_reference_seqs_fp Path to (Py)Nast aligned reference sequences. REQUIRED when method ChimeraSlayer [default: None] -b, --blast_db Database to blast against. Must provide either –blast_db or –reference_seqs_fp when method is blast_fragments [default: None] -m, --chimera_detection_method Chimera detection method. Choices: blast_fragments or ChimeraSlayer or usearch61. [default:ChimeraSlayer] -n, --num_fragments Number of fragments to split sequences into (i.e., number of expected breakpoints + 1) [default: 3] -d, --taxonomy_depth Number of taxonomic divisions to consider when comparing taxonomy assignments [default: 4] -e, --max_e_value Max e-value to assign taxonomy [default: 1e-30] -R, --min_div_ratio Min divergence ratio (passed to ChimeraSlayer). If set to None uses ChimeraSlayer default value. [default: None] -k, --keep_intermediates Keep intermediate files, useful for debugging [default: False] --suppress_usearch61_intermediates Use to suppress retention of usearch intermediate files/logs.[default: False] --suppress_usearch61_ref Use to suppress reference based chimera detection with usearch61 [default: False] --suppress_usearch61_denovo Use to suppress de novo based chimera detection with usearch61 [default: False] --split_by_sampleid Enable to split sequences by initial SampleID, requires that fasta be in demultiplexed format, e.g., >Sample.1_0, >Sample.2_1, >Sample.1_2, with the initial string before first underscore matching SampleIDs. If not in this format, could cause unexpected errors. [default: False] --non_chimeras_retention Usearch61 only - selects subsets of sequences detected as non-chimeras to retain after de novo and reference based chimera detection. Options are intersection or union. union will retain sequences that are flagged as non-chimeric from either filter, while intersection will retain only those sequences that are flagged as non-chimeras from both detection methods. [default: union] --usearch61_minh Minimum score (h) to be classified as chimera. Increasing this value tends to the number of false positives (and also sensitivity).[default: 0.28] --usearch61_xn Weight of „no? vote. Increasing this value tends to the number of false positives (and also sensitivity). Must be > 1.[default: 8.0] --usearch61_dn Pseudo-count prior for „no? votes. (n). Increasing this value tends to the number of false positives (and also sensitivity). Must be > 0.[default: 1.4] --usearch61_mindiffs Minimum number of diffs in a segment. Increasing this value tends to reduce the number of false positives while reducing sensitivity to very low-divergence chimeras. Must be > 0.[default: 3] --usearch61_mindiv Minimum divergence, i.e. 100% - identity between the query and closest reference database sequence. Expressed as a percentage, so the default is 0.8%, which allows chimeras that are up to 99.2% similar to a reference sequence. This value is chosen to improve sensitivity to very low-divergence chimeras. Must be > 0.[default: 0.8] --usearch61_abundance_skew Abundance skew setting for de novo chimera detection with usearch61. Must be > 0. [default: 2.0] --percent_id_usearch61 Percent identity threshold for clustering with usearch61. [default: 0.97] --minlen Minimum length of sequence allowed for usearch61 [default: 64] --word_length Word length value for usearch61. [default: 8] --max_accepts Max_accepts value to usearch61. [default: 1] --max_rejects Max_rejects value for usearch61. [default: 8] -o, --output_fp Path to store output, output filepath in the case of blast_fragments and ChimeraSlayer, or directory in case of usearch61 [default: derived from input_seqs_fp] Output: The result of identify_chimeric_seqs.py is a text file that identifies which sequences are chimeric. blast_fragments example: For each sequence provided as input, the blast_fragments method splits the input sequence into n roughly-equal-sized, non-overlapping fragments, and assigns taxonomy to each fragment against a reference database. The BlastTaxonAssigner (implemented in assign_taxonomy.py) is used for this. The taxonomies of the fragments are compared with one another (at a default depth of 4), and if contradictory assignments are returned the sequence is identified as chimeric. For example, if an input sequence was split into 3 fragments, and the following taxon assignments were returned: fragment1: Archaea;Euryarchaeota;Methanobacteriales;Methanobacterium fragment2: Archaea;Euryarchaeota;Halobacteriales;uncultured fragment3: Archaea;Euryarchaeota;Methanobacteriales;Methanobacterium The sequence would be considered chimeric at a depth of 3 (Methanobacteriales vs. Halobacteriales), but non-chimeric at a depth of 2 (all Euryarchaeota). blast_fragments begins with the assumption that a sequence is non-chimeric, and looks for evidence to the contrary. This is important when, for example, no taxonomy assignment can be made because no blast result is returned. If a sequence is split into three fragments, and only one returns a blast hit, that sequence would be considered non-chimeric. This is because there is no evidence (i.e., contradictory blast assignments) for the sequence being chimeric. This script can be run by the following command, where the resulting data is written to the directory “identify_chimeras/” and using default parameters (e.g. chimera detection method (“-m blast_fragments”), number of fragments (“-n 3”), taxonomy depth (“-d 4”) and maximum E-value (“-e 1e-30”)): ChimeraSlayer Example: Identify chimeric sequences using the ChimeraSlayer algorithm against a user provided reference data base. The input sequences need to be provided in aligned (Py)Nast format. The reference data base needs to be provided as aligned FASTA (-a). Note that the reference database needs to be the same that was used to build the alignment of the input sequences! usearch61 Example: Identify chimeric sequences using the usearch61 algorithm against a user provided reference data base. The input sequences should be the demultiplexed (not clustered rep set!) sequences, such as those output from split_libraries.py. The input sequences need to be provided as unaligned fasta in the same orientation as the query sequences. identify_missing_files.py – This script checks for the existence of expected files in parallel runs. Description: This script checks for the existence of expected files in parallel runs, and is useful for checking the status of a parallel run or for finding out what poller.py is waiting on in a possibly failed run. Usage: identify_missing_files.py [options] Input Arguments: [REQUIRED] -e, --expected_out_fp The list of expected output files Output: This script does not create any output files. Example: Check for the existence of files listed in expected_out_files.txt from a PyNAST alignment run, and print a warning for any that are missing. inflate_denoiser_output.py – Inflate denoiser results so they can be passed directly to OTU pickers. Description: Inflate denoiser results so they can be passed directly to pick_otus.py, parallel_pick_otus_uclust_ref.py, or pick_de_novo_otus.py. Note that the results of this script have not be abundance sorted, so they must be before being passed to the OTU picker. The uclust OTU pickers incorporate this abundance presorting by default. The inflation process writes each centroid sequence n times, where n is the number of reads that cluster to that centroid, and writes each singleton once. Flowgram identifiers are mapped back to post-split_libraries identifiers in this process (i.e., identifiers in fasta fps). inflate_denoiser_output.py [options] Usage: Input Arguments: [REQUIRED] -c, --centroid_fps The centroid fasta filepaths -s, --singleton_fps The singleton fasta filepaths -f, --fasta_fps The input (to denoiser) fasta filepaths -d, --denoiser_map_fps The denoiser map filepaths -o, --output_fasta_fp The output fasta filepath Output: Inflate the results of a single denoiser run. Inflate the results of multiple denoiser runs to a single inflated_seqs.fna file. insert_seqs_into_tree.py – Tree Insertion Description: This script takes a set of aligned sequences (query) either in the same file as the aligned reference set or separated (depending on method) along with a starting tree and produces a new tree containing the query sequences. This script requires that the user is running Raxml v7.3.0, PPlacer git repository version and ParsInsert 1.0.4. Usage: insert_seqs_into_tree.py [options] Input Arguments: [REQUIRED] -i, --input_fasta_fp Path to the input fasta file -o, --output_dir Path to the output directory -t, --starting_tree_fp Starting Tree which you would like to insert into. -r, --refseq_fp Filepath for reference alignment [OPTIONAL] -m, --insertion_method Method for aligning sequences. Valid choices are: pplacer, raxml_v730, parsinsert [default: raxml_v730] -s, --stats_fp Stats file produced by tree-building software. REQUIRED if -m pplacer [default: None] -p, --method_params_fp Parameters file containing method-specific parameters to use. [default: None] Output: The result of this script produces a tree file (in Newick format) along with a log file containing the output from the underlying tool used for tree insertion. RAxML Example (default): If you just want to use the default options, you can supply an alignment files where the query and reference sequences are included, along with a starting tree as follows: ParsInsert Example: If you want to insert sequences using pplacer, you can supply a fasta file containg query sequences (aligned to reference sequences) along with the reference alignment, a starting tree and the stats file produced when building the starting tree via pplacer as follows: Pplacer Example: If you want to insert sequences using pplacer, you can supply a fasta file containg query sequences (aligned to reference sequences) along with the reference alignment, a starting tree and the stats file produced when building the starting tree via pplacer as follows: Parameters file: Additionally, users can supply a parameters file to change the options of the underlying tools as follows: jackknifed_beta_diversity.py – A workflow script for performing jackknifed UPGMA clustering and build jackknifed 2d and 3D PCoA plots. Description: To directly measure the robustness of individual UPGMA clusters and clusters in PCoA plots, one can perform jackknifing (repeatedly resampling a subset of the available data from each sample). Usage: jackknifed_beta_diversity.py [options] Input Arguments: [REQUIRED] -i, --otu_table_fp The input fasta file [REQUIRED] -o, --output_dir The output directory [REQUIRED] -e, --seqs_per_sample Number of sequences to include in each jackknifed subset [REQUIRED] -m, --mapping_fp Path to the mapping file [REQUIRED] [OPTIONAL] -t, --tree_fp Path to the tree file [default: None; REQUIRED for phylogenetic measures] -p, --parameter_fp Path to the parameter file, which specifies changes to the default behavior. See #qiime-parameters . [if omitted, default values will be used] --master_tree Method for computing master trees in jackknife analysis. “consensus”: consensus of trees from jackknifed otu tables. “full”: tree generated from input (unsubsambled) otu table. [default: consensus] -f, --force Force overwrite of existing output directory (note: existing files in output_dir will not be removed) [default: None] -w, --print_only Print the commands but don?t call them – useful for debugging [default: False] -a, --parallel Run in parallel where available [default: False] -O, --jobs_to_start Number of jobs to start. NOTE: you must also pass -a to run in parallel, this defines the number of jobs to be started if and only if -a is passed [default: 2] Output: This scripts results in several distance matrices (from beta_diversity.py), several rarified otu tables (from multiple_rarefactions.py) several UPGMA trees (from upgma_cluster.py), a supporting file and newick tree with support values (from tree_compare.py), and 2D and 3D PCoA plots. Example: These steps are performed by the following command: Compute beta diversity distance matrix from otu table (and tree, if applicable); build rarefied OTU tables by evenly sampling to the specified depth (-e); build UPGMA tree from full distance matrix; compute distance matrics for rarefied OTU tables; build UPGMA trees from rarefied OTU table distance matrices; build a consensus tree from the rarefied UPGMA trees; compare rarefied OTU table distance matrix UPGMA trees to either (full or consensus) tree for jackknife support of tree nodes; perform principal coordinates analysis on distance matrices generated from rarefied OTU tables; generate 2D and 3D PCoA plots with jackknifed support. load_remote_mapping_file.py – Downloads and saves a remote mapping file Description: This script exports, downloads, and saves a mapping file that is stored remotely. Currently, the only type of remote mapping file that is supported is a Google Spreadsheet, though other methods of remote storage may be supported in the future. For more information and examples pertaining to this script and remote mapping files in general, please refer to the accompanying tutorial, which can be found at Usage: load_remote_mapping_file.py [options] Input Arguments: [REQUIRED] -k, --spreadsheet_key The spreadsheet key that will be used to identify the Google Spreadsheet to load. This is the part of the Google Spreadsheet URL that comes after „key=?. You may instead provide the entire URL and the key will be extracted from it. If you provide the entire URL, you may need to enclose it in single quotes -o, --output_fp The output filepath [OPTIONAL] -w, --worksheet_name The name of the worksheet in the Google Spreadsheet that contains the mapping file. If the worksheet name contains spaces, please include quotes around the name. [default: the first worksheet in the Google Spreadsheet will be used] Output: The script outputs a single file, which is the metadata mapping file obtained from the remote location (in QIIME-compatible format). Load mapping file from Google Spreadsheet: The following command exports and downloads a QIIME metadata mapping file from a Google Spreadsheet, using the data found in the first worksheet of the spreadsheet. Load specific worksheet: The following command exports from a worksheet named „Fasting_Map?. make_2d_plots.py – Make 2D PCoA Plots Description: This script generates 2D PCoA plots using the principal coordinates file generated by performing beta diversity measures of an OTU table. Usage: make_2d_plots.py [options] Input Arguments: [REQUIRED] -i, --coord_fname Input principal coordinates filepath (i.e., resulting file from principal_coordinates.py). Alternatively, a directory containing multiple principal coordinates files for jackknifed PCoA results. -m, --map_fname Input metadata mapping filepath [OPTIONAL] -b, --colorby Comma-separated list categories metadata categories (column headers) to color by in the plots. The categories must match the name of a column header in the mapping file exactly. Multiple categories can be list by comma separating them without spaces. The user can also combine columns in the mapping file by separating the categories by “&&” without spaces. [default=color by all] -p, --prefs_path Input user-generated preferences filepath. NOTE: This is a file with a dictionary containing preferences for the analysis. [default: None] -k, --background_color Background color to use in the plots. [default: white] --ellipsoid_opacity Used only when plotting ellipsoids for jackknifed beta diversity (i.e. using a directory of coord files instead of a single coord file). The valid range is between 0-1. 0 produces completely transparent (invisible) ellipsoids and 1 produces completely opaque ellipsoids. [default=0.33] --ellipsoid_method Used only when plotting ellipsoids for jackknifed beta diversity (i.e. using a directory of coord files instead of a single coord file). Valid values are “IQR” and “sdev”. [default=IQR] --master_pcoa Used only when plotting ellipsoids for jackknifed beta diversity (i.e. using a directory of coord files instead of a single coord file). These coordinates will be the center of each ellipisoid. [default: None; arbitrarily chosen PC matrix will define the center point] --scree Generate the scree plot [default: False] -o, --output_dir Path to the output directory Output: This script generates an output folder, which contains several files. To best view the 2D plots, it is recommended that the user views the _pcoa_2D.html file. Default Example: If you just want to use the default output, you can supply the principal coordinates file (i.e., resulting file from principal_coordinates.py), where the default coloring will be based on the SampleID as follows: Output Directory Usage: If you want to give an specific output directory (e.g. “2d_plots”), use the following code. Mapping File Usage: Additionally, the user can supply their mapping file („-m?) and a specific category to color by („-b?) or any combination of categories. When using the -b option, the user can specify the coloring for multiple mapping labels, where each mapping label is separated by a comma, for example: -b „mapping_column1,mapping_column2?. The user can also combine mapping labels and color by the combined label that is created by inserting an „&&? between the input columns, for example: -b „mapping_column1&&mapping_column2?.If the user wants to color by specific mapping labels, they can use the following code: Scree plot Usage: A scree plot can tell you how many axes are likely to be important and help determine how many „real? underlying gradients there might be in your data as well as their relative „strength?. If you want to generate a scree plot, use the following code. Color by all categories: If the user would like to color all categories in their metadata mapping file, they should not pass -b. Color by all is the default behavior. Prefs File: The user can supply a prefs file to color by, as follows: Jackknifed Principal Coordinates (w/ confidence intervals): If you have created jackknifed PCoA files, you can pass the folder containing those files, instead of a single file. The user can also specify the opacity of the ellipses around each point „–ellipsoid_opacity?, which is a value from 0-1. Currently there are two metrics „–ellipsoid_method? that can be used for generating the ellipsoids, which are „IQR? and „sdev?. The user can specify all of these options as follows: make_3d_plots.py – Make 3D PCoA plots Description: This script automates the construction of 3D plots (kinemage format) from the PCoA output file generated by principal_coordinates.py (e.g. P1 vs. P2 vs. P3, P2 vs. P3 vs. P4, etc., where P1 is the first component). Usage: make_3d_plots.py [options] Input Arguments: [REQUIRED] -i, --coord_fname Input principal coordinates filepath (i.e., resulting file from principal_coordinates.py). Alternatively, a directory containing multiple principal coordinates files for jackknifed PCoA results. -m, --map_fname Input metadata mapping filepath [OPTIONAL] -b, --colorby Comma-separated list categories metadata categories (column headers) to color by in the plots. The categories must match the name of a column header in the mapping file exactly. Multiple categories can be list by comma separating them without spaces. The user can also combine columns in the mapping file by separating the categories by “&&” without spaces. [default=color by all] -s, --scaling_method Comma-separated list of scaling methods (i.e. scaled or unscaled) [default=unscaled] -a, --custom_axes This is the category from the metadata mapping file to use as a custom axis in the plot. For instance, if there is a pH category and you would like to see the samples plotted on that axis instead of PC1, PC2, etc., one can use this option. It is also useful for plotting time-series data. Note: if there is any non-numeric data in the column, it will not be plotted [default: None] -p, --prefs_path Input user-generated preferences filepath. NOTE: This is a file with a dictionary containing preferences for the analysis. [default: None] -k, --background_color Background color to use in the plots. [default: black] --ellipsoid_smoothness Used only when plotting ellipsoids for jackknifed beta diversity (i.e. using a directory of coord files instead of a single coord file). Valid choices are 0-3. A value of 0 produces very coarse “ellipsoids” but is fast to render. If you encounter a memory error when generating or displaying the plots, try including just one metadata column in your plot. If you still have trouble, reduce the smoothness level to 0. [default: 1] --ellipsoid_opacity Used only when plotting ellipsoids for jackknifed beta diversity (i.e. using a directory of coord files instead of a single coord file). The valid range is between 0-1. 0 produces completely transparent (invisible) ellipsoids and 1 produces completely opaque ellipsoids. [default=0.33] --ellipsoid_method Used only when plotting ellipsoids for jackknifed beta diversity (i.e. using a directory of coord files instead of a single coord file). Valid values are “IQR” and “sdev”. [default=IQR] --master_pcoa Used only when plotting ellipsoids for jackknifed beta diversity (i.e. using a directory of coord files instead of a single coord file). These coordinates will be the center of each ellipisoid. [default: None; arbitrarily chosen PC matrix will define the center point] -t, --taxa_fname Used only when generating BiPlots. Input summarized taxa filepath (i.e., from summarize_taxa.py). Taxa will be plotted with the samples. [default=None] --n_taxa_keep Used only when generating BiPlots. This is the number of taxa to display. Use -1 to display all. [default: 10] --biplot_output_file Used only when generating BiPlots. Output coordinates filepath when generating a biplot. [default: None] --output_format Output format. If this option is set to invue you will need to also use the option -b to define which column(s) from the metadata file the script should use when writing an output file. [default: king] -n, --interpolation_points Used only when generating inVUE plots. Number of points between samples for interpolatation. [default: 0] --polyhedron_points Used only when generating inVUE plots. The number of points to be generated when creating a frame around the PCoA plots. [default: 4] --polyhedron_offset Used only when generating inVUE plots. The offset to be added to each point created when using the –polyhedron_points option. This is only used when using the invue output_format. [default: 1.5] --add_vectors Create vectors based on a column of the mapping file. This parameter accepts up to 2 columns: (1) create the vectors, (2) sort them. If you wanted to group by Species and order by SampleID you will pass –add_vectors=Species but if you wanted to group by Species but order by DOB you will pass –add_vectors=Species,DOB; this is useful when you use –custom_axes param [default: None] --vectors_algorithm The algorithm used to create the vectors. The method used can be RMS (either using „avg? or „trajectory?); or the first difference (using „diff?), or „wdiff? for a modified first difference algorithm (see –window_size) the aforementioned use all the dimensions and weights them using their percentage explained; returns the norm of the created vectors; and their confidence using ANOVA. The Vectors are created as follows: for „avg? it calculates the average at each timepoint (averaging within a group), then calculates the norm of each point; for „trajectory? calculates the norm for the 1st-2nd, 2nd-3rd, etc.; for „diff?, it calculates the norm for all the time-points and then calculates the first difference for each resulting point; for for „wdiff? it uses the same procedure as the previous method but the subtraction will be between the mean of the next number of elements specified in –window_size and the current element, both methods („wdiff? and „diff?) will also include the mean and the standard deviation of the calculations [defautl: None] --vectors_axes The number of axes to account while doing the vector specificcalculations. We suggest using 3 because those are the ones being displayed in the plots but you could use any number between 1 and number of samples- 1. To use all of them pass 0. [default: 3] --vectors_path Name of the file to save the first difference, or the root mean square (RMS) of the vectors grouped by the column used with the –add_vectors function. Note that this option only works with –add_vectors. The file is going to be created inside the output_dir and its name will start with the word „Vectors?.[default: vectors_output.txt] -w, --weight_by_vector Use -w when you want the output created in the –vectors_path to be weighted by the space between samples in the –add_vectors, sorting column, i. e. days between samples [default: False] --window_size Use –window_size, when selecting the modified first difference („wdiff?) option for –vectors_algorithm. This integer determines the number of elements to be averaged per element subtraction, the resulting vector. [default: None] -o, --output_dir Path to the output directory Output: By default, the script will plot the first three dimensions in your file. Other combinations can be viewed using the “Views:Choose viewing axes” option in the KiNG viewer (Chen, Davis, & Richardson, 2009), which may require the installation of kinemage software. The first 10 components can be viewed using “Views:Paralled coordinates” option or typing “/”. The mouse can be used to modify display parameters, to click and rotate the viewing axes, to select specific points (clicking on a point shows the sample identity in the low left corner), or to select different analyses (upper right window). Although samples are most easily viewed in 2D, the third dimension is indicated by coloring each sample (dot/label) along a gradient corresponding to the depth along the third component (bright colors indicate points close to the viewer). Default Usage: If you just want to use the default output, you can supply the principal coordinates file (i.e., resulting file from principal_coordinates.py) and a user-generated mapping file, where the default coloring will be based on the SampleID as follows: Mapping File Usage by Category: Additionally, the user can supply their mapping file („-m?) and a specific category to color by („-b?) or any combination of categories. When using the -b option, the user can specify the coloring for multiple mapping labels, where each mapping label is separated by a comma, for example: -b?mapping_column1,mapping_column2?. The user can also combine mapping labels and color by the combined label that is created by inserting an „&&? between the input columns, for example: -b „mapping_column1&&mapping_column2?. Color All Categories: If the user would like to color all categories in their metadata mapping file they should not pass -b (default is color by all categories) Prefs File Example: As an alternative, the user can supply a preferences (prefs) file, using the -p option. The prefs file allows the user to give specific samples their own columns within a given mapping column. This file also allows the user to perform a color gradient, given a specific mapping column.If the user wants to color by using the prefs file (e.g. prefs.txt), they can use the following code: Output Directory: If you want to give an specific output directory (e.g. „3d_plots?), use the following code: Background Color Example: If the user would like to color the background white they can use the „-k? option as follows: Jackknifed Principal Coordinates (w/ confidence intervals): If you have created jackknifed PCoA files, you can pass the folder containing those files, instead of a single file. The user can also specify the opacity of the ellipses around each point „–ellipsoid_opacity?, which is a value from 0-1. Currently there are two metrics „–ellipsoid_method? that can be used for generating the ellipsoids, which are „IQR? and „sdev?. The user can specify all of these options as follows: Bi-Plots: If the user would like to see which taxa are more prevalent in different areas of the PCoA plot, they can generate Bi-Plots, by passing a principal coordinates file or folder „-i?, a mapping file „-m?, and a summarized taxa file „-t? from summarize_taxa.py. Can be combined with jacknifed principal coordinates. make_bipartite_network.py – This script makes a bipartite network connecting samples to observations. It is most suitable for visualization with cytoscape. Description: This script was created to ease the process of making bipartite networks that have appeared in high profile publications including 10.1073/pnas.1217767110 and 10.1126/science.1198719. The script will take an biom table and a mapping file and produce an edge table which connects each sample in the biom table to the observations found in that sample. It is bipartite because there are two distinct node classes – OTU and Sample. The „OTU? node class does not have to be an operational taxonomic unit, it can be a KEGG category or metabolite etc. – anything that is an observation. The edges are weighted by the abundance of the observation in the sample to which it is connected. The output files of this script are intended to be loaded into Cytoscape. The EdgeTable should be uploaded first, and then the NodeAttrTable file can be uploaded as node attributes to control coloring, sizing, and shaping as the user desires. The overall idea behind this script is to make bipartite network creation easier. To that end, the color, size, and shape options are used to provide fields in the NetworkViz tab of Cytoscape so that nodes can be appropriately presented. Those options are passed via comma separated strings (as in the example below). The most common visualization strategy is to color sample nodes by a metadata category like timepoint or pH, color OTU nodes by one of their taxonomic levels, and to scale OTU node size by abundance. This script makes this process easy (as well as a myriad of other visualiation strategies). Once the tables are created by this script they must be opened in cytoscape. This process is described in detail in the QIIME bipartite network tutorial available at: , size, and shape options in this script default to „NodeType?. OTU nodes have NodeType: otu, sample nodes have NodeType: sample. Thus, if you ran this script with defaults, you would only be able to change the shape, size, and color of the nodes depending on whether or not they were observations or samples. You would not be able to distinguish between two observations based on color, shape, or size. The script is flexible in that it allows you to pass any number of fields for the –{s,o}{shape,size,color}. This will allow you to distinguish between OTU and sample nodes in a huge number of different ways. The usage examples below show some of the common use cases and what options you would pass to emulate them. There are a couple of important considerations for using this script: Note that the –md_fields option has a different meaning depending on the type of metadata in the biom table. Regardless of type, the md_fields will be the headers in the OTUNodeTable.txt. If the metadata is a dict or default dict, the md_fields will be used as keys to extract data from the biom file metadata. If the metadata is a list or a string, then the md_fields will be have no intrinsic relation to the columns they head. For example if md_fields=[„k?,?p?,?c?] and the metadata contained in a given OTU was „k__Bacteria;p__Actinobacter;c__Actino? the resulting OTUNodeTable would have k__Bacteria in the „k? column, p__Actinobacter in the „p? column, and c__Actino in the „c? column. If one passed md_fields=[„1.0?,?XYZ?,?Five?] then the OTUNodeTable would have columns headed by [„1.0?,?XYZ?,?Five?], but the metadata values in those columns would be the same (e.g. „1.0? column entry would be k__Bacteria etc.) If the number of elements in the metadata for a given OTU is not equal to the number of headers provided the script will adjust the OTU metadata. In the case where the metadata is too short, it will add „Other? into the OTU metadata until the required length is reached. In the case where the metadata is too long it will simply remove extra entries. This means you can end up with many observations which have the value of „Other? if you have short taxonomic strings/lists for your observations. The available fields for both sample and otu nodes are: [NodeType, Abundance] For observation nodes the additional fields available are: any fields you passed for the md_fields For sample nodes the additional fields available are any fields found in the mapping file headers If multiple fields are passed for a given option, they will be concatenated in the output with a „_? character. Usage: make_bipartite_network.py [options] Input Arguments: [REQUIRED] -i, --biom_fp The input file path for biom table. -m, --map_fp The input file path for mapping file. -o, --output_dir Directory that will be created for storing the results. -k, --observation_md_header_key Key to retrieve metadata (usually taxonomy) from the biom file. --md_fields Metadata fields that will be the headers of the OTUNodeTable. If the biom table has metadata dictionaries, md_fields will be the keys extracted from the biom table metadata. Passed like “kingdom,phylum,class”. [OPTIONAL] --scolors Commas separated string specifying fields of interest for sample node coloring [default: NodeType]. --ocolors Commas separated string specifying fields of interest for observation node coloring [default: NodeType]. --sshapes Commas separated string specifying fields of interest for sample node shape [default: NodeType]. --oshapes Commas separated string specifying fields of interest for observation node shape [default: NodeType]. --ssizes Commas separated string specifying fields of interest for sample node size [default: NodeType]. --osizes Commas separated string specifying fields of interest for observation node size [default: NodeType]. Output: The output of this script is four files: 1. EdgeTable - table with connections between samples and observations. 2. OTUNodeTable - table with observations and their associated metadata. 3. SampleNodeTable - table with samples and their associated metadata. 4. NodeAttrTable - table with the node attributes specified by the user with the given options. Create an EdgeTable and NodeAttrTable that allow you to color sample nodes with one of their metadata categories (Treatment for our example), observation nodes (in this case OTUs) by their taxonomic level (class for our example), control observation node size by their abundance, and control node shape by whether its an observation or sample. Create an EdgeTable and NodeAttrTable that allow you to color sample nodes by a combination of their time point and diet, color observation nodes by their abundance and family, and node shape by whether the node is an observation or sample. Note that the names in the –md_fields are irrelevant as long as the field passed for –ocolors is available. The length is important however, since there are 5 levels in our OTU table. If fewer fewer than 5 fields were passed for –md_fields we would get an error. make_bootstrapped_tree.py – Make bootstrapped tree Description: This script takes a tree and bootstrap support file and creates a pdf, colored by bootstrap support. Usage: make_bootstrapped_tree.py [options] Input Arguments: [REQUIRED] -m, --master_tree This is the path to the master tree -s, --support This is the path to the bootstrap support file -o, --output_file This is the filename where the output should be written. Output: The result of this script is a pdf file. Example: In this example, the user supplies a tree file and a text file containing the jackknife support information, which results in a pdf file: make_distance_boxplots.py – Creates boxplots to compare distances between categories Description: This script creates boxplots that allow for the comparison between different categories found within the mapping file. The boxplots that are created compare distances within all samples of a field value, as well as between different field values. Individual within and between distances are also plotted. The script also performs two-sample t-tests for all pairs of boxplots to help determine which boxplots (distributions) are significantly different. Tip: the script tries its best to fit everything into the plot, but there are cases where plot elements may get cut off (e.g. if axis labels are extremely long), or things may appear squashed, cluttered, or too small (e.g. if there are many boxplots in one plot). Increasing the width and/or height of the plot (using –width and –height) usually fixes these problems. For more information and examples pertaining to this script, please refer to the accompanying tutorial, which can be found at make_distance_boxplots.py [options] Usage: Input Arguments: [REQUIRED] -m, --mapping_fp The mapping filepath -o, --output_dir Path to the output directory -d, --distance_matrix_fp Input distance matrix filepath (i.e. the result of beta_diversity.py). WARNING: Only symmetric, hollow distance matrices may be used as input. Asymmetric distance matrices, such as those obtained by the UniFrac Gain metric (i.e. beta_diversity.py -m unifrac_g), should not be used as input -f, --fields Comma-separated list of fields to compare, where the list of fields should be in quotes (e.g. “Field1,Field2,Field3”) [OPTIONAL] -g, --imagetype Type of image to produce (i.e. png, svg, pdf) [default: pdf] --save_raw_data Store raw data used to create boxplots in tab-delimited files [default: False] --suppress_all_within Suppress plotting of “all within” boxplot [default: False] --suppress_all_between Suppress plotting of “all between” boxplot [default: False] --suppress_individual_within Suppress plotting of individual “within” boxplot(s) [default: False] --suppress_individual_between Suppress plotting of individual “between” boxplot(s) [default: False] --suppress_significance_tests Suppress performing signifance tests between each pair of boxplots [default: False] -n, --num_permutations The number of Monte Carlo permutations to perform when calculating the nonparametric p-value in the significance tests. Must be an integer greater than or equal to zero. If zero, the nonparametric p-value will not be calculated and will instead be reported as “N/A”. This option has no effect if –suppress_significance_tests is supplied [default: 0] -t, --tail_type The type of tail test to compute when calculating the p-values in the significance tests. “high” specifies a one-tailed test for values greater than the observed t statistic, while “low” specifies a one-tailed test for values less than the observed t statistic. “two-sided” specifies a two-tailed test for values greater in magnitude than the observed t statistic. This option has no effect if –suppress_significance_tests is supplied. Valid choices: low or high or two-sided [default: two-sided] --y_min The minimum y-axis value in the resulting plot. If “auto”, it is automatically calculated [default: 0] --y_max The maximum y-axis value in the resulting plot. If “auto”, it is automatically calculated [default: 1] --width Width of the output image in inches. If not provided, a “best guess” width will be used [default: auto] --height Height of the output image in inches [default: 6] --transparent Make output images transparent (useful for overlaying an image on top of a colored background) [default: False] --whisker_length Length of the whiskers as a function of the IQR. For example, if 1.5, the whiskers extend to 1.5 * IQR. Anything outside of that range is seen as an outlier [default: 1.5] --box_width Width of each box in plot units [default: 0.5] --box_color The color of the boxes. Can be any valid matplotlib color string, such as “black”, “magenta”, “blue”, etc. See valid color strings that may be used. Will be ignored if –color_individual_within_by_field is supplied [default: same as plot background, which is white unless –transparent is enabled] --color_individual_within_by_field Field in the the mapping file to color the individual “within” boxes by. A legend will be provided to match boxplot colors to field states. A one-to-one mapping must exist between the field to be colored and the field to color by, otherwise the coloring will be ambiguous. If this option is supplied, –box_color will be ignored. If –suppress_individual_within is supplied, this option will be ignored [default: None] --sort Sort boxplots by increasing median. If no sorting is applied, boxplots will be grouped logically as follows: all within, all between, individual within, and individual between [default: False] Output: Images of the plots are written to the specified output directory (one image per field). The raw data used in the plots and the results of significance tests can optionally be written into tab-delimited files (one file per field) that are most easily viewed in a spreadsheet program such as Microsoft Excel. Compare distances between Fast and Control samples: This example will generate an image with boxplots for all within and all between distances for the field Treatment, and will also include plots for individual within (e.g. Control vs. Control, Fast vs. Fast) and individual between (e.g. Control vs. Fast). The generated plot PDF and signifiance testing results will be written to the output directory „out1?. Only plot individual field value distances: This example will generate a PNG of all individual field value distances (within and between) for the Treatment field. Save raw data: This example will generate an SVG image of the boxplots and also output the plotting data to a tab-delimited file. Suppress significance tests: This example will only generate a plot and skip the significance testing step. This can be useful if you are operating on a large dataset and are not interested in performing the statistical tests (or at least not initially). make_distance_comparison_plots.py – Creates plots comparing distances between sample groupings Description: This script creates plots (bar charts, scatter plots, or box plots) that allow for the comparison between samples grouped at different field states of a mapping file field. This script can work with any field in the mapping file, and it can compare any number of field states to all other field states within that field. This script may be especially useful for fields that represent a time series, because a plot can be generated showing the distances between samples at certain timepoints against all other timepoints. For example, a time field might contain the values 1, 2, 3, 4, and 5, which label samples that are from day 1, day 2, day 3, and so on. This time field can be specified when the script is run, as well as the timepoint(s) to compare to every other timepoint. For example, two comparison groups might be timepoints 1 and 2. The resulting plot would contain timepoints for days 3, 4, and 5 along the x-axis, and at each of those timepoints, the distances between day 1 and that timepoint would be plotted, as well as the distances between day 2 and the timepoint. The script also performs two-sample t-tests for all pairs of distributions to help determine which distributions are significantly different from each other. Tip: the script tries its best to fit everything into the plot, but there are cases where plot elements may get cut off (e.g. if axis labels are extremely long), or things may appear squashed, cluttered, or too small (e.g. if there are many boxplots in one plot). Increasing the width and/or height of the plot (using –width and –height) usually fixes these problems. For more information and examples pertaining to this script, please refer to the accompanying tutorial, which can be found at Usage: make_distance_comparison_plots.py [options] Input Arguments: [REQUIRED] -m, --mapping_fp The mapping filepath -o, --output_dir Path to the output directory -d, --distance_matrix_fp Input distance matrix filepath (i.e. the result of beta_diversity.py). WARNING: Only symmetric, hollow distance matrices may be used as input. Asymmetric distance matrices, such as those obtained by the UniFrac Gain metric (i.e. beta_diversity.py -m unifrac_g), should not be used as input -f, --field Field in the mapping file to make comparisons on -c, --comparison_groups Comma-separated list of field states to compare to every other field state, where the list of field states should be in quotes (e.g. “FieldState1,FieldState2,FieldState3”) [OPTIONAL] -t, --plot_type Type of plot to produce (“bar” is bar chart, “scatter” is scatter plot, and “box” is box plot) [default: bar] -g, --imagetype Type of image to produce (i.e. png, svg, pdf) [default: pdf] --save_raw_data Store raw data used to create plot in a tab-delimited file [default: False] --suppress_significance_tests Suppress performing signifance tests between each pair of distributions [default: False] -n, --num_permutations The number of Monte Carlo permutations to perform when calculating the nonparametric p-value in the significance tests. Must be an integer greater than or equal to zero. If zero, the nonparametric p-value will not be calculated and will instead be reported as “N/A”. This option has no effect if –suppress_significance_tests is supplied [default: 0] --tail_type The type of tail test to compute when calculating the p-values in the significance tests. “high” specifies a one-tailed test for values greater than the observed t statistic, while “low” specifies a one-tailed test for values less than the observed t statistic. “two-sided” specifies a two-tailed test for values greater in magnitude than the observed t statistic. This option has no effect if –suppress_significance_tests is supplied. Valid choices: low or high or two-sided [default: two-sided] --width Width of the output image in inches [default: 12] --height Height of the output image in inches [default: 6] --x_tick_labels_orientation Type of orientation for x-axis tick labels [default: vertical] -a, --label_type Label type (“numeric” or “categorical”). If the label type is defined as numeric, the x-axis will be scaled accordingly. Otherwise the x-values will treated categorically and will be evenly spaced [default: categorical]. --y_min The minimum y-axis value in the resulting plot. If “auto”, it is automatically calculated [default: 0] --y_max The maximum y-axis value in the resulting plot. If “auto”, it is automatically calculated [default: 1] --transparent Make output images transparent (useful for overlaying an image on top of a colored background ) [default: False] --whisker_length If –plot_type is “box”, determines the length of the whiskers as a function of the IQR. For example, if 1.5, the whiskers extend to 1.5 * IQR. Anything outside of that range is seen as an outlier. If –plot_type is not “box”, this option is ignored [default: 1.5] --error_bar_type If –plot_type is “bar”, determines the type of error bars to use. “stdv” is standard deviation and “sem” is the standard error of the mean. If –plot_type is not “bar”, this option is ignored [default: stdv] --distribution_width Width (in plot units) of each individual distribution (e.g. each bar if the plot type is a bar chart, or the width of each box if the plot type is a boxplot) [default: auto] Output: An image of the plot is written to the specified output directory. The raw data used in the plots and the results of significance tests can optionally be written into tab-delimited files that are most easily viewed in a spreadsheet program such as Microsoft Excel. Compare distances between Native and Input samples for each timepoint in the Time field: This example will generate a PDF containing a bar chart with the distances between Native samples and every other timepoint, as well as the distances between Input samples and every other timepoint. The output image will be put in the „out1? directory. For more details about this example input data, please refer to the accompanying tutorial. make_distance_histograms.py – Make distance histograms Description: To visualize the distance between samples and/or categories in the metadata mapping file, the user can generate histograms to represent the distances between samples. This script generates an HTML file, where the user can compare the distances between samples based on the different categories associated to each sample in the metadata mapping file. Distance histograms provide a way to compare different categories and see which tend to have larger/smaller distances than others. For example, in a hand study, you may want to compare the distances between hands to the distances between individuals (with the file “hand_distances.txt” using the parameter -d hand_distances.txt). The categories are defined in the metadata mapping file (specified using the parameter -m hand_map.txt). If you want to look at the distances between hands and individuals, choose the “Hand” field and “Individual” field (using the parameter –fields Hand,Individual (notice the fields are comma-delimited)). For each of these groups of distances a histogram is made. The output is an HTML file which is created in the “Distance_Histograms” directory (using the parameter -o Distance_Histograms to specify output directory) where you can look at all the distance histograms individually, and compare them between each other. make_distance_histograms.py [options] Usage: Input Arguments: [REQUIRED] -d, --distance_matrix_file Input distance matrix filepath (i.e. the result of beta_diversity.py). WARNING: Only symmetric, hollow distance matrices may be used as input. Asymmetric distance matrices, such as those obtained by the UniFrac Gain metric (i.e. beta_diversity.py -m unifrac_g), should not be used as input -m, --map_fname Input metadata mapping filepath. [OPTIONAL] -p, --prefs_path Input user-generated preferences filepath. NOTE: This is a file with a dictionary containing preferences for the analysis. This dictionary must have a “Fields” key mapping to a list of desired fields. [default: None] -o, --dir_path Output directory. [default: ./] -k, --background_color Background color for use in the plots (black or white) [default: white] --monte_carlo Deprecated: pass –monte_carlo_iters > 0 to enable --suppress_html_output Suppress HTML output. [default: False] -f, --fields Comma-separated list of fields to compare, where the list of fields should be in quotes (e.g. “Field1,Field2,Field3”). Note: if this option is passed on the command-line, it will overwrite the fields in prefs file. [default: first field in mapping file is used] --monte_carlo_iters Number of iterations to perform for Monte Carlo analysis. [default: 0; No monte carlo simulation performed] Output: The result of this script will be a folder containing images and/or an HTML file (with appropriate javascript files), depending on the user-defined parameters. Distance histograms example: In the following command, the user supplies a distance matrix (i.e. the resulting file from beta_diversity.py), the user-generated metadata mapping file and one category “Treatment” to generate distance histograms. Multiple categories: For comparison of multiple categories (e.g. Treatment, DOB), you can use the following command (separating each category with a comma). Suppress HTML output: By default, HTML output is automatically generated. If the user would like to suppress the HTML output, you can use the following command. Preferences file: You can provide your own preferences file (prefs.txt) with the following command. If a preferences file is supplied, you do not need to supply fields on the command-line. make_fastq.py – Make FASTQ file for ERA submission from paired FASTA and QUAL files Description: The ERA currently requires a separate FASTQ file for each library, split by library id. This code takes the output from split_libraries.py and the corresponding QUAL files and produces ERA-compatible FASTQ files. Usage: make_fastq.py [options] Input Arguments: [REQUIRED] -f, --input_fasta_fp Path to the input fasta file -q, --qual Names of QUAL files, comma-delimited [OPTIONAL] -o, --result_fp Path to store results [default: .fastq] -s, --split Make separate file for each library [default:False] Output: Matches QUAL info to FASTA entries by id, and writes FASTQ output to one file or to per-library files. The FASTQ format for each record is as follows: @seq_id [and optional description] seq as bases + [and optionally with repeat of seq_id and repeat line] qual scores as string of chr(33+qual) Example: Take input FASTA file input_fasta_filepath and QUAL file input_qual_filepath: make separate file for each library (with the -s option: assumes that the FASTA file is the output of split_libraries.pyor similar script): make_library_id_lists.py – Make library id lists Description: Makes a list of the ids corresponding to each library represented in the input fasta file. Assumes that the libraries are the output of split_libraries.py and that they contain the 454 read id for each sequence as is standard in the split_libraries.py output. Produces a separate file for each library. Usage: make_library_id_lists.py [options] Input Arguments: [REQUIRED] -i, --input_fasta The path to a FASTA file containing input sequences [OPTIONAL] -s, --screened_rep_seqs The path to a FASTA file containing screened representative seqs[DEFAULT: None] -u, --otus The path to an OTU file mapping OTUs onto rep seqs[DEFAULT: None] -o, --outdir The base directory to save results (one file per library). -f, --field Index of space-delimited field to read id from [DEFAULT: 1] --debug Show debug output. Output: This script produces a separate file for each library. Example: Create a list containing library ids for a fasta file (seqs.fna): make_otu_heatmap.py – Make heatmap of OTU table Description: Once an OTU table has been generated, it can be visualized using a heatmap. In these heatmaps each row corresponds to an OTU, and each column corresponds to a sample. The higher the relative abundance of an OTU in a sample, the more intense the color at the corresponsing position in the heatmap. By default, the OTUs (rows) will be clustered by UPGMA hierarchical clustering, and the samples (columns) will be presented in the order in which they appear in the OTU table. Alternatively, the user may pass in a tree to sort the OTUs (rows) or samples (columns), or both. For samples, the user may also pass in a mapping file. If the user passes in a mapping file and a metadata category, samples (columns in the heatmap) will be grouped by category value and subsequently clustered within each group. Usage: make_otu_heatmap.py [options] Input Arguments: [REQUIRED] -i, --otu_table_fp Path to the input OTU table (i.e., the output from make_otu_table.py) [OPTIONAL] -o, --output_dir Path to the output directory -t, --otu_tree Tree file to be used for sorting OTUs in the heatmap -m, --map_fname Metadata mapping file to be used for sorting Samples in the heatmap. -c, --category Metadata category for sorting samples. Samples will be clustered within each category level using euclidean UPGMA. -s, --sample_tree Tree file to be used for sorting samples (e.g, output from upgma_cluster.py). If both this and the sample mapping file are provided, the mapping file is ignored. --no_log_transform Data will not be log-transformed. Without this option, all zeros will be set to a small value (default is 1/2 the smallest non-zero entry). Data will be translated to be non-negative after log transform, and num_otu_hits will be set to 0. --suppress_row_clustering No UPGMA clustering of OTUs (rows) is performed. If –otu_tree is provided, this flag is ignored. --suppress_column_clustering No UPGMA clustering of Samples (columns) is performed. If –map_fname is provided, this flag is ignored. --absolute_abundance Do not normalize samples to sum to 1.[default False] --log_eps Small value to replace zeros for log transform. [default: 1/2 the smallest non-zero entry]. Output: The heatmap image is located in the specified output directory. It is formatted as a PDF file. Examples: Using default values: Different output directory (i.e., “otu_heatmap”): Sort the heatmap columns by the order in a mapping file, as follows: Sort the heatmap columns by Sample ID?s and the heatmap rows by the order of tips in the tree, you can supply a tree as follows: Group the heatmap columns by metadata category (e.g., GENDER), then cluster within each group: make_otu_heatmap_html.py – Make heatmap of OTU table Description: Create an interactive OTU heatmap from an OTU table. This script parses the OTU count table and filters the table by counts per otu (user-specified), then converts the table into a javascript array, which can be loaded into a web application. The OTU heatmap displays raw OTU counts per sample, where the counts are colored based on the contribution of each OTU to the total OTU count present in that sample (blue: contributes low percentage of OTUs to sample; red: contributes high percentage of OTUs). This web application allows the user to filter the otu table by number of counts per otu. The user also has the ability to view the table based on taxonomy assignment. Additional features include: the ability to drag rows (up and down) by clicking and dragging on the row headers; and the ability to zoom in on parts of the heatmap by clicking on the counts within the heatmap. make_otu_heatmap_html.py [options] Usage: Input Arguments: [REQUIRED] -i, --otu_table_fp Path to the input OTU table (i.e., the output from make_otu_table.py) -o, --output_dir Path to the output directory [OPTIONAL] -n, --num_otu_hits Only include OTUs with at least this many sequences. [default: 5] -t, --tree Path to newick tree where OTUs are tips, used for sorting OTUs in the heatmap -m, --map_fname Input metadata mapping filepath, used for sorting samples in the heatmap --sample_tree Path to newick tree where samples are tips (e.g, output from upgma_cluster.py) used for sorting samples in the heatmap. If both this and the metadata mapping file are provided, the mapping file will be ignored. --log_transform Log-transform the data. All zeros will be set to a small value (default is 1/2 of the smallest non-zero entry). Data will be translated to be non-negative after log transform and the num_otu_hits will be set to 0. --log_eps Small value to replace zeros when performing log transformation. [default: 1/2 the smallest non-zero entry]. Output: The interactive heatmap is located in OUTPUT_DIR/otu_table.html where OUTPUT_DIR is specified as -o. Safari is recommended for viewing the OTU Heatmap, since the HTML table generation is much faster than Firefox (as of this writing). Generate an OTU heatmap: By using the default values (“-n 5), you can then use the code as follows: Generate a filtered OTU heatmap: If you would like to filter the OTU table by a different number of counts per OTU (i.e., 10): Generate a sample-sorted OTU heatmap: If you would like to sort the heatmap by Sample IDs then you should supply the mapping file, as follows: Generate a sample and OTU-sorted OTU heatmap: If you would like to sort the heatmap by Sample IDs and the tips in the tree, you can supply a tree as follows: make_otu_network.py – Make an OTU network and calculate statistics Description: This script generates the otu network files to be passed into cytoscape and statistics for those networks. It uses the OTU fileand the user metadata mapping file. Network-based analysis is used to display and analyze how OTUs are partitioned between samples. This is a powerful way to display visually large and highly complex datasets in such a way that similarities and differences between samples are emphasized. The visual output of this analysis is a clustering of samples according to their shared OTUs - samples that share more OTUs cluster closer together. The degree to which samples cluster is based on the number of OTUs shared between samples (when OTUs are found in more than one sample) and this is weighted according to the number of sequences within an OTU. In the network diagram, there are two kinds of “nodes” represented, OTU-nodes and sample-nodes. These are shown with symbols such as filled circles and filled squares. If an OTU is found within a sample, the two nodes are connected with a line (an “edge”). (OTUs found only in one sample are given a second, distinct OTU-node shape.) The nodes and edges can then be colored to emphasize certain aspects of the data. For instance, in the initial application of this analysis in a microbial ecology study, the gut bacteria of a variety of mammals was surveyed, and the network diagrams were colored according to the diets of the animals, which highlighted the clustering of hosts by diet category (herbivores, carnivores, omnivores). In a meta-analysis of bacterial surveys across habitat types, the networks were colored in such a way that the phylogenetic classification of the OTUs was highlighted: this revealed the dominance of shared Firmicutes in vertebrate gut samples versus a much higher diversity of phyla represented amongst the OTUs shared by environmental samples. Not just pretty pictures: the connections within the network are analyzed statistically to provide support for the clustering patterns displayed in the network. A G-test for independence is used to test whether sample-nodes within categories (such as diet group for the animal example used above) are more connected within than a group than expected by chance. Each pair of samples is classified according to whether its members shared at least one OTU, and whether they share a category. Pairs are then tested for independence in these categories (this asks whether pairs that share a category also are equally likely to share an OTU). This statistical test can also provide support for an apparent lack of clustering when it appears that a parameter is not contributing to the clustering. This OTU-based approach to comparisons between samples provides a counterpoint to the tree-based PCoA graphs derived from the UniFrac analyses. In most studies, the two approaches reveal the same patterns. They can reveal different aspects of the data, however. The network analysis can provide phylogenetic information in a visual manner, whereas PCoA-UniFrac clustering can reveal subclusters that may be obscured in the network. The PCs can be pulled out individually and regressed against other metadata; the network analysis can provide a visual display of shared versus unique OTUs. Thus, together these tools can be used to draw attention to disparate aspects of a dataset, as desired by the author. In more technical language: OTUs and samples are designated as two types of nodes in a bipartite network in which OTU-nodes are connected via edges to sample-nodes in which their sequences are found. Edge weights are defined as the number of sequences in an OTU. To cluster the OTUs and samples in the network, a stochastic spring-embedded algorithm is used, where nodes act like physical objects that repel each other, and connections act a springs with a spring constant and a resting length: the nodes are organized in a way that minimized forces in the network. These algorithms are implemented in Cytoscape (Shannon et al., 2003). Usage: make_otu_network.py [options] Input Arguments: [REQUIRED] -i, --input_fp Name of otu table file in biom format [REQUIRED] -m, --map_fname Name of input map file [REQUIRED] -o, --output_dir Output directory for all analyses [REQUIRED] [OPTIONAL] -b, --colorby This is the categories to color by in the plots from the user-generated mapping file. The categories must match the name of a column header in the mapping file exactly and multiple categories can be list by comma separating them without spaces. The user can also combine columns in the mapping file by separating the categories by “&&” without spaces [default=None] -p, --prefs_path This is the user-generated preferences file. NOTE: This is a file with a dictionary containing preferences for the analysis [default: None] -k, --background_color This is the background color to use in the plots. [default: None] Output: The result of make_otu_network.py consists of a folder which contains edge and node files to be loaded into cytoscape along with props files labeled by category, which can used for coloring. Example: Create network cytoscape and statistic files in a user-specified output directory. This example uses an OTU table (-i) and the metadata mapping file (-m), and the results are written to the “otu_network/” folder. make_otu_table.py – Make OTU table Description: The script make_otu_table.py tabulates the number of times an OTU is found in each sample, and adds the taxonomic predictions for each OTU in the last column if a taxonomy file is supplied. Usage: make_otu_table.py [options] Input Arguments: [REQUIRED] -i, --otu_map_fp Path to the input OTU map (i.e., the output from pick_otus.py) -o, --output_biom_fp The output otu table in biom format (recommended extension: .biom) [OPTIONAL] -t, --taxonomy Path to taxonomy assignment, containing the assignments of taxons to sequences (i.e., resulting txt file from assign_taxonomy.py) [default: None] -e, --exclude_otus_fp Path to a file listing OTU identifiers that should not be included in the OTU table (e.g., the output of identify_chimeric_seqs.py) or a fasta file where seq ids should be excluded (e.g., failures fasta file fromalign_seqs.py) Output: The output of make_otu_table.py is a biom file, where the columns correspond to Samples and rows correspond to OTUs and the number of times a sample appears in a particular OTU. Make OTU table: Make an OTU table from an OTU map (i.e., result from pick_otus.py) and a taxonomy assignment file (i.e., result from assign_taxonomy.py). Write the output file to otu_table.biom. Make OTU table, excluding OTU ids listed in a fasta file: Make an OTU table, excluding the sequences listed in pynast_failures.fna. Note that the file pass as -e must end with either „.fasta? or „.fna?. Make OTU table, excluding a list of OTU ids: Make an OTU table, excluding the sequences listed in chimeric_seqs.txt make_per_library_sff.py – Make per-library sff files from ID lists Description: This script generates per-library sff files using a directory of text files, one per library, which list read ID?s to be included. The ID list files should contain one read ID per line. If a line contains multiple words (separated by whitespace), then only the first word is used. A „>? character is stripped from the beginning of the line, if present. Blank lines in the file are skipped. Usage: make_per_library_sff.py [options] Input Arguments: [REQUIRED] -i, --input_sff Input sff file (separate multiple files w/ comma) -l, --libdir Directory containing ID list text files, one per library [OPTIONAL] -p, --sfffile_path Path to sfffile binary [default: use sfffile in $PATH] --use_sfftools Use external sfffile program instead of equivalent Python routines. --debug Print debugging output to stdout [default: False] Output: The result of this script generates sff files for each library. Example: Make per-library sff files using input.sff and a directory of libs where each file in the directory contains the id lists for each library: make_phylogeny.py – Make Phylogeny Description: Many downstream analyses require that the phylogenetic tree relating the OTUs in a study be present. The script make_phylogeny.py produces this tree from a multiple sequence alignment. Trees are constructed with a set of sequences representative of the OTUs, by default using FastTree (Price, Dehal, & Arkin, 2009). Usage: make_phylogeny.py [options] Input Arguments: [REQUIRED] -i, --input_fp Path to read input fasta alignment, only first word in defline will be considered [OPTIONAL] -t, --tree_method Method for tree building. Valid choices are: clearcut, clustalw, fasttree_v1, fasttree, raxml_v730, muscle [default: fasttree] -o, --result_fp Path to store result file [default: .tre] -l, --log_fp Path to store log file [default: No log file created.] -r, --root_method Method for choosing root of phylo tree Valid choices are: midpoint, tree_method_default [default: tree_method_default] Output: The result of make_phylogeny.py consists of a newick formatted tree file (.tre) and optionally a log file. The tree file is formatted using the Newick format and this file can be viewed using most tree visualization tools, such as TopiaryTool, FigTree, etc. The tips of the tree are the first word from the input sequences from the fasta file, e.g.: „>101 PC.481_71 RC:1..220? is represented in the tree as „101?. Examples: A simple example of make_phylogeny.py is shown by the following command, where we use the default tree building method (fasttree) and write the file to the current working directory without a log file: Alternatively, if the user would prefer using another tree building method (i.e. clearcut (Sheneman, Evans, & Foster, 2006)), then they could use the following command: make_prefs_file.py – Generate preferences file Description: This script generates a preferences (prefs) file, which can be passed to make_distance_histograms.py, make_2d_plots.py and make_3d_plots.py. The prefs file allows for defining the monte_carlo distance, gradient coloring of continuous values in the 2D and 3D plots, the ball size scale for all the samples and the color of the arrow and the line of the arrow for the procrustes analysis. Currently there is only one color gradient: red to blue. Usage: make_prefs_file.py [options] Input Arguments: [REQUIRED] -m, --map_fname This is the metadata mapping file [default=None] -o, --output_fp The output filepath [OPTIONAL] -b, --mapping_headers_to_use Mapping fields to use in prefs file [default: ALL] -k, --background_color This is the backgroundcolor to use in the plots. [default: black] -d, --monte_carlo_dists Monte carlo distanceto use for each sample header [default: 10] -i, --input_taxa_file Summarized taxa file with samplecounts by taxonomy (resulting file from summarize_taxa.py) -s, --ball_scale Scale factor for the size of each ball in the plots [default: 1.0] -l, --arrow_line_color Arrow line color forprocrustes analysis. [default: white] -a, --arrow_head_color Arrow head color forprocrustes analysis. [default: red] Output: The result of this script is a text file, containing coloring preferences to be used by make_distance_histograms.py, make_2d_plots.py and make_3d_plots.py. Examples: To make a prefs file, the user is required to pass in a user-generated mapping file using “-m” and an output filepath, using “-o”. When using the defaults, the script will use ALL categories from the mapping file, set the background to black and the monte_carlo distances to 10. If the user would like to use specified categories („SampleID,Individual?) or combinations of categories („SampleID&&Individual?), they will need to use the -b option, where each category is comma delimited, as follows: If the user would like to change the background color for their plots, they can pass the „-k? option, where the colors: black and white can be used for 3D plots and many additional colors can be used for the 2D plots, such as cyan, pink, yellow, etc.: If the user would like to change the monte_carlo distances, they can pass the „-d? option as follows: If the user would like to add a list of taxons they can pass the „-i? option, which is the resulting taxa file from summarize_taxa.py, as follows: If the user would like to add the ball size scale they can pass the „-s? option as follows: If the user would like to add the head and line color for the arrows in the procrustes analysis plot they can pass the „-a? and „-l? options as follows: make_qiime_py_file.py – Create python file Description: This is a script which will add headers and footers to new python files and make them executable. Usage: make_qiime_py_file.py [options] Input Arguments: [REQUIRED] -o, --output_fp The output filepath [OPTIONAL] -s, --script Pass if creating a script to include option parsing framework [default:False]. -t, --test Pass if creating a unit test file to include relevant information [default:False]. -a, --author_name The script author?s (probably you) name to be included the header variables. This will typically need to be enclosed in quotes to handle spaces. [default:AUTHOR_NAME] -e, --author_email The script author?s (probably you) e-mail address to be included the header variables. [default:AUTHOR_EMAIL] -c, --copyright The copyright information to be included in the header variables. [default:Copyright 2011, The QIIME project] Output: The results of this script is either a python script, test, or library file, depending on the input parameters. Example usage: Create a new script: Create a new test file: Create a basic file (e.g., for library code): make_qiime_rst_file.py – Make Sphinx RST file Description: This script will take a script file and convert the usage strings and options to generate a documentation .rst file. Usage: make_qiime_rst_file.py [options] Input Arguments: [REQUIRED] -i, --input_script This is the input script for which to make a .rst file -o, --output_dir Path to the output directory Output: This will output a Sphinx rst-formatted file. Example: make_rarefaction_plots.py – Generate Rarefaction Plots Description: Once the batch alpha diversity files have been collated, you may want to compare the diversity using plots. Using the results from collate_alpha.py, you can plot the samples and or by category in the mapping file using this script. This script creates an html file of rarefaction plots based on the supplied collated alpha-diversity files in a folder or a comma-separated list of files, by passing the “-i” option. Be aware that this script produces many images for the interactive html pages, so you may choose to not create these pages. The user may also supply optional arguments like an image type (-g), and a resolution (-d). Usage: make_rarefaction_plots.py [options] Input Arguments: [REQUIRED] -i, --input_dir Input directory containing results from collate_alpha.py. [REQUIRED] -m, --map_fname Input metadata mapping filepath. [REQUIRED] [OPTIONAL] -b, --colorby Comma-separated list categories metadata categories (column headers) to color by in the plots. The categories must match the name of a column header in the mapping file exactly. Multiple categories can be list by comma separating them without spaces. The user can also combine columns in the mapping file by separating the categories by “&&” without spaces. [default=color by all] -p, --prefs_path Input user-generated preferences filepath. NOTE: This is a file with a dictionary containing preferences for the analysis. [default: None] -k, --background_color Background color to use in the plots[default: white] -g, --imagetype Type of image to produce (i.e. png, svg, pdf). WARNING: Some formats may not properly open in your browser! [default: png] -d, --resolution Resolution of the plot. [default: 75] -y, --ymax Maximum y-value to be used for the plots. Allows for directly comparable rarefaction plots between analyses [default: None] -w, --webpage DEPRECATED: Suppress HTML output. [default: True] -s, --suppress_html_output Suppress HTML output. [default: False] -e, --std_type Calculation to perform for generating error bars. Options are standard deviation (stddev) or standard error (stderr). [default: stddev] -o, --output_dir Path to the output directory --output_type Write the HTML output as one file, images embedded, or several. Options are file_creation, multiple files, and memory. [default: file_creation] Output: The result of this script produces a folder and within that folder there is a sub-folder containing image files. Within the main folder, there is an html file. Default Example: For generated rarefaction plots using the default parameters, including the mapping file and one rarefaction file, you can use the following command: Specify Image Type and Resolution: Optionally, you can change the resolution („-d?) and the type of image created („-i?), by using the following command: Use Prefs File: You can also supply a preferences file „-p?, as follows Set Background Color: Alternatively, you can set the plot background „-k? Generate raw data without interactive webpages: The user can choose to not create an interactive webpage („-w? option). This is for the case, where the user just wants the average plots and theraw average data. make_tep.py – Makes TopiaryExplorer project file Description: This script makes a TopiaryExplorer project file (.tep) and a jnlp file with the data location preloaded. WARNING: The jnlp file relies on an absolute path, if you move the .tep file, the generated jnlp will no longer work. However, you can still open the .tep file from your normal TopiaryExplorer install. make_tep.py [options] Usage: Input Arguments: [REQUIRED] -i, --otu_table_fp Path to the input OTU table (i.e., the output from make_otu_table.py) -m, --mapping_fp The mapping filepath -t, --tree_fp Path to tree [OPTIONAL] -o, --output_dir Path to the output directory -p, --prefs_file_fp Path to prefs file -w, --web_flag Web codebase jnlp flag [default: False] -u, --url Url path for the tep file. Note: when passing this flag, it will overwrite the supplied OTU table, Mapping and Tree files. Output: The result of this script is written to a .tep file and a .jnlp file, both with the name supplied by -o Example: Create .tep file and .jnlp file: map_reads_to_reference.py – Script for performing assignment of reads against a reference database Description: Usage: map_reads_to_reference.py [options] Input Arguments: [REQUIRED] -i, --input_seqs_filepath Path to input sequences file -r, --refseqs_fp Path to reference sequences to search against [default: None] [OPTIONAL] -m, --assignment_method Method for picking OTUs. Valid choices are: bwa-short, usearch, bwa-sw, blat, blat-nt. [default: usearch] -t, --observation_metadata_fp Path to observation metadata (e.g., taxonomy, EC, etc) [default: None] -o, --output_dir Path to store result file [default: ./_mapped/] -e, --evalue Max e-value to consider a match [default: 1e-10] -s, --min_percent_id Min percent id to consider a match [default: 0.75] --max_diff MaxDiff to consider a match (applicable for -m bwa-short) – see the aln section of “man bwa” for details [default (defined by bwa): 0.04] --queryalnfract Min percent of the query seq that must match to consider a match (usearch only) [default: 0.35] --targetalnfract Min percent of the target/reference seq that must match to consider a match (usearch only) [default: 0.0] --max_accepts Max_accepts value (usearch only) [default: 1] --max_rejects Max_rejects value to (usearch only) [default: 32] Output: Run assignment with usearch using default parameters Run nucleotide versus protein BLAT using default parameters Run nucleotide versus protein BLAT using scricter e-value threshold Run nucleotide versus nucleotide BLAT with default parameters Run assignment with bwa-short using default parameters. bwa-short is intended to be used for reads up to 200bp. WARNING: reference sequences must be dereplicated! No matches will be found to reference sequences which show up multiple times (even if their sequence identifiers are different)! Run assignment with bwa-sw using default parameters. WARNING: reference sequences must be dereplicated! No matches will be found to reference sequences which show up multiple times (even if their sequence identifiers are different)! merge_mapping_files.py – Merge mapping files Description: This script provides a convenient interface for merging mapping files which contains data on different samples. Usage: merge_mapping_files.py [options] Input Arguments: [REQUIRED] -m, --mapping_fps The input mapping files in a comma-separated list -o, --output_fp The output mapping file to write [OPTIONAL] -n, --no_data_value Value to represent missing data (i.e., when all fields are not defined in all mapping files) [default: no_data] Output: The result of this script is a merged mapping file (tab-delimited). Example: Merge two mapping files into a new mapping file (merged_mapping.txt). In cases where a mapping field is not provided for some samples, add the value „Data not collected?. merge_otu_maps.py – Merge OTU mapping files Description: This script merges OTU mapping files generated by denoise_wrapper.py and/or pick_otus.py For example, if otu_map1.txt contains: 0 seq1 seq2 seq5 1 seq3 seq4 2 seq6 seq7 seq8 and otu_map2.txt contains: 110 0 2 221 1 The resulting OTU map will be: 110 seq1 seq2 seq5 seq6 seq7 seq8 221 seq3 seq4 merge_otu_maps.py [options] Usage: Input Arguments: [REQUIRED] -i, --otu_map_fps The otu map filepaths, comma-separated and ordered as the OTU pickers were run [REQUIRED] -o, --output_fp Path to write output OTU map [REQUIRED] [OPTIONAL] -f, --failures_fp Failures filepath, if applicable Output: The result of this script is an OTU mapping file. Expand an OTU map: If the seq_ids in otu_map2.txt are otu_ids in otu_map1.txt, expand the seq_ids in otu_map2.txt to be the full list of associated seq_ids from otu_map1.txt. Write the resulting otu map to otu_map.txt (-o). Expand a failures file: Some OTU pickers (e.g. uclust_ref) will generate a list of failures for sequences which could not be assigned to OTUs. If this occurs in a chained OTU picking process, the failures file will need to be expanded to include the orignal sequence ids. To do this, pass the failures file via -f, and the otu maps up to, but not including, the step that generated the failures file. merge_otu_tables.py – Merge two or more OTU tables into a single OTU table. Description: This script merges two or more OTU tables into a single OTU table. This is useful, for example, when you?ve created several reference-based OTU tables for different analyses and need to combine them for a larger analysis. Requirements: It is also very important that your OTUs are consistent across the different OTU tables. For example, you cannot safely merge OTU tables from two independent de novo OTU picking runs. Finally, either all or none of the OTU tables can contain taxonomic information: you can?t merge some OTU tables with taxonomic data and some without taxonomic data. merge_otu_tables.py [options] Usage: Input Arguments: [REQUIRED] -i, --input_fps The otu tables in biom format (comma-separated) -o, --output_fp The output otu table filepath Output: Merge two OTU tables into a single OTU table multiple_rarefactions.py – Perform multiple subsamplings/rarefactions on an otu table Description: To perform bootstrap, jackknife, and rarefaction analyses, the otu table must be subsampled (rarefied). This script rarefies, or subsamples, OTU tables. This does not provide curves of diversity by number of sequences in a sample. Rather it creates a series of subsampled OTU tables by random sampling (without replacement) of the input OTU table. Samples that have fewer sequences then the requested rarefaction depth for a given output otu table are omitted from those ouput otu tables. The pseudo-random number generator used for rarefaction by subsampling is NumPy?s default - an implementation of the Mersenne twister PRNG. Usage: multiple_rarefactions.py [options] Input Arguments: [REQUIRED] -i, --input_path Input OTU table filepath. -o, --output_path Output directory. -m, --min Minimum number of seqs/sample for rarefaction. -x, --max Maximum number of seqs/sample (inclusive) for rarefaction. -s, --step Size of each steps between the min/max of seqs/sample (e.g. min, min+step... for level <= max). [OPTIONAL] -n, --num-reps The number of iterations at each step. [default: 10] --lineages_included Retain taxonomic (lineage) information for each OTU. Note: this will only work if lineage information is in the input OTU table. [default: False] -k, --keep_empty_otus Retain OTUs of all zeros, which are usually omitted from the output OTU tables. [default: False] --subsample_multinomial Subsample using subsampling with replacement [default: False] Output: The result of multiple_rarefactions.py consists of a number of biom files, which depend on the minimum/maximum number of sequences per samples, steps and iterations. The files have the same otu table format as the input otu_table.biom, and are named in the following way: rarefaction_100_0.biom, where “100” corresponds to the sequences per sample and “0” the iteration. Generate rarefied OTU tables: Generate rarefied OTU tables beginning with 10 (-m) sequences/sample through 140 (-x) sequences per sample in steps of of 10 (-s), performing 2 iterations at each sampling depth (-n). All resulting OTU tables will be written to „rarefied_otu_tables? (-o). Any sample containing fewer sequences in the input file than the requested number of sequences per sample is removed from the output rarefied otu table. multiple_rarefactions_even_depth.py – Perform multiple rarefactions on a single otu table, at one depth of sequences/sample Description: To perform bootstrap, jackknife, and rarefaction analyses, the otu table must be subsampled (rarefied). This script rarefies, or subsamples, an OTU table. This does not provide curves of diversity by number of sequences in a sample. Rather it creates a subsampled OTU table by random sampling (without replacement) of the input OTU table. Samples that have fewer sequences then the requested rarefaction depth are omitted from the ouput otu tables. The pseudo-random number generator used for rarefaction by subsampling is NumPy?s default - an implementation of the Mersenne twister PRNG. Usage: multiple_rarefactions_even_depth.py [options] Input Arguments: [REQUIRED] -i, --input_path Input otu table filepath -o, --output_path Write output rarefied otu tables files to this dir (makes dir if it doesn?t exist) -d, --depth Sequences per sample to subsample [OPTIONAL] -n, --num-reps Num iterations at each seqs/sample level [default: 10] --lineages_included Output rarefied otu tables will include taxonomic (lineage) information for each otu, if present in input otu table [default: False] -k, --keep_empty_otus Otus (rows) of all zeros are usually omitted from the output otu tables, with -k they will not be removed from the output files [default: False] Output: The results of this script consist of n subsampled OTU tables, written to the directory specified by -o. The file has the same otu table format as the input otu_table.biom. Note: if the output files would be empty, no files are written. Example: subsample otu_table.biom at 100 seqs/sample (-d) 10 times (-n) and write results to files (e.g., rarefaction_400_0.biom) in „rarefied_otu_tables/? (-o). neighbor_joining.py – Build a neighbor joining tree comparing samples Description: The input to this step is a distance matrix (i.e. resulting file from beta_diversity.py). Usage: neighbor_joining.py [options] Input Arguments: [REQUIRED] -i, --input_path Input path. directory for batch processing, filename for single file operation -o, --output_path Output path. directory for batch processing, filename for single file operation Output: The output is a newick formatted tree compatible with most standard tree viewing programs. Batch processing is also available, allowing the analysis of an entire directory of distance matrices. neighbor joining (nj) cluster (Single File): To perform nj clustering on a single distance matrix (e.g.: beta_div.txt, a result file from beta_diversity.py) use the following idiom: neighbor joining (Multiple Files): The script also functions in batch mode if a folder is supplied as input. This script operates on every file in the input directory and creates a corresponding neighbor joining tree file in the output directory, e.g.: nmds.py – Nonmetric Multidimensional Scaling (NMDS) Description: Nonmetric Multidimensional Scaling (NMDS) is commonly used to compare groups of samples based on phylogenetic or count-based distance metrics (see section on beta_diversity.py). Usage: nmds.py [options] Input Arguments: [REQUIRED] -i, --input_path Path to the input distance matrix file(s) (i.e., the output from beta_diversity.py). Is a directory for batch processing and a filename for a single file operation. -o, --output_path Output path. directory for batch processing, filename for single file operation [OPTIONAL] -d, --dimensions Number of dimensions of NMDS spacedefault: 3 Output: The resulting output file consists of the NMDS axes (columns) for each sample (rows). Pairs of NMDS axes can then be graphed to view the relationships between samples. The bottom of the output file contains the stress of the ordination. NMDS (Single File): For this script, the user supplies a distance matrix (i.e. resulting file from beta_diversity.py), along with the output filename (e.g. beta_div_coords.txt), as follows: NMDS (Dimensions): For this script, the user supplies a distance matrix (i.e. resulting file from beta_diversity.py), the number of dimensions of NMDS space and the output filename (e.g. beta_div_coords.txt), as follows: NMDS (Multiple Files): The script also functions in batch mode if a folder is supplied as input (e.g. from beta_diversity.py run in batch). No other files should be present in the input folder - only the distance matrix files to be analyzed. This script operates on every distance matrix file in the input directory and creates a corresponding nmds results file in the output directory, e.g.: otu_category_significance.py – OTU significance and co-occurence analysis Description: The script otu_category_significance.py tests whether any of the OTUs in an OTU table are significantly associated with a category in the category mapping file. This code uses ANOVA, the G test of independence, Pearson correlation, or a paired t-test to find OTUs that are differentially represented across experimental treatments or measured variables. The script can also be used to measure co-occurrence. For instance it can also be used with presence/absence or abundance data for a phylogenetic group (such as that determined with quantitative PCR) to determine if any OTUs co-occur with a taxon of interest, using the ANOVA, G test of Independence, or correlation. One useful feature is to be able to run otu category significance across all taxonomic levels of an OTU table. For example, you can take your otu_table.biom and run summarize_taxa.py on it, which will create an output directory that contains your OTU table summarized at different taxonomic levels, with the file names containing L2, L3, L4 etc. to designate different taxonomic resolutions. This script allows you to “sweep” over those taxonomic levels contained within the directory by passing -i input_directory. This output will then contain an individual results file that is written for each summarized taxa table. Note for QIIME 1.6: the input directory that you pass to this script must contain biom tables. Thus, you might need to convert your summarized taxa tables that are in the classic OTU table form to biom tables using the convert_biom.py script. If you run multiple_rarefactions_even_depth.py on your OTU table, you may want to run otu_category_significance on each table and then collate the results achieved from each table. In order to do this, simply pass the directory as the input filepath, and then pass the -w collate option. See usage examples below. The statistical test to be run is designated with the -s option, and includes the following options: The G test of independence (g_test): determines whether OTU presence/absence is associated with a category (e.g. if an OTU is more or less likely to be present in samples from people with a disease vs healthy controls). ANOVA (ANOVA): determines whether OTU relative abundance is different between categories (e.g. if any OTUs are increased or decreased in relative abundance in the gut microbiota of obese versus lean individuals). Pearson correlation (correlation): determines whether OTU abundance is correlated with a continuous variable in the category mapping file. (e.g. which OTUs are positively or negatively correlated with measured pH across soil samples) The tests also include options for longitudinal data (i.e. datasets in which multiple samples are collected from a single individual or site.) The composition of microbes may differ substantially across samples for reasons that do not relate to a study treatment. For instance, a given OTU may not be in an individual or study site for historical reasons, and so cannot change as a result of a treatment. The longitudinal tests thus ignore samples from individuals in which a particular OTU has never been observed across samples. The category mapping file must have an “individual” column indicating which sample is from which individual or site, and a “reference_sample” column, indicating which sample is the reference sample for an individual or site (e.g. time point zero in a timeseries experiment). The longitudinal options include: Pearson correlation (longitudinal_correlation): determines whether OTU relative abundance is correlated with a continuous variable in the category mapping file while accounting for an experimental design where multiple samples are collected from the same individual or site. Uses the change in relative abundance for each sample from the reference sample (e.g. timepoint zero in a timeseries analysis) rather than the absolute relative abundances in the correlation (e.g. if the relative abundance before the treatment was 0.2, and after the treatment was 0.4, the new values for the OTU relative abundance will be 0.0 for the before sample, and 0.2 for the after, thus indicating that the OTU went up in response to the treatment.) Paired t-test (paired_T): This option is when measurements were taken “before” and “after” a treatment. There must be exactly two measurements for each individual/site. The category mapping file must again have an individual column, indicating which sample is from which individual, and a reference_sample column, that has a 1 for the before time point and a 0 for the after. With the exception of longitudinal correlation and paired_T, this script can be performed on a directory of OTU tables (for example, the output of multiple_rarefactions_even_depth.py), in addition to on a single OTU table. If the script is called on a directory, the resulting p-values are the average of the p-values observed when running a single test on each otu_table separately. It is generally a good practice to rarefy the OTU table (e.g. with single_rarefaction.py) prior to running these significance tests in order to avoid artifacts or biases from unequal sample sizes. otu_category_significance.py [options] Usage: Input Arguments: [REQUIRED] -i, --otu_table_fp Path to the otu table in biom format, or to a directory containing OTU tables -m, --category_mapping_fp Path to category mapping file -o, --output_fp Path to the output file or directory [OPTIONAL] -c, --category Name of the category over which to run the analysis -s, --test The type of statistical test to run. options are: g_test: determines whether OTU presence/absence is associated with a category using the G test of Independence. ANOVA: determines whether OTU abundance is associated with a category. correlation: determines whether OTU abundance is correlated with a continuous variable in the category mapping file. longitudinal_correlation: determine whether OTU relative abundance is correlated with a continuous variable in the category mapping file in longitudinal study designs such as with timeseries data. paired_T: determine whether OTU relative abundance goes up or down in response to a treatment. [default: ANOVA] -f, --filter Minimum fraction of samples that must contain the OTU for the OTU to be included in the analysis. For longitudinal options, is the fraction of individuals/sites that were not ignored because of the OTU not being observed in any of the samples from that individual/site. [default: 0.25] -t, --threshold Threshold under which to consider something absent: Only used if you have numerical data that should be converted to present or absent based on a threshold. Should be None for categorical data or with the correlation test. default value is None -l, --otu_include_fp Path to a file with a list of OTUs to evaluate. By default evaluates all OTUs that pass the minimum sample filter. If a filepath is given here in which each OTU name one wishes to evaluate is on a separate line, will apply this additional filter -z, --reference_sample_column This column specifies the sample to which all other samples within an individual are compared. For instance, for timeseries data, it would usually be the initial timepoint before a treatment began. The reference samples should be marked with a 1, and other samples with a 0. -n, --individual_column Name of the column in the category mapping file that designates which sample is from which individual. -b, --converted_otu_table_output_fp The test options longitudinal_correlation and paired_T convert the original OTU table into one in which samples that are ignored because they are never observed in an individual are replaced with the ignore number 999999999 and the OTU counts are the change in relative abundance compared to the designated reference sample. If a filepath is given with the -b option this converted OTU table will be saved to this path. --relative_abundance Some of the statistical tests, such as Pearson correlation and ANOVA, convert the OTU counts to relative abundances prior to performing the calculations. This parameter can be set if a user wishes to disable this step. (e.g. if an OTU table has already been converted to relative abundances.) -w, --collate_results When passing in a directory of OTU tables, this parameter gives you the option of collating those resulting values. For example, if your input directory contained multiple rarefied OTU tables at the same depth, pass the -w option in order to find the average p-value for your statistical test over all rarefied tables and collate the results into one file. If your input directory contained OTU tables that contained different taxonomic levels, filtering levels, etc then do not pass the -w option so that an individual results file is created for every input OTU table. [default: False] Output: The G test results are output as tab delimited text, which can be examined in Excel. The output has the following columns: , OTU: The name of the OTU. , g_val: The raw test statistic. , g_prob: The probability that this OTU is non-randomly distributed across the categories. , Bonferroni_corrected: The probability after correction for multiple comparisons with the Bonferroni correction. In this correction, the p-value is multiplied by the number of comparisons performed (the number of OTUs remaining after applying the filter). , FDR_corrected: The probability after correction with the “false discovery rate” method. In this method, the raw p-values are ranked from low to high. Each p-value is multiplied by the number of comparisons divided by the rank. This correction is less conservative than the Bonferroni correction. The list of significant OTUs is expected to have the percent of false positives predicted by the p value. , Contingency table columns: The next columns give the information in the contingency table and will vary in number and name based on the number of categories and their names. The two numbers in brackets represent the number of samples that were observed in those categories and the number that would be expected if the OTU members were randomly distributed across samples in the different categories. These columns can be used to evaluate the nature of a non-random association (e.g. if that OTU is always present in a particular category or if it is never present). , Consensus lineage: The consensus lineage for that OTU will be listed in the last column if it was present in the input OTU table. The ANOVA results are output as tab delimited text that can be examined in Excel. The output has the following columns: , OTU: The name of the OTU. , prob: The raw probability from the ANOVA , Bonferroni_corrected: The probability after correction for multiple comparisons with the Bonferroni correction. In this correction, the p-value is multiplied by the number of comparisons performed (the number of OTUs remaining after applying the filter). , FDR_corrected: The probability after correction with the “false discovery rate” method. In this method, the raw p-values are ranked from low to high. Each p-value is multiplied by the number of comparisons divided by the rank. This correction is less conservative than the Bonferroni correction. The list of significant OTUs is expected to have the percent of false positives predicted by the p value. , Category Mean Columns: Contains one column for each category reporting the mean count of the OTU in that category. , Consensus lineage: The consensus lineage for that OTU will be listed in the last column if it was present in the input OTU table. The correlation and longitudinal_correlation test results are output as tab delimited text, which can be examined in Excel. The output has the following columns: , OTU: The name of the OTU. , prob: The probability that the OTU relative abundance is correlated with the category values across samples. , otu_values_y: a list of the values (relative abundance) of the OTU across the samples that were plotted on the y axis for the correlation. , cat_values_x: a list of the values of the selected category that were plotted on the x axis for the correlation. , Bonferroni_corrected: The probability after correction for multiple comparisons with the Bonferroni correction. In this correction, the p-value is multiplied by the number of comparisons performed (the number of OTUs remaining after applying the filter). , FDR_corrected: The probability after correction with the “false discovery rate” method. In this method, the raw p-values are ranked from low to high. Each p-value is multiplied by the number of comparisons divided by the rank. This correction is less conservative than the Bonferroni correction. The list of significant OTUs is expected to have the percent of false positives predicted by the p value. , r: Pearson?s r. This value ranges from -1 to +1, with -1 indicating a perfect negative correlation, +1 indicating a perfect positive correlation, and 0 indicating no relationship. , Consensus lineage: The consensus lineage for that OTU will be listed in the last column if it was present in the input OTU table. The paired_T results are output as tab delimited text that can be examined in Excel. The output has the following columns: , OTU: The name of the OTU. , prob: The raw probability from the paired T test , T stat: The raw T value , average_diff: The average difference between the before and after samples in the individuals in which the OTU was observed. , num_pairs: The number of sample pairs (individuals) in which the OTU was observed. , Bonferroni_corrected: The probability after correction for multiple comparisons with the Bonferroni correction. In this correction, the p-value is multiplied by the number of comparisons performed (the number of OTUs remaining after applying the filter). , FDR_corrected: The probability after correction with the “false discovery rate” method. In this method, the raw p-values are ranked from low to high. Each p-value is multiplied by the number of comparisons divided by the rank. This correction is less conservative than the Bonferroni correction. The list of significant OTUs is expected to have the percent of false positives predicted by the p value. , Consensus lineage: The consensus lineage for that OTU will be listed in the last column if it was present in the input OTU table. G-test: Perform a G test on otu_table.biom testing OTUs for differences in the abundance across the category “Treatment”: ANOVA: Perform an ANOVA on otu_table.biom testing OTUs for differences in the abundance across the category “Treatment”: ANOVA on multiple OTU tables and collate results: Perform an ANOVA on all OTU tables in rarefied_otu_tables testing OTUs for differences in the abundance across the category “Treatment” and collate the results into one file: ANOVA on multiple OTU tables and write out separate results files: Perform an ANOVA on all OTU tables in rarefied_otu_tables testing OTUs for differences in the abundance across the category “Treatment” and produce a results file for each OTU table: parallel_align_seqs_pynast.py – Parallel sequence alignment using PyNAST Description: A wrapper for the align_seqs.py PyNAST option, intended to make use of multicore/multiprocessor environments to perform analyses in parallel. Usage: parallel_align_seqs_pynast.py [options] Input Arguments: [REQUIRED] -i, --input_fasta_fp Path to the input fasta file -o, --output_dir Path to the output directory [OPTIONAL] -a, --pairwise_alignment_method Method to use for pairwise alignments [default: uclust] -d, --blast_db Database to blast against [default: created on-the-fly from template_alignment] -e, --min_length Minimum sequence length to include in alignment [default: 75% of the median input sequence length] -p, --min_percent_id Minimum percent sequence identity to closest blast hit to include sequence in alignment [default: 75.0] -O, --jobs_to_start Number of jobs to start [default: 2] -R, --retain_temp_files Retain temporary files after runs complete (useful for debugging) [default: False] -S, --suppress_submit_jobs Only split input and write commands file - don?t submit jobs [default: False] -T, --poll_directly Poll directly for job completion rather than running poller as a separate job. If -T is specified this script will not return until all jobs have completed. [default: False] -U, --cluster_jobs_fp Path to cluster jobs script (defined in qiime_config) [default: start_parallel_jobs.py] -W, --suppress_polling Suppress polling of jobs and merging of results upon completion [default: False] -X, --job_prefix Job prefix [default: descriptive prefix + random chars] -Z, --seconds_to_sleep Number of seconds to sleep between checks for run completion when polling runs [default: 1] -t, --template_fp Filepath for template against [default: /Users/caporaso/data/greengenes_core_sets/core_set_aligned_imputed.fasta_11_8_07.no_dots] Output: This results in a multiple sequence alignment (FASTA-formatted). Example: Align the input file (-i) against using PyNAST and write the output (-o) to $PWD/pynast_aligned_seqs/. ALWAYS SPECIFY ABSOLUTE FILE PATHS (absolute path represented here as $PWD, but will generally look something like /home/ubuntu/my_analysis/). parallel_alpha_diversity.py – Parallel alpha diversity Description: This script performs like the alpha_diversity.py script, but is intended to make use of multicore/multiprocessor environments to perform analyses in parallel. Usage: parallel_alpha_diversity.py [options] Input Arguments: [REQUIRED] -i, --input_path Input path, must be directory [REQUIRED] -o, --output_path Output path, must be directory [REQUIRED] [OPTIONAL] -t, --tree_path Path to newick tree file, required for phylogenetic metrics [default: None] -m, --metrics Metrics to use, comma delimited -R, --retain_temp_files Retain temporary files after runs complete (useful for debugging) [default: False] -S, --suppress_submit_jobs Only split input and write commands file - don?t submit jobs [default: False] -T, --poll_directly Poll directly for job completion rather than running poller as a separate job. If -T is specified this script will not return until all jobs have completed. [default: False] -U, --cluster_jobs_fp Path to cluster jobs script (defined in qiime_config) [default: start_parallel_jobs.py] -W, --suppress_polling Suppress polling of jobs and merging of results upon completion [default: False] -X, --job_prefix Job prefix [default: descriptive prefix + random chars] -Z, --seconds_to_sleep Number of seconds to sleep between checks for run completion when polling runs [default: 1] -O, --jobs_to_start Number of jobs to start [default: 2] Output: The resulting output will be the same number of files as supplied by the user. The resulting files are tab-delimited text files, where the columns correspond to alpha diversity metrics and the rows correspond to samples and their calculated diversity measurements. Example: Apply the observed_species, chao1, PD_whole_tree metrics (-m) to all otu tables in rarefied_otu_tables/ (-i) and write the resulting output files to adiv/ (-o, will be created if it doesn?t exist). Use the rep_set.tre (-t) to compute phylogenetic diversity metrics. ALWAYS SPECIFY ABSOLUTE FILE PATHS (absolute path represented here as $PWD, but will generally look something like /home/ubuntu/my_analysis/). parallel_assign_taxonomy_blast.py – Parallel taxonomy assignment using BLAST Description: This script performs like the assign_taxonomy.py script, but is intended to make use of multicore/multiprocessor environments to perform analyses in parallel. Usage: parallel_assign_taxonomy_blast.py [options] Input Arguments: [REQUIRED] -i, --input_fasta_fp Full path to input_fasta_fp [REQUIRED] -o, --output_dir Full path to store output files [REQUIRED] [OPTIONAL] -r, --reference_seqs_fp Ref seqs to blast against. Must provide either –blast_db or –reference_seqs_db for assignment with blast [default: /Users/caporaso/data/gg_12_10_otus/rep_set/97_otus.fasta] -b, --blast_db Database to blast against. Must provide either –blast_db or –reference_seqs_db for assignment with blast [default: None] -e, --e_value Maximum e-value to record an assignment, only used for blast method [default: 0.001] -B, --blastmat_dir Full path to directory containing blastmat file [default: /Applications/blast-2.2.22/data/] -O, --jobs_to_start Number of jobs to start [default: 2] -R, --retain_temp_files Retain temporary files after runs complete (useful for debugging) [default: False] -S, --suppress_submit_jobs Only split input and write commands file - don?t submit jobs [default: False] -T, --poll_directly Poll directly for job completion rather than running poller as a separate job. If -T is specified this script will not return until all jobs have completed. [default: False] -U, --cluster_jobs_fp Path to cluster jobs script (defined in qiime_config) [default: start_parallel_jobs.py] -W, --suppress_polling Suppress polling of jobs and merging of results upon completion [default: False] -X, --job_prefix Job prefix [default: descriptive prefix + random chars] -Z, --seconds_to_sleep Number of seconds to sleep between checks for run completion when polling runs [default: 1] -t, --id_to_taxonomy_fp Full path to id_to_taxonomy mapping file [default: /Users/caporaso/data/gg_12_10_otus/taxonomy/97_otu_taxonomy.txt] Output: Mapping of sequence identifiers to taxonomy and quality scores. Example: Assign taxonomy to all sequences in the input file (-i) using BLAST with the id to taxonomy mapping file (-t) and reference sequences file (-r), and write the results (-o) to $PWD/blast_assigned_taxonomy/. ALWAYS SPECIFY ABSOLUTE FILE PATHS (absolute path represented here as $PWD, but will generally look something like /home/ubuntu/my_analysis/). parallel_assign_taxonomy_rdp.py – Parallel taxonomy assignment using RDP Description: This script performs like the assign_taxonomy.py script, but is intended to make use of multicore/multiprocessor environments to perform analyses in parallel. Usage: parallel_assign_taxonomy_rdp.py [options] Input Arguments: [REQUIRED] -i, --input_fasta_fp Full path to input_fasta_fp [REQUIRED] -o, --output_dir Path to store output files [REQUIRED] [OPTIONAL] --rdp_classifier_fp Full path to rdp classifier jar file [default: /Applications/rdp_classifier_2.2/rdp_classifier-2.2.jar] -c, --confidence Minimum confidence to record an assignment [default: 0.8] -t, --id_to_taxonomy_fp Full path to id_to_taxonomy mapping file [default: /Users/caporaso/data/gg_12_10_otus/taxonomy/97_otu_taxonomy.txt] -r, --reference_seqs_fp Ref seqs to rdp against. [default: /Users/caporaso/data/gg_12_10_otus/rep_set/97_otus.fasta] --rdp_max_memory Maximum memory allocation, in MB, for Java virtual machine when using the rdp method. Increase for large training sets [default: 1500] -O, --jobs_to_start Number of jobs to start [default: 2] -R, --retain_temp_files Retain temporary files after runs complete (useful for debugging) [default: False] -S, --suppress_submit_jobs Only split input and write commands file - don?t submit jobs [default: False] -T, --poll_directly Poll directly for job completion rather than running poller as a separate job. If -T is specified this script will not return until all jobs have completed. [default: False] -U, --cluster_jobs_fp Path to cluster jobs script (defined in qiime_config) [default: start_parallel_jobs.py] -W, --suppress_polling Suppress polling of jobs and merging of results upon completion [default: False] -X, --job_prefix Job prefix [default: descriptive prefix + random chars] -Z, --seconds_to_sleep Number of seconds to sleep between checks for run completion when polling runs [default: 1] Output: Mapping of sequence identifiers to taxonomy and quality scores. Example: Assign taxonomy to all sequences in the input file (-i) using the RDP classifier and write the results (-o) to $PWD/rdp_assigned_taxonomy/. ALWAYS SPECIFY ABSOLUTE FILE PATHS (absolute path represented here as $PWD, but will generally look something like /home/ubuntu/my_analysis/). parallel_beta_diversity.py – Parallel beta diversity Description: This script performs like the beta_diversity.py script, but is intended to make use of multicore/multiprocessor environments to perform analyses in parallel. Usage: parallel_beta_diversity.py [options] Input Arguments: [REQUIRED] -i, --input_path Input path, must be directory [REQUIRED] -o, --output_path Output path, must be directory [REQUIRED] [OPTIONAL] -m, --metrics Beta-diversity metric(s) to use. A comma-separated list should be provided when multiple metrics are specified. [default: unweighted_unifrac,weighted_unifrac] -t, --tree_path Path to newick tree file, required for phylogenetic metrics [default: None] -R, --retain_temp_files Retain temporary files after runs complete (useful for debugging) [default: False] -S, --suppress_submit_jobs Only split input and write commands file - don?t submit jobs [default: False] -T, --poll_directly Poll directly for job completion rather than running poller as a separate job. If -T is specified this script will not return until all jobs have completed. [default: False] -U, --cluster_jobs_fp Path to cluster jobs script (defined in qiime_config) [default: start_parallel_jobs.py] -W, --suppress_polling Suppress polling of jobs and merging of results upon completion [default: False] -X, --job_prefix Job prefix [default: descriptive prefix + random chars] -Z, --seconds_to_sleep Number of seconds to sleep between checks for run completion when polling runs [default: 1] -O, --jobs_to_start Number of jobs to start [default: 2] -f, --full_tree By default, each job removes calls _fast_unifrac_setup to remove unused parts of the tree. pass -f if you already have a minimal tree, and this script will run faster Output: The output of parallel_beta_diversity.py is a folder containing text files, each a distance matrix between samples. Apply beta_diversity.py in parallel to multiple otu tables: Apply the unweighted_unifrac and weighted_unifrac metrics (modify with -m) to all otu tables in rarefied_otu_tables (-i) and write the resulting output files to bdiv/ (-o, will be created if it doesn?t exist). Use the rep_set.tre (-t) to compute phylogenetic diversity metrics. ALWAYS SPECIFY ABSOLUTE FILE PATHS (absolute path represented here as $PWD, but will generally look something like /home/ubuntu/my_analysis/). Apply beta_diversity.py in parallel to a single otu table: parallel_blast.py – Parallel BLAST Description: This script for performing blast while making use of multicore/multiprocessor environments to perform analyses in parallel. Usage: parallel_blast.py [options] Input Arguments: [REQUIRED] -i, --infile_path Path of sequences to use as queries [REQUIRED] -o, --output_dir Name of output directory for blast jobs [REQUIRED] [OPTIONAL] -c, --disable_low_complexity_filter Disable filtering of low-complexity sequences (i.e., -F F is passed to blast) [default: False] -e, --e_value E-value threshold for blasts [default: 1e-30] -n, --num_hits Number of hits per query for blast results [default: 1] -w, --word_size Word size for blast searches [default: 30] -a, --blastmat_dir Full path to directory containing blastmat file [default: /Applications/blast-2.2.22/data/] -r, --refseqs_path Path to fasta sequences to search against. Required if -b is not provided. -b, --blast_db Name of pre-formatted BLAST database. Required if -r is not provided. -O, --jobs_to_start Number of jobs to start [default: 2] -R, --retain_temp_files Retain temporary files after runs complete (useful for debugging) [default: False] -S, --suppress_submit_jobs Only split input and write commands file - don?t submit jobs [default: False] -T, --poll_directly Poll directly for job completion rather than running poller as a separate job. If -T is specified this script will not return until all jobs have completed. [default: False] -U, --cluster_jobs_fp Path to cluster jobs script (defined in qiime_config) [default: start_parallel_jobs.py] -W, --suppress_polling Suppress polling of jobs and merging of results upon completion [default: False] -X, --job_prefix Job prefix [default: descriptive prefix + random chars] -Z, --seconds_to_sleep Number of seconds to sleep between checks for run completion when polling runs [default: 1] Output: Example: BLAST $PWD/inseqs.fasta (-i) against a blast database created from $PWD/refseqs.fasta (-r). Store the results in $PWD/blast_out/ (-o). ALWAYS SPECIFY ABSOLUTE FILE PATHS (absolute path represented here as $PWD, but will generally look something like /home/ubuntu/my_analysis/). parallel_identify_chimeric_seqs.py – Parallel chimera detection Description: This script works like the identify_chimeric_seqs.py script, but is intended to make use of multicore/multiprocessor environments to perform analyses in parallel. parallel_identify_chimeric_seqs.py [options] Usage: Input Arguments: [REQUIRED] -i, --input_fasta_fp Path to the input fasta file [OPTIONAL] -a, --aligned_reference_seqs_fp Path to (Py)Nast aligned reference sequences. REQUIRED when method ChimeraSlayer [default: /Users/caporaso/data/greengenes_core_sets/core_set_aligned_imputed.fasta_11_8_07.no_dots] -t, --id_to_taxonomy_fp Path to tab-delimited file mapping sequences to assigned taxonomy. Each assigned taxonomy is provided as a comma-separated list. [default: None; REQUIRED when method is blast_fragments] -r, --reference_seqs_fp Path to reference sequences (used to build a blast db when method blast_fragments). [default: None; REQUIRED when method blast_fragments if no blast_db is provided;] -b, --blast_db Database to blast against. Must provide either –blast_db or –reference_seqs_fp when method is blast_fragments [default: None] -m, --chimera_detection_method Chimera detection method. Choices: blast_fragments or ChimeraSlayer. [default:ChimeraSlayer] -n, --num_fragments Number of fragments to split sequences into (i.e., number of expected breakpoints + 1) [default: 3] -d, --taxonomy_depth Number of taxonomic divisions to consider when comparing taxonomy assignments [default: 4] -e, --max_e_value Max e-value to assign taxonomy [default: 1e-30] --min_div_ratio Min divergence ratio (passed to ChimeraSlayer). If set to None uses ChimeraSlayer default value. [default: None] -o, --output_fp Path to store output [default: derived from input_seqs_fp] -O, --jobs_to_start Number of jobs to start [default: 2] -R, --retain_temp_files Retain temporary files after runs complete (useful for debugging) [default: False] -S, --suppress_submit_jobs Only split input and write commands file - don?t submit jobs [default: False] -T, --poll_directly Poll directly for job completion rather than running poller as a separate job. If -T is specified this script will not return until all jobs have completed. [default: False] -U, --cluster_jobs_fp Path to cluster jobs script (defined in qiime_config) [default: start_parallel_jobs.py] -W, --suppress_polling Suppress polling of jobs and merging of results upon completion [default: False] -X, --job_prefix Job prefix [default: descriptive prefix + random chars] -Z, --seconds_to_sleep Number of seconds to sleep between checks for run completion when polling runs [default: 1] Output: The result of parallel_identify_chimeric_seqs.py is a text file that identifies which sequences are chimeric. blast_fragments example: For each sequence provided as input, the blast_fragments method splits the input sequence into n roughly-equal-sized, non-overlapping fragments, and assigns taxonomy to each fragment against a reference database. The BlastTaxonAssigner (implemented in assign_taxonomy.py) is used for this. The taxonomies of the fragments are compared with one another (at a default depth of 4), and if contradictory assignments are returned the sequence is identified as chimeric. For example, if an input sequence was split into 3 fragments, and the following taxon assignments were returned: fragment1: Archaea;Euryarchaeota;Methanobacteriales;Methanobacterium fragment2: Archaea;Euryarchaeota;Halobacteriales;uncultured fragment3: Archaea;Euryarchaeota;Methanobacteriales;Methanobacterium The sequence would be considered chimeric at a depth of 3 (Methanobacteriales vs. Halobacteriales), but non-chimeric at a depth of 2 (all Euryarchaeota). blast_fragments begins with the assumption that a sequence is non-chimeric, and looks for evidence to the contrary. This is important when, for example, no taxonomy assignment can be made because no blast result is returned. If a sequence is split into three fragments, and only one returns a blast hit, that sequence would be considered non-chimeric. This is because there is no evidence (i.e., contradictory blast assignments) for the sequence being chimeric. This script can be run by the following command, where the resulting data is written to $PWD/blast_fragments_chimeric_seqs.txt and using default parameters (i.e., number of fragments (“-n 3”), taxonomy depth (“-d 4”) and maximum E-value (“-e 1e-30”)). ALWAYS SPECIFY ABSOLUTE FILE PATHS (absolute path represented here as $PWD, but will generally look something like /home/ubuntu/my_analysis/). ChimeraSlayer Example: Identify chimeric sequences using the ChimeraSlayer algorithm against a user provided reference database. The input sequences need to be provided in aligned (Py)Nast format and the reference database needs to be provided as aligned FASTA (-a). Note that the reference database needs to be the same that was used to build the alignment of the input sequences! ALWAYS SPECIFY ABSOLUTE FILE PATHS (absolute path represented here as $PWD, but will generally look something like /home/ubuntu/my_analysis/). parallel_map_reads_to_reference.py – Description: Usage: parallel_map_reads_to_reference.py [options] Input Arguments: [REQUIRED] -i, --input_seqs_filepath Path to input sequences file -o, --output_dir Directory to store results -r, --refseqs_fp Path to reference sequences [OPTIONAL] -t, --observation_metadata_fp Path to observation metadata (e.g., taxonomy, EC, etc) [default: None] -m, --assignment_method Method for picking OTUs. Valid choices are: usearch blat bwa-short. [default: usearch] -e, --evalue Max e-value to consider a match [default: 1e-10] -s, --min_percent_id Min percent id to consider a match [default: 0.75] --max_diff MaxDiff to consider a match (applicable for -m bwa) – see the aln section of “man bwa” for details [default (defined by bwa): 0.04] --queryalnfract Min percent of the query seq that must match to consider a match (usearch only) [default: 0.35] --targetalnfract Min percent of the target/reference seq that must match to consider a match (usearch only) [default: 0.0] --max_accepts Max_accepts value (usearch only) [default: 1] --max_rejects Max_rejects value to (usearch only) [default: 32] -O, --jobs_to_start Number of jobs to start [default: 2] -R, --retain_temp_files Retain temporary files after runs complete (useful for debugging) [default: False] -S, --suppress_submit_jobs Only split input and write commands file - don?t submit jobs [default: False] -T, --poll_directly Poll directly for job completion rather than running poller as a separate job. If -T is specified this script will not return until all jobs have completed. [default: False] -U, --cluster_jobs_fp Path to cluster jobs script (defined in qiime_config) [default: start_parallel_jobs.py] -W, --suppress_polling Suppress polling of jobs and merging of results upon completion [default: False] -X, --job_prefix Job prefix [default: descriptive prefix + random chars] -Z, --seconds_to_sleep Number of seconds to sleep between checks for run completion when polling runs [default: 1] Output: parallel_multiple_rarefactions.py – Parallel multiple file rarefaction Description: This script performs like the multiple_rarefactions.py script, but is intended to make use of multicore/multiprocessor environments to perform analyses in parallel. Usage: parallel_multiple_rarefactions.py [options] Input Arguments: [REQUIRED] -i, --input_path Input filepath, (the otu table) [REQUIRED] -o, --output_path Write output rarefied otu tables here makes dir if it doesn?t exist [REQUIRED] -m, --min Min seqs/sample [REQUIRED] -x, --max Max seqs/sample (inclusive) [REQUIRED] [OPTIONAL] -n, --num-reps Num iterations at each seqs/sample level [default: 10] --suppress_lineages_included Exclude taxonomic (lineage) information for each OTU. -s, --step Levels: min, min+step... for level <= max [default: 1] --subsample_multinomial Subsample using subsampling with replacement [default: False] -R, --retain_temp_files Retain temporary files after runs complete (useful for debugging) [default: False] -S, --suppress_submit_jobs Only split input and write commands file - don?t submit jobs [default: False] -T, --poll_directly Poll directly for job completion rather than running poller as a separate job. If -T is specified this script will not return until all jobs have completed. [default: False] -U, --cluster_jobs_fp Path to cluster jobs script (defined in qiime_config) [default: start_parallel_jobs.py] -W, --suppress_polling Suppress polling of jobs and merging of results upon completion [default: False] -X, --job_prefix Job prefix [default: descriptive prefix + random chars] -Z, --seconds_to_sleep Number of seconds to sleep between checks for run completion when polling runs [default: 1] -O, --jobs_to_start Number of jobs to start [default: 2] Output: The result of parallel_multiple_rarefactions.py consists of a number of files, which depend on the minimum/maximum number of sequences per samples, steps and iterations. The files have the same otu table format as the input otu_table.biom, and are named in the following way: rarefaction_100_0.txt, where “100” corresponds to the sequences per sample and “0” for the iteration. OTU tables of different depths: Build rarefied otu tables containing 10 (-m) to 140 (-x) sequences in steps of 10 (-s) with 2 (-n) repetions per number of sequences, from otu_table.biom (-i). Write the output files to the rarefied_otu_tables directory (-o, will be created if it doesn?t exist). The name of the output files will be of the form rarefaction__.biom. ALWAYS SPECIFY ABSOLUTE FILE PATHS (absolute path represented here as $PWD, but will generally look something like /home/ubuntu/my_analysis/). OTU tables of the same depth: Build 8 rarefied otu tables each containing exactly 100 sequences per sample (even depth rarefaction). ALWAYS SPECIFY ABSOLUTE FILE PATHS (absolute path represented here as $PWD, but will generally look something like /home/ubuntu/my_analysis/). parallel_pick_otus_blast.py – Parallel pick otus using BLAST Description: This script performs like the pick_otus.py script, but is intended to make use of multicore/multiprocessor environments to perform analyses in parallel. Usage: parallel_pick_otus_blast.py [options] Input Arguments: [REQUIRED] -i, --input_fasta_fp Full path to input_fasta_fp -o, --output_dir Path to store output files [OPTIONAL] -e, --max_e_value Max E-value [default: 1e-10] -s, --similarity Sequence similarity threshold [default: 0.97] -r, --refseqs_fp Full path to template alignment [default: None] -b, --blast_db Database to blast against [default: None] --min_aligned_percent Minimum percent of query sequence that can be aligned to consider a hit (BLAST OTU picker only) [default: 0.5] -O, --jobs_to_start Number of jobs to start [default: 2] -R, --retain_temp_files Retain temporary files after runs complete (useful for debugging) [default: False] -S, --suppress_submit_jobs Only split input and write commands file - don?t submit jobs [default: False] -T, --poll_directly Poll directly for job completion rather than running poller as a separate job. If -T is specified this script will not return until all jobs have completed. [default: False] -U, --cluster_jobs_fp Path to cluster jobs script (defined in qiime_config) [default: start_parallel_jobs.py] -W, --suppress_polling Suppress polling of jobs and merging of results upon completion [default: False] -X, --job_prefix Job prefix [default: descriptive prefix + random chars] -Z, --seconds_to_sleep Number of seconds to sleep between checks for run completion when polling runs [default: 1] Output: The output consists of two files (i.e. seqs_otus.txt and seqs_otus.log). The .txt file is composed of tab-delimited lines, where the first field on each line corresponds to an (arbitrary) cluster identifier, and the remaining fields correspond to sequence identifiers assigned to that cluster. Sequence identifiers correspond to those provided in the input FASTA file. The resulting .log file contains a list of parameters passed to this script along with the output location of the resulting .txt file. Example: Pick OTUs by blasting $PWD/inseqs.fasta against $PWD/refseqs.fasta and write the output to the $PWD/blast_otus/ directory. ALWAYS SPECIFY ABSOLUTE FILE PATHS (absolute path represented here as $PWD, but will generally look something like /home/ubuntu/my_analysis/). parallel_pick_otus_trie.py – Parallel pick otus using a trie Description: This script performs like the pick_otus.py script, but is intended to make use of multicore/multiprocessor environments to perform analyses in parallel. The script uses the first p bases of each read to sort all reads into separate buckets and then each buckets is processed separately. Note that in cases of amplicon sequencing we do not expect the buckets to be even sized, but rather a few buckets make up the majority of reads. Thus, not all combination of prefix length p and number of CPUS -O make sense. Good combinations for a small desktop multicore system would be -p 5 (default) and -O 4. For larger clusters, we suggest -p 10 and -O 20. Increasing -p to a value much larger than 10 will lead to lots of temporary files and many small jobs, so likely will not speed up the OTU picking. On the other hand, the max speed-up is bounded by the size of the largest buckets, so adding more cores will not always increase efficiency. Usage: parallel_pick_otus_trie.py [options] Input Arguments: [REQUIRED] -i, --input_fasta_fp Full path to input_fasta_fp -o, --output_dir Path to store output files [OPTIONAL] -p, --prefix_length Prefix length used to split the input. Must be smaller than the shortest seq in input! [default: 5] -O, --jobs_to_start Number of jobs to start [default: 2] -R, --retain_temp_files Retain temporary files after runs complete (useful for debugging) [default: False] -S, --suppress_submit_jobs Only split input and write commands file - don?t submit jobs [default: False] -T, --poll_directly Poll directly for job completion rather than running poller as a separate job. If -T is specified this script will not return until all jobs have completed. [default: False] -U, --cluster_jobs_fp Path to cluster jobs script (defined in qiime_config) [default: start_parallel_jobs.py] -W, --suppress_polling Suppress polling of jobs and merging of results upon completion [default: False] -X, --job_prefix Job prefix [default: descriptive prefix + random chars] -Z, --seconds_to_sleep Number of seconds to sleep between checks for run completion when polling runs [default: 1] Output: The output consists of two files (i.e. seqs_otus.txt and seqs_otus.log). The .txt file is composed of tab-delimited lines, where the first field on each line corresponds to an (arbitrary) cluster identifier, and the remaining fields correspond to sequence identifiers assigned to that cluster. Sequence identifiers correspond to those provided in the input FASTA file. The resulting .log file contains a list of parameters passed to this script along with the output location of the resulting .txt file. Example: Pick OTUs by building a trie out of $PWD/inseqs.fasta and write the output to the $PWD/trie_otus/ directory. ALWAYS SPECIFY ABSOLUTE FILE PATHS (absolute path represented here as $PWD, but will generally look something like /home/ubuntu/my_analysis/). Example: Pick OTUs by building a trie out of $PWD/inseqs.fasta and write the output to the $PWD/trie_otus/ directory. Split the input according to the first 10 bases of each read and process each set independently. parallel_pick_otus_uclust_ref.py – Parallel pick otus using uclust_ref Description: This script works like the pick_otus.py script, but is intended to make use of multicore/multiprocessor environments to perform analyses in parallel. Usage: parallel_pick_otus_uclust_ref.py [options] Input Arguments: [REQUIRED] -i, --input_fasta_fp Full path to input_fasta_fp -o, --output_dir Path to store output files -r, --refseqs_fp Full path to reference collection [OPTIONAL] -s, --similarity Sequence similarity threshold [default: 0.97] -z, --enable_rev_strand_match Enable reverse strand matching for uclust otu picking, will double the amount of memory used. [default: False] -A, --optimal_uclust Pass the –optimal flag to uclust for uclust otu picking. [default: False] -E, --exact_uclust Pass the –exact flag to uclust for uclust otu picking. [default: False] --max_accepts Max_accepts value to uclust and uclust_ref [default: 20] --max_rejects Max_rejects value to uclust and uclust_ref [default: 500] --stepwords Stepwords value to uclust and uclust_ref [default: 20] --word_length W value to uclust and uclust_ref [default: 12] --uclust_stable_sort Deprecated: stable sort enabled by default, pass –uclust_suppress_stable_sort to disable [default: True] --suppress_uclust_stable_sort Don?t pass –stable-sort to uclust [default: False] -d, --save_uc_files Enable preservation of intermediate uclust (.uc) files that are used to generate clusters via uclust. [default: True] --uclust_otu_id_prefix OTU identifier prefix (string) for the de novo uclust OTU picker [default: None, OTU ids are ascending integers] -O, --jobs_to_start Number of jobs to start [default: 2] -R, --retain_temp_files Retain temporary files after runs complete (useful for debugging) [default: False] -S, --suppress_submit_jobs Only split input and write commands file - don?t submit jobs [default: False] -T, --poll_directly Poll directly for job completion rather than running poller as a separate job. If -T is specified this script will not return until all jobs have completed. [default: False] -U, --cluster_jobs_fp Path to cluster jobs script (defined in qiime_config) [default: start_parallel_jobs.py] -W, --suppress_polling Suppress polling of jobs and merging of results upon completion [default: False] -X, --job_prefix Job prefix [default: descriptive prefix + random chars] -Z, --seconds_to_sleep Number of seconds to sleep between checks for run completion when polling runs [default: 1] Output: The output consists of two files (i.e. seqs_otus.txt and seqs_otus.log). The .txt file is composed of tab-delimited lines, where the first field on each line corresponds to an OTU identifier which is the reference sequence identifier, and the remaining fields correspond to sequence identifiers assigned to that OTU. The resulting .log file contains a list of parameters passed to this script along with the output location of the resulting .txt file. Example: Pick OTUs by searching $PWD/inseqs.fasta against $PWD/refseqs.fasta with reference-based uclust and write the output to the $PWD/blast_otus/ directory. This is a closed-reference OTU picking process. ALWAYS SPECIFY ABSOLUTE FILE PATHS (absolute path represented here as $PWD, but will generally look something like /home/ubuntu/my_analysis/). parallel_pick_otus_usearch61_ref.py – Parallel pick otus using usearch_ref Description: This script works like the pick_otus.py script, but is intended to make use of multicore/multiprocessor environments to perform analyses in parallel. Usage: parallel_pick_otus_usearch61_ref.py [options] Input Arguments: [REQUIRED] -i, --input_fasta_fp Full path to input_fasta_fp -o, --output_dir Path to store output files -r, --refseqs_fp Full path to reference collection [OPTIONAL] -s, --similarity Sequence similarity threshold [default: 0.97] -z, --enable_rev_strand_match Enable reverse strand matching for uclust, uclust_ref, usearch, usearch_ref, usearch61, or usearch61_ref otu picking, will double the amount of memory used. [default: False] --max_accepts Max_accepts value to uclust, uclust_ref, usearch61, and usearch61_ref. By default, will use value suggested by method (uclust: 20, usearch61: 1) [default: default] --max_rejects Max_rejects value for uclust, uclust_ref, usearch61, and usearch61_ref. With default settings, will use value recommended by clustering method used (uclust: 500, usearch61: 8 for usearch_fast_cluster option, 32 for reference and smallmem options) [default: default] --word_length Word length value for uclust, uclust_ref, and usearch, usearch_ref, usearch61, and usearch61_ref. With default setting, will use the setting recommended by the method (uclust: 12, usearch: 64, usearch61: 8). int value can be supplied to override this setting. [default: default] --minlen Minimum length of sequence allowed for usearch, usearch_ref, usearch61, and usearch61_ref. [default: 64] --usearch_fast_cluster Use fast clustering option for usearch or usearch61_ref with new clusters. –enable_rev_strand_match can not be enabled with this option, and the only valid option for usearch61_sort_method is „length?. This option uses more memory than the default option for de novo clustering. [default: False] --usearch61_sort_method Sorting method for usearch61 and usearch61_ref. Valid options are abundance, length, or None. If the –usearch_fast_cluster option is enabled, the only sorting method allowed in length. [default: abundance] --sizeorder Enable size based preference in clustering with usearch61. Requires that –usearch61_sort_method be abundance. [default: False] -O, --jobs_to_start Number of jobs to start [default: 2] -R, --retain_temp_files Retain temporary files after runs complete (useful for debugging) [default: False] -S, --suppress_submit_jobs Only split input and write commands file - don?t submit jobs [default: False] -T, --poll_directly Poll directly for job completion rather than running poller as a separate job. If -T is specified this script will not return until all jobs have completed. [default: False] -U, --cluster_jobs_fp Path to cluster jobs script (defined in qiime_config) [default: start_parallel_jobs.py] -W, --suppress_polling Suppress polling of jobs and merging of results upon completion [default: False] -X, --job_prefix Job prefix [default: descriptive prefix + random chars] -Z, --seconds_to_sleep Number of seconds to sleep between checks for run completion when polling runs [default: 1] Output: Example: Pick OTUs by searching $PWD/inseqs.fasta against $PWD/refseqs.fasta with reference-based usearch and write the output to the $PWD/usearch_ref_otus/ directory. This is a closed-reference OTU picking process. ALWAYS SPECIFY ABSOLUTE FILE PATHS (absolute path represented here as $PWD, but will generally look something like /home/ubuntu/my_analysis/). pick_closed_reference_otus.py – Closed-reference OTU picking/Shotgun UniFrac workflow. Description: This script picks OTUs using a closed reference and constructs an OTU table. Taxonomy is assigned using a pre-defined taxonomy map of reference sequence OTU to taxonomy. If full-length genomes are provided as the reference sequences, this script applies the Shotgun UniFrac method. Usage: pick_closed_reference_otus.py [options] Input Arguments: [REQUIRED] -i, --input_fp The input sequences -r, --reference_fp The reference sequences -o, --output_dir The output directory [OPTIONAL] -p, --parameter_fp Path to the parameter file, which specifies changes to the default behavior. See #qiime-parameters . [if omitted, default values will be used] -t, --taxonomy_fp The taxonomy map [default: None] -f, --force Force overwrite of existing output directory (note: existing files in output_dir will not be removed) [default: None] -w, --print_only Print the commands but don?t call them – useful for debugging [default: False] -a, --parallel Run in parallel where available [default: False] -O, --jobs_to_start Number of jobs to start. NOTE: you must also pass -a to run in parallel, this defines the number of jobs to be started if and only if -a is passed [default: 2] Output: Pick OTUs, assign taxonomy, and create an OTU table against a reference set of OTUs. ALWAYS SPECIFY ABSOLUTE FILE PATHS (absolute path represented here as $PWD, but will generally look something like /home/ubuntu/my_analysis/). Pick OTUs and create an OTU table against a reference set of OTUs without adding taxonomy assignments. ALWAYS SPECIFY ABSOLUTE FILE PATHS (absolute path represented here as $PWD, but will generally look something like /home/ubuntu/my_analysis/). Pick OTUs, assign taxonomy, and create an OTU table against a reference set of OTUs using usearch_ref. ALWAYS SPECIFY ABSOLUTE FILE PATHS (absolute path represented here as $PWD, but will generally look something like /home/ubuntu/my_analysis/). pick_de_novo_otus.py – A workflow for de novo OTU picking, taxonomy assignment, phylogenetic tree construction, and OTU table construction. Description: This script takes a sequence file and performs all processing steps through building the OTU table. Usage: pick_de_novo_otus.py [options] Input Arguments: [REQUIRED] -i, --input_fp The input fasta file [REQUIRED] -o, --output_dir The output directory [REQUIRED] [OPTIONAL] -p, --parameter_fp Path to the parameter file, which specifies changes to the default behavior. See #qiime-parameters . [if omitted, default values will be used] -f, --force Force overwrite of existing output directory (note: existing files in output_dir will not be removed) [default: None] -w, --print_only Print the commands but don?t call them – useful for debugging [default: False] -a, --parallel Run in parallel where available [default: False] -O, --jobs_to_start Number of jobs to start. NOTE: you must also pass -a to run in parallel, this defines the number of jobs to be started if and only if -a is passed [default: 2] Output: This script will produce an OTU mapping file (pick_otus.py), a representative set of sequences (FASTA file from pick_rep_set.py), a sequence alignment file (FASTA file from align_seqs.py), taxonomy assignment file (from assign_taxonomy.py), a filtered sequence alignment (from filter_alignment.py), a phylogenetic tree (Newick file from make_phylogeny.py) and a biom-formatted OTU table (frommake_otu_table.py). Simple example: The following command will start an analysis on seqs.fna (-i), which is a post-split_libraries fasta file. The sequence identifiers in this file should be of the form _. The following steps, corresponding to the preliminary data preparation, are applied: Pick de novo OTUs at 97%; pick a representative sequence for each OTU (the OTU centroid sequence); align the representative set with PyNAST; assign taxonomy with RDP classifier; filter the alignment prior to tree building - remove positions which are all gaps, and specified as 0 in the lanemask; build a phylogenetic tree with FastTree; build an OTU table. All output files will be written to the directory specified by -o, and subdirectories as appropriate. ALWAYS SPECIFY ABSOLUTE FILE PATHS (absolute path represented here as $PWD, but will generally look something like /home/ubuntu/my_analysis/). pick_open_reference_otus.py – Description: Usage: pick_open_reference_otus.py [options] Input Arguments: [REQUIRED] -i, --input_fps The input sequences filepath or comma-separated list of filepaths -r, --reference_fp The reference sequences -o, --output_dir The output directory [OPTIONAL] -m, --otu_picking_method The OTU picking method to use for reference and de novo steps. Passing usearch61, for example, means that usearch61 will be used for the de novo steps and usearch61_ref will be used for reference steps. [default: uclust] -p, --parameter_fp Path to the parameter file, which specifies changes to the default behavior. See #qiime-parameters . [if omitted, default values will be used] --prefilter_refseqs_fp The reference sequences to use for the prefilter, if different from the reference sequecnces to use for the OTU picking [default: same as passed for –reference_fp] -n, --new_ref_set_id Unique identifier for OTUs that get created in this ref set (this is useful to support combining of reference sets) [default:New] -f, --force Force overwrite of existing output directory (note: existing files in output_dir will not be removed) [default: None] -a, --parallel Run in parallel where available [default: False] -O, --jobs_to_start Number of jobs to start. NOTE: you must also pass -a to run in parallel, this defines the number of jobs to be started if and only if -a is passed [default: 2] -s, --percent_subsample Percent of failure sequences to include in the subsample to cluster de novo (larger numbers should give more comprehensive ,results but will be slower) [default:0.001] --prefilter_percent_id Sequences are pre-clustered at this percent id against the reference and any reads which fail to hit are discarded (a quality filter); pass 0.0 to disable [default:0.60] --step1_otu_map_fp Reference OTU picking OTU map (to avoid rebuilding if one has already been built) --step1_failures_fasta_fp Reference OTU picking failures fasta filepath (to avoid rebuilding if one has already been built) --suppress_step4 Suppress the final de novo OTU picking step (may be necessary for extremely large data sets) [default: False] --min_otu_size The minimum otu size (in number of sequences) to retain the otu [default: 2] --suppress_taxonomy_assignment Skip the taxonomy assignment step, resulting in an OTU table without taxonomy [default: False] --suppress_align_and_tree Skip the sequence alignment and tree-building steps [default: False] Output: Run the subsampled open-reference OTU picking workflow on seqs1.fna using refseqs.fna as the reference collection. ALWAYS SPECIFY ABSOLUTE FILE PATHS (absolute path represented here as $PWD, but will generally look something like /home/ubuntu/my_analysis/ Run the subsampled open-reference OTU picking workflow on seqs1.fna using refseqs.fna as the reference collection and using usearch61 and usearch61_ref as the OTU picking methods. ALWAYS SPECIFY ABSOLUTE FILE PATHS (absolute path represented here as $PWD, but will generally look something like /home/ubuntu/my_analysis/ Run the subsampled open-reference OTU picking workflow in iterative mode on seqs1.fna and seqs2.fna using refseqs.fna as the initial reference collection. ALWAYS SPECIFY ABSOLUTE FILE PATHS (absolute path represented here as $PWD, but will generally look something like /home/ubuntu/my_analysis/ Run the subsampled open-reference OTU picking workflow in iterative mode on seqs1.fna and seqs2.fna using refseqs.fna as the initial reference collection. This is useful if you?re working with marker genes that do not result in useful alignment (e.g., fungal ITS). ALWAYS SPECIFY ABSOLUTE FILE PATHS (absolute path represented here as $PWD, but will generally look something like /home/ubuntu/my_analysis/ Run the subsampled open-reference OTU picking workflow in iterative mode on seqs1.fna and seqs2.fna using refseqs.fna as the initial reference collection, suppressing assignment of taxonomy. This is useful if you?re working with a reference collection without associated taxonomy. ALWAYS SPECIFY ABSOLUTE FILE PATHS (absolute path represented here as $PWD, but will generally look something like /home/ubuntu/my_analysis/ pick_otus.py – OTU picking Description: The OTU picking step assigns similar sequences to operational taxonomic units, or OTUs, by clustering sequences based on a user-defined similarity threshold. Sequences which are similar at or above the threshold level are taken to represent the presence of a taxonomic unit (e.g., a genus, when the similarity threshold is set at 0.94) in the sequence collection. Currently, the following clustering methods have been implemented in QIIME: 1. cd-hit (Li & Godzik, 2006; Li, Jaroszewski, & Godzik, 2001), which applies a “longest-sequence-first list removal algorithm” to cluster sequences. 2. blast (Altschul, Gish, Miller, Myers, & Lipman, 1990), which compares and clusters each sequence against a reference database of sequences. 3. Mothur (Schloss et al., 2009), which requires an input file of aligned sequences. The input file of aligned sequences may be generated from an input file like the one described below by runningalign_seqs.py. For the Mothur method, the clustering algorithm may be specified as nearest-neighbor, furthest-neighbor, or average-neighbor. The default algorithm is furthest-neighbor. 4. prefix/suffix [Qiime team, unpublished], which will collapse sequences which are identical in their first and/or last bases (i.e., their prefix and/or suffix). The prefix and suffix lengths are provided by the user and default to 50 each. 5. Trie [Qiime team, unpublished], which collapsing identical sequences and sequences which are subsequences of other sequences. 6. uclust (Edgar, RC 2010), creates “seeds” of sequences which generate clusters based on percent identity. 7. uclust_ref (Edgar, RC 2010), as uclust, but takes a reference database to use as seeds. New clusters can be toggled on or off. 8. usearch (Edgar, RC 2010, version v5.2.236), creates “seeds” of sequences which generate clusters based on percent identity, filters low abundance clusters, performs de novo and reference based chimera detection. 9. usearch_ref (Edgar, RC 2010, version v5.2.236), as usearch, but takes a reference database to use as seeds. New clusters can be toggled on or off. Quality filtering pipeline with usearch 5.X is described as usearch_qf “usearch quality filter”, described here: 8. usearch61 (Edgar, RC 2010, version v6.1.544), creates “seeds” of sequences which generate clusters based on percent identity. 9. usearch61_ref (Edgar, RC 2010, version v6.1.544), as usearch61, but takes a reference database to use as seeds. New clusters can be toggled on or off. Chimera checking with usearch 6.X is implemented in identify_chimeric_seqs.py. Chimera checking should be done first with usearch 6.X, and the filtered resulting fasta file can then be clustered. The primary inputs for pick_otus.py are: 1. A FASTA file containing sequences to be clustered 2. An OTU threshold (default is 0.97, roughly corresponding to species-level OTUs); 3. The method to be applied for clustering sequences into OTUs. pick_otus.py takes a standard fasta file as input. Usage: pick_otus.py [options] Input Arguments: [REQUIRED] -i, --input_seqs_filepath Path to input sequences file [OPTIONAL] -m, --otu_picking_method Method for picking OTUs. Valid choices are: mothur, trie, uclust_ref, usearch, usearch_ref, blast, usearch61, usearch61_ref, prefix_suffix, cdhit, uclust. The mothur method requires an input file of aligned sequences. usearch will enable the usearch quality filtering pipeline. [default: uclust] -c, --clustering_algorithm Clustering algorithm for mothur otu picking method. Valid choices are: furthest, nearest, average. [default: furthest] -M, --max_cdhit_memory Maximum available memory to cd-hit-est (via the program?s -M option) for cdhit OTU picking method (units of Mbyte) [default: 400] -o, --output_dir Path to store result file [default: ./_picked_otus/] -r, --refseqs_fp Path to reference sequences to search against when using -m blast, -m uclust_ref, -m usearch_ref, or -m usearch61_ref [default: None] -b, --blast_db Pre-existing database to blast against when using -m blast [default: None] --min_aligned_percent Minimum percent of query sequence that can be aligned to consider a hit (BLAST OTU picker only) [default: 0.5] -s, --similarity Sequence similarity threshold (for blast, cdhit, uclust, uclust_ref, usearch, usearch_ref, usearch61, or usearch61_ref) [default: 0.97] -e, --max_e_value Max E-value when clustering with BLAST [default: 1e-10] -q, --trie_reverse_seqs Reverse seqs before picking OTUs with the Trie OTU picker for suffix (rather than prefix) collapsing [default: False] -n, --prefix_prefilter_length Prefilter data so seqs with identical first prefix_prefilter_length are automatically grouped into a single OTU. This is useful for large sequence collections where OTU picking doesn?t scale well [default: None; 100 is a good value] -t, --trie_prefilter Prefilter data so seqs which are identical prefixes of a longer seq are automatically grouped into a single OTU; useful for large sequence collections where OTU picking doesn?t scale well [default: False] -p, --prefix_length Prefix length when using the prefix_suffix otu picker; WARNING: CURRENTLY DIFFERENT FROM prefix_prefilter_length (-n)! [default: 50] -u, --suffix_length Suffix length when using the prefix_suffix otu picker [default: 50] -z, --enable_rev_strand_match Enable reverse strand matching for uclust, uclust_ref, usearch, usearch_ref, usearch61, or usearch61_ref otu picking, will double the amount of memory used. [default: False] -D, --suppress_presort_by_abundance_uclust Suppress presorting of sequences by abundance when picking OTUs with uclust or uclust_ref [default: False] -A, --optimal_uclust Pass the –optimal flag to uclust for uclust otu picking. [default: False] -E, --exact_uclust Pass the –exact flag to uclust for uclust otu picking. [default: False] -B, --user_sort Pass the –user_sort flag to uclust for uclust otu picking. [default: False] -C, --suppress_new_clusters Suppress creation of new clusters using seqs that don?t match reference when using -m uclust_ref, -m usearch61_ref, or -m usearch_ref [default: False] --max_accepts Max_accepts value to uclust, uclust_ref, usearch61, and usearch61_ref. By default, will use value suggested by method (uclust: 20, usearch61: 1) [default: default] --max_rejects Max_rejects value for uclust, uclust_ref, usearch61, and usearch61_ref. With default settings, will use value recommended by clustering method used (uclust: 500, usearch61: 8 for usearch_fast_cluster option, 32 for reference and smallmem options) [default: default] --stepwords Stepwords value to uclust and uclust_ref [default: 20] --word_length Word length value for uclust, uclust_ref, and usearch, usearch_ref, usearch61, and usearch61_ref. With default setting, will use the setting recommended by the method (uclust: 12, usearch: 64, usearch61: 8). int value can be supplied to override this setting. [default: default] --uclust_otu_id_prefix OTU identifier prefix (string) for the de novo uclust OTU picker and for new clusters when uclust_ref is used without -C [default: denovo, OTU ids are ascending integers] --suppress_uclust_stable_sort Don?t pass –stable-sort to uclust [default: False] --suppress_uclust_prefilter_exact_match Don?t collapse exact matches before calling uclust [default: False] -d, --save_uc_files Enable preservation of intermediate uclust (.uc) files that are used to generate clusters via uclust. Also enables preservation of all intermediate files created by usearch and usearch61. [default: True] -j, --percent_id_err Percent identity threshold for cluster error detection with usearch. [default: 0.97] -g, --minsize Minimum cluster size for size filtering with usearch. [default: 4] -a, --abundance_skew Abundance skew setting for de novo chimera detection with usearch. [default: 2.0] -f, --db_filepath Reference database of fasta sequences for reference based chimera detection with usearch. [default: None] --perc_id_blast Percent ID for mapping OTUs created by usearch back to original sequence IDs [default: 0.97] --de_novo_chimera_detection Deprecated: de novo chimera detection performed by default, pass –suppress_de_novo_chimera_detection to disable. [default: None] -k, --suppress_de_novo_chimera_detection Suppress de novo chimera detection in usearch. [default: False] --reference_chimera_detection Deprecated: Reference based chimera detection performed by default, pass –supress_reference_chimera_detection to disable [default: None] -x, --suppress_reference_chimera_detection Suppress reference based chimera detection in usearch. [default: False] --cluster_size_filtering Deprecated, cluster size filtering enabled by default, pass –suppress_cluster_size_filtering to disable. [default: None] -l, --suppress_cluster_size_filtering Suppress cluster size filtering in usearch. [default: False] --remove_usearch_logs Disable creation of logs when usearch is called. Up to nine logs are created, depending on filtering steps enabled. [default: False] --derep_fullseq Dereplication of full sequences, instead of subsequences. Faster than the default –derep_subseqs in usearch. [default: False] -F, --non_chimeras_retention Selects subsets of sequences detected as non-chimeras to retain after de novo and reference based chimera detection. Options are intersection or union. union will retain sequences that are flagged as non-chimeric from either filter, while intersection will retain only those sequences that are flagged as non-chimeras from both detection methods. [default: union] --minlen Minimum length of sequence allowed for usearch, usearch_ref, usearch61, and usearch61_ref. [default: 64] --usearch_fast_cluster Use fast clustering option for usearch or usearch61_ref with new clusters. –enable_rev_strand_match can not be enabled with this option, and the only valid option for usearch61_sort_method is „length?. This option uses more memory than the default option for de novo clustering. [default: False] --usearch61_sort_method Sorting method for usearch61 and usearch61_ref. Valid options are abundance, length, or None. If the –usearch_fast_cluster option is enabled, the only sorting method allowed in length. [default: abundance] --sizeorder Enable size based preference in clustering with usearch61. Requires that –usearch61_sort_method be abundance. [default: False] Output: The output consists of two files (i.e. seqs_otus.txt and seqs_otus.log). The .txt file is composed of tab-delimited lines, where the first field on each line corresponds to an (arbitrary) cluster identifier, and the remaining fields correspond to sequence identifiers assigned to that cluster. Sequence identifiers correspond to those provided in the input FASTA file. Usearch (i.e. usearch quality filter) can additionally have log files for each intermediate call to usearch. Example lines from the resulting .txt file: 0 seq1 seq5 1 seq2 2 seq3 3 seq4 seq6 seq7 This result implies that four clusters were created based on 7 input sequences. The first cluster (cluster id 0) contains two sequences, sequence ids seq1 and seq5; the second cluster (cluster id 1) contains one sequence, sequence id seq2; the third cluster (cluster id 2) contains one sequence, sequence id seq3, and the final cluster (cluster id 3) contains three sequences, sequence ids seq4, seq6, and seq7. The resulting .log file contains a list of parameters passed to the pick_otus.py script along with the output location of the resulting .txt file. Example (uclust method, default): Using the seqs.fna file generated from split_libraries.py and outputting the results to the directory “picked_otus_default/”, while using default parameters (0.97 sequence similarity, no reverse strand matching): To change the percent identity to a lower value, such as 90%, and also enable reverse strand matching, the command would be the following: Uclust Reference-based OTU picking example: uclust_ref can be passed via -m to pick OTUs against a reference set where sequences within the similarity threshold to a reference sequence will cluster to an OTU defined by that reference sequence, and sequences outside of the similarity threshold to a reference sequence will form new clusters. OTU identifiers will be set to reference sequence identifiers when sequences cluster to reference sequences, and „qiime_otu_? for new OTUs. Creation of new clusters can be suppressed by passing -C, in which case sequences outside of the similarity threshold to any reference sequence will be listed as failures in the log file, and not included in any OTU. Example (cdhit method): Using the seqs.fna file generated from split_libraries.py and outputting the results to the directory “cdhit_picked_otus/”, while using default parameters (0.97 sequence similarity, no prefix filtering): Currently the cd-hit OTU picker allows for users to perform a pre-filtering step, so that highly similar sequences are clustered prior to OTU picking. This works by collapsing sequences which begin with an identical n-base prefix, where n is specified by the -n parameter. A commonly used value here is 100 (e.g., -n 100). So, if using this filter with -n 100, all sequences which are identical in their first 100 bases will be clustered together, and only one representative sequence from each cluster will be passed to cd-hit. This is used to greatly decrease the run-time of cd-hit-based OTU picking when working with very large sequence collections, as shown by the following command: Alternatively, if the user would like to collapse identical sequences, or those which are subsequences of other sequences prior to OTU picking, they can use the trie prefiltering (“-t”) option as shown by the following command. Note: It is highly recommended to use one of the prefiltering methods when analyzing large datasets (>100,000 seqs) to reduce run-time. BLAST OTU-Picking Example: OTUs can be picked against a reference database using the BLAST OTU picker. This is useful, for example, when different regions of the SSU RNA have sequenced and a sequence similarity based approach like cd-hit therefore wouldn?t work. When using the BLAST OTU picking method, the user must supply either a reference set of sequences or a reference database to compare against. The OTU identifiers resulting from this step will be the sequence identifiers in the reference database. This allows for use of a pre-existing tree in downstream analyses, which again is useful in cases where different regions of the 16s gene have been sequenced. The following command can be used to blast against a reference sequence set, using the default E-value and sequence similarity (0.97) parameters: If you already have a pre-built BLAST database, you can pass the database prefix as shown by the following command: If the user would like to change the sequence similarity (“-s”) and/or the E-value (“-e”) for the blast method, they can use the following command: Prefix-suffix OTU Picking Example: OTUs can be picked by collapsing sequences which begin and/or end with identical bases (i.e., identical prefixes or suffixes). This OTU picker is currently likely to be of limited use on its own, but will be very useful in collapsing very similar sequences in a chained OTU picking strategy that is currently in development. For example, the user will be able to pick OTUs with this method, followed by representative set picking, and then re-pick OTUs on their representative set. This will allow for highly similar sequences to be collapsed, followed by running a slower OTU picker. This ability to chain OTU pickers is not yet supported in QIIME. The following command illustrates how to pick OTUs by collapsing sequences which are identical in their first 50 and last 25 bases: Mothur OTU Picking Example: The Mothur program () provides three clustering algorithms for OTU formation: furthest-neighbor (complete linkage), average-neighbor (group average), and nearest-neighbor (single linkage). Details on the algorithms may be found on the Mothur website and publications (Schloss et al., 2009). However, the running times of Mothur?s clustering algorithms scale with the number of sequences squared, so the program may not be feasible for large data sets. The following command may be used to create OTUs based on a furthest-neighbor algorithm (the default setting) using aligned sequences as input: If you prefer to use a nearest-neighbor algorithm instead, you may specify this with the „-c? flag: The sequence similarity parameter may also be specified. For example, the following command may be used to create OTUs at the level of 90% similarity: usearch : Usearch () provides clustering, chimera checking, and quality filtering. The following command specifies a minimum cluster size of 2 to be used during cluster size filtering: usearch example where reference-based chimera detection is disabled, and minimum cluster size filter is reduced from default (4) to 2: pick_otus_through_otu_table.py – A workflow script for picking OTUs through building OTU tables Description: This script takes a sequence file and performs all processing steps through building the OTU table. Usage: pick_otus_through_otu_table.py [options] Input Arguments: [REQUIRED] -i, --input_fp The input fasta file [REQUIRED] -o, --output_dir The output directory [REQUIRED] [OPTIONAL] -p, --parameter_fp Path to the parameter file, which specifies changes to the default behavior. See #qiime-parameters . [if omitted, default values will be used] -f, --force Force overwrite of existing output directory (note: existing files in output_dir will not be removed) [default: None] -w, --print_only Print the commands but don?t call them – useful for debugging [default: False] -a, --parallel Run in parallel where available [default: False] -O, --jobs_to_start Number of jobs to start. NOTE: you must also pass -a to run in parallel, this defines the number of jobs to be started if and only if -a is passed [default: 4] Output: This script will produce an OTU mapping file (pick_otus.py), a representative set of sequences (FASTA file from pick_rep_set.py), a sequence alignment file (FASTA file from align_seqs.py), taxonomy assignment file (from assign_taxonomy.py), a filtered sequence alignment (from filter_alignment.py), a phylogenetic tree (Newick file from make_phylogeny.py) and a biom-formatted OTU table (frommake_otu_table.py). Simple example: The following command will start an analysis on seqs.fna (-i), which is a post-split_libraries fasta file. The sequence identifiers in this file should be of the form _. The following steps, corresponding to the preliminary data preparation, are applied: Pick de novo OTUs at 97%; pick a representative sequence for each OTU (the OTU centroid sequence); align the representative set with PyNAST; assign taxonomy with RDP classifier; filter the alignment prior to tree building - remove positions which are all gaps, and specified as 0 in the lanemask; build a phylogenetic tree with FastTree; build an OTU table. All output files will be written to the directory specified by -o, and subdirectories as appropriate. ALWAYS SPECIFY ABSOLUTE FILE PATHS (absolute path represented here as $PWD, but will generally look something like /home/ubuntu/my_analysis/). pick_reference_otus_through_otu_table. py – Reference OTU picking/Shotgun UniFrac workflow. Description: This script picks OTUs using a reference-based method and constructs an OTU table. Taxonomy is assigned using a pre-defined taxonomy map of reference sequence OTU to taxonomy. If full-length genomes are provided as the reference sequences, this script applies the Shotgun UniFrac method. Usage: pick_reference_otus_through_otu_table.py [options] Input Arguments: [REQUIRED] -i, --input_fp The input sequences -r, --reference_fp The reference sequences -o, --output_dir The output directory [OPTIONAL] -p, --parameter_fp Path to the parameter file, which specifies changes to the default behavior. See #qiime-parameters . [if omitted, default values will be used] -t, --taxonomy_fp The taxonomy map [default: None] -f, --force Force overwrite of existing output directory (note: existing files in output_dir will not be removed) [default: None] -w, --print_only Print the commands but don?t call them – useful for debugging [default: False] -a, --parallel Run in parallel where available [default: False] -O, --jobs_to_start Number of jobs to start. NOTE: you must also pass -a to run in parallel, this defines the number of jobs to be started if and only if -a is passed [default: 4] Output: Pick OTUs, assign taxonomy, and create an OTU table against a reference set of OTUs. ALWAYS SPECIFY ABSOLUTE FILE PATHS (absolute path represented here as $PWD, but will generally look something like /home/ubuntu/my_analysis/). Pick OTUs and create an OTU table against a reference set of OTUs without adding taxonomy assignments. ALWAYS SPECIFY ABSOLUTE FILE PATHS (absolute path represented here as $PWD, but will generally look something like /home/ubuntu/my_analysis/). pick_rep_set.py – Pick representative set of sequences Description: After picking OTUs, you can then pick a representative set of sequences. For each OTU, you will end up with one sequence that can be used in subsequent analyses. By default, the representative sequence for an OTU is chosen as the most abundant sequence showing up in that OTU. This is computed by collapsing identical sequences, and choosing the one that was read the most times as the representative sequence (note that each of these would have a different sequence identifier in the FASTA provided as input). Usage: pick_rep_set.py [options] Input Arguments: [REQUIRED] -i, --input_file Path to input otu mapping file [REQUIRED] [OPTIONAL] -f, --fasta_file Path to input fasta file [REQUIRED if not picking against a reference set; default: None] -m, --rep_set_picking_method Method for picking representative sets. Valid choices are random, longest, most_abundant, first [default: first (first chooses cluster seed when picking otus with uclust)] -o, --result_fp Path to store result file [default: _rep_set.fasta] -l, --log_fp Path to store log file [default: No log file created.] -s, --sort_by Sort by otu or seq_id [default: otu] -r, --reference_seqs_fp Collection of preferred representative sequences [default: None] Output: The output from pick_rep_set.py is a single FASTA file containing one sequence per OTU. The FASTA header lines will be the OTU identifier (from here on used as the unique sequence identifier) followed by a space, followed by the sequence identifier originally associated with the representative sequence. The name of the output FASTA file will be _rep_set.fasta by default, or can be specified via the “-o” parameter. Simple example: picking a representative set for de novo-picked OTUs: The script pick_rep_set.py takes as input an „OTU map? (via the “-i” parameter) which maps OTU identifiers to sequence identifiers. Typically, this will be the output file provided bypick_otus.py. Additionally, a FASTA file is required, via “-f”, which contains all of the sequences whose identifiers are listed in the OTU map. By default, a representative sequence will be chosen as the most abundant sequence in the OTU. This can be changed to, for example, choose the first sequence listed in each OTU by passing -m first. Picking OTUs with “preferred representative” sequences: Under some circumstances you may have a fasta file of “preferred representative” sequences. An example of this is if you were to pick OTUs against a reference collection with uclust_ref. In this case you may want your representative sequences to be the sequences from the reference collection, rather than the sequences from your sequencing run. To achieve this, you can pass the original reference collection via -r. If you additionally allowed for new clusters (i.e., sequences which don?t match a reference sequence are used as seeds for new OTUs) you?ll also need to pass the original sequence collection to pick a representative sequence from the sequencing run in that case. pick_subsampled_reference_otus_throug h_otu_table.py – Description: Usage: pick_subsampled_reference_otus_through_otu_table.py [options] Input Arguments: [REQUIRED] -i, --input_fps The input sequences filepath or comma-separated list of filepaths -r, --reference_fp The reference sequences -o, --output_dir The output directory [OPTIONAL] -p, --parameter_fp Path to the parameter file, which specifies changes to the default behavior. See #qiime-parameters . [if omitted, default values will be used] --prefilter_refseqs_fp The reference sequences to use for the prefilter, if different from the reference sequecnces to use for the OTU picking [default: same as passed for –reference_fp] -n, --new_ref_set_id Unique identifier for OTUs that get created in this ref set (this is useful to support combining of reference sets) [default:New] -f, --force Force overwrite of existing output directory (note: existing files in output_dir will not be removed) [default: None] -a, --parallel Run in parallel where available [default: False] -O, --jobs_to_start Number of jobs to start. NOTE: you must also pass -a to run in parallel, this defines the number of jobs to be started if and only if -a is passed [default: 4] -s, --percent_subsample Percent of failure sequences to include in the subsample to cluster de novo (larger numbers should give more comprehensive ,results but will be slower) [default:0.001] --prefilter_percent_id Sequences are pre-clustered at this percent id against the reference and any reads which fail to hit are discarded (a quality filter); pass 0.0 to disable [default:0.60] --step1_otu_map_fp Reference OTU picking OTU map (to avoid rebuilding if one has already been built) --step1_failures_fasta_fp Reference OTU picking failures fasta filepath (to avoid rebuilding if one has already been built) --suppress_step4 Suppress the final de novo OTU picking step (may be necessary for extremely large data sets) [default: False] --min_otu_size The minimum otu size (in number of sequences) to retain the otu [default: 2] --suppress_taxonomy_assignment Skip the taxonomy assignment step, resulting in an OTU table without taxonomy [default: False] --suppress_align_and_tree Skip the sequence alignment and tree-building steps [default: False] Output: Run the subsampled open-reference OTU picking workflow on seqs1.fna using refseqs.fna as the reference collection. ALWAYS SPECIFY ABSOLUTE FILE PATHS (absolute path represented here as $PWD, but will generally look something like /home/ubuntu/my_analysis/ Run the subsampled open-reference OTU picking workflow in iterative mode on seqs1.fna and seqs2.fna using refseqs.fna as the initial reference collection. ALWAYS SPECIFY ABSOLUTE FILE PATHS (absolute path represented here as $PWD, but will generally look something like /home/ubuntu/my_analysis/ Run the subsampled open-reference OTU picking workflow in iterative mode on seqs1.fna and seqs2.fna using refseqs.fna as the initial reference collection. This is useful if you?re working with marker genes that do not result in useful alignment (e.g., fungal ITS). ALWAYS SPECIFY ABSOLUTE FILE PATHS (absolute path represented here as $PWD, but will generally look something like /home/ubuntu/my_analysis/ Run the subsampled open-reference OTU picking workflow in iterative mode on seqs1.fna and seqs2.fna using refseqs.fna as the initial reference collection, suppressing assignment of taxonomy. This is useful if you?re working with a reference collection without associated taxonomy. ALWAYS SPECIFY ABSOLUTE FILE PATHS (absolute path represented here as $PWD, but will generally look something like /home/ubuntu/my_analysis/ plot_rank_abundance_graph.py – plot rank-abundance curve Description: Plot a set of rank-abundance graphs from an OTU table and a set of sample names. Multiple graphs will be plotted into the same figure, in order to allow for an easy comparison across samples. Usage: plot_rank_abundance_graph.py [options] Input Arguments: [REQUIRED] -i, --otu_table_fp Path to the input OTU table (i.e., the output from make_otu_table.py) -s, --sample_name Name of the sample to plot. Use “*” to plot all. -o, --result_fp Path to store resulting figure file. File extension will be appended if not supplied (e.g.: rankfig -> rankfig.pdf). Additionally, a log file rankfig_log.txt will be created [OPTIONAL] -a, --absolute_counts Plot absolute abundance values instead of relative [default: False] -n, --no-legend Do not draw a legend [default: False] -x, --x_linear_scale Draw x axis in linear scale [default: False] -y, --y_linear_scale Draw y axis in linear scale [default: False] -f, --file_type Save plot using this image type. Choice of pdf, svg, png, eps [default: pdf] Output: Single graph example: Plot the rank-abundance curve of one sample using a linear scale for the x_axis: multiple graph example: Plot the rank-abundance curve of several samples: multiple graph example: Plot the rank-abundance curve of all samples in an OTU table: plot_semivariogram.py – Fits a model between two distance matrices and plots the result Description: Fits a spatial autocorrelation model between two matrices and plots the result. This script will work with two distance matrices but will ignore the 0s at the diagonal and the values that go to N/A Usage: plot_semivariogram.py [options] Input Arguments: [REQUIRED] -x, --input_path_x Path to distance matrix to be displayed in the x axis -y, --input_path_y Path to distance matrix to be displayed in the y axis -o, --output_path Output path. directory for batch processing, filename for single file operation [OPTIONAL] -b, --binning Binning ranges. Format: [increment,top_limit], when top_limit is -1=infinitum; you can specify several ranges using the same format, i.e. [2.5,10][50,-1] will set two bins, one from 0-10 using 2.5 size steps and from 10-inf using 50 size steps. Note that the binning is used to clean the plots (reduce number of points) but ignored to fit the model. [default: None] --ignore_missing_samples This will overpass the error raised when the matrices have different sizes/samples --x_max X axis max limit [default: auto] --x_min X axis min limit [default: auto] --y_max Y axis max limit [default: auto] --y_min Y axis min limit [default: auto] -X, --x_label Label for the x axis [default: Distance Dissimilarity (m)] -Y, --y_label Label for the y axis [default: Community Dissimilarity] -t, --fig_title Title of the plot [default: Semivariogram] --dot_color Dot color for plot, more info: [default: white] --dot_marker Dot color for plot, more info: [default: o] --line_color Line color for plot, more info: [default: blue] --dot_alpha Alpha for dots, more info: [default: 1] --line_alpha Alpha for dots, more info: [default: 1] -m, --model Model to be fitted to the data. Valid choices are:nugget, exponential, gaussian, periodic, linear. [default: exponential] -p, --print_model Print in the title of the plot the function of the fit. [default: False] Output: The resulting output file consists of a pdf image containing the plot between the two distances matrices and the fitted model Fitting: For this script, the user supplies two distance matrices (i.e. resulting file from beta_diversity.py), along with the output filename (e.g. semivariogram), and the model to fit, as follows: Modify the the default method to gaussian plot_taxa_summary.py – Make taxaonomy summary charts based on taxonomy assignment Description: This script automates the construction of pie, bar and area charts showing the breakdown of taxonomy by given levels. The script creates an html file for each chart type for easy visualization. It uses the taxonomy or category counts from summarize_taxa.py for combined samples by level (-i) and user specified labels for each file passed in (-l). Output will be written to the user specified folder (-o) the, where the default is the current working directory. The user can also specify the number of categories displayed for within a single pie chart, where the rest are grouped together as the „other category? using the (-n) option, default is 20. Usage: plot_taxa_summary.py [options] Input Arguments: [REQUIRED] -i, --counts_fname Input comma-separated list of summarized taxa filepaths (i.e results from summarize_taxa.py) [REQUIRED] [OPTIONAL] -l, --labels Comma-separated list of taxonomic levels (e.g. Phylum,Class,Order) [default=None] -n, --num_categories The maximum number of taxonomies to show in each pie chart. All additional taxonomies are grouped into an “other” category. NOTE: this functionality only applies to the pie charts. [default: 20] -o, --dir_path Output directory -b, --colorby This is the categories to color by in the plots from the metadata mapping file. The categories must match the name of a column header in the mapping file exactly and multiple categories can be list by comma separating them without spaces. [default=None] -p, --prefs_path Input user-generated preferences filepath. NOTE: This is a file with a dictionary containing preferences for the analysis. The key taxonomy_coloring is used for the coloring. [default: None] -k, --background_color This is the background color to use in the plots (black or white) [default: white] -d, --dpi This is the resolution of the plot. [default: 80] -x, --x_width This is the width of the x-axis to use in the plots. [default: 12] -y, --y_height This is the height of the y-axis to use in the plots. [default: 6] -w, --bar_width This the width of the bars in the bar graph and should be a number between 0 and 1. NOTE: this only applies to the bar charts. [default: 0.75] -t, --type_of_file This is the type of image to produce (i.e. pdf,svg,png). [default: pdf] -c, --chart_type This is the type of chart to plot (i.e. pie, bar or area). The user has the ability to plot multiple types, by using a comma-separated list (e.g. area,pie) [default: area,bar] -r, --resize_nth_label Make every nth label larger than the other lables. This is for large area and bar charts where the font on the x-axis is small. This requires an integer value greater than 0. [default: 0] -s, --include_html_legend Include HTML legend. If present, the writing of the legend in the html page is included. [default: False] -m, --include_html_counts Include HTML counts. If present, the writing of the counts in the html table is included [default: False] -a, --label_type Label type (“numeric” or “categorical”). If the label type is defined as numeric, the x-axis will be scaled accordingly. Otherwise the x-values will treated categorically and be evenly spaced [default: categorical]. Output: The script generates an output folder, which contains several files. For each pie chart there is a png and a pdf file. The best way to view all of the pie charts is by opening up the file taxonomy_summary_pie_chart.html. Examples: If you wish to run the code using default parameters, you must supply a counts file (phylum.txt) along with the taxon level label (Phylum), the type(s) of charts to produce, and an output directory, by using the following command: If you want to make charts for multiple levels at a time (phylum.txt,class.txt,genus.txt) use the following command: Additionally, if you would like to display on a set number of taxa (“-n 10”) in the pie charts, you can use the following command: If you would like to display generate pie charts for specific samples, i.e. sample „PC.636? and sample „PC.635? that are in the counts file header, you can use the following command: poller.py – Poller for parallel QIIME scripts. Description: Script for polling parallel runs to check completion. Usage: poller.py [options] Input Arguments: [REQUIRED] -f, --check_run_complete_file Path to file containing a list of files that must exist to declare a run complete [REQUIRED] [OPTIONAL] -r, --check_run_complete_f Function which returns True when run is completed [default: qiime.parallel.poller.basic_check_run_complete_f] -p, --process_run_results_f Function to be called when runs complete [default: qiime.parallel.poller.basic_process_run_results_f] -m, --process_run_results_file Path to file containing a map of tmp filepaths which should be written to final output filepaths [default: None] -c, --clean_up_f Function called after processing result [default: qiime.parallel.poller.basic_clean_up_f] -d, --clean_up_file List of files and directories to remove after run [default: None] -t, --time_to_sleep Time to wait between calls to status_callback_f (in seconds) [default: 3] Output: No output created. Poller example: Runs the poller, which checks for the existence of two input files (file1.txt and file2.txt) and merges their contents. A cleanup file is provided that instructs the poller to remove the newly merged file. poller_example.py – Create python file Description: This script is designed for use with parallel jobs to wait for their completion, and subsequently process the results and clean up. This script allows users to see it in action, and also to allow manual testing as this is a difficult process to unit test. To test, call the example command below. The poller will begin running, at which time you can create the three polled files in POLLED_DIR. When all three are created, the poller will process the results, clean up, and exit. Usage: poller_example.py [options] Input Arguments: [REQUIRED] -d, --polled_dir Path to directory to poll [OPTIONAL] -P, --poller_fp Full path to qiime/parallel/poller.py [default: /Users/caporaso/code/Qiime/scripts/poller.py] -Y, --python_exe_fp Full path to python executable [default: python] -c, --suppress_custom_functions Use the default functions for checking run completion, processing results, and cleaning up (these are quiet) [default: False] Output: The poller waits for three files to be created: , /poller_test_0.txt , /poller_test_1.txt , /poller_test_2.txt , is defined via -d. Existence of these three files is checked every 5 seconds with verbose_check_run_complete_f. When all three exist verbose_process_run_results_f is called, which cats all the files into a single file: /poller_test_completed.txt. Finally, verbose_clean_up_f is called which removes the original three files the poller was waiting on. Example usage: The actual call to the polling command is printed for reference just prior to calling it. This illustrates how to pass both functions and filepaths to the poller. For an example where the default (non-verbose) check_run_complete_f, process_run_results_f, and clean_up_f are used, pass -c. Again, the polling command will be printed just prior to calling: .-////- pool_by_metadata.py – pool samples in OTU table and mapping file based on sample metadata from mapping file Description: this script outputs a new otu table and mapping file with some samples removed and replaced with one pooled sample. the new pooled sample will have fields in the mapping file the same as its constituent samples, if all are idential. Else it will just say „multipleValues?. Usage: pool_by_metadata.py [options] Input Arguments: [REQUIRED] -i, --otu_table_fp Path to the input OTU table (i.e., the output from make_otu_table.py) -m, --map Path to the map file [REQUIRED] -s, --states String containing valid states, e.g. „STUDY_NAME:DOG?. Setting just „STUDY_NAME? will bin samples by all unique category values. e.g. it will pool all samples marked DOG into a sample called STUDY_NAME.DOG all CAT into STUDY_NAME.CAT etc. [OPTIONAL] -o, --otu_outfile Name of otu output file, default is otu_filename.pooled.txt -p, --map_outfile Name of map output file, default is map_filename.pooled.txt -l, --pooled_sample_name New sample name used in new mapping file and new otu table Output: The result is a pooled OTU table and mapping file. Examples: The following command pools all the Control samples into one sample named „pooledControl?. The resulting data is written to seqs_otus.txt.filtered.xls and Fasting_Map.txt.filtered.xls: Some variations are: Pooling all samples in both Control and Fast in the Treatment field (i.e. pooling everything): Excluding Fast in the Treatment field - the syntax here is “*” to keep everything, then !Fast to eliminate the Fast group: preferences_file – Example of a prefs file This shows a generic overview of a preferences file (.txt): , { , „background_color?:?color_name?, , ‘sample_coloring’: o { , „Name_for_color_scheme1?: , { , „column?:?mapping_column1?, , „colors?:{„Sample1?:?red?,?Sample2?:? blue?}, , } o }, , ‘MONTE_CARLO_GROUP_DISTANCES’: o { , „mapping_column1?: distance_to_use1, , „mapping_column2?: distance_to_use2, o }, , ‘FIELDS’: o [ , „mapping_column1?, , „mapping_column2?, o ], , ‘taxonomy_coloring’: o { , „Taxonomy_Level?: , { , „column?:?summarized_otu_table_column _number2?, , ‘colors’: , { , “Lineage”:(„color_name1?,”c olor1_in_hex”), , “Lineage”:(„color_name2?,”c olor2_in_hsv”), , } , } o } , } This shows an example of a Prefs file (.txt): , { , „background_color?:?black?, , ‘sample_coloring’: o { , „Samples?: , { , „column?:?SampleID?, , „colors?:{„Sample1?:?red?,?Sample2?:? blue?}, , }, , „TreatmentType?: , { , „column?:?Treatment?, , „colors?:((„red?,(0,100,100)),(„blue? ,(240,100,100))) , } o }, , ‘MONTE_CARLO_GROUP_DISTANCES’: o { , „SampleID?: 10, , „Treatment?: 10 o }, , ‘FIELDS’: o [ , „SampleID?, , „Treatment? o ], , ‘taxonomy_coloring’: o { , „Level_3?: , { , „column?:„3?, , ‘colors’: , { , „Root;Bacteria;Bacteroidete s;Flavobacteria?:(„red?,(0,100,100)), , „Root;Bacteria;Bacteroidete s;Sphingobacteria?:(„blue?,(240,100,100)) , } , } o } , } principal_coordinates.py – Principal Coordinates Analysis (PCoA) Description: Principal Coordinate Analysis (PCoA) is commonly used to compare groups of samples based on phylogenetic or count-based distance metrics (see section on beta_diversity.py). Usage: principal_coordinates.py [options] Input Arguments: [REQUIRED] -i, --input_path Path to the input distance matrix file(s) (i.e., the output from beta_diversity.py). Is a directory for batch processing and a filename for a single file operation. -o, --output_path Output path. directory for batch processing, filename for single file operation Output: The resulting output file consists of the principal coordinate (PC) axes (columns) for each sample (rows). Pairs of PCs can then be graphed to view the relationships between samples. The bottom of the output file contains the eigenvalues and % variation explained for each PC. PCoA (Single File): For this script, the user supplies a distance matrix (i.e. resulting file from beta_diversity.py), along with the output filename (e.g. beta_div_coords.txt), as follows: PCoA (Multiple Files): The script also functions in batch mode if a folder is supplied as input (e.g. from beta_diversity.py run in batch). No other files should be present in the input folder - only the distance matrix files to be analyzed. This script operates on every distance matrix file in the input directory and creates a corresponding principal coordinates results file in the output directory, e.g.: print_metadata_stats.py – Count the number of samples associated to a category value Description: Sum up the number of samples with each category value and print this information. Usage: print_metadata_stats.py [options] Input Arguments: [REQUIRED] -m, --mapping_file The input metadata file -c, --category The category to examine [OPTIONAL] -o, --output_fp Path where output will be written [default: print to screen] Output: Two columns, the first being the category value and the second being the count. Output is to standard out. If there are unspecified values, the output category is identified as *UNSPECIFIED* Example: Count the number of samples associated with Treatment Example writting the output to a file: Count the number of samples associated with Treatment and save them to a file called stats.txt print_qiime_config.py – Print out the qiime config settings. Description: A simple scripts that prints out the qiime config settings and does some sanity checks. Usage: print_qiime_config.py [options] Input Arguments: [OPTIONAL] -t, --test Test the qiime config for sanity [default: False] Output: This prints the qiime_config to stdout. Example 1: Print qiime config settings: . Example 2: Print and check qiime config settings for sanity: .- process_iseq.py – Given a directory of per-swath qseq files, this script generates a single fastq per lane. Description: Usage: process_iseq.py [options] Input Arguments: [REQUIRED] -i, --input_fps The input filepaths (either iseq or gzipped iseq format; comma-separated if more than one). See Processing Illumina Data tutorial for a description of the iseq file type. -o, --output_dir The output directory -b, --barcode_length Length of the barcode [OPTIONAL] --barcode_in_header Pass if barcode is in the header index field (rather than at the beginning of the sequence) --barcode_qual_c If no barcode quality string is available, score each base with this quality [default: b] Output: Generate fastq files from lanes 1 and 2 (read 1 data) where barcodes are contained as the first tweleve bases of the sequences. Generate fastq files from the gzipped lanes 1 and 2 (read 1 data) where barcodes are contained as the first tweleve bases of the sequences. process_qseq.py – Given a directory of per-swath qseq files, this script generates a single fastq per lane. Description: Usage: process_qseq.py [options] Input Arguments: [REQUIRED] -i, --input_dir The input directory -o, --output_dir The output directory -r, --read The read number to consider [OPTIONAL] -l, --lanes The lane numbers to consider, comma-separated [defaut: 1,2,3,4,5,6,7,8] -b, --bases The number of bases to include (useful for slicing a barcode) [defaut: all] --ignore_pass_filter Ignore the illumina pass filter [default:False; reads with 0 in pass filter field are discarded] Output: Generate fastq files from all lanes of read 1 data in the current directory. Generate fastq files from all lanes of read 2 data in the current directory, truncating the sequences after the first 12 bases. process_sff.py – Convert sff to FASTA and QUAL files Description: This script converts a directory of sff files into FASTA, QUAL and flowgram files. Usage: process_sff.py [options] Input Arguments: [REQUIRED] -i, --input_dir Input directory of sff files or a single sff filepath [OPTIONAL] --no_trim Do not trim sequence/qual (requires –use_sfftools option) [default: False] -f, --make_flowgram Generate a flowgram file. [default: False] -t, --convert_to_FLX Convert Titanium reads to FLX length. [default: False] --use_sfftools Use the external programs sfffile and sffinfo for processing, instead of the equivalent python implementation -o, --output_dir Input directory of sff files [default: same as input dir] Output: This script results in FASTA and QUAL formatted files. Simple example: Convert all the sffs in directory “sffs/” to fasta and qual. Convert a single sff to fasta and qual. Flowgram example: Convert all the sffs in directory “sffs/” to fasta and qual, along with a flowgram file. Convert a single sff to fasta and qual, along with a flowgram file. Output example: Convert all the sffs in directory “sffs/” to fasta and qual, along with a flowgram file and write them to another directory. quality_scores_plot.py – Generates histograms of sequence quality scores and number of nucleotides recorded at a particular index Description: Two plots are generated by this module. The first shows line plots indicating the average and standard deviations for the quality scores of the input quality score file, starting with the first nucleotide and ending with the the final nucleotide of the largest sequence. A second histogram shows a line plot with the nucleotide count for each position, so that one may easily visualize how sequence length drops off. A dotted line shows the cut-off point for a score to be acceptable (default is 25). A text file logging the average, standard deviation, and base count for each base position is also generated. These three sections are comma separated. The truncate_fasta_qual_files.py module can be used to create truncated versions of the input fasta and quality score files. By using this module to assess the beginning of poor quality base calls, one can determine the base position to begin truncating sequences at. quality_scores_plot.py [options] Usage: Input Arguments: [REQUIRED] -q, --qual_fp Quality score file used to generate histogram data. [OPTIONAL] -o, --output_dir Output directory. Will be created if does not exist. [default: .] -s, --score_min Minimum quality score to be considered acceptable. Used to draw dotted line on histogram for easy visualization of poor quality scores. [default: 25] -v, --verbose Turn on this flag to disable verbose output. [default: True] Output: A .pdf file with the two plots will be created in the output directory Example: Generate plots and output to the quality_histograms folder relatedness.py – Calculate NRI (net relatedness index) and NTI (nearest taxon index) using the formulas from Phylocom 4.2/3.41 and Webb 2002. Description: This script calculates NRI and NTI from a path to a Newick formatted tree and a path to a comma separated list of ids in that tree that form the group whose NRI/NTI you want to test. The tree is not required to have distances. If none are found script will use the number of nodes (self inclusive) as their distance from one another. NRI and NTI are calculated as described in the Phylocom manual (which is a slightly modified version of that found in Webb 2002, and Webb 2000). The Phylocom manual is freely available on the web and Webb 2002 can be found in the Annual Review of Ecology and Systematics: Phylogenies and Community Ecology Webb 2002. relatedness.py [options] Usage: Input Arguments: [REQUIRED] -t, --tree_fp The tree filepath -g, --taxa_fp Taxa list filepath [OPTIONAL] -i, --iters Number of iterations to use for sampling tips without replacement (null model 2 community sampling, see see ). [default: 1000] -m, --methods Comma-separated list of metrics to calculate. [default: nri,nti] -o, --output_fp Path where output will be written [default: print to screen] Output: Outputs a value for specified tests Calculate both NRI and NTI from the given tree and group of taxa: Calculate only NRI: Calculate only NTI using a different number of iterations: Calculate only NTI using a different number of iterations and save the results into a file called output.txt: shared_phylotypes.py – Compute shared OTUs between all pairs of samples Description: This script computes from an OTU table a matrix with the number of shared phylotypes between all pairs of samples. Usage: shared_phylotypes.py [options] Input Arguments: [REQUIRED] -i, --otu_table_fp Path to the input OTU table in biom format or a directory containing OTU tables -o, --output_fp The output filepath [OPTIONAL] -r, --reference_sample Name of reference sample to which all pairs of samples should be compared [default: None] Output: Single example: Compute shared OTUs on one OTU table for all samples Reference sample example: Compute shared OTUs with respect to a reference sample. Computes shared OTUs between all pairs of samples and the reference sample. E.g. in a transplant study this can be used to establish a base line count of shared OTUs with the Donor sample before and after the transplant. Batch mode example: Compute shared OTUs for a set of OTU tables, e.g. from running multiple_rarefactions.py, with an even number of sequences per sample. The resulting directory can be fed todissimilarity_mtx_stats.py, which computes mean, median and the standard deviation on the provided tables. simsam.py – Simulate samples for each sample in an OTU table, using a phylogenetic tree. Description: This script makes n samples related to each sample in an input otu table An input OTU table with 3 samples and n=2 will result in an output OTU table with 6 samples total: 3 clusters of 2 related samples. To simulate each of the new samples, this script uses a sample in the input OTU table, and for each OTU in that sample the script traverses rootward on the tree a distance specified by „-d? to a point x. It then randomly selects a tip that decends from x, (call that new tip „o2?), and reassigns all observations of the original OTU to the tip/OTU „o2?. Usage: simsam.py [options] Input Arguments: [REQUIRED] -i, --otu_table The input otu table -t, --tree_file Tree file -o, --output_dir Path to the output directory -d, --dissim Dissimilarity between nodes up the tree, as a single value or comma-separated list of values -n, --num Number of simulated samples per input sample, as a single value or comma-separated list of values [OPTIONAL] -m, --mapping_fp The mapping filepath. If provided, an output mapping file containing the replicated sample IDs (with all other metadata columns copied over) will also be created [default: None] Output: The output directory will contain an OTU table with samples named: „original_sample_0, original_sample_1 ...? If a mapping file is provided via -m, an output mapping file containing the replicated sample IDs (with all other metadata columns copied over) will also be created. Create an OTU table with 3 related samples for each sample in otu_table.biom with dissimilarities of 0.001. Create OTU tables with 2, 3 and 4 related samples for each sample in otu_table.biom with dissimilarities of 0.001 and 0.01. Additionally create new mapping files with metadata for each of the new samples for use in downstream analyses. single_rarefaction.py – Perform rarefaction on an otu table Description: To perform bootstrap, jackknife, and rarefaction analyses, the otu table must be subsampled (rarefied). This script rarefies, or subsamples, an OTU table. This does not provide curves of diversity by number of sequences in a sample. Rather it creates a subsampled OTU table by random sampling (without replacement) of the input OTU table. Samples that have fewer sequences then the requested rarefaction depth are omitted from the ouput otu tables. The pseudo-random number generator used for rarefaction by subsampling is NumPy?s default - an implementation of the Mersenne twister PRNG. single_rarefaction.py [options] Usage: Input Arguments: [REQUIRED] -i, --input_path Input OTU table filepath. -o, --output_path Output OTU table filepath. -d, --depth Number of sequences to subsample per sample. [OPTIONAL] --lineages_included Deprecated: lineages are now included by default. Pass –supress_lineages_included to prevent output OTU tables from including taxonomic (lineage) information for each OTU. Note: this will only work if lineage information is in the input OTU table. --suppress_lineages_included Exclude taxonomic (lineage) information for each OTU. -k, --keep_empty_otus Retain OTUs of all zeros, which are usually omitted from the output OTU tables. [default: False] --subsample_multinomial Subsample using subsampling with replacement [default: False] Output: The results of single_rarefaction.py consist of a single subsampled OTU table. The file has the same otu table format as the input otu_table.biom. note: if the output file would be empty, no file is written Example: subsample otu_table.biom (-i) at 100 seqs/sample (-d), write results to otu_table_even100.txt (-o). sort_otu_table.py – Script for sorting the sample IDs in an OTU table based on a specified value in a mapping file. Description: Usage: sort_otu_table.py [options] Input Arguments: [REQUIRED] -i, --input_otu_table Input OTU table filepath in BIOM format. -o, --output_fp Output OTU table filepath. [OPTIONAL] -m, --mapping_fp Input metadata mapping filepath. [default: None] -s, --sort_field Category to sort OTU table by. [default: None] -l, --sorted_sample_ids_fp Sorted sample id filepath [default: None] Output: sort samples by the age field in the mapping file sort samples based on order in a file where each line starts with a sample id split_fasta_on_sample_ids.py – Split a single post-split_libraries.py fasta file into per-sample fasta files. Description: Split a single post-split_libraries.py fasta file into per-sample fasta files. This script requires that the sequences identitifers are in post-split_libraries.py format (i.e., SampleID_SeqID). A fasta file will be created for each unique SampleID. Usage: split_fasta_on_sample_ids.py [options] Input Arguments: [REQUIRED] -i, --input_fasta_fp The input fasta file to split -o, --output_dir The output directory [default: None] [OPTIONAL] --buffer_size The number of sequences to read into memory before writing to file (you usually won?t need to change this) [default: 500] Output: This script will produce an output directory with as many files as samples. Split seqs.fna into one fasta file per sample and store the resulting fasta files in „out? split_libraries.py – Split libraries according to barcodes specified in mapping file Description: Since newer sequencing technologies provide many reads per run (e.g. the 454 GS FLX Titanium series can produce 400-600 million base pairs with 400-500 base pair read lengths) researchers are now finding it useful to combine multiple samples into a single 454 run. This multiplexing is achieved through the application of a pyrosequencing-tailored nucleotide barcode design (described in (Parameswaran et al., 2007)). By assigning individual, unique sample specific barcodes, multiple sequencing runs may be performed in parallel and the resulting reads can later be binned according to sample. The script split_libraries.py performs this task, in addition to several quality filtering steps including user defined cut-offs for: sequence lengths; end-trimming; minimum quality score. To summarize, by using the fasta, mapping, and quality files, the program split_libraries.py will parse sequences that meet user defined quality thresholds and then rename each read with the appropriate Sample ID, thus formatting the sequence data for downstream analysis. If a combination of different sequencing technologies are used in any particular study, split_libraries.py can be used to perform the quality-filtering for each library individually and the output may then be combined. Sequences from samples that are not found in the mapping file (no corresponding barcode) and sequences without the correct primer sequence will be excluded. Additional scripts can be used to exclude sequences that match a given reference sequence (e.g. the human genome; exclude_seqs_by_blast.py) and/or sequences that are flagged as chimeras (identify_chimeric_seqs.py). Usage: split_libraries.py [options] Input Arguments: [REQUIRED] -m, --map Name of mapping file. NOTE: Must contain a header line indicating SampleID in the first column and BarcodeSequence in the second, LinkerPrimerSequence in the third. -f, --fasta Names of fasta files, comma-delimited [OPTIONAL] -q, --qual Names of qual files, comma-delimited [default: None] -r, --remove_unassigned DEPRECATED: pass –retain_unassigned_reads to keep unassigned reads [default: None] -l, --min-seq-length Minimum sequence length, in nucleotides [default: 200] -L, --max-seq-length Maximum sequence length, in nucleotides [default: 1000] -t, --trim-seq-length Calculate sequence lengths after trimming primers and barcodes [default: False] -s, --min-qual-score Min average qual score allowed in read [default: 25] -k, --keep-primer Do not remove primer from sequences -B, --keep-barcode Do not remove barcode from sequences -a, --max-ambig Maximum number of ambiguous bases [default: 6] -H, --max-homopolymer Maximum length of homopolymer run [default: 6] -M, --max-primer-mismatch Maximum number of primer mismatches [default: 0] -b, --barcode-type Barcode type, hamming_8, golay_12, variable_length (will disable any barcode correction if variable_length set), or a number representing the length of the barcode, such as -b 4. [default: golay_12] -o, --dir-prefix Directory prefix for output files [default: .] -e, --max-barcode-errors Maximum number of errors in barcode [default: 1.5] -n, --start-numbering-at Seq id to use for the first sequence [default: 1] --retain_unassigned_reads Retain sequences which are Unassigned in the output sequence file[default: False] -c, --disable_bc_correction Disable attempts to find nearest corrected barcode. Can improve performance. [default: False] -w, --qual_score_window Enable sliding window test of quality scores. If the average score of a continuous set of w nucleotides falls below the threshold (see -s for default), the sequence is discarded. A good value would be 50. 0 (zero) means no filtering. Must pass a .qual file (see -q parameter) if this functionality is enabled. Default behavior for this function is to truncate the sequence at the beginning of the poor quality window, and test for minimal length (-l parameter) of the resulting sequence. [default: 0] -g, --discard_bad_windows If the qual_score_window option (-w) is enabled, this will override the default truncation behavior and discard any sequences where a bad window is found. [default: False] -p, --disable_primers Disable primer usage when demultiplexing. Should be enabled for unusual circumstances, such as analyzing Sanger sequence data generated with different primers. [default: False] -z, --reverse_primers Enable removal of the reverse primer and any subsequence sequence from the end of each read. To enable this, there has to be a “ReversePrimer” column in the mapping file. Primers a required to be in IUPAC format and written in the 5? to 3? direction. Valid options are „disable?, „truncate_only?, and „truncate_remove?. „truncate_only? will remove the primer and subsequent sequence data from the output read and will not alter output of sequences where the primer cannot be found. „truncate_remove? will flag sequences where the primer cannot be found to not be written and will record the quantity of such failed sequences in the log file. [default: disable] --reverse_primer_mismatches Set number of allowed mismatches for reverse primers (option -z). [default: 0] -d, --record_qual_scores Enables recording of quality scores for all sequences that are recorded. If this option is enabled, a file named seqs_filtered.qual will be created in the output directory, and will contain the same sequence IDs in the seqs.fna file and sequence quality scores matching the bases present in the seqs.fna file. [default: False] -i, --median_length_filtering Disables minimum and maximum sequence length filtering, and instead calculates the median sequence length and filters the sequences based upon the number of median absolute deviations specified by this parameter. Any sequences with lengths outside the number of deviations will be removed. [default: None] -j, --added_demultiplex_field Use -j to add a field to use in the mapping file as an additional demultiplexing option to the barcode. All combinations of barcodes and the values in these fields must be unique. The fields must contain values that can be parsed from the fasta labels such as “plate=R_2008_12_09”. In this case, “plate” would be the column header and “R_2008_12_09” would be the field data (minus quotes) in the mapping file. To use the run prefix from the fasta label, such as “>FLP3FBN01ELBSX”, where “FLP3FBN01” is generated from the run ID, use “-j run_prefix” and set the run prefix to be used as the data under the column headerr “run_prefix”. [default: None] -x, --truncate_ambi_bases Enable to truncate at the first “N” character encountered in the sequences. This will disable testing for ambiguous bases (-a option) [default: False] Output: Three files are generated by split_libraries.py: 1) .fna file (e.g. seqs.fna) - This is a FASTA file containing all sequences which meet the user-defined parameters, where each sequence identifier now contains its corresponding sample id from mapping file. 2) histograms.txt- This contains the counts of sequences with a particular length. 3) split_library_log.txt - This file contains a summary of the split_libraries.py analysis. Specifically, this file includes information regarding the number of sequences that pass quality control (number of seqs written) and how these are distributed across the different samples which, through the use of bar-coding technology, would have been pooled into a single 454 run. The number of sequences that pass quality control will depend on length restrictions, number of ambiguous bases, max homopolymer runs, barcode check, etc. All of these parameters are summarized in this file. If raw sequences do not meet the specified quality thresholds they will be omitted from downstream analysis. Since we never see a perfect 454 sequencing run, the number of sequences written should always be less than the number of raw sequences. The number of sequences that are retained for analysis will depend on the quality of the 454 run itself in addition to the default data filtering thresholds in the split_libraries.py script. The default parameters (minimum quality score = 25, minimum/maximum length = 200/1000, no ambiguous bases allowed, no mismatches allowed in primer sequence) can be adjusted to meet the user?s needs. Standard Example: Using a single 454 run, which contains a single FASTA, QUAL, and mapping file while using default parameters and outputting the data into the Directory “Split_Library_Output”: Multiple FASTA and QUAL Files Example: For the case where there are multiple FASTA and QUAL files, the user can run the following comma-separated command as long as there are not duplicate barcodes listed in the mapping file: Duplicate Barcode Example: An example of this situation would be a study with 1200 samples. You wish to have 400 samples per run, so you split the analysis into three runs and reuse barcoded primers (you only have 600). After initial analysis you determine a small subset is underrepresented (<500 sequences per samples) and you boost the number of sequences per sample for this subset by running a fourth run. Since the same sample IDs are in more than one run, it is likely that some sequences will be assigned the same unique identifier by split_libraries.py when it is run separately on the four different runs, each with their own barcode file. This will cause a problem in file concatenation of the four different runs into a single large file. To avoid this, you can use the „-n? parameter which defines a start index for split_libraries.py. From experience, most FLX runs (when combining both files for a single plate) will have 350,000 to 650,000 sequences. Thus, if Run 1 for split_libraries.py uses „-n 1000000?, Run 2 uses „-n 2000000?, etc., then you are guaranteed to have unique identifiers after concatenating the results of multiple FLX runs. With newer technologies you will just need to make sure that your start index spacing is greater than the potential number of sequences. To run split_libraries.py, you will need two or more (depending on the number of times the barcodes were reused) separate mapping files (one for each Run, for example one for Run1 and another one for Run2), then you can run split_libraries.py using the FASTA and mapping file for Run1 and FASTA and mapping file for Run2. Once you have run split libraries on each file independently, you can concatenate (e.g. using the „cat? command) the sequence files that were generated by split_libraries.py. You can also concatenate the mapping files, since the barcodes are not necessary for downstream analyses, unless the same sample IDs are found in multiple mapping files. Run split_libraries.py on Run 1: Run split_libraries.py on Run 2. The resulting FASTA files from Run 1 and Run 2 can then be concatenated using the „cat? command (e.g. cat Split_Library_Run1_Output/seqs.fna Split_Library_Run2_Output/seqs.fna > Combined_seqs.fna) and used in downstream analyses. Barcode Decoding Example: The standard barcode types supported by split_libraries.py are golay (Length: 12 NTs) and hamming (Length: 8 NTs). For situations where the barcodes are of a different length than golay and hamming, the user can define a generic barcode type “-b” as an integer, where the integer is the length of the barcode used in the study. Note: When analyzing large datasets (>100,000 seqs), users may want to use a generic barcode type, even for length 8 and 12 NTs, since the golay and hamming decoding processes can be computationally intensive, which causes the script to run slow. Barcode correction can be disabled with the -c option if desired. For the case where the 8 base pair barcodes were used, you can use the following command: Linkers and Primers: The linker and primer sequence (or all the degenerate possibilities) are associated with each barcode from the mapping file. If a barcode cannot be identified, all the possible primers in the mapping file are tested to find a matching sequence. Using truncated forms of the same primer can lead to unexpected results for rare circumstances where the barcode cannot be identified and the sequence following the barcode matches multiple primers. In many cases, sequence reads are long enough to sequence through the reverse primer and sequencing adapter. To remove these primers and all following sequences, the -z option can be used. By default, this option is set to „disable?. If it is set to „truncate_only?, split_libraries will trim the primer and any sequence following it if the primer is found. If the „truncate_remove? option is set, split_libraries.py will trim the primer if found, and will not write the sequence if the primer is not found. The allowed mismatches for the reverse primer are set with the –reverse_primer_mismatches parameter (default 0). To use reverse primer removal, one must include a „ReversePrimer? column in the mapping file, with the reverse primer recorded in the 5? to 3? orientation. Example reverse primer removal, where primers are trimmed if found, and sequence is written unchanged if not found. Mismatches are increased to 1 from the default 0: split_libraries_fastq.py – This script performs demultiplexing of Fastq sequence data where barcodes and sequences are contained in two separate fastq files (common on Illumina runs). Description: Usage: split_libraries_fastq.py [options] Input Arguments: [REQUIRED] -i, --sequence_read_fps The sequence read fastq files (comma-separated if more than one) -o, --output_dir Directory to store output files -m, --mapping_fps Metadata mapping files (comma-separated if more than one) [OPTIONAL] -b, --barcode_read_fps The barcode read fastq files (comma-separated if more than one) [default: None] --store_qual_scores Store qual strings in .qual files [default: False] --sample_id Single sample id to be applied to all sequences (used when data is not multiplexed) [default: None] --store_demultiplexed_fastq Write demultiplexed fastq files [default: False] --retain_unassigned_reads Retain sequences which don?t map to a barcode in the mapping file (sample ID will be “Unassigned”) [default: False] -r, --max_bad_run_length Max number of consecutive low quality base calls allowed before truncating a read [default: 3] -p, --min_per_read_length_fraction Min number of consecutive high quality base calls to include a read (per single end read) as a fraction of the input read length [default: 0.75] -n, --sequence_max_n Maximum number of N characters allowed in a sequence to retain it – this is applied after quality trimming, and is total over combined paired end reads if applicable [default: 0] -s, --start_seq_id Start seq_ids as ascending integers beginning with start_seq_id [default: 0] --rev_comp_barcode Reverse compliment barcode reads before lookup [default: False] --rev_comp_mapping_barcodes Reverse compliment barcode in mapping before lookup (useful if barcodes in mapping file are reverse compliments of golay codes)[default: False] --rev_comp Reverse compliment sequence before writing to output file (useful for reverse-orientation reads) [default: False] -q, --phred_quality_threshold The minimum acceptable Phred quality score (e.g., for Q20 and better, specify -q 20) [default: 3] --last_bad_quality_char DEPRECATED: use -q instead. This method of setting is not robust to different versions of CASAVA. --barcode_type The type of barcode used. This can be an integer, e.g. for length 6 barcodes, or golay_12 for golay error-correcting barcodes. Error correction will only be applied for golay_12 barcodes. [default: golay_12] --max_barcode_errors Maximum number of errors in barcode [default: 1.5] --phred_offset The ascii offset to use when decoding phred scores - warning: in most cases you don?t need to pass this value [default: determined automatically] Output: Demultiplex and quality filter (at Phred Q20) one lane of Illumina fastq data and write results to ./slout_q20.: Demultiplex and quality filter (at Phred Q20) one lane of Illumina fastq data and write results to ./slout_q20. Store trimmed quality scores in addition to sequence data.: Demultiplex and quality filter (at Phred Q20) two lanes of Illumina fastq data and write results to ./slout_q20.: Quality filter (at Phred Q20) one non-multiplexed lane of Illumina fastq data and write results to ./slout_single_sample_q20.: Quality filter (at Phred Q20) two non-multiplexed lanes of Illumina fastq data and write results to ./slout_single_sample_q20.: split_otu_table.py – Split in a single OTU table into one OTU table per value in a specified field of the mapping file. Description: Usage: split_otu_table.py [options] Input Arguments: [REQUIRED] -i, --otu_table_fp The input otu table -m, --mapping_fp The mapping file path -f, --mapping_field Mapping column to split otu table on -o, --output_dir The output directory Output: Split otu_table.biom into per-study OTU tables, and store the results in ./per_study_otu_tables/ split_otu_table_by_taxonomy.py – Script to split a single OTU table into multiple tables based on the taxonomy at some user-specified depth.? Description: Usage: split_otu_table_by_taxonomy.py [options] Input Arguments: [REQUIRED] -i, --input_fp The input otu table in BIOM format -o, --output_dir The output directory -L, --level The taxonomic level to split at [OPTIONAL] --md_identifier The relevant observation metadat key [default: taxonomy] --md_as_string Metadata is included as string [default: metadata is included as list] Output: Split seqs_otu_table.biom into taxon-specific OTU tables based on the third level in the taxonomy, and write the taxon-specific OTU tables to ./L3/ start_parallel_jobs.py – Starts multiple jobs in parallel on multicore or multiprocessor systems. Description: This script is designed to start multiple jobs in parallel on systems with no queueing system, for example a multiple processor or multiple core laptop/desktop machine. This also serves as an example „cluster_jobs? which users can use as a template to define scripts to start parallel jobs in their environment. Usage: start_parallel_jobs.py [options] Input Arguments: [OPTIONAL] -m, --make_jobs Make the job files [default: None] -s, --submit_jobs Submit the job files [default: None] Output: No output is created. Example: Start each command listed in test_jobs.txt in parallel. The run ID for these jobs will be RUNID. start_parallel_jobs_sc.py – Starts parallel jobs on Sun GridEngine queueing systems. Description: Starts multiple jobs in parallel on Sun GridEngine systems. This is designed to work with StarCluster EC2 instances, but may be applicable beyond there. Usage: start_parallel_jobs_sc.py [options] Input Arguments: [OPTIONAL] -m, --make_jobs Make the job files [default: None] -s, --submit_jobs Submit the job files [default: None] -q, --queue_name The queue to submit jobs to [default: all.q] Output: No output is created. Job submission example: Start each command listed in test_jobs.txt in parallel. The run ID for these jobs will be RUNID. Queue specification example: Submit the commands listed in test_jobs.txt to the specified queue. start_parallel_jobs_torque.py – Starts multiple jobs in parallel on torque/qsub based multiprocessor systems. Description: This script is designed to start multiple jobs in parallel on cluster systems with a torque/qsub based scheduling system. Usage: start_parallel_jobs_torque.py [options] Input Arguments: [OPTIONAL] -m, --make_jobs Make the job files [default: None] -s, --submit_jobs Submit the job files [default: None] -q, --queue Name of queue to submit to [default: friendlyq] -j, --job_dir Directory to store the jobs [default: jobs/] Output: No output is created. Job submission example: Start each command listed in test_jobs.txt in parallel. The run ID for these jobs will be RUNID. Queue specification example: Submit the commands listed in test_jobs.txt to the specified queue. Jobs output directory specification example: Submit the commands listed in test_jobs.txt, with the jobs put under the my_jobs/ directory. submit_to_mgrast.py – This script submits a FASTA file to MG-RAST Description: This script takes a split-library FASTA file and generates individual FASTA files for each sample, then submits each sample FASTA file to MG-RAST, given the user provides an MG-RAST web-services authorization key and Project ID. To get a web-services authorization key, the user should have an account on MG-RAST. Once logged in, the user can go to their Account Management page and under Preferences they should click „here?, where they will see a Web Services section where they can click on the „generate new key? if they have not already been provided one. Usage: submit_to_mgrast.py [options] Input Arguments: [REQUIRED] -i, --input_fasta_fp Path to the input fasta file -w, --web_key_auth The web services authorization key from MG-RAST -p, --project_id The title to be used for the project -o, --output_dir Path to the output directory Output: The resulting directory will contain all of the sample-separated FASTA files, along with a log html file, which informs the user of the jobs started on MG-RAST Example: The user can submit a post-split-library FASTA file, which will be loaded and processed into MG-RAST under the users account („-w?) and project („-p?), as follows: subsample_fasta.py – Randomly subsample sequences from a given fasta file Description: Subsample the seqs.fna file, randomly select 5% of the sequences: Usage: subsample_fasta.py [options] Input Arguments: [REQUIRED] -i, --input_fasta_fp Path to the input fasta file -p, --percent_subsample Specify the percentage of sequences to subsample [OPTIONAL] -o, --output_fp The output filepath Output: Example: Subsample seqs.fasta to approximately 5% summarize_otu_by_cat.py – Summarize an OTU table by a single column in the mapping file. Description: Collapse an OTU table based on values in a single column in the mapping file. For example, if you have 10 samples, five of which are from females and five of which are from males, you could use this script to collapse the ten samples into two corresponding based on their values in a „Sex? column in your mapping file. summarize_otu_by_cat.py [options] Usage: Input Arguments: [REQUIRED] -i, --mapping_fp Input metadata mapping filepath [REQUIRED] -c, --otu_table_fp Input OTU table filepath. [REQUIRED] -m, --mapping_category Summarize OTU table using this category. The user can also combine columns in the mapping file by separating the categories by “&&” without spaces. [REQUIRED] -o, --output_fp Output OTU table filepath. [REQUIRED] [OPTIONAL] -n, --normalize Normalize OTU counts, where the OTU table columns sum to 1. Output: Example: Collapsed otu_table.biom on the „Treatment? column in Fasting_Map.txt and write the resulting OTU table to otu_table_by_treatment.txt Combine two categories and collapse otu_table.biom on the 'Sex' and 'Age' columns in map.txt and write the resulting OTU table to otu_table_by_sex_and_age.txt summarize_taxa.py – Summarize taxa and store results in a new table or appended to an existing mapping file. Description: The summarize_taxa.py script provides summary information of the representation of taxonomic groups within each sample. It takes an OTU table that contains taxonomic information as input. The taxonomic level for which the summary information is provided is designated with the -L option. The meaning of this level will depend on the format of the taxon strings that are returned from the taxonomy assignment step. The taxonomy strings that are most useful are those that standardize the taxonomic level with the depth in the taxonomic strings. For instance, for the RDP classifier taxonomy, Level 2 = Domain (e.g. Bacteria), 3 = Phylum (e.g. Firmicutes), 4 = Class (e.g. Clostridia), 5 = Order (e.g. Clostridiales), 6 = Family (e.g. Clostridiaceae), and 7 = Genus (e.g. Clostridium). By default, the relative abundance of each taxonomic group will be reported, but the raw counts can be returned if -a is passed. By default, taxa summary tables will be output in both classic (tab-separated) and BIOM formats. The BIOM-formatted taxa summary tables can be used as input to other QIIME scripts that accept BIOM files. Usage: summarize_taxa.py [options] Input Arguments: [REQUIRED] -i, --otu_table_fp Input OTU table filepath [REQUIRED] [OPTIONAL] -L, --level Taxonomic level to summarize by. [default: 2,3,4,5,6] -m, --mapping Input metadata mapping filepath. If supplied, then the taxon information will be added to this file. This option is useful for coloring PCoA plots by taxon abundance or to perform statistical tests of taxon/mapping associations. --md_identifier The relevant observation metadata key [default: taxonomy] --md_as_string Metadata is included as string [default: metadata is included as list] -d, --delimiter Delimiter separating taxonomy levels. [default: ;] -r, --relative_abundance DEPRECATED: please use -a/–absolute_abundance to disable relative abundance [default: ] -a, --absolute_abundance If present, the absolute abundance of the lineage in each sample is reported. By default, this script uses relative abundance [default: False] -l, --lower_percentage If present, OTUs having higher absolute abundance are trimmed. To remove OTUs that make up more than 5% of the total dataset you would pass 0.05. [default: None] -u, --upper_percentage If present, OTUs having lower absolute abundance are trimmed. To remove the OTUs that makes up less than 45% of the total dataset you would pass 0.45. [default: None] -t, --transposed_output If present, the output will be written transposed from the regular output. This is helpful in cases when you want to use Site Painter to visualize your data [default: False] -o, --output_dir Path to the output directory --suppress_classic_table_output If present, the classic (TSV) format taxon table will not be created in the output directory. This option is ignored if -m/–mapping is present [default: False] --suppress_biom_table_output If present, the BIOM-formatted taxon table will not be created in the output directory. This option is ignored if -m/–mapping is present [default: False] Output: There are two possible output formats depending on whether or not a mapping file is provided with the -m option. If a mapping file is not provided, a table is returned where the taxonomic groups are each in a row and there is a column for each sample. If a mapping file is provided, the summary information will be appended to this file. Specifically, a new column will be made for each taxonomic group to which the relative abundances or raw counts will be added to the existing rows for each sample. The addition of the taxonomic information to the mapping file allows for taxonomic coloration of Principal coordinates plots in the 3d viewer. As described in the make_3d_plots.py section, principal coordinates plots can be dynamically colored based on any of the metadata columns in the mapping file. Dynamic coloration of the plots by the relative abundances of each taxonomic group can help to distinguish which taxonomic groups are driving the clustering patterns. Examples: Summarize taxa based at taxonomic levels 2, 3, 4, 5, and 6, and write resulting taxa tables to the directory ”./tax” Examples: Summarize taxa based at taxonomic levels 2, 3, 4, 5, and 6, and write resulting mapping files to the directory ”./tax” summarize_taxa_through_plots.py – A workflow script for performing taxonomy summaries and plots Description: The steps performed by this script are: Summarize OTU by Category (optional, pass -c); Summarize Taxonomy; and Plot Taxonomy Summary Usage: summarize_taxa_through_plots.py [options] Input Arguments: [REQUIRED] -i, --otu_table_fp The input otu table [REQUIRED] -o, --output_dir The output directory [REQUIRED] [OPTIONAL] -p, --parameter_fp Path to the parameter file, which specifies changes to the default behavior. See #qiime-parameters. [if omitted, default values will be used] -m, --mapping_fp Path to the mapping file [REQUIRED if passing -c] -f, --force Force overwrite of existing output directory (note: existing files in output_dir will not be removed) [default: None] -w, --print_only Print the commands but don?t call them – useful for debugging [default: False] -c, --mapping_category Summarize OTU table using this category. [default: None] -s, --sort Sort the OTU Table [default: False] Output: The results of this script is a folder (specified by -o) containing taxonomy summary files (at different levels) and a folder containing taxonomy summary plots. Additionally, if a mapping_catgory is supplied there will be a summarized OTU table. The primary interface for this output are the OUTPUT_DIR/taxa_summary_plots/*html files which are interactive plots that can be opened in a web browser (see the mouse-overs for interactivity). Plot taxa summaries for all samples: Plot taxa summaries on a categorical basis: Alternatively, the user can supply a mapping_category, where the OTU is summarized based on a sample metadata category: supervised_learning.py – Run supervised classification using OTUs as predictors and a mapping file category as class labels. Description: This script trains a supervised classifier using OTUs (or other continuous input sample x observation data) as predictors, and a mapping file column containing discrete values as the class labels. Outputs: , cv_probabilities.txt: the label probabilities for each of the given samples. (if available) , mislabeling.txt: A convenient presentation of cv_probabilities for mislabeling detection. , confusion_matrix.txt: confusion matrix for hold-out predictions. , summary.txt: a summary of the results, including the expected generalization error of the classifier , feature_importance_scores.txt: a list of discriminative OTUs with their associated importance scores (if available) It is recommended that you remove low-depth samples and rare OTUs before running this script. This can drastically reduce the run-time, and in many circumstances will not hurt performance. It is also recommended to perform rarefaction to control for sampling effort before running this script. For example, to rarefy at depth 200, then remove OTUs present in < 10 samples run: single_rarefaction.py -i otu_table_filtered.txt -d 200 -o otu_table_rarefied200.txt filter_otus_from_otu_table.py -i otu_table_rarefied200.txt -s 10 For an overview of the application of supervised classification to microbiota, see PubMed ID 21039646. This script also has the ability to collate the supervised learning results produced on an input directory. For example, in order to reduce any variation introduced through producing a rarefied OTU table, the user can run multiple_rarefactions_even_depth.py on the OTU table, and then pass that directory into supervised_learning.py. The user can then pass a -w collate_results filepath to produce a single results file that contains the average estimated generalization error of the classified, and the pooled standard deviation (for cv5 and cv10 errortypes). This script requires that R be installed and in the search path. To install R visit: , run R and excecute the command “install.packages(“randomForest”)”, then type q() to exit. Usage: supervised_learning.py [options] Input Arguments: [REQUIRED] -i, --input_data Input data file containing predictors (e.g. otu table) or a directory of otu tables -m, --mapping_file File containing meta data (response variables) -c, --category Name of meta data category to predict -o, --output_dir The output directory [OPTIONAL] -f, --force Force overwrite of existing output directory (note: existing files in output_dir will not be removed) [default: None] --ntree Number of trees in forest (more is better but slower) [default: 500] -e, --errortype Type of error estimation. Valid choices are: oob, loo, cv5, cv10. oob: out-of-bag, fastest, only builds one classifier, use for quick estimates; cv5: 5-fold cross validation, provides mean and standard deviation of error, use for good estimates on very large data sets; cv10: 10-fold cross validation, provides mean and standard deviation of error, use for best estimates; loo: leave-one-out cross validation, use for small data sets (less than ~30-50 samples) [default oob] -w, --collate_results_fp When passing in a directory of OTU tables that are rarefied at an even depth, this option will collate the results into a single specified output file, averaging the estimated errors and standard deviations. [default: None] Output: Outputs a ranking of features (e.g. OTUs) by importance, an estimation of the generalization error of the classifier, and the predicted class labels and posterior class probabilities according to the classifier. Simple example of random forests classifier: Running with 10-fold cross-validation for improved estimates of generalization error and feature importances: Running with 1,000 trees for improved generalization error: Run 10-fold cross validation on a directory of OTU tables rarefied at an even depth: Run 10-fold cross validation on a directory of OTU tables rarefied at an even depth and collate the results into a single file: transform_coordinate_matrices.py – Transform 2 coordinate matrices Description: This script transforms 2 coordinate matrices (e.g., the output of principal_coordinates.py) using procrustes analysis to minimize the distances between corresponding points. Monte Carlo simulations can additionally be performed (-r random trials are run) to estimate the probability of seeing an M^2 value as extreme as the actual M^2. Usage: transform_coordinate_matrices.py [options] Input Arguments: [REQUIRED] -i, --input_fps Comma-separated input files -o, --output_dir The output directory [OPTIONAL] -r, --random_trials Number of random permutations of matrix2 to perform. [default: (no Monte Carlo analysis performed)] -d, --num_dimensions Number of dimensions to include in output matrices [default: 3] -s, --sample_id_map_fp Map of original sample ids to new sample ids [default: None] --store_trial_details Store PC matrices for individual trials [default: False] Output: Two transformed coordinate matrices corresponding to the two input coordinate matrices, and (if -r was specified) a text file summarizing the results of the Monte Carlo simulations. Write the transformed procrustes matrices to file: Generate transformed procrustes matrices and monte carlo p-values: tree_compare.py – Compare jackknifed/bootstrapped trees Description: Compares jackknifed/bootstrapped trees (support trees) with a master tree constructed typically from the entire dataset (e.g: a resulting file from upgma_cluster.py) and outputs support for nodes. if support trees do not have all tips that master has (e.g. because samples with few sequences were dropped during a jackknifing analysis), the output master tree will have only those tips included in all support trees if support trees have tips that the master tree does not, those tips will be ignored (removed from the support tree during analysis) Usage: tree_compare.py [options] Input Arguments: [REQUIRED] -m, --master_tree Master tree filepath -s, --support_dir Path to dir containing support trees -o, --output_dir Output directory, writes three files here makes dir if it doesn?t exist Output: The result of tree_compare.py contains the master tree, now with internal nodes uniquely named, a separate bootstrap/jackknife support file, listing the support for each internal node, and a jackknife_named_nodes.tre tree, where internal nodes are named with their support values from 0 to 1.0, for use with tree visualization software (e.g. FigTree). Example: Given the sample upgma tree generated by the user for the entire dataset, the directory of bootstrap/jackknife supported trees (e.g.: the resulting directory from upgma_cluster.py) and the directory to write the results for the tree comparisons, the following command compares the support trees with the master: trflp_file_to_otu_table.py – Convert TRFLP text file to an OTU table Description: The input for this script is a TRFLP text file. The output of this script is an OTU table text file that can be use with QIIME for further analysis Usage: trflp_file_to_otu_table.py [options] Input Arguments: [REQUIRED] -i, --input_path Input path: TRFLP text file -o, --output_path Output file: OTU table Output: Usage: You need to pass a TRFLP text, the script will remove not wanted chars sample and otus names, and will add zeros as need it trim_sff_primers.py – Trim sff primers Description: Finds the technical read regions for each library, and resets the left trim. Usage: trim_sff_primers.py [options] Input Arguments: [REQUIRED] -l, --libdir The directory containing per-library sff files -m, --input_map Path to the input mapping file describing the libraries [OPTIONAL] -p, --sfffile_path Path to sfffile binary [default: sfffile] -q, --sffinfo_path Path to sffinfo binary [default: sffinfo] --use_sfftools Use external sffinfo and sfffile programs instead of equivalent Python implementation. --debug Print command-line output for debugging [default: False] Output: This script replaces the original sff files with the trimmed versions. Simple example: Trim a directory of per-sff files in sff_dir (-l sff_dir/) using an input map (-m input_map.txt). This script uses the sff utility binaries which must be in your path. truncate_fasta_qual_files.py – Generates filtered fasta and quality score files by truncating at the specified base position. Description: This module is designed to remove regions of poor quality in 454 sequence data. Drops in quality can be visualized with the quality_scores_plot.py module. The base position specified will be used as an index to truncate the sequence and quality scores, and all data at that base position and to the end of the sequence will be removed in the output filtered files. Usage: truncate_fasta_qual_files.py [options] Input Arguments: [REQUIRED] -f, --fasta_fp Input fasta filepath to be truncated. -q, --qual_fp Input quality scores filepath to be truncated. -b, --base_pos Nucleotide position to truncate the fasta and quality score files at. [OPTIONAL] -o, --output_dir Output directory. Will be created if does not exist. [default: .] Output: Filtered versions of the input fasta and qual file (based on input name with „_filtered? appended) will be generated in the output directory Example: Truncate the input fasta and quality files at base position 100, output to the filtered_seqs directory: truncate_reverse_primer.py – Takes a demultiplexed fasta file, finds a specified reverse primer sequence, and truncates this primer and subsequent sequences following the reverse primer. Description: Takes input mapping file and fasta sequences which have already have been demultiplexed (via split_libraries.py, denoise_wrapper.py, ampliconnoise.py, etc.) with fasta labels that are in QIIME format, i.e., SampleID_#. This script will use the SampleID and a mapping file with a ReversePrimer column to find the reverse primer by local alignment and remove this and any subsequent sequence in a filtered output fasta file. Usage: truncate_reverse_primer.py [options] Input Arguments: [REQUIRED] -f, --fasta_fp Fasta file. Needs to have fasta labels in proper demultiplexed format. -m, --mapping_fp Mapping filepath. ReversePrimer field required. Reverse primers need to be in 5?->3? orientation. [OPTIONAL] -o, --output_dir Output directory. Will be created if does not exist. [default: .] -z, --truncate_option Truncation option. The default option, “truncate_only” will try to find the reverse primer to truncate, and if not found, will write the sequence unchanged. If set to “truncate_remove”, sequences where the reverse primer is not found will not be written. [default: truncate_only] -M, --primer_mismatches Number of mismatches allowed in the reverse primer. [default: 2] Output: Truncated version of the input fasta file (based on input name with „seqs_rev_primer_truncated? appended) will be generated in the output directory, along with a .log file. Example: Find, truncate reverse primers from the fasta file seqs.fna, with the SampleIDs and reverse primers specified in Mapping_File_Rev_Primer.txt, writes output fasta file to the reverse_primer_removed directory: unweight_fasta.py – Transform fasta files with abundance weighting into unweighted Description: E.g. makes 3 fasta records from a weighted input fasta file containing the following record: >goodsample1_12_3 bc_val=20 AATGCTTGTCACATCGATGC Usage: unweight_fasta.py [options] Input Arguments: [REQUIRED] -i, --input_fasta The input fasta file -o, --output_file The output fasta filepath -l, --label Sequence label used for all records. fasta label lines will look like: >label_423 Output: a .fasta file make 3 fasta records from the following record: >goodsample1_12_3 bc_val=20 AATGCTTGTCACATCGATGC resulting in: >goodsample_0 AATGCTTGTCACATCGATGC >goodsample_1 AATGCTTGTCACATCGATGC >goodsample_2 AATGCTTGTCACATCGATGC upgma_cluster.py – Build a UPGMA tree comparing samples Description: In addition to using PCoA, it can be useful to cluster samples using UPGMA (Unweighted Pair Group Method with Arithmetic mean, also known as average linkage). As with PCoA, the input to this step is a distance matrix (i.e. resulting file from beta_diversity.py). Usage: upgma_cluster.py [options] Input Arguments: [REQUIRED] -i, --input_path Input path. directory for batch processing, filename for single file operation -o, --output_path Output path. directory for batch processing, filename for single file operation Output: The output is a newick formatted tree compatible with most standard tree viewing programs. Batch processing is also available, allowing the analysis of an entire directory of distance matrices. UPGMA Cluster (Single File): To perform UPGMA clustering on a single distance matrix (e.g.: beta_div.txt, a result file from beta_diversity.py) use the following idiom: UPGMA Cluster (Multiple Files): The script also functions in batch mode if a folder is supplied as input. This script operates on every file in the input directory and creates a corresponding upgma tree file in the output directory, e.g.: validate_demultiplexed_fasta.py – Checks a fasta file to verify if it has been properly demultiplexed, i.e., it is in QIIME compatible format. Description: Checks file is a valid fasta file, does not contain gaps („.? or „-„ characters), contains only valid nucleotide characters, no fasta label is duplicated, SampleIDs match those in a provided mapping file, fasta labels are formatted to have SampleID_X as normally generated by QIIME demultiplexing, and the BarcodeSequence/LinkerPrimerSequences are not found in the fasta sequences. Optionally this script can also verify that the SampleIDs in the fasta sequences are also present in the tip IDs of a provided newick tree file, can test for equal sequence lengths across all sequences, and can test that all SampleIDs in the mapping file are represented in the fasta file labels. Usage: validate_demultiplexed_fasta.py [options] Input Arguments: [REQUIRED] -m, --mapping_fp Name of mapping file. NOTE: Must contain a header line indicating SampleID in the first column and BarcodeSequence in the second, LinkerPrimerSequence in the third. If no barcode or linkerprimer sequence is present, leave data fields empty. -i, --input_fasta_fp Path to the input fasta file [OPTIONAL] -o, --output_dir Directory prefix for output files [default: .] -t, --tree_fp Path to the tree file; Needed to test if sequence IDs are a subset or exact match to the tree tips, options -s and -e [default: None] -s, --tree_subset Determine if sequence IDs are a subset of the tree tips, newick tree must be passed with the -t option. [default: False] -e, --tree_exact_match Determine if sequence IDs are an exact match to tree tips, newick tree must be passed with the -t option. [default: False] -l, --same_seq_lens Determine if sequences are all the same length. [default: False] -a, --all_ids_found Determine if all SampleIDs provided in the mapping file are represented in the fasta file labels. [default: False] Output: Example:
/
本文档为【劳务资质办理流程】,请使用软件OFFICE或WPS软件打开。作品中的文字与图均可以修改和编辑, 图片更改请在作品中右键图片并更换,文字修改请直接点击文字进行修改,也可以新增和删除文档中的内容。
[版权声明] 本站所有资料为用户分享产生,若发现您的权利被侵害,请联系客服邮件isharekefu@iask.cn,我们尽快处理。 本作品所展示的图片、画像、字体、音乐的版权可能需版权方额外授权,请谨慎使用。 网站提供的党政主题相关内容(国旗、国徽、党徽..)目的在于配合国家政策宣传,仅限个人学习分享使用,禁止用于任何广告和商用目的。

历史搜索

    清空历史搜索