The third generation of the CAP sequence assembly program. The CAP3 program includes a number of improvements and new features. The program has a capability to clip 58 and 38 low-quality regions of. Aug 20, 2012 CAP3 is a DNA sequence assembly program that uses forward–reverse constraints in order to correct assembly errors and link contigs. The program can be used for processing DNA sequences and to.
“ How much intelligence does one need to sneak upon lettuce? ” - Solomon Short This 'book' is actually the result of an exercise in self-defense. It contains texts from several years of help files, mails, postings, questions, answers etc.pp concerning MIRA and assembly projects one can do with it. I never really intended to push MIRA. It started out as a PhD thesis and I subsequently continued development when I needed something to be done which other programs couldn't do at the time.
But MIRA has always been available as binary on the Internet since 1999. And as Open Source since 2007. Somehow, MIRA seems to have caught the attention of more than just a few specialised sequencing labs and over the years I've seen an ever growing number of mails in my inbox and on the MIRA mailing list. Both from people having been 'since ever' in the sequencing business as well as from labs or people just getting their feet wet in the area. The help files - and through them this book - sort of reflect this development. Most of the chapters contain both very specialised topics as well as step-by-step walk-throughs intended to help people to get their assembly projects going. Some parts of the documentation are written in a decidedly non-scientific way.
Please excuse, time for rewriting mails somewhat lacking, some texts were re-used almost verbatim. The last few years have seen tremendous change in the sequencing technologies and MIRA 4 reflects that: core data structures and routines had to be thrown overboard and replaced with faster and/or more versatile versions suited for the broad range of technologies and use-cases I am currently running MIRA with. Nothing is perfect, and both MIRA and this documentation (even if it is rather pompously called Definitive Guide) are far from it. If you spot an error either in MIRA or this manual, feel free to report it. Or, even better, correct it if you can.
At least with the manual files it should be easy: they're basically just some decorated text files. I hope that MIRA will be as useful to you as it has been to me. Have a lot of fun with it. Rheinfelden, February 2014 Bastien Chevreux.
electrophoresis sequencing (aka Sanger sequencing). 454 pyro-sequencing (GS20, FLX or Titanium). Ion Torrent. Solexa (Illumina) sequencing. (in development) Pacific Biosciences sequencing into contiguous sequences (called contigs). One can use the sequences of different sequencing technologies either in a single assembly run (a true hybrid assembly) or by mapping one type of data to an assembly of other sequencing type (a semi-hybrid assembly (or mapping)) or by mapping a data against consensus sequences of other assemblies (a simple mapping).
The MIRA acronym stands for Mimicking Intelligent Read Assembly and the program pretty well does what its acronym says (well, most of the time anyway). It is the Swiss army knife of sequence assembly that I've used and developed during the past 14 years to get assembly jobs I work on done efficiently - and especially accurately. That is, without me actually putting too much manual work into it. Over time, other labs and sequencing providers have found MIRA useful for assembly of extremely 'unfriendly' projects containing lots of repetitive sequences.
As always, your mileage may vary. the part with the MIRA quick tour. the part which gives a quick overview for which data sets to use MIRA and for which not. the part which showcases different features of MIRA (lots of screen shots!).
where and how to get help if things don't work out as you expected After that, reading should depend on the type of data you intend to work with: there are specific chapters for assembly of de-novo, of mapping and of EST / RNASeq projects. They all contain an overview on how to define your data and how to launch MIRA for these data sets. There is also chapter on how to prepare data sets from specific sequencing technologies. The chapter on working with results of MIRA should again be of general interest to everyone. It describes the structure of output directories and files and gives first pointers on what to find where. Also, converting results into different formats - with and without filtering for specific needs - is covered there. As the previously cited chapters are more introductory in their nature, they do not go into the details of MIRA parametrisation.
While MIRA has a comprehensive set of standard settings which should be suited for a majority of assembly tasks, the are more than 150 switches / parameters with which one can fine tune almost every aspect of an assembly. A complete description for each and every parameter and how to correctly set parameters for different use cases and sequencing technologies can be found in the reference chapter. As not every assembly project is simple, there is also a chapter with tips on how to deal with projects which turn out to be 'hard.' It certainly helps if you at least skim through it even if you do not expect to have problems with your data. It contains a couple of tricks on what one can see in result files as well as in temporary and log files which are not explained elsewhere. MIRA comes with a number of additional utilities which are described in an own chapter. While the purpose of miraconvert should be quite clear quite quickly, the versatility of use cases for mirabait might surprise more than one.
Be sure to check it out. As from time to time some general questions on sequencing are popping up on the MIRA talk mailing list, I have added a chapter with some general musings on what to consider when going into sequencing projects. This should be in no way a replacement for an exhaustive talk with a sequencing provider, but it can give a couple of hints on what to take care of. There is also a FAQ chapter with some of the more frequently asked questions which popped up in the past few years. Finally, there are also chapters covering some more technical aspects of MIRA: the MAF format and structure / content of the tmp directory have own chapters. Complete walkthroughs.
Are lacking at the moment for MIRA 4. In the MIRA 3 manual I had them, but so many things have changed (at all levels: MIRA, the sequencing technologies, data repositories) that I did not have time to update them. I probably will need quite some time to write new ones. Feel free to send me some if you are inclined to help fellow scientists.
The MIRA quick tour Input can be in various formats like Staden experiment (EXP), Sanger CAF, FASTA, FASTQ or PHD file. Ancillary data containing additional information helpful to the assembly as is contained in, e.g. NCBI traceinfo XML files or Staden EXP files, is also honoured. If present, base qualities in phred style and SCF signal electrophoresis trace files are used to adjudicate between or even correct contradictory stretches of bases in reads by either the integrated automatic EdIt editor (written by Thomas Pfisterer) or the assembler itself. MIRA was conceived especially with the problem of repeats in genomic data and SNPs in transcript (EST / RNASeq) data in mind. Considerable effort was made to develop a number of strategies - ranging from standard clone-pair size restrictions to discovery and marking of base positions discriminating the different repeats / SNPs - to ensure that repetitive elements are correctly resolved and that misassemblies do not occur.
The resulting assembly can be written in different standard formats like CAF, Staden GAP4 directed assembly, ACE, HTML, FASTA, simple text or transposed contig summary (TCS) files. These can easily be imported into numerous finishing tools or further evaluated with simple scripts. The aim of MIRA is to build the best possible assembly. having a more or less full overview on the whole project at any time of the assembly, i.e.
Knowledge of almost all possible read-pairs in a project,. using high confidence regions (HCRs) of several aligned read-pairs to start contig building at a good anchor point of a contig, extending clipped regions of reads on a 'can be justified' basis. using all available data present at the time of assembly, i.e., instead of relying on sequence and base confidence values only, the assembler will profit from trace files containing electrophoresis signals, tags marking possible special attributes of DNA, information on specific insert sizes of read-pairs etc. having 'intelligent' contig objects accept or refuse reads based on the rate of unexplainable errors introduced into the consensus. learning from mistakes by discovering and analysing possible repeats differentiated only by single nucleotide polymorphisms.
The important bases for discriminating different repetitive elements are tagged and used as new information. using the possibility given by the integrated automatic editor to correct errors present in contigs (and subsequently) reads by generating and verifying complex error hypotheses through analysis of trace signals in several reads covering the same area of a consensus,. iteratively extending reads (and subsequently) contigs based on. Genome mapping As the complexity of mapping is a lot lower than de-novo, one can basically double (perhaps even triple) the number of reads compared to 'de-novo'. The limiting factor will be the amount of RAM though, and MIRA will also need lots of it if you go into eukaryotes. The main limiting factor regarding time will be the number of reference sequences (backbones) you are using.
MIRA being pedantic during the mapping process, it might be a rather long wait if you have more than 500 to 1000 reference sequences. ESTs / RNASeq The default values for MIRA should allow it to work with many EST and RNASeq data sets, sometimes even from non-normalised libraries. For extreme coverage cases however (like, something with a lot of cases at and above 10k coverage), one would perhaps want to resort to data reduction routines before feeding the sequences to MIRA. On the other hand, recent developments of MIRA were targeted at making de-novo RNASeq assembly of non-normalised libraries liveable, and indeed I now regularly use MIRA for data sets with up to 50 million Illumina 100bp reads.
MIRA learns to discern non-perfect repeats, leading to better assemblies MIRA is an iterative assembler (it works in several passes) and acts a bit like a child when exploring the world: it explores the assembly space and is specifically parameterised to allow a couple of assembly errors during the first passes. But after each pass some routines (the 'parents', if you like) check the result, searching for assembly errors and deduce knowledge about specific assemblies MIRA should not have ventured into. MIRA will then prevent these errors to re-occur in subsequent passes. As an example, consider the following multiple alignment.
With the introduction of 454 reads, MIRA also got in 2007 specialised editors to search and correct for typical 454 sequencing problems like the homopolymer run over-/undercalls. These editors are now integrated into MIRA itself and are not part of EdIt anymore.
While not being paramount to the assembly quality, both editors provide additional layers of safety for the MIRA learning algorithm to discern non-perfect repeats even on a single base discrepancy. Furthermore, the multiple alignments generated by these two editors are way more pleasant to look at (or automatically analyse) than the ones containing all kind of gaps, insertions, deletions etc.pp. MIRA lets you see why contigs end where they end A very useful feature for finishing are hash frequency (HAF) tags which MIRA sets in the assembly. Provided your finishing editor understands those tags ( gap4, gap5 and consed are fine but there may be others), they'll give you precious insight where you might want to be cautious when joining to contigs or where you would need to perform some primer walking. MIRA colourises the assembly with the HAF tags to show repetitiveness.
You will need to read about the HAF tags in the reference manual, but in a nutshell: the HAF5, HAF6 and HAF7 tags tell you potentially have repetitive to very repetitive read areas in the genome, while HAF2 tags will tell you that these areas in the genome have not been covered as well as they should have been. As an example, the following figure shows the coverage of a contig. The question is now: why did MIRA stop building this contig on the left end (left oval) and why on the right end (right oval). Looking at the HAF tags in the contig, the answer becomes quickly clear: the left contig end has HAF5 tags in the reads (shown in bright red in the following figure). This tells you that MIRA stopped because it probably could not unambiguously continue building this contig. Indeed, if you BLAST the sequence at the NCBI, you will find out that this is an rRNA area of a bacterium, of which bacteria normally have several copies in the genome. MIRA tags problematic decisions in hybrid assemblies Many people combine Sanger & 454 - or nowadays more 454 & Solexa - to improve the sequencing quality of their project through two (or more) sequencing technologies.
To reduce time spent in finishing, MIRA automatically tags those bases in a consensus of a hybrid assembly where reads from different sequencing technologies severely contradict each other. The following example shows a hybrid 454 / Solexa assembly where reads from 454 (highlighted read names in following figure) were not sure whether to have one or two 'G' at a certain position. The consensus algorithm would have chosen 'two Gs' for 454, obviously a wrong decision as all Solexa reads at the same spot (the reads which are not highlighted) show only one 'G' for the given position.
While MIRA chose to believe Solexa in this case, it tagged the position anyway in case someone chooses to check these kind of things. MIRA allows older finishing programs to cope with amount data in Solexa mapping projects Quality control is paramount when you do mutation analysis for biologists: I know they'll be on my doorstep the very next minute they found out one of the SNPs in the resequencing data wasn't a SNP, but a sequencing artefact. And I can understand them: why should they invest - per SNP - hours in the wet lab if I can invest a couple of minutes to get them data false negative rates (and false discovery rates) way below 1%? So, finishing and quality control for any mapping project is a must.
Both gap4 and consed start to have a couple of problems when projects have millions of reads: you need lots of RAM and scrolling around the assembly gets a test to your patience. Still, these two assembly finishing programs are amongst the better ones out there, although gap5 starts to quickly arrive in a state in which it allows itself to substitute to gap4. So, MIRA reduces the number of reads in Solexa mapping projects without sacrificing information on coverage. The principle is pretty simple: for 100% matching reads, MIRA tracks coverage of every reference base and creates long synthetic, coverage equivalent reads (CERs) in exchange for the Solexa reads. Reads that do not match 100% are kept as own entities, so that no information gets lost.
The following figure illustrates this. Coverage equivalent reads (CERs) explained. Left side of the figure: a conventional mapping with eleven reads of size 4 against a consensus (in uppercase). The inversed base in the lowest read depicts a sequencing error. Right side of the figure: the same situation, but with coverage equivalent reads (CERs). Note that there are less reads, but no information is lost: the coverage of each reference base is equivalent to the left side of the figure and reads with differences to the reference are still present.
This strategy is very effective in reducing the size of a project. As an example, in a mapping project with 9 million Solexa 36mers, MIRA created a project with 1.7m reads: 700k CER reads representing 8 million 100% matching Solexa reads, and it kept 950k mapped reads as they had ≥ mismatch (be it sequencing error or true SNP) to the reference. A reduction of 80%, and numbers for mapping projects with Solexa 100bp reads are in a similar range. Also, mutations of the resequenced strain now really stand out in the assembly viewer as the following figure shows.
MIRA tags SNPs and other features, outputs result files for biologists Want to assemble two or several very closely related genomes without reference, but finding SNPs or differences between them? Tired of looking at some text output from mapping programs and guessing whether a SNP is really a SNP or just some random junk? MIRA tags all SNPs (and other features like missing coverage etc.) it finds so that - when using a finishing viewer like gap4 or consed - one can quickly jump from tag to tag and perform quality control. This works both in de-novo assembly and in mapping assembly, all MIRA needs is the information which read comes from which strain. The following figure shows a mapping assembly of Solexa 36mers against a bacterial reference sequence, where a mutant has an indel position in an gene. Extensive possibilities to clip data if needed: by quality, by masked bases, by A/T stretches, by evidence from other reads. Routines to re-extend reads into clipped parts if multiple alignment allows for it.
Read in ancillary data in different formats: EXP, NCBI TRACEINFO XML, SSAHA2, SMALT result files and text files. Detection of chimeric reads. Pipeline to discover SNPs in ESTs from different strains (miraSearchESTSNPs).
Support for many different of input and output formats (FASTA, EXP, FASTQ, CAF, MAF.). Automatic memory management (when RAM is tight). Over 150 parameters to tune the assembly for a lot of use cases, many of these parameters being tunable individually depending on sequencing technology they apply to. Versions There are two kind of versions for MIRA that can be compiled form source files: production and development. Production versions are from the stable branch of the source code. These versions are available for download from SourceForge. Development versions are from the development branch of the source tree.
These are also made available to the public and should be compiled by users who want to test out new functionality or to track down bugs or errors that might arise at a given location. Release candidates (rc) also fall into the development versions: they are usually the last versions of a given development branch before being folded back into the production branch. MIRA MIRA has been put under the GPL version 2. This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA, USA You may also visit at the Open Source Initiative for a copy of this licence.
Getting help / Mailing lists / Reporting bugs Please try to find an answer to your question by first reading the documents provided with the MIRA package (FAQs, READMEs, usage guide, guides for specific sequencing technologies etc.). It's a lot, but then again, they hopefully should cover 90% of all questions.
If you have a tough nut to crack or simply could not find what you were searching for, you can subscribe to the MIRA talk mailing list and send in your question (or comment, or suggestion), see for more information on that. Now that the number of subscribers has reached a good level, there's a fair chance that someone could answer your question before I have the opportunity or while I'm away from mail for a certain time.
Note Please very seriously consider using the mailing list before mailing me directly. Every question which can be answered by participants of the list is time I can invest in development and documentation of MIRA. I have a day job as bioinformatician which has nothing to do with MIRA and after work hours are rare enough nowadays. Furthermore, Google indexes the mailing list and every discussion / question asked on the mailing list helps future users as they show up in Google searches. Only mail me directly ([email protected]) if you feel that there's some information you absolutely do not want to share publicly.
Note Subscribing to the list before sending mails to it is necessary as messages from non-subscribers will be stopped by the system to keep the spam level low. To report bugs or ask for new features, please use the SourceForge ticketing system at:. This ensures that requests do not get lost and you get the additional benefit to automatically know when a bug has been fixed as I will not send separate emails, that's what bug trackers are there for. Finally, new or intermediate versions of MIRA will be announced on the separate MIRA announce mailing list. Traffic is very low there as the only one who can post there is me.
Subscribe if you want to be informed automatically on new releases of MIRA. For miraversion, the stable versions of MIRA with the general public as audience usually have a version number in three parts, like 3.0.5, sometimes also followed by some postfix like in 3.2.0rc1 to denote release candidate 1 of the 3.2.0 version of MIRA. On very rare occasions, stable versions of MIRA can have four part like in, e.g., 3.4.0.1: these versions create identical binaries to their parent version ( 3.4.0) and just contains fixes to the source build machinery. The version string sometimes can have a different format: sometext-0-g somehexnumber like in, e.g., ftfastercontig-0-g4a27c91. These versions of MIRA are snapshots from the development tree of MIRA and usually contain new functionality which may not be as well tested as the rest of MIRA, hence contains more checks and more debugging output to catch potential errors. OS-and-binarytype finally defines for which operating system and which processor class the package is destined.
E.g., linux-gnux8664static contains static binaries for Linux running a 64 bit processor. Source packages are usually named mira- miraversion.tar.bz2 Examples for packages at SourceForge.
Installing from a precompiled binary package Download the package, unpack it. Inside, there is - beside other directories - a bin. Copy or move the files and soft-links inside this directory to a directory in your $PATH variable.
Additional scripts for special purposes are in the scripts directory. You might or might not want to have them in your $PATH. Scripts and programs for MIRA from other authors are in the 3rdparty directory.
Here too, you may or may not want to have (some of them) in your $PATH. Integration with third party programs (gap4, consed) MIRA sets tags in the assemblies that can be read and interpreted by the Staden gap4 package or consed. These tags are extremely useful to efficiently find places of interest in an assembly (be it de-novo or mapping), but both gap4 and consed need to be told about these tags. Data files for a correct integration are delivered in the support directory of the distribution. Please consult the README in that directory for more information on how to integrate this information in either of these packages. gcc ≥ 4.6.2, with libstdc6.
You really want to use a simple installation package pre-configured for your system, but in case you want or have to install gcc yourself, please refer to for more information on the GNU compiler collection. BOOST library ≥ 1.46.
Lower versions might work, but untested. You would need to change the checking in the configure script for this to run through. You really want to use a simple installation package pre-configured for your system, but in case you want or have to install BOOST yourself, please refer to for more information on the BOOST library. Warning Do NOT use a so called staged BOOST library, only a fully installed library will work at the moment. zlib. Should your system not have zlib installed or available as simple installation package, please see for more information regarding zlib. Should your system not have gmake installed or available as simple installation package, please see for more information regarding GNU make.
GNU flex ≥ 2.5.33. Should your system not have flex installed or available as simple installation package, please see for more information regarding flex. Expat library ≥ 2.0.1. Should your system not have the Expat library and header files already installed or available as simple installation package, you will need to download and install a yourself.
Please see and for information on how to do this. A small utility from the vim package. TCmalloc library ≥ 1.6.
Not a prerequisite per se, but highly recommended: MIRA will also work without, but memory requirements may then be a lot higher (40% and more). TCmalloc is part of the Google perftools library, version 1.6 or higher, lower might work, but untested. Should your system not have the perftools library and header files already installed or available as simple installation package, you will need to download and install a yourself. Note that Google perftools itself needs libunwind: For building the documentation, additional prerequisites are from the DocBook tool chain.
Warning As of MIRA 3.9.0, support for 32 bit platforms is being slowly phased out. While MIRA should compile and also run fine on 32 bit platforms, I do not guarantee it anymore as I haven't used 32 bit systems in the last 5 years.enable-warnings Enables compiler warnings, useful only for developers, not for users.enable-debug Lets the MIRA binary contain C/C debug symbols.enable-mirastatic Builds static binaries which are easier to distribute. Some platforms (like OpenSolaris) might not like this and you will get an error from the linker.enable-optimisations Instructs the configure script to set optimisation switches for compiling (on by default).
Switching optimisations off (warning, high impact on run-time) might be interesting only for, e.g, debugging with valgrind.enable-publicquietmira Some parts of MIRA can dump additional debug information during assembly, setting this switch to 'no' performs this. Warning: MIRA will be a bit chatty, using this is not recommended for public usage.enable-developmentversion Using MIRA with enabled development mode may lead to extra output on stdout as well as some additional data in the results which should not appear in real world data -enable-boundtracking -enable-bugtracking Both flags above compile in some basic checks into mira that look for sanity within some functions: Leaving this on 'yes' (default) is encouraged, impact on run time is minimal. (K)Ubuntu 12.04 You will need to install a couple of tools and libraries before compiling MIRA. Here's the recipe: sudo apt-get install make flex libgoogle-perftools-dev sudo apt-get install libboost-doc libboost.1.48-dev libboost.1.48.0 Once this is done, you can unpack and compile MIRA. For a dynamically linked version, use: tar xvjf mira-4.0.0.tar.bz2 cd mira-4.0.0./configure make && make install For a statically linked version, just change the configure line from above into./configure -enable-mirastatic In case you also want to build documentation yourself, you will need this in addition: sudo apt-get install xsltproc docbook-xsl dblatex.
OpenSUSE 12.1 You will need to install a couple of tools and libraries before compiling MIRA. Here's the recipe: sudo zypper install gcc-c boost-devel sudo zypper install flex libexpat-devel google-perftools-devel zlib-devel Once this is done, you can unpack and compile MIRA. For a dynamically linked version, use: tar xvjf mira-4.0.0.tar.bz2 cd mira-4.0.0./configure make && make install For a statically linked version you will need to compile and install the Google perftools library yourself as the package delivered by Fedora contains only dynamic libraries. In case you also want to build documentation yourself, you will need this in addition: sudo zypper install docbook-xsl-stylesheets dblatex. Fedora 17 You will need to install a couple of tools and libraries before compiling MIRA.
Here's the recipe: sudo yum -y install gcc-c boost-devel sudo yum install flex expat-devel google-perftools-devel vim-common zlib-devel Once this is done, you can unpack and compile MIRA. For a dynamically linked version, use: tar xvjf mira-4.0.0.tar.bz2 cd mira-4.0.0./configure make && make install For a statically linked version you will need to compile and install the Google perftools library yourself as the package delivered by Fedora contains only dynamic libraries.
In case you also want to build documentation yourself, you will need this in addition: sudo yum -y install docbook-xsl dblatex. Compile everything from scratch This lets you build a self-contained static MIRA binary. The only prerequisite here is that you have a working gcc ≥ 4.6.2. Please download all necessary files (expat, flex, etc.pp) and then simply follow the script below. The only things that you will want to change are the path used and, maybe, the name of some packages in case they were bumped up a version or revision.
Contributed by Sven Klages. Download and install a current XCode. Download and compile a current GCC (≥ 4.8.2). Do NOT use a GCC from MacPorts, it lacks a vitally important library ( libstdc.a).
Download, compile with GCC and install a current BOOST library. Download, compile with GCC and install all libraries MIRA needs (flex, etc.pp). Follow the directions given in and. Download the MIRA source package and unpack it. In the unpacked MIRA directory, create a directory called OSXstatlibs. Into this directory, you need to softlink all needed static libraries from GCC, BOOST, flex, etc.pp. E.g., I have GCC installed in /opt/localwgcc48/ and therefore I need to use the following to link GCC static libraries: $ ln -s /opt/localwgcc48/lib/.a.
I have all the other libraries (BOOST, flex, etc.pp) installed in /opt/biosw/, therefore I also need to link these libraries: $ ln -s /opt/biosw/lib/.a. Run./configure -enable-mirastatic. Stands for additional configure parameters needed and then run make.
NetBSD 5 (i386) Contributed by Thomas Vaughan The system flex (/usr/bin/flex) is too old, but the devel/flex package from a recent pkgsrc works fine. BSD make doesn't like one of the lines in src/progs/Makefile, so use GNU make instead (available from pkgsrc as devel/gmake). Other relevant pkgsrc packages: devel/boost-libs, devel/boost-headers and textproc/expat.
The configure script has to be told about these pkgsrc prerequisites (they are usually rooted at /usr/pkg but other locations are possible): FLEX=/usr/pkg/bin/flex./configure -with-expat=/usr/pkg -with-boost=/usr/pkg If attempting to build a pkgsrc package of MIRA, note that the LDFLAGS passed by the pkgsrc mk files don't remove the need for the -with-boost option. The configure script complains about flex being too old, but this is harmless because it honours the $FLEX variable when writing out makefiles.
mira for assembly of genomic data as well as assembly of EST data from one or multiple strains / organisms and. miraSearchESTSNPs for assembly of EST data from different strains (or organisms) and SNP detection within this assembly. This is the former miraEST program which was renamed as many people got confused regarding whether to use MIRA in est mode or miraEST. Note that miraSearchESTSNPs is usually realised as a link to the mira executable, the executable decides by the name it was called with which module to start. The manifest file: introduction A manifest file can be seen as a two part configuration file for an assembly: the first part contains some general information while the second part contains information about the sequencing data to be loaded.
project = tells the assembler the name you wish to give to the whole assembly project. MIRA will use that name throughout the whole assembly for naming directories, files and a couple of other things.
You can name the assembly anyway you want, you should however restrain yourself and use only alphanumeric characters and perhaps the characters plus, minus and underscore. Using slashes or backslashes here is a recipe for catastrophe.
job = tells the assembler what kind of data it should expect and how it should assemble it. You need to make your choice mainly in three steps and in the end concatenate your choices to the job= entry of the manifest. are you building an assembly from scratch (choose: denovo) or are you mapping reads to an existing backbone sequence (choose: mapping)? Leaving this out automatically chooses denovo as default.
are the data you are assembling forming a larger contiguous sequence (choose: genome) or are you assembling small fragments like in EST or mRNA libraries (choose: est)? Leaving this out automatically chooses genome as default. do you want a quick and dirty assembly for first insights (choose: draft) or an assembly that should be able to tackle even most nasty cases (choose: accurate)? Leaving this out automatically chooses accurate as default. Once you're done with your choices, concatenate everything with commas and you're done. E.g.: ' -job=mapping,genome,draft' will give you a mapping assembly of a genome in draft quality. Note For de-novo assembly of genomes, these switches are optimised for 'decent' coverages that are commonly seen to get you something useful, i.e., ≥ 7x for Sanger, =18x for 454 FLX or Titanium, ≥ 25x for 454 GS20 and ≥ 30x for Solexa.
Should you venture into lower coverage or extremely high coverage (say, =60x for 454), you will need to adapt a few parameters via extensive switches. parameters = is used in case you want to change one of the 150+ extended parameters MIRA has to offer to control almost every aspect of an assembly. This is described in more detail in a separate section below. Understanding readgroups and DNA templates When you send away your DNA for sequencing, it is going to be prepared for sequencing according to your wishes. Sequencing providers call this 'constructing a library' and regardless whether you sequence with Sanger, 454, Illumina, Ion Torrent, Pacific Biosciences or other technologies, the 'library prep' is always there. With most library preps, your DNA is first amplified and then cut into small pieces. These pieces are called templates and their length can be anywhere between a few dozen bases, a few hundred bases or even a couple of dozen or even hundred kilobases.
The important thing is that these templates can be much bigger in s.