Some Theory on RNA Strings
Added page:
http://www.rnaparse.com/about.html
Current events
We are juggling with several software applications in bioinformatics as well as some other more-commercial apps not specifically associated with science research or mentioned in www.rnaparse.com. Research into the specific problems of RNA folding has led to the deployment of more general fast pattern matching algorithms and some “neural network-like” programs (for lack of another word.) that parse raw data several times over and pass the results between n number of data or text files while re-parsing, adding or removing information as needed. Some example exe’s may be made available in the coming months.
James
Attenuators (con’t), Pseudoknot application.
Matching attenuation sequences has turned out to be a challenging problem. The simplest way I’ve discovered is to write a grammar in two parts, one matching the first loop, and two, checking for direct repeats in the correct positions. Furthermore, this specific configuration is complex (Thus rare in random data.) and seems rare in genomic sequences.
I’ve located the following attenuator-like structures in two related genomes:
GTGGTCGGCCACAGGCGTGG ((((::::)))):::::::: ::::::::((((::::)))) >emb|V01174.1| Avian myelocytomatosis virus 5' LTR and gene for p96 polyprotein, proviral DNA in Gallus gallus genomic DNA Length=3780 GENE ID: 1491913 Amvgp1 | p110 [Avian myelocytomatosis virus] 459 GTGGTCGGCCACAGGCGTGG 478 >gb|AF033809.1|AF033809 Avian myelocytomatosis virus, complete genome Length=3392 GENE ID: 1491913 Amvgp1 | p110 [Avian myelocytomatosis virus] 72 GTGGTCGGCCACAGGCGTGG 91 Turns out this configuration is more common in bacterial genomes. MTB DS016976 et al. *note several repeats throughout ((((::::)))):::::::: ::::::::((((::::)))) ACGCTGTCGCGTGCCGACGC GCCGCAGACGGCTAAAGCCG GCTGGCCGCAGCCAGCGCTG GTGCACGCGCACAACGGTGC CGGTGGTCACCGTCGGCGGT GGTGTCGGCACCGGCCGGTG GCTCGTCGGAGCCAAAGCTC GCTCACCCGAGCGGCAGCTC GGCGATTCCGCCGTCGGGCG CCGCCCGGGCGGGGCGCCGC CCGGCAATCCGGCGTGCCGG TGTTCGGCAACAAGTATGTT CGCAAACATGCGGGAGCGCA GCGCGTCGGCGCGGGAGCGC GCGGACAGCCGCGGGAGCGG CGCGCGGCCGCGCTGCCGCG CCCGGAGCCGGGCATCCCCG CGCCCGGCGGCGCGCTCGCC GGCCCCACGGCCACCAGGCC CCGGCGTCCCGGGAGGCCGG CGGGAAAGCCCGCCATCGGG CGTGCTGGCACGAAGTCGTG CGGCTCTGGCCGACATCGGC GCCGCATCCGGCGGAGGCCG GGCCTGATGGCCAATCGGCC GGCGTGATCGCCAAGAGGCG ACCGGCGCCGGTGATTACCG TATATAGATATATAGATATA TATATACATATACAAATATA TATATATATATATAGATATA ATATATAAATATAAATATAT ATATATATATATAGATATAT GCCGTTGCCGGCCTGGGCCG GCCGAGGGCGGCTTCGGCCG GCGCATAAGCGCGAGAGCGC GCACTCCGGTGCTGCTGCAC GCGCGTGAGCGCCCGGGCGC CGGCGCGGGCCGGTTCCGGC CGGCAGGAGCCGGCGCCGGC TGGTACCGACCAGATTTGGT GGCGCCAGCGCCTGGCGGCG ACCGCATACGGTCCCAACCG CCGCGGTTGCGGTAGGCCGC GGCGCCGGCGCCGTCTGGCG CGGTGGCAACCGTCTGCGGT CGGCGATGGCCGGTCTCGGC CCTTCGGTAAGGCAACCCTT GCCGGGTGCGGCTGCTGCCG CCGCGGTAGCGGATCACCGC CCACCGGAGTGGTTGGCCAC GGCTGGCTAGCCAGCCGGCT CGACCGTTGTCGGGGCCGAC CGGGACCACCCGAAGGCGGG GCCGGAGTCGGCAGTGGCCG GCGCCGCTGCGCAGCAGCGC GCGCGGCGGCGCGTTGGCGC GCGCCGGTGCGCCGCGGCGC CATCACTTGATGAAATCATC GTGCTGTTGCACGTTGGTGC CCCGCCGCCGGGTGATCCCG GGGAGCTATCCCCCGGGGGA CGGTGGGCACCGCCCCCGGT CGGTGGAAACCGAACCCGGT GCGGGGACCCGCCGAGGCGG GCGGCCAACCGCAACAGCGG GCGCGGCAGCGCTGCTGCGC CCGCCGGTGCGGTGTCCCGC GCCGGATGCGGCTCCCGCCG CCCGCACACGGGAGAGCCCG CGCCCGTAGGCGAATCCGCC CGGACCAGTCCGACCACGGA CTGGATGTCCAGCGCGCTGG GGCGCGGCCGCCTGCTGGCG CGGCTGTGGCCGCGGTCGGC ACCCTGGTGGGTACCAACCC GTGCGTTAGCACCTCGGTGC CGGCGCGCGCCGTTACCGGC CGGCCGCAGCCGGGACCGGC GCCGTGTGCGGCCAGTGCCG CGGGCGTGCCCGCTAACGGG CGACGGTGGTCGACACCGAC GCGAGCGGTCGCGGCCGCGA CGCGGAGTCGCGGGTGCGCG CGGCCCAAGCCGAGCGCGGC CGGTTCCGACCGGATCCGGT CAGGCCGTCCTGGCGCCAGG CCGGCGCACCGGCGAACCGG CATCGTCGGATGGTTTCATC CCACCAGGGTGGCCGTCCAC //************************* I've also developed a pseudoknot grammar application that allows the investigator to vary the length of stems 1 and 2 and loops 1,2 and 3. This will be available as grantware within the next 2-3 months for public download. If you wish a beta copy contact me.
Attenuators – preliminary results
Very roughly, attenuators switch between two different stem-loop systems depending on how a gene is to be expressed.
I have completed a grammar that locates potential attenuators in the form:
L= {axbxa}
where a and b are compliments and x is any nucleotide.
note two stems are possible, a-b and b-a
The grammar first parses a stem and loop of some given size plus a tail.
Results are sent to a file where the first few nts. are checked for repeats in the first and last nts.
Two examples:
Hepatitis C virus
GAAGACATCTCATCTTCTGCCACTCAAAGAAG
((((:::::::::))))::::::::::::::: stem/loop + tail. configuration a
{{{{::::::::::::::::::::::::}}}} repeat GAAG
:::::::::::::((((:::::::::::)))) head + stem/loop. configuration b
s/r s' r'
Hepatitis C virus
CCGGTGAGTACACCGGAATTGCCAGGACGACCGG
((((::::::::)))):::::::::::::::::: stem/loop + tail. configuration a
{{{{::::::::::::::::::::::::::}}}} repeat CCGG
::::::::::::((((::::::::::::::)))) head + stem/loop. configuration b
s/r s' r'
James F. Lynn
Staged Grammars
We have successfully implemented and coded grammars that are “staged” – meaning a dataset is parsed either with RE’s or context-free or combinations of both then re-parsed from a database or file with more computationally expensive algorithms.
Example: I parsed a 22,000,000 character file for certain structures that may or may not contain a specific secondary structure and sent that to a text file which was automatically re-parsed for a very specific structure. File 2 in this hierarchy was of approximately 100,000 characters.
The original 22 million char. file was parsed in ~ 2 seconds while the second file of 100k char. parsed in ~55 minutes. (Noting that these are linear parses, matching the 2nd structure would have taken roughly 220 hours.) – A huge advantage !
Several grammars coupled with results may be chained and/or branched together, working from less-specific to highly-specific patterns.
I’ll get to posting up some example .exe’s next month after some grant work is completed.
New .exe’s
We have built several new applications having to do with certain hard to parse structures such as ‘kissing hairpins’ in the general form ([)(]) and ORF finding software. Some of these will be released as grantware while some remain private and in development. The following is an example of a kissing hairpin found with our new application:
Parse time is linear at ~ 3000 nts/sec.
DNA Direct Repeats
Demo of how a grammar may match repeating nts. in a DNA string. On my machine it runs 6 million nts/~37mins.
http://www.rnaparse.com/Downloads/repeats.exe
New TCR app
http://www.rnaparse.com/Downloads.html
For parsing n=12-20 with or without filler: {{{{{{…}}}}}} – {{{{{{{{{{…}}}}}}}}}}
This tandem complement repeat finder run in linear time O(n) and has a current limit of 100 million bases.
Output is to screen and also to a file called dna_results.
TCR Demo available for download
Tandem compliment repeats are examples of multiple crossing (multi-context-sensitive languages) structures. Examples are: AGCT.TCGA (This example is also a mirror repeat), GCTC.CGAG (1st G paired with 4th position C. 2nd C paired wirh 5th position G…)
The download demo is limited to scanning 2000 bp at a time in 2 files and finding perfect TCRs of length 12: e.g ATCGAT.TAGCTA
Create a simple text file called “DNA1″ and “DNA2″ and place in the same directory as TCR6.exe
http://www.rnaparse.com/Downloads.html
Experiments give O(n) time/space at 1,000,000 characters/~30 seconds
DARPA Viral Mutation – Prophecy
DARPA Wants to Create Almanacs of Every Possible Virus Mutation
June 2nd, 2010
If ever a story belonged in the Kill Off category, this is it.
Via: Wired:
Right now, preparing for new viral threats means looking to the past, creating hypotheses based on how pathogens have changed before. Now Darpa wants to reverse that strategy: test every possible outcome, to create a prophetic almanac that warns of viral mutations and outbreaks in advance — giving scientists the chance to change the course of the future before illness strikes.
The Pentagon’s far-out research arm has been zeroing in on the danger of mutating pathogens, and the corresponding problem of drug resistance, for years now. The agency is already funding tobacco-based vaccine production, a seven-day plan to thwart biothreats, and prescient viral infection detectors. And they’ve even set their sights on psychic medics, with a 2007 program that sought to turn docs into all-knowing illness predictors.
Now, Darpa wants the powers of premonition to wipe out viral threats altogether. They’re hosting a workshop for a new program, called “Prophecy,” that’ll develop methods to predict the rate, location and likely mutations of viral agents.
First, the agency wants novel lab-based methods to reproduce “virus-host interactions,” in different environments. After that, researchers will sequence different viral genomes, and test how they adapt and change under diverse conditions.
Ideally, that’ll yield a host of algorithms, capable of accurately predicting “the rate, direction and phenotype of viral mutations.” From there, scientists will be able to develop appropriate attack strategies in the right geographic locations. Most notably, Darpa wants to see mere mortals outdo the forces of nature, by creating “high energy evolutionary boundaries” that keep genetic mutations at bay.
Even if Darpa’s program doesn’t result in omniscient predictive powers, the possibility of more accurately anticipating viral mutations would have widespread implications. Health agencies could prep for looming outbreaks, new vaccines could be fast-tracked — and if scientists do manage to thwart evolution, the threat of resistance to antibiotic and antiviral meds could be all but eliminated.