Skip to content

Attenuators (con’t), Pseudoknot application.

Matching attenuation sequences has turned out to be a challenging problem. The simplest way I’ve discovered is to write a grammar in two parts, one matching the first loop, and two, checking for direct repeats in the correct positions.  Furthermore, this specific configuration is complex (Thus rare in random data.) and seems rare in genomic sequences.

I’ve located the following attenuator-like structures in two related genomes:

GTGGTCGGCCACAGGCGTGG
((((::::))))::::::::
::::::::((((::::))))
>emb|V01174.1|  Avian myelocytomatosis virus 5' LTR and gene for p96 polyprotein,
proviral DNA in Gallus gallus genomic DNA
Length=3780 GENE ID: 1491913 Amvgp1 | p110 [Avian myelocytomatosis virus]
459  GTGGTCGGCCACAGGCGTGG  478

>gb|AF033809.1|AF033809  Avian myelocytomatosis virus, complete genome
Length=3392 GENE ID: 1491913 Amvgp1 | p110 [Avian myelocytomatosis virus]
72  GTGGTCGGCCACAGGCGTGG  91

Turns out this configuration is more common in bacterial genomes.
MTB DS016976 et al. *note several repeats throughout

((((::::))))::::::::
::::::::((((::::))))
ACGCTGTCGCGTGCCGACGC
GCCGCAGACGGCTAAAGCCG
GCTGGCCGCAGCCAGCGCTG
GTGCACGCGCACAACGGTGC
CGGTGGTCACCGTCGGCGGT
GGTGTCGGCACCGGCCGGTG
GCTCGTCGGAGCCAAAGCTC
GCTCACCCGAGCGGCAGCTC
GGCGATTCCGCCGTCGGGCG
CCGCCCGGGCGGGGCGCCGC
CCGGCAATCCGGCGTGCCGG
TGTTCGGCAACAAGTATGTT
CGCAAACATGCGGGAGCGCA
GCGCGTCGGCGCGGGAGCGC
GCGGACAGCCGCGGGAGCGG
CGCGCGGCCGCGCTGCCGCG
CCCGGAGCCGGGCATCCCCG
CGCCCGGCGGCGCGCTCGCC
GGCCCCACGGCCACCAGGCC
CCGGCGTCCCGGGAGGCCGG
CGGGAAAGCCCGCCATCGGG
CGTGCTGGCACGAAGTCGTG
CGGCTCTGGCCGACATCGGC
GCCGCATCCGGCGGAGGCCG
GGCCTGATGGCCAATCGGCC
GGCGTGATCGCCAAGAGGCG
ACCGGCGCCGGTGATTACCG
TATATAGATATATAGATATA
TATATACATATACAAATATA
TATATATATATATAGATATA
ATATATAAATATAAATATAT
ATATATATATATAGATATAT
GCCGTTGCCGGCCTGGGCCG
GCCGAGGGCGGCTTCGGCCG
GCGCATAAGCGCGAGAGCGC
GCACTCCGGTGCTGCTGCAC
GCGCGTGAGCGCCCGGGCGC
CGGCGCGGGCCGGTTCCGGC
CGGCAGGAGCCGGCGCCGGC
TGGTACCGACCAGATTTGGT
GGCGCCAGCGCCTGGCGGCG
ACCGCATACGGTCCCAACCG
CCGCGGTTGCGGTAGGCCGC
GGCGCCGGCGCCGTCTGGCG
CGGTGGCAACCGTCTGCGGT
CGGCGATGGCCGGTCTCGGC
CCTTCGGTAAGGCAACCCTT
GCCGGGTGCGGCTGCTGCCG
CCGCGGTAGCGGATCACCGC
CCACCGGAGTGGTTGGCCAC
GGCTGGCTAGCCAGCCGGCT
CGACCGTTGTCGGGGCCGAC
CGGGACCACCCGAAGGCGGG
GCCGGAGTCGGCAGTGGCCG
GCGCCGCTGCGCAGCAGCGC
GCGCGGCGGCGCGTTGGCGC
GCGCCGGTGCGCCGCGGCGC
CATCACTTGATGAAATCATC
GTGCTGTTGCACGTTGGTGC
CCCGCCGCCGGGTGATCCCG
GGGAGCTATCCCCCGGGGGA
CGGTGGGCACCGCCCCCGGT
CGGTGGAAACCGAACCCGGT
GCGGGGACCCGCCGAGGCGG
GCGGCCAACCGCAACAGCGG
GCGCGGCAGCGCTGCTGCGC
CCGCCGGTGCGGTGTCCCGC
GCCGGATGCGGCTCCCGCCG
CCCGCACACGGGAGAGCCCG
CGCCCGTAGGCGAATCCGCC
CGGACCAGTCCGACCACGGA
CTGGATGTCCAGCGCGCTGG
GGCGCGGCCGCCTGCTGGCG
CGGCTGTGGCCGCGGTCGGC
ACCCTGGTGGGTACCAACCC
GTGCGTTAGCACCTCGGTGC
CGGCGCGCGCCGTTACCGGC
CGGCCGCAGCCGGGACCGGC
GCCGTGTGCGGCCAGTGCCG
CGGGCGTGCCCGCTAACGGG
CGACGGTGGTCGACACCGAC
GCGAGCGGTCGCGGCCGCGA
CGCGGAGTCGCGGGTGCGCG
CGGCCCAAGCCGAGCGCGGC
CGGTTCCGACCGGATCCGGT
CAGGCCGTCCTGGCGCCAGG
CCGGCGCACCGGCGAACCGG
CATCGTCGGATGGTTTCATC
CCACCAGGGTGGCCGTCCAC
//*************************
I've also developed a pseudoknot grammar application that allows the investigator to vary the length of stems 1 and 2 and loops 1,2 and 3. This will be available as grantware within the next 2-3 months for public download. If you wish a beta copy contact me.

Attenuators – preliminary results

Very roughly, attenuators switch between two different stem-loop systems depending on how a gene is to be expressed.

I have completed a grammar that locates potential attenuators in the form:

L= {axbxa}

where a and b are compliments and x is any nucleotide.
note two stems are possible, a-b and b-a

The grammar first parses a stem and loop of some given size plus a tail.
Results are sent to a file where the first few nts. are checked for repeats in the first and last nts.

Two examples:

Hepatitis C virus 

GAAGACATCTCATCTTCTGCCACTCAAAGAAG

((((:::::::::)))):::::::::::::::  stem/loop + tail. configuration a
{{{{::::::::::::::::::::::::}}}}  repeat GAAG
:::::::::::::((((:::::::::::))))  head + stem/loop. configuration b
 s/r          s'              r'

Hepatitis C virus 

CCGGTGAGTACACCGGAATTGCCAGGACGACCGG

((((::::::::))))::::::::::::::::::  stem/loop + tail. configuration a
{{{{::::::::::::::::::::::::::}}}}  repeat CCGG
::::::::::::((((::::::::::::::))))  head + stem/loop. configuration b
 s/r          s'               r'

James F. Lynn


Staged Grammars

We have successfully implemented and coded grammars that are “staged” – meaning a dataset is parsed either with RE’s or context-free or combinations of both then re-parsed from a database or file with more computationally expensive algorithms.

Example: I parsed a 22,000,000 character file for certain structures that may or may not contain a specific secondary structure and sent that to a text file which was automatically re-parsed for a very specific structure. File 2  in this hierarchy  was of approximately 100,000 characters.

The original 22 million char. file was parsed in ~ 2 seconds while the second file of 100k char. parsed in ~55 minutes. (Noting that these are linear parses, matching the 2nd structure would have taken roughly 220 hours.) – A huge advantage !

Several grammars coupled with results may be chained and/or  branched together,  working from less-specific to highly-specific patterns.

I’ll get to posting up  some example .exe’s  next month after some grant work is completed.

New .exe’s

We have built several new applications having to do with certain hard to parse structures such as ‘kissing hairpins’ in the general form ([)(]) and ORF finding software. Some of these will be released as grantware while some remain private and in development. The following is an example of a kissing hairpin found with our new application:  Parse time is linear at ~ 3000 nts/sec.

DNA Direct Repeats

Demo of how a grammar may match repeating nts. in a DNA string. On my machine it runs 6 million nts/~37mins.

http://www.rnaparse.com/Downloads/repeats.exe

New TCR app

http://www.rnaparse.com/Downloads.html

For parsing n=12-20 with or without filler: {{{{{{…}}}}}} – {{{{{{{{{{…}}}}}}}}}}

This tandem complement repeat finder run in linear time O(n) and  has a current limit of 100 million bases.

Output is to screen and also to a file called dna_results.

TCR Demo available for download

Tandem compliment repeats are examples of multiple crossing (multi-context-sensitive languages) structures.  Examples are: AGCT.TCGA (This example is also a mirror repeat),  GCTC.CGAG (1st G paired with 4th position C. 2nd C paired wirh 5th position G…)

The download demo is limited to scanning 2000 bp at a time in 2 files and finding perfect TCRs of length 12: e.g  ATCGAT.TAGCTA

Create a simple text file called “DNA1” and “DNA2” and place in the same directory as TCR6.exe

http://www.rnaparse.com/Downloads.html

Experiments give O(n) time/space at 1,000,000 characters/~30 seconds