Skip to content

Machine Learning Grammars

I have completed a program that uses system of grammars that learn from input, and using what they learn to parse a set of unknown words to see if they check against the original set. 

There are 3 files needed:

Learn set: containing words, letters, numbers, names etc. 
Rule Set: empty until filled with automatically written rules. 
Test Set: containing words, letters, numbers, names etc. 

The algorithm is roughly as follows using as simple as possible examples as I can muster. 

1) First grammar reads a learn.txt file using rule::= (“%^[+’a-zA-Z’#%’word’^”)
e.g. the learn file contains the word “hello” 
and casts that into “word”

2) At the same time print the head of the new rule: (“%^[‘
The body of the new rule: hello
And the tail of the new rule: ‘$%’x’^”)
and store that result into x e.g rule::= (“%^[‘hello’$%’x’^”)

3) The 3rd grammar then reads the new rule from rules.txt 
and parses a test.txt file in which it either matches or doesn’t match-
placing the result into a file called results.txt 

In other words, the grammars read a file of words, forms a set of rules from what it reads such that another grammar uses the new rules to parse a set of test words.

Evolving Grammar Rules

2014

Work continues on several fronts. I have designed an application that reads my grammars from a file and modifies them as needed. 
A sample rule set printed to a file looks like this:

rule –> “%^%^%”$%’x’^%4(%(%([‘T’$%’RHO’;’x’;’`\”A\”+x’.%)%|%(%([‘A’$%’RHO’;’x’;’`\”T\”+x’.%)%|%(%([‘G’$%’RHO’;’x’;’`\”C\”+x’.%)%|%(%([‘C’$%’RHO’;’x’;’`\”G\”+x’.%)%)%)%)%)%)[6’AGCT’#%’x’!%’stem’^”

(This simple grammar matches an rna stem-loop structure where stem size = 4, loop size =6)

Variables may be placed anywhere in the grammar and manipulated by the accompanying C++ code. Combined with ML and/or GA algorithms, this makes for an extremely powerful pattern matching machine. 

November 2012

Research continues and is yielding positive results. Unfortunately, its not a priority to update all details here on wordpress because I doubt its widely read anyway. An entire series of grammars have been packaged as windows apps/unix and will soon be available on my website for download as grantware.

April 2012

A March 2012 generous personal grant will be used to purchase new computer equipment and software. The remainder will support new projects having to do with GA and advanced CS decision tree methods. Thank you for supporting this research.

Genetic Algorithm + Grammars

I am working on combining  Nevill-Manning like GA’s and context-free grammars (context-sensitive grammars later) as an experiment where the program selects rule sets from a database constellation in an attempt to match looser patterns without losing too much original form.

I’ve also added a  filter to certain RNA secondary structure paring grammars: I’m quite pleased that a bioinformatician at the Craig Venter Institute (Rockville, MD) sent me C++ source code for the Nussinov-Jacobson algorithm which with some modification, works perfectly as an adjunct to my intended application. This code will help verify and refine results obtained from my grammatical algorithms for structural-consensus RNA secondary structures.

Some Theory on RNA Strings

Added page:

http://www.rnaparse.com/about.html

Current events

We are juggling with several software applications in bioinformatics as well as some other more-commercial apps not specifically associated with science research or mentioned in http://www.rnaparse.com.  Research into the specific problems of RNA folding has led to the deployment of more general fast pattern matching algorithms and some “neural network-like” programs (for lack of another word.) that parse raw data several times over and pass the results between n number of data or text files while re-parsing, adding or removing information as needed.  Some example exe’s may be made available in the coming months.

James

Attenuators (con’t), Pseudoknot application.

Matching attenuation sequences has turned out to be a challenging problem. The simplest way I’ve discovered is to write a grammar in two parts, one matching the first loop, and two, checking for direct repeats in the correct positions.  Furthermore, this specific configuration is complex (Thus rare in random data.) and seems rare in genomic sequences.

I’ve located the following attenuator-like structures in two related genomes:

GTGGTCGGCCACAGGCGTGG
((((::::))))::::::::
::::::::((((::::))))
>emb|V01174.1|  Avian myelocytomatosis virus 5' LTR and gene for p96 polyprotein,
proviral DNA in Gallus gallus genomic DNA
Length=3780 GENE ID: 1491913 Amvgp1 | p110 [Avian myelocytomatosis virus]
459  GTGGTCGGCCACAGGCGTGG  478

>gb|AF033809.1|AF033809  Avian myelocytomatosis virus, complete genome
Length=3392 GENE ID: 1491913 Amvgp1 | p110 [Avian myelocytomatosis virus]
72  GTGGTCGGCCACAGGCGTGG  91

Turns out this configuration is more common in bacterial genomes.
MTB DS016976 et al. *note several repeats throughout

((((::::))))::::::::
::::::::((((::::))))
ACGCTGTCGCGTGCCGACGC
GCCGCAGACGGCTAAAGCCG
GCTGGCCGCAGCCAGCGCTG
GTGCACGCGCACAACGGTGC
CGGTGGTCACCGTCGGCGGT
GGTGTCGGCACCGGCCGGTG
GCTCGTCGGAGCCAAAGCTC
GCTCACCCGAGCGGCAGCTC
GGCGATTCCGCCGTCGGGCG
CCGCCCGGGCGGGGCGCCGC
CCGGCAATCCGGCGTGCCGG
TGTTCGGCAACAAGTATGTT
CGCAAACATGCGGGAGCGCA
GCGCGTCGGCGCGGGAGCGC
GCGGACAGCCGCGGGAGCGG
CGCGCGGCCGCGCTGCCGCG
CCCGGAGCCGGGCATCCCCG
CGCCCGGCGGCGCGCTCGCC
GGCCCCACGGCCACCAGGCC
CCGGCGTCCCGGGAGGCCGG
CGGGAAAGCCCGCCATCGGG
CGTGCTGGCACGAAGTCGTG
CGGCTCTGGCCGACATCGGC
GCCGCATCCGGCGGAGGCCG
GGCCTGATGGCCAATCGGCC
GGCGTGATCGCCAAGAGGCG
ACCGGCGCCGGTGATTACCG
TATATAGATATATAGATATA
TATATACATATACAAATATA
TATATATATATATAGATATA
ATATATAAATATAAATATAT
ATATATATATATAGATATAT
GCCGTTGCCGGCCTGGGCCG
GCCGAGGGCGGCTTCGGCCG
GCGCATAAGCGCGAGAGCGC
GCACTCCGGTGCTGCTGCAC
GCGCGTGAGCGCCCGGGCGC
CGGCGCGGGCCGGTTCCGGC
CGGCAGGAGCCGGCGCCGGC
TGGTACCGACCAGATTTGGT
GGCGCCAGCGCCTGGCGGCG
ACCGCATACGGTCCCAACCG
CCGCGGTTGCGGTAGGCCGC
GGCGCCGGCGCCGTCTGGCG
CGGTGGCAACCGTCTGCGGT
CGGCGATGGCCGGTCTCGGC
CCTTCGGTAAGGCAACCCTT
GCCGGGTGCGGCTGCTGCCG
CCGCGGTAGCGGATCACCGC
CCACCGGAGTGGTTGGCCAC
GGCTGGCTAGCCAGCCGGCT
CGACCGTTGTCGGGGCCGAC
CGGGACCACCCGAAGGCGGG
GCCGGAGTCGGCAGTGGCCG
GCGCCGCTGCGCAGCAGCGC
GCGCGGCGGCGCGTTGGCGC
GCGCCGGTGCGCCGCGGCGC
CATCACTTGATGAAATCATC
GTGCTGTTGCACGTTGGTGC
CCCGCCGCCGGGTGATCCCG
GGGAGCTATCCCCCGGGGGA
CGGTGGGCACCGCCCCCGGT
CGGTGGAAACCGAACCCGGT
GCGGGGACCCGCCGAGGCGG
GCGGCCAACCGCAACAGCGG
GCGCGGCAGCGCTGCTGCGC
CCGCCGGTGCGGTGTCCCGC
GCCGGATGCGGCTCCCGCCG
CCCGCACACGGGAGAGCCCG
CGCCCGTAGGCGAATCCGCC
CGGACCAGTCCGACCACGGA
CTGGATGTCCAGCGCGCTGG
GGCGCGGCCGCCTGCTGGCG
CGGCTGTGGCCGCGGTCGGC
ACCCTGGTGGGTACCAACCC
GTGCGTTAGCACCTCGGTGC
CGGCGCGCGCCGTTACCGGC
CGGCCGCAGCCGGGACCGGC
GCCGTGTGCGGCCAGTGCCG
CGGGCGTGCCCGCTAACGGG
CGACGGTGGTCGACACCGAC
GCGAGCGGTCGCGGCCGCGA
CGCGGAGTCGCGGGTGCGCG
CGGCCCAAGCCGAGCGCGGC
CGGTTCCGACCGGATCCGGT
CAGGCCGTCCTGGCGCCAGG
CCGGCGCACCGGCGAACCGG
CATCGTCGGATGGTTTCATC
CCACCAGGGTGGCCGTCCAC
//*************************
I've also developed a pseudoknot grammar application that allows the investigator to vary the length of stems 1 and 2 and loops 1,2 and 3. This will be available as grantware within the next 2-3 months for public download. If you wish a beta copy contact me.

Attenuators – preliminary results

Very roughly, attenuators switch between two different stem-loop systems depending on how a gene is to be expressed.

I have completed a grammar that locates potential attenuators in the form:

L= {axbxa}

where a and b are compliments and x is any nucleotide.
note two stems are possible, a-b and b-a

The grammar first parses a stem and loop of some given size plus a tail.
Results are sent to a file where the first few nts. are checked for repeats in the first and last nts.

Two examples:

Hepatitis C virus 

GAAGACATCTCATCTTCTGCCACTCAAAGAAG

((((:::::::::)))):::::::::::::::  stem/loop + tail. configuration a
{{{{::::::::::::::::::::::::}}}}  repeat GAAG
:::::::::::::((((:::::::::::))))  head + stem/loop. configuration b
 s/r          s'              r'

Hepatitis C virus 

CCGGTGAGTACACCGGAATTGCCAGGACGACCGG

((((::::::::))))::::::::::::::::::  stem/loop + tail. configuration a
{{{{::::::::::::::::::::::::::}}}}  repeat CCGG
::::::::::::((((::::::::::::::))))  head + stem/loop. configuration b
 s/r          s'               r'

James F. Lynn


Staged Grammars

We have successfully implemented and coded grammars that are “staged” – meaning a dataset is parsed either with RE’s or context-free or combinations of both then re-parsed from a database or file with more computationally expensive algorithms.

Example: I parsed a 22,000,000 character file for certain structures that may or may not contain a specific secondary structure and sent that to a text file which was automatically re-parsed for a very specific structure. File 2  in this hierarchy  was of approximately 100,000 characters.

The original 22 million char. file was parsed in ~ 2 seconds while the second file of 100k char. parsed in ~55 minutes. (Noting that these are linear parses, matching the 2nd structure would have taken roughly 220 hours.) – A huge advantage !

Several grammars coupled with results may be chained and/or  branched together,  working from less-specific to highly-specific patterns.

I’ll get to posting up  some example .exe’s  next month after some grant work is completed.