gbadev.org forum archive

This is a read-only mirror of the content originally found on forum.gbadev.org (now offline), salvaged from Wayback machine copies. A new forum can be found here.

Coding > Ye Mighty Lexer Compiler ^_^

#145738 - keldon - Wed Nov 21, 2007 7:17 am

---=== POST REMOVED (ORIGINAL CONTENT NO LONGER EXISTS) DUE TO CONFUSION CAUSED BY ITS DESCRIPTION ===---

My library/tool is called Ye Mighty Lexer Compiler (currently in Java).
The input it takes is a grammar (or a set of grammars) and text to parse.
The output it produces is the largest string it can extract from the input string using the set of grammars, and also which grammar it used.
Before this tool existed, life sucked because you had to manually code a unique parser.
Now that this tool exists, life is better because you can define a grammar to automatically parse and lex text by using simple grammars; it also shows you how easy the code is to create.


Last edited by keldon on Fri Dec 14, 2007 8:23 am; edited 4 times in total

#145761 - Lick - Wed Nov 21, 2007 7:15 pm

You mean auto-completion?

T9 is more of a system that works with multiple characters mapped to a smaller amount of keys/buttons. It would then make sense to guess which combination of characters could form a word that most used (using a tree or whatever).

On a system with only one or two characters mapped to each key/button, it wouldn't make sense to implement T9. Auto-completion on the other hand is what lots of websites have already implemented using Javascript and XHRs (XHttpRequests).
_________________
http://licklick.wordpress.com

#145778 - keldon - Wed Nov 21, 2007 11:14 pm

POST REMOVED DUE TO CONFUSIONS CAUSED BY IT, FORGET EVERYTHING BEFORE

Last edited by keldon on Fri Dec 14, 2007 10:04 am; edited 2 times in total

#145827 - keldon - Fri Nov 23, 2007 10:44 am

I've done it!

You feed it a token grammar and a string, and this son-of-a-gun will give you the next token. Warning: like any declarative language this will do what you tell it to do, therefore do not give it faulty grammars (like ones that will accept nothing)!

Code:

      ArrayList<Node> alphaNumericList = new ArrayList<Node>();
      ArrayList<Node> sentenceList = new ArrayList<Node>();
      ArrayList<Node> anotherWordList = new ArrayList<Node>();
      ArrayList<Node> wordList = new ArrayList<Node>();
      ArrayList<Node> whitespaceList = new ArrayList<Node>();
      
      Node whitespace;
      Node word;
      Node anotherWord;
      Node sentence;
      
      whitespaceList.add ( TerminalFactory.createTerminalString(" "));
      whitespaceList.add (  RepetitionFactory.createRepetition(TerminalFactory.createTerminalString(" ")) );
      whitespace = ListFactory.createList(whitespaceList);
      
      alphaNumericList.add(number());
      alphaNumericList.add(letter());
      
      wordList.add (letter());
      wordList.add(RepetitionFactory.createRepetition( letter() ));
      word = ListFactory.createList(wordList);

      anotherWordList.add(whitespace);
      anotherWordList.add(word);
      anotherWord = ListFactory.createList(anotherWordList);
      
      sentenceList.add (word);
      sentenceList.add (RepetitionFactory.createRepetition(anotherWord));
      sentenceList.add (OptionFactory.createOptional(TerminalFactory.createTerminalString(".")));
      sentence = ListFactory.createList(sentenceList);
      
      
      System.out.println ( Parser.parse("The dirty fairies are dead", sentence));
      System.out.println ( Parser.parse("The dirty fairies are dead.", sentence));
      System.out.println ( Parser.parse("The dirty fai.ries are dead.", sentence));


Output:
Code:
26
27
14


Pretty neat, eh? I'll upload code to my server later today when I get into the office and cure my sleep deprivation with 900% of my recommended daily allowance for glucose and vitamins with a block of Barocca.

P.s. it is possible to code efficiently in C without garbage collection; I've just used these constructs for simplicity.

#145831 - keldon - Fri Nov 23, 2007 1:30 pm

Legal disclaimer:
By downloading the said file, knowingly or not, you agree to have no rights to its code or your knowledge of the knowledge gained upon mentally processing it. You have no copying rights, understanding rights, or right to process any thoughts derived from the knowledge of the said file. You are however given the right to live and breathe under the condition you do so without violation of any stated rule in this disclaimer.


Code available at https://github.com/keldon85-avasopht/mighty-parser

As for the approach, I've never done anything like this before, I just looked at the rules for grammars and created factories for them. The parser does most of the work, but you can see from the [poorly (un)commented] code how it works.

Edit: I've made a further update to have it select from a list of grammars, returning the grammar that gave the best match. Note that ambiguous grammars will give ambiguous results, such as one grammar accepting white space (only) and another beginning with being able to accept white space. On another note I've added in detection of faulty grammar!

Edit: I've made the next update: now the parser will return the largest valid token from a string, which it was not doing before. Before the code could either determine whether a string completely satisfies a grammar or extract the largest string that is accepted. Now it will extract the largest valid grammar, hence is a complete tokenizer.

---=== THIS POST CONTAINS A SIMPLE EXPLANATION AND TEST CASE ===---

DzzD wrote:

sry, I dont understand anything, could you explain a simple test case please ?


Okay, basically this code can automatically tokenize any string based on the [CFG] rules you give it. And for free it also doubles up as an automatic lexer.

So as a test case, consider the following grammar (shown in the last post, and also in the code example) ...
Code:
word = LETTER {LETTER} .
anotherWord = whiteSpace word .
sentence = word {anotherWord} ["."] .

... and also the following rule (also defined in the code) ...
Code:
whitespace = " " {" "} .

... if you were to call Parser.parse ("This is a lovely valid sentence. This is the next sentence!", sentence), it would return you the number of characters that form a valid sentence, which would be "This is a lovely sentence."

If you then were to call Parser.parse (" This is not a valid sentence because sentences do not begin with whitespace.", sentence), it would return -1 to tell you that it could not return a token. If you were to pass whitespace instead of sentence then it would return 7, i.e. " "!

My latest code update will also return a lexeme (as com.avasopht.mightyParser.traversing.Token).

Edit: I did a little speed test with a valid sentence formed of 12 031 characters; it searched 2 140 577 nodes and took 1.469 seconds to parse. Bearing in mind that the grammar depicting sentences involves branching for each character, increasing the search space. If you have a specific word then it generates a list that does not require so many nodes, so if your grammar defines the text, "some_reserved_keyword" then it will not need to branch at each letter.

Now for a C++ version.


Last edited by keldon on Fri Jul 20, 2012 12:32 pm; edited 1 time in total

#146071 - keldon - Wed Nov 28, 2007 12:45 pm

I've made a further advancement in the code, which I need to update some time. Basically the parser can have the saving of the stack eliminated as it is redundant in the presence of the "choice stack". This [should] give it a significant speed improvement.

It will also make a C++ conversion 10 times simpler, in fact there's practically nothing to it!

#146078 - Lick - Wed Nov 28, 2007 5:59 pm

I fail to see the use of such a system, unless you're an language teacher?
_________________
http://licklick.wordpress.com

#146085 - keldon - Wed Nov 28, 2007 7:01 pm

Here's why I did it
- fun (honestly)
- frustration (also)
- tokenizing to regular expressions (if you wanted to add it in)
- flexible lexing and ability to change lex and tokenizing rules easily
- just to see how it's done (or if I could do it)

I could have performed lexing using regular expressions, but gained much more by creating my own.

#146467 - keldon - Tue Dec 04, 2007 5:00 pm

Ok, I've also found a way that this code can tidily generate a parse tree! The 'choice stack' holds the key to it, what I think I'll do is dedicate my site to explaining these algorithms or something.

Plus it lets me do a little less work for the site, but still have a fair amount of 'kick' behind it!

#147061 - elyk1212 - Thu Dec 13, 2007 10:13 pm

I am not sure what this is for exactly, but why not consider "a recursive decent parser" if using this for sentence syntax? They are simple to write, and easy to extend and can be directly based on EBNF grammer you specify.

A thought...:
Else, if you are using it for T9 (that's for text intuition/prediction?) ,

you could use pattern based heuristics with IDA* (or something like this) on a search tree, constructed from all words in your dictionary. It is an A.I. approach, and I am not completely sure about how to get pattern values (usage data) based on each word.

You would have to either be given this or you could write a scrapper that searched relevant papers/books and tallied word usage statistics. It would have to have some modern samples in there, to get all the crazy slang/abbreviations though.


Anyway, without some kind of search heuristics, if your dictionary is fairly large, your users could get significant delays.

#147070 - keldon - Fri Dec 14, 2007 1:00 am

Well this allows you to write a grammar as opposed to code that reads a grammar! If you look at the more recent posts you'll see what it does.

#147073 - sajiimori - Fri Dec 14, 2007 1:28 am

Maybe you should write a post that explains it a bit more plainly, since nobody seems to understand what this thread is about. :)

My program is called ________.
The input it takes is __________, and it looks like this: _______.
The output it produces is ________, and it looks like this: _________.
Before this tool existed, life sucked because you had to __________.
Now that this tool exists, life is better because you can _________.

#147076 - keldon - Fri Dec 14, 2007 1:51 am

sajiimori wrote:
Maybe you should write a post that explains it a bit more plainly, since nobody seems to understand what this thread is about. :)

My program is called ________.
The input it takes is __________, and it looks like this: _______.
The output it produces is ________, and it looks like this: _________.
Before this tool existed, life sucked because you had to __________.
Now that this tool exists, life is better because you can _________.


Good call; I've edited the original post accordingly. On another note I have an alternative method of walking the nodes. My view on parsing is that once you create the structure there are 3 main types of node properties.
    Choices
    When a node contains more than one out edge it creates a choice. Choices should not contain terminals or children.

    Children
    When a node contains a child, you cannot get past it without first going through the child node! Nodes with children should not contain choices or terminal.

    Terminal
    When a node contains a character it is a terminal node. Terminal nodes should have exactly one out edge and no children.

A node that is neither a choice, parent, or terminal is an end node. Essentially the code will tokenize based on a grammar, but such a task can also lex and parse. It works by traversing the structure using the following behaviour:
    - grammar is a graph: every grammar is equivalent to a graph where nodes can contain terminals or child nodes. A grammars is passed as a node!
    - track 'parents': every time you enter a child node store the parent node in a stack like manor, popping it (to return to the node) when you reach an end node
    - track 'choice points': every time you have a choice, store the 'cur_pos', 'parent stack' and which choice you took (in a stack)
    - store 'cur_pos': keep track of how many letters have been accepted (i.e. what character we are up to)
    - only one end ('last_char_read'): There is only one valid end node, which is the node that can be reached directly from the 'grammar node' without going through any children. There should only be one. When this is reached store it as the new 'last_char_read'
    - terminals terminate: when a terminal is reached you it either matches the character at the 'cur_pos', or you have to pop your choice stack until you reach one where you have another choice. Store 'cur_pos', 'parent_stack' and the new choice you took. If you cannot then return the last_char_read. Also perform the same action if you have read every character.


So my current method walks the nodes one by one, but my alternative method (which was actually the first approach I thought up) involves working with a single "walker". The walker just tries to find its way around the nodes, keeping track of the parents it finds; but every time it reaches a choice point it will duplicate itself and the duplicates will follow each choice!

The walkers will die whenever they reach a terminal that does not satisfy their cur_pos.


Last edited by keldon on Fri Dec 14, 2007 9:21 am; edited 1 time in total

#147089 - elyk1212 - Fri Dec 14, 2007 8:30 am

Quote:
POST REMOVED DUE TO COMPLICATIONS, FORGET EVERYTHING BEFORE


LOL, That's awesome.

Dude you sound like me (especially when I have too many projects). ;)

#147092 - keldon - Fri Dec 14, 2007 10:04 am

elyk1212 wrote:
Quote:
POST REMOVED DUE TO COMPLICATIONS, FORGET EVERYTHING BEFORE


LOL, That's awesome.

Dude you sound like me (especially when I have too many projects). ;)

That was a typo, I meant confusions ^_^