gbadev.org forum archive

---=== POST REMOVED (ORIGINAL CONTENT NO LONGER EXISTS) DUE TO CONFUSION CAUSED BY ITS DESCRIPTION ===---

My library/tool is called Ye Mighty Lexer Compiler (currently in Java).
The input it takes is a grammar (or a set of grammars) and text to parse.
The output it produces is the largest string it can extract from the input string using the set of grammars, and also which grammar it used.
Before this tool existed, life sucked because you had to manually code a unique parser.
Now that this tool exists, life is better because you can define a grammar to automatically parse and lex text by using simple grammars; it also shows you how easy the code is to create.

Last edited by keldon on Fri Dec 14, 2007 8:23 am; edited 4 times in total

You mean auto-completion?

T9 is more of a system that works with multiple characters mapped to a smaller amount of keys/buttons. It would then make sense to guess which combination of characters could form a word that most used (using a tree or whatever).

On a system with only one or two characters mapped to each key/button, it wouldn't make sense to implement T9. Auto-completion on the other hand is what lots of websites have already implemented using Javascript and XHRs (XHttpRequests).
_________________
http://licklick.wordpress.com

POST REMOVED DUE TO CONFUSIONS CAUSED BY IT, FORGET EVERYTHING BEFORE

Last edited by keldon on Fri Dec 14, 2007 10:04 am; edited 2 times in total

I've done it!

You feed it a token grammar and a string, and this son-of-a-gun will give you the next token. Warning: like any declarative language this will do what you tell it to do, therefore do not give it faulty grammars (like ones that will accept nothing)!

Code:

   ArrayList<Node> alphaNumericList = new ArrayList<Node>();
   ArrayList<Node> sentenceList = new ArrayList<Node>();
   ArrayList<Node> anotherWordList = new ArrayList<Node>();
   ArrayList<Node> wordList = new ArrayList<Node>();
   ArrayList<Node> whitespaceList = new ArrayList<Node>();

   Node whitespace;
   Node word;
   Node anotherWord;
   Node sentence;

   whitespaceList.add ( TerminalFactory.createTerminalString(" "));
   whitespaceList.add ( RepetitionFactory.createRepetition(TerminalFactory.createTerminalString(" ")) );
   whitespace = ListFactory.createList(whitespaceList);

   alphaNumericList.add(number());
   alphaNumericList.add(letter());

   wordList.add (letter());
   wordList.add(RepetitionFactory.createRepetition( letter() ));
   word = ListFactory.createList(wordList);

   anotherWordList.add(whitespace);
   anotherWordList.add(word);
   anotherWord = ListFactory.createList(anotherWordList);

   sentenceList.add (word);
   sentenceList.add (RepetitionFactory.createRepetition(anotherWord));
   sentenceList.add (OptionFactory.createOptional(TerminalFactory.createTerminalString(".")));
   sentence = ListFactory.createList(sentenceList);


   System.out.println ( Parser.parse("The dirty fairies are dead", sentence));
   System.out.println ( Parser.parse("The dirty fairies are dead.", sentence));
   System.out.println ( Parser.parse("The dirty fai.ries are dead.", sentence));

Output:

Code:

26
27
14

Pretty neat, eh? I'll upload code to my server later today when I get into the office and cure my sleep deprivation with 900% of my recommended daily allowance for glucose and vitamins with a block of Barocca.

P.s. it is possible to code efficiently in C without garbage collection; I've just used these constructs for simplicity.

Legal disclaimer:
By downloading the said file, knowingly or not, you agree to have no rights to its code or your knowledge of the knowledge gained upon mentally processing it. You have no copying rights, understanding rights, or right to process any thoughts derived from the knowledge of the said file. You are however given the right to live and breathe under the condition you do so without violation of any stated rule in this disclaimer.

Code available at https://github.com/keldon85-avasopht/mighty-parser

As for the approach, I've never done anything like this before, I just looked at the rules for grammars and created factories for them. The parser does most of the work, but you can see from the [poorly (un)commented] code how it works.

Edit: I've made a further update to have it select from a list of grammars, returning the grammar that gave the best match. Note that ambiguous grammars will give ambiguous results, such as one grammar accepting white space (only) and another beginning with being able to accept white space. On another note I've added in detection of faulty grammar!

Edit: I've made the next update: now the parser will return the largest valid token from a string, which it was not doing before. Before the code could either determine whether a string completely satisfies a grammar or extract the largest string that is accepted. Now it will extract the largest valid grammar, hence is a complete tokenizer.

---=== THIS POST CONTAINS A SIMPLE EXPLANATION AND TEST CASE ===---

DzzD wrote:

sry, I dont understand anything, could you explain a simple test case please ?

Okay, basically this code can automatically tokenize any string based on the [CFG] rules you give it. And for free it also doubles up as an automatic lexer.

So as a test case, consider the following grammar (shown in the last post, and also in the code example) ...

Code:

word = LETTER {LETTER} .
anotherWord = whiteSpace word .
sentence = word {anotherWord} ["."] .

... and also the following rule (also defined in the code) ...

Code:

whitespace = " " {" "} .

... if you were to call Parser.parse ("This is a lovely valid sentence. This is the next sentence!", sentence), it would return you the number of characters that form a valid sentence, which would be "This is a lovely sentence."

If you then were to call Parser.parse (" This is not a valid sentence because sentences do not begin with whitespace.", sentence), it would return -1 to tell you that it could not return a token. If you were to pass whitespace instead of sentence then it would return 7, i.e. " "!

My latest code update will also return a lexeme (as com.avasopht.mightyParser.traversing.Token).

Edit: I did a little speed test with a valid sentence formed of 12 031 characters; it searched 2 140 577 nodes and took 1.469 seconds to parse. Bearing in mind that the grammar depicting sentences involves branching for each character, increasing the search space. If you have a specific word then it generates a list that does not require so many nodes, so if your grammar defines the text, "some_reserved_keyword" then it will not need to branch at each letter.

Now for a C++ version.

Last edited by keldon on Fri Jul 20, 2012 12:32 pm; edited 1 time in total

I've made a further advancement in the code, which I need to update some time. Basically the parser can have the saving of the stack eliminated as it is redundant in the presence of the "choice stack". This [should] give it a significant speed improvement.

It will also make a C++ conversion 10 times simpler, in fact there's practically nothing to it!

I fail to see the use of such a system, unless you're an language teacher?
_________________
http://licklick.wordpress.com

Here's why I did it
- fun (honestly)
- frustration (also)
- tokenizing to regular expressions (if you wanted to add it in)
- flexible lexing and ability to change lex and tokenizing rules easily
- just to see how it's done (or if I could do it)

I could have performed lexing using regular expressions, but gained much more by creating my own.

Ok, I've also found a way that this code can tidily generate a parse tree! The 'choice stack' holds the key to it, what I think I'll do is dedicate my site to explaining these algorithms or something.

Plus it lets me do a little less work for the site, but still have a fair amount of 'kick' behind it!

I am not sure what this is for exactly, but why not consider "a recursive decent parser" if using this for sentence syntax? They are simple to write, and easy to extend and can be directly based on EBNF grammer you specify.

A thought...:
Else, if you are using it for T9 (that's for text intuition/prediction?) ,

you could use pattern based heuristics with IDA* (or something like this) on a search tree, constructed from all words in your dictionary. It is an A.I. approach, and I am not completely sure about how to get pattern values (usage data) based on each word.

You would have to either be given this or you could write a scrapper that searched relevant papers/books and tallied word usage statistics. It would have to have some modern samples in there, to get all the crazy slang/abbreviations though.

Anyway, without some kind of search heuristics, if your dictionary is fairly large, your users could get significant delays.

Well this allows you to write a grammar as opposed to code that reads a grammar! If you look at the more recent posts you'll see what it does.

Maybe you should write a post that explains it a bit more plainly, since nobody seems to understand what this thread is about. :)

My program is called ________.
The input it takes is __________, and it looks like this: _______.
The output it produces is ________, and it looks like this: _________.
Before this tool existed, life sucked because you had to __________.
Now that this tool exists, life is better because you can _________.

sajiimori wrote:

Maybe you should write a post that explains it a bit more plainly, since nobody seems to understand what this thread is about. :)

My program is called ________.
The input it takes is __________, and it looks like this: _______.
The output it produces is ________, and it looks like this: _________.
Before this tool existed, life sucked because you had to __________.
Now that this tool exists, life is better because you can _________.

Good call; I've edited the original post accordingly. On another note I have an alternative method of walking the nodes. My view on parsing is that once you create the structure there are 3 main types of node properties.

Choices

Children

Terminal

exactly one out edge

A node that is neither a choice, parent, or terminal is an end node. Essentially the code will tokenize based on a grammar, but such a task can also lex and parse. It works by traversing the structure using the following behaviour:

grammar is a graph

track 'parents'

track 'choice points'

store 'cur_pos'

i.e. what character we are up to

only one end ('last_char_read')

terminals terminate

pop

So my current method walks the nodes one by one, but my alternative method (which was actually the first approach I thought up) involves working with a single "walker". The walker just tries to find its way around the nodes, keeping track of the parents it finds; but every time it reaches a choice point it will duplicate itself and the duplicates will follow each choice!

The walkers will die whenever they reach a terminal that does not satisfy their cur_pos.

Last edited by keldon on Fri Dec 14, 2007 9:21 am; edited 1 time in total

Quote:

POST REMOVED DUE TO COMPLICATIONS, FORGET EVERYTHING BEFORE

LOL, That's awesome.

Dude you sound like me (especially when I have too many projects). ;)

elyk1212 wrote:

Quote:

POST REMOVED DUE TO COMPLICATIONS, FORGET EVERYTHING BEFORE

LOL, That's awesome.

Dude you sound like me (especially when I have too many projects). ;)

That was a typo, I meant confusions ^_^

gbadev.org forum archive

Coding > Ye Mighty Lexer Compiler ^_^

#145738 - keldon - Wed Nov 21, 2007 7:17 am

#145761 - Lick - Wed Nov 21, 2007 7:15 pm

#145778 - keldon - Wed Nov 21, 2007 11:14 pm

#145827 - keldon - Fri Nov 23, 2007 10:44 am

#145831 - keldon - Fri Nov 23, 2007 1:30 pm

#146071 - keldon - Wed Nov 28, 2007 12:45 pm

#146078 - Lick - Wed Nov 28, 2007 5:59 pm

#146085 - keldon - Wed Nov 28, 2007 7:01 pm

#146467 - keldon - Tue Dec 04, 2007 5:00 pm

#147061 - elyk1212 - Thu Dec 13, 2007 10:13 pm

#147070 - keldon - Fri Dec 14, 2007 1:00 am

#147073 - sajiimori - Fri Dec 14, 2007 1:28 am

#147076 - keldon - Fri Dec 14, 2007 1:51 am

#147089 - elyk1212 - Fri Dec 14, 2007 8:30 am

#147092 - keldon - Fri Dec 14, 2007 10:04 am