gbadev.org forum archive

This is a read-only mirror of the content originally found on forum.gbadev.org (now offline), salvaged from Wayback machine copies. A new forum can be found here.

Help Wanted > String Isolation, Automated Retrieval

#104236 - sgeos - Wed Sep 27, 2006 10:31 am

If you want localize your project now or in the future, you need isolate all of your text strings and put them in one place. The logical place is in a spreadsheet, because translators can easily work with spreadsheets. In theory spreadsheets can also hold more than language (provided a universal encoding).

Given a spreadsheet with string names and strings:
Code:
TXT_LANG      en              fr      ja
TXT_CHARSET   iso-8859-1      ...     ...
TXT_MENU      Menu            ...     ...
TXT_M_MANUAL  Online Help     ...     ...

I'd like to create a general solution that:
A) Retrieves the data from the spreadsheet
B) Inserts strings into a project based on a language setting

Please post here or send me a private message if you would like to help. I'm using open office.

-Brendan

#104291 - tepples - Wed Sep 27, 2006 8:48 pm

To get the data out of your spreadsheet, tell OpenOffice.org to export it in tab-separated format, which is easy for a program to parse.

By "inserts strings into a project" what do you mean? A project on which platform? Does this platform have a file system? Do you want translators to be able to see their work on hardware without having to recompile the program? Do you want the player to be able to select a language at runtime?

Without any other sort of requirement guidance, I'd come up with something that takes your tab-separated file and spits out something like this:
Code:
/* begin strings-en.c */
const char TXT_MENU[] = "Menu";
const char TXT_M_MANUAL[] = "Online Help";
/* EOF */

/* begin strings-tokipona.c */
const char TXT_MENU[] = "lipu wile";
const char TXT_M_MANUAL[] = "lipu pi kama sona";
/* EOF */

"ja" scares me. I've never tried exporting a spreadsheet containing kanji (Chinese ideograms) from OOo. Can your tools handle all strings using UTF-8 encoding?
_________________
-- Where is he?
-- Who?
-- You know, the human.
-- I think he moved to Tilwick.

#104343 - sgeos - Thu Sep 28, 2006 5:40 am

tepples wrote:
To get the data out of your spreadsheet, tell OpenOffice.org to export it in tab-separated format,

A tab delimited csv file with nothing surrounding the text?

tepples wrote:
which is easy for a program to parse.

The ideal solution doesn't involve the user manually opening OOo. Admittedly, one assumes that you don't get new translations or updates multiple times a day- a manual step might not be a big deal.

It looks like OOo macros can be run from the command line, so automation amounts to writing a macro.

tepples wrote:
By "inserts strings into a project" what do you mean? A project on which platform?

I'm looking for a general solution that is not platform specific. I might be working with C or HTML or even something weird and proprietary. I think that there are two types of general scenarios:

A) Template/Non-Centralized
spreadsheet -(conversion)-> data table -(preprocessor)-> embedded text

B) Centralized File
spreadsheet -(conversion)-> data table -(filter)-> useful strings file(s)

The "data table" can be a tab delimited file unless something else is "easier" to work with. The C preprocessor can be used for the template method (with a step or two in between).

Centralized file output is specific to what you happen to be doing. A generic solution for C could be written if that would be useful.

tepples wrote:
Does this platform have a file system? Do you want translators to be able to see their work on hardware without having to recompile the program?

Being able to view translations on hardware without recompiling would be fantastic. It is not something I had even considered.

tepples wrote:
Do you want the player to be able to select a language at runtime?

This option certainly wants to be reserved.

tepples wrote:
Without any other sort of requirement guidance, I'd come up with something that takes your tab-separated file and spits out something like this:

*snip code*

Which is a fine solution if your build only needs one language at a time. Can the language be changed at runtime using that solution?

As far as C goes, I'd work through an API like this:
Code:
void addLanguage(int pLanguage, const char **pMessageData);
void setLanguage(int pLanguage);
const char *getMessage(int pMessage);
const char *getMessageL(int pMessage, int pLanguage);

And spit out something like this:
Code:
// Language Codes
#define LANG_EN 0
#define LANG_TOKIPONA 1

// Message ID Codes
#define ID_TXT_MENU 0
#define ID_TXT_N_MANUAL 1

// Messages (Default Language)
#define TXT_MENU getMessage(ID_TXT_MENU)
#define TXT_N_MANUAL getMessage(ID_TXT_N_MANUAL)

// Messages (Specified Language)
#define LTXT_MENU(LTXT_a) getMessageL(ID_TXT_MENU, LTXT_a)
#define LTXT_N_MANUAL(LTXT_a) getMessageL(ID_TXT_N_MANUAL, LTXT_a)

// en Message Table
const char *gTxtEn[] =
{
   "Menu",
   "Online Help",
   NULL
};

// tokipona Message Table
const char *gTxtTokipona[] =
{
   "lipu wile",
   "lipu pi kama sona",
   NULL
};

You might not want to actually go through a function to get your messages.

tepples wrote:
"ja" scares me. I've never tried exporting a spreadsheet containing kanji (Chinese ideograms) from OOo.

Have you ever tried looking at the source of a Japanese web page? Copy (from the net), paste, export works fine for me.

tepples wrote:
Can your tools handle all strings using UTF-8 encoding?

If your tools can only handle ASCII, you can convert the non-ascii compatible characters to numeric character references before mucking around with them. I've found that this is the best way to make things not break in strange places. =) Many cygwin tools seem to assume ASCII. You always convert your data back to UTF-8 or whatnot as one of your last steps.

Instead, you could just convert the data into raw bytes in the firest place. The data only needs to be human readable in the spreadsheed and on the target platform.

-Brendan

#106350 - ecurtz - Wed Oct 18, 2006 5:44 am

How's this going? I'm interested in seeing how it comes out, and here are some semi-random thoughts...

You may want to use a byte or short at the beginning to mark the strings for length (pascal style, for us old-time Mac types.)

Have you looked at gettext? There may be some useful utilities there for yanking out your strings and cataloging them.

Are you going to assign a category or section id or something to the strings to aid in swapping sections into RAM? I suppose this doesn't matter much on cart systems.

Consider embedding the layout in the strings (spacing, line breaks, etc.) yes it's an ugly hack and "should" be done at runtime, but on the other hand more code on the computer means more control over output and less work on the final device.

#106380 - sgeos - Wed Oct 18, 2006 3:12 pm

ecurtz wrote:
How's this going?

I have an implementation specific solution that converts tab delimited spreadsheet files into an import table.

ecurtz wrote:
I'm interested in seeing how it comes out,

This is the perl script I'm using. You might want to have it spit out ifdef, define, and endif. ...I might just add that...
Code:
#!/usr/bin/perl -w

while (defined($line = <STDIN>))
{
        if ($line =~ s/([^\t\n]+)([\t])([^\t\n]+|)(?:[\t]|)([^\t\n]+|)(?:[\t]|)[\n]/#define\t$1$2$3\n/g)
        {
                print STDOUT $line;
        }
        else
        {
                print STDOUT "\n";
        }
}

The final $3 represents the second column in the spreadsheet. $4 would be the third column. The script would have to be enhanced to handle more columns. Usage:
Code:
cat in.csv | csv2h.pl > out.h


ecurtz wrote:
You may want to use a byte or short at the beginning to mark the strings for length (pascal style, for us old-time Mac types.)

I could see a length marker being helpful. Depends on what you want to do.

ecurtz wrote:
Have you looked at gettext? There may be some useful utilities there for yanking out your strings and cataloging them.

I had not. The goal is actually to yank all strings before starting. gettext appears to be for "fixing" already written programs.

ecurtz wrote:
Are you going to assign a category or section id or something to the strings to aid in swapping sections into RAM? I suppose this doesn't matter much on cart systems.

I'd be more inclined to use a different string table for each language and let the compiler worry about the details unless they become a problem. See the example above.

Here is another data table communication example:
Code:
#include "stdio.h"
#include "stdlib.h"
#include "string.h"

// txt.h - autogenerated
#define TXT_NEWLINE   "\n"
#define TXT_DUMMY   "---"
#define TXT_PLAYER   "Exandris"
#define TXT_HELPME   "Help me!  I'm in the castle!"
#define TXT_AREYOU   "Are you %s?"
#define TXT_CASH   "You have %s gold."

// tid.h - would also be autogenerated; different script
#define TID_NEWLINE   0
#define TID_DUMMY   1
#define TID_PLAYER   2
#define TID_HELPME   3
#define TID_AREYOU   4
#define TID_CASH   5
#define TID_MAX      6

// txt.c - would also be autogenerated; different script
const char *gText[] =
{
   TXT_NEWLINE,
   TXT_DUMMY,
   TXT_PLAYER,
   TXT_HELPME,
   TXT_AREYOU,
   TXT_CASH,
   NULL
};

// main.h
#define BUFFER_MAX 82

// main.c
int gGold = 1234;

const char *getString(int pId)
{
   return gText[pId];
}

void printStringBase(const char *pFormat, const char *pParam)
{
   printf(pFormat, pParam);
   printf(TXT_NEWLINE);
}

void printString(int pId)
{
   const char *format = getString(pId);
   printStringBase(format, TXT_DUMMY);
}

void printStringS(int pId, int pParamId)
{
   const char *format = getString(pId);
   const char *param  = getString(pParamId);
   printStringBase(format, param);
}

void printStringI(int pId, int pParam)
{
   const char *format = getString(pId);
   char param[BUFFER_MAX];
   sprintf(param, "%d", param);
   printStringBase(format, param);
}

void messageTest(void)
{
   int i;
   for (i = 0; i < TID_MAX; i++)
      printString(i);
}

int main(void)
{
   // real calls
   printString(TID_PLAYER);
   printString(TID_HELPME);
   printStringS(TID_AREYOU, TID_PLAYER);
   printStringI(TID_CASH, gGold);

   messageTest();
   return 0;
}

A bunch of variations of this theme could be written. You could put different language versions of the print functions in function tables and then use the API above. There is a good chance you'll need to write new functions for different languages.

Note that no message has more than one parameter. This is a Good Thing when dealing with translation. This could happen to you- "You kick the apple and a tree falls on your head." This is much better- "You kick the tree." "An apple falls on your head." Sure, not as much fun, but it can't be botched by the translator.

Note that all of the messages use %s for parameters. This is not a must, but I believe that you should write your messages for your offline proofreaders and translators. Even a non-tech savvy proofreader should be able to understand- "%s means the program puts something like a name or number there." This might no fly- "A % followed by another character will be replaced with a program supplied value." Then again I might be too afraid of how non-tech savvy people can get. =) The online message test also takes advantage of the strings only approach.

You might want to store the string length if your display system needs that information. I'd probably put it in an array of its own and use the message id as an index.

All of these files should be autogenerated. The thought of handing a header over to a proofreader or translator frightens me.

ecurtz wrote:
Consider embedding the layout in the strings (spacing, line breaks, etc.) yes it's an ugly hack and "should" be done at runtime, but on the other hand more code on the computer means more control over output and less work on the final device.

Do you have an example? Precalculation is often a very good thing.

This may sound crazy, but you might want to consider not using ASCII encoding. Your font starts with glyph 0, so you may wish to use a proprietary encoding that takes advantage of that. I'd probably use high values for control codes. This way your text display system can use built in byte code functions. You'll need to preprocess all of your messages to take advantage of this.

If you are using ASCII, 0x80 to 0xFF could be used as byte codes. You may back yourself into a system that is hard extend to multibyte encoding formats. Although I suppose that that is the case with any program that assumes 1 byte == 1 glyph or control code.

-Brendan

#106398 - tepples - Wed Oct 18, 2006 6:05 pm

Or you can use 0x01 through 0x1F (or even just 0x1B) followed by one or more ASCII characters for your control codes, and then use UTF-8 for text.
_________________
-- Where is he?
-- Who?
-- You know, the human.
-- I think he moved to Tilwick.