gbadev.org forum archive

This is a read-only mirror of the content originally found on forum.gbadev.org (now offline), salvaged from Wayback machine copies. A new forum can be found here.

C/C++ > String parsing bug I can't figure out

#163374 - NovaYoshi - Tue Sep 30, 2008 11:29 pm

I'm trying to write some code to let me split a string into sections, each section going into a new string, using a space to separate each. I'm trying to get this right so I can add argument support to my IRC bot's commands...
Code:

#include <stdio.h>


int main()
{
   system("cls"); //clear the screen
   char buffer[1024];
   char output[128][4]={"","","",""};   //where the split up text will go
   int x=0;
   int y=0;
   int z=0;
   char c=' ';

   fgets(buffer, 1024, stdin); //so spaces are read


   while(c!='\0')//until we go and find a null byte in the input string
   {
      puts("------------------------------");
      c=buffer[x];
      printf("(C:\'%c\' X:%d Y:%d Z:%d)", c,x,y,z);

      if(c==' ')   //go and start filling the next buffer if we encounter a space
      {
         puts("Space encountered...");
         output[z][y+1]='\0';   //finish the string with a null byte
         y=0;   //We want to start filling from the beginning of the next buffer
         z++;   //We want to start filling the next buffer
      }
      else
         if(isalpha(c))   //so returns and crap are caught
         {
            output[z][y]=c;   //put the character in
         }
         else
         {
            puts("Was not alphanumeric... ignoring...");
         }
      x++;
      y++;
      putchar('\n');
   printf("Out1: %s\n",output[0]);
   printf("Out2: %s\n",output[1]);
   printf("Out3: %s\n",output[2]);
   printf("Out4: %s\n",output[3]);
      putchar('\n');
   }
   output[z][y+1]='\0';   //add a null byte to that last buffer
   //Show the final results
   printf("Buffer: %s\n",buffer);
   printf("Out1: %s\n",output[0]);
   printf("Out2: %s\n",output[1]);
   printf("Out3: %s\n",output[2]);
   printf("Out4: %s\n",output[3]);

   return(1);
}


[edit] I asked on #C of Rizon, and Shahid helped me. He found what was wrong... and gave me the fixed source:
Code:

#include <ctype.h>
#include <stdio.h>
 
 
int main()
{
   system("cls");
   char buffer[1024];
   char output[4][128] = {'\0' };   //where the split up text will go
   int x=0;
   int y=0;
   int word=0;
   char c=' ';
 
   fgets(buffer, 1024, stdin);
 
 
   while(c != '\0')//until we go and find a null byte in the input string
   {
      c=buffer[x];
      if(c==' ')   //go and start filling the next buffer if we encounter a space
      {
         puts("Space encountered...");
         output[word][y+1]='\0';   //finish the string with a null byte
         y=0;   //We want to start filling from the beginning of the next buffer
         word++;   //We want to start filling the next buffer
      }
      else
      {
         if(isalpha(c))   //so returns and crap are caught
         {
            output[word][y]=c;   //put the character in
         }
         else
         {
            puts("Was not alphanumeric... ignoring...");
         }
         y++;
      }
      x++;
   }
   output[word][y+1]='\0';   //add a null byte to that last buffer
 
   printf("Buffer: %s\n",buffer);
   printf("Out1: %s\n",output[0]);
   printf("Out2: %s\n",output[1]);
   printf("Out3: %s\n",output[2]);
   printf("Out4: %s\n",output[3]);
 
   return 0;
}

#163381 - nanou - Wed Oct 01, 2008 5:13 am

... or you could use strtok().

It's frowned upon by some (but no more so than some of other things you're doing anyway.)
_________________
- nanou

#163387 - mml - Wed Oct 01, 2008 9:08 am

Use strtok, or if you're worried about the frowning (and your C library has it), use strtok_r

#163394 - Miked0801 - Wed Oct 01, 2008 5:23 pm

That there is some dangerous code. You are likely to overwrite at least 2 arrays that I can see and probably more. Why 1024 on input via fgets? You could fgetc the buffer instead and step individually. You have no word cout. More than 4 words and boom. You have no character count. More than 128 letters and boom. Everytime I see [z][y+1] I cringe as that is another case of memory corruption waitng to occur. Single letter variable names will destroy you when you pick this code up later on and try to read it.

But, you get the idea.

#163405 - NovaYoshi - Wed Oct 01, 2008 8:28 pm

Right, and I've been fixing up the code in implementing it into my bot, so that my bot won't crash with input like that...
[edit] Of course, I COULD be sloppy and have a second bot check to see if the first bot stops running, and restart it...
Miked0801 wrote:
Why 1024 on input via fgets? You could fgetc the buffer instead and step individually.

With the plugin system that XChat uses, the input string is already defined, and I don't know if fgetc works with strings and not just files...
And I don't need that big buffer anymore, for the same reason. I just have it read from the command handler's arguments.

#163485 - cheesethulhu - Fri Oct 03, 2008 6:30 am

I have to second what other posters have said. If you're allowed to modify the string passed to your code, then you should definitely use strtok_r(). If you aren't allowed to modify it, use strdup() and then strtok_r() (and of course eventually free()).

Regardless of what direction you decide to go with the code, here are some changes I would make to what you posted (the version edited by Shahid).

Code:

#include <ctype.h>
#include <stdio.h>

// (note #6)
#define MAX_ARGUMENTS         (4)
#define MAX_ARGUMENT_LENGTH (128)

int main()
{
   system("cls");
   char buffer[1024];
   char output[MAX_ARGUMENTS][MAX_ARGUMENT_LENGTH] = {'\0' };   //where the split up text will go
   int x=0;
   int y=0;
   int word=0;
   char c=' ';
   int i;
 
   fgets(buffer, 1024, stdin);
 
 
   while((c != '\0') && (word < MAX_ARGUMENTS))//until we go and find a null byte in the input string (note #6)
   {
      c=buffer[x];
      if(isspace(c) || ('\0' == c))   //go and start filling the next buffer if we encounter a space (note #1)
      {
         puts("Space encountered...");
         if(y > 0) // (note #2)
         {
            output[word][y]='\0';   //finish the string with a null byte (note #3)
            y=0;   //We want to start filling from the beginning of the next buffer
            word++;   //We want to start filling the next buffer
         }
      }
      else
      {
         if(isprint(c))   //so returns and crap are caught (note #4)
         {
            if((y + 1) < MAX_ARGUMENT_LENGTH)
            {
               output[word][y]=c;   //put the character in
               y++; // (note #5)
            }
            else {
               // (note #6)
               // error handling for exceeding the maximum argument length
            }
         }
         else
         {
            puts("Was not alphanumeric... ignoring...");
         }
      }
      x++;
   }
   // (note #1)
 
   printf("Buffer: \"%s\"\n",buffer);
   for(i = 0; i < word; ++i) {
      printf("Out%d: %s\n", i + 1, output[i]);
   }
 
   return 0;
}


1. Using isspace() will catch tabs as well. Checking for '\0' here will prevent you from having to duplicate code at the end of the while() loop. Also word will be an accurate count of the number of arguments parsed.

2. Multiple spaces in a row or leading/trailing spaces will result in extra empty output strings unless you only start the next word after finding at least one character of the current word.

3. The index y is already one past the last character you have inserted into the string, so you don't need to add 1 when appending the null byte. Masked by the string having been initialized to null bytes or else you would be encountering an extra byte of garbage at the end of each output string.

4. Using isalpha() does not agree with your informational message. You at least want isalnum() and most likely want isprint(). That way if the user passes an invalid argument like "bad^%$#@!argument", you won't tell them they passed "badargument".

5. You should only increment the output string index when you put a character in the output string. Otherwise, when you encounter an invalid character, your string will contain garbage or (in this case) get truncated (due to skipping over a null byte).

6. Makes the code less vulnerable to buffer overflow.

I'm sure there are other things that can be improved, but those are the ones that jump out at me.