Case Insensitive Matching in C++

I had this epiphany yesterday while working on my new command line note-taking project and I wanted to write a blog post about it since I haven’t seen anyone on the internet yet take this approach (though there aren’t exactly a lot blogs posts on programming theory of this kind in general).

My program is written in C. It provides a search functionality very similar to the case insensitive matching of grep -i (you 'nix users should know what I’m talking about). If you’ve done much in C, you likely know that string parsing is not so easy (or is it just different). Thus the question…how to perform case insensitive text searching in c.

A few notes though before we proceed. I’m fairly new to c (about 1 year as a hobby) so everything I say here might not be entirely right (it’ll work, it just might not be the best way). If you catch something that’s wrong or could use improvement, please send me an email. Secondly, since this is probably something the C gods have already mastered, I will be writing this post aimed at the newer folk (since I myself am one), so bear with me if you already know how to do this. One final note. I am still ceaselessly amazed at how computers work, so I get fairly giddy when it comes to actual memory management and whatnot. Brace yourselves…

Chars == Ints (kind of)

To continue, we need to understand a few things about base data types in memory.

  • Ints: An int is just 8 bits of memory (well, it’s 16 including signing, but we don’t need to cover that here).

  • Chars: Chars are just ints, but marked as chars. Effectively, a number has been assigned to each letter and symbol (including uppercase and lowercase), which is where integers meet chars. The integer determines which char is selected.

To demonstrate those two data types, let’s take a look at some sample code.

using namespace std;
#include <iostream>

int main( int argc, char** argv ) {
  int i = 72;
  char c = i;
  cout << "The integer " << i;
  cout << " is the same as char " << c << "!" <<  endl;
  return 0;
}

What we do here is create <code>int i</code> with the value of 72. We then create <code>char c</code> and assign it the value of i (still 72). Finally, we print both int i and char c and get…

The integer 72 is the same as char H!

If you’re wondering, we could have also just assigned char c the value of 72 explicitly and it would have still printed the letter H.

Now that that’s out of the way…

A Short Char - Integer List

  • ! " # $ % & ' ( ) * + , - . /: 35 - 47

  • 0-9: 48 - 57

  • : ; < = > ? @: 58 - 64

  • A - Z (uppercase): 65 - 90

  • [ \ ] ^ _ `: 91 - 96

  • a - z (lowercase): 97 - 122

Lowercase == Uppercase + 32

You may have noticed an interesting fact about the numbers assigned to characters in [English] computing: uppercase and lowercase letters don’t have the same integers.

These character integer range seperations are key to performing a case-insensitive string search in c+\+. What they mean is, if you happen upon the letter a, which is integer 97, then you know that its capital equivalent is going to be 32 lower (int 65). Suddenly parsing text just got a lot easier.

Piecing it all together

Since characters are simply just integers, we can perform text matching via number ranges and math operators. For instance…

Suppose you want to build a password validator that allows numbers, upper case, lower case, and : ; < = > ? @ [ \ ] ^ _ `. That is the integer range 48 - 57 (the char equivelants of integers), 58 - 64 (the first symbols), 65 - 90 (the uppercase), 91 - 96 (the second set of symbols), and 97-122 (the lowercase). Combining those ranges, the allowable characters make up the integer range of 48 - 122. Thus, our program might look something like…

using namespace std;
#include <iostream>

int validate_pass( const char* pass ) {
  long i = 0;
  while( pass[i] ) {
    if( pass[i] < 48 || pass[i] > 122 ) {
      return 0;
    }
    i++;
  }
  return 1;
}

int main( int argc, char** argv ) {
  // The first password that meets the requirements
  const char* pass = "good_password123";
  cout << pass;
  if( validate_pass( pass ) ) {
    cout << " is valid." << endl;
  } else {
    cout << " is not valid." << endl;
  }

  // The second password fails because ! is int 35, which is out of range
  const char* pass2 = "bad_password!";
  cout << pass2;
  if( validate_pass( pass2 ) ) {
    cout << " is valid." << endl;
  } else {
    cout << " is not valid." << endl;
  }
  return 0;
}

Will output…

good_password123 is valid.
bad_password! is not valid.

The first password succeeds because all of its characters are within the range of 48 - 122. The second password fails because its final character, the "!", is int 35, which is outside of the allowable character range of 48 - 122. That brings a whole new meaning to the out_of_range exception, doesn’t it?

That’s just one simple example of how this could work. One personal note, please don’t put that restraint of > 48 on your users if you write a validator script. Not having access to the more common symbols is a nightmare for users.

If you would like to see another example, the one I wrote for case insensitive matchings in my note program can be found at https://oper.io/src/nullspoon/noteless.git/tree/src/common.c#n197 in the str_contains_case_insensitive method.

Hopefully this is useful for someone besides myself. Either way though, I’m still super excited about the ease of making real-life data programatically usable through conversion to integers. It makes me want to see what other real-life data I can convert to numbers for easier parsing. Images? Chemistry notation?

I do say my good man, Why, then the world’s mine oyster, Which I with numbers will open. (okay, I may have modified the quote a tad)

Category:Programming Category:C