Wednesday, April 16, 2008

Case insensitive string comparison

While looking up for case insensitive comparison function, I came across this nice article by Matt Austern : Case Insensitive String Comparison). And decided to try to make a sample that does that. Below is the result, am not sure if it is perfect and has no issues but that is the best I could do. Atleast, better than ignoring the locale completely! Yes, ignoring it would work most of the times as it's not needed but just in case, a need came up, what would you do?

I tested the code with VS 2005 (couldn't test it with gcc as I found out that I did not have support for the german locale as I only had the sample strings for that language (out of the above article) and I felt lazy enough to find more sample to test) but if someone finds an issue with some other language words for which your compiler provides locale support, can you please point it out?

Code:

    #include<iostream>
    #include<locale>
    #include<string>
    #include<algorithm>
    #include<functional>
    //used boost::bind due to buggy bind2nd
    //#include<tr1/bind.hpp>
    #include<boost/bind.hpp>

    //using namespace std;
    //using namespace std::tr1;
    //using namespace std::tr1::placeholders;

    using namespace boost;

    struct CaseInsensitiveCompare
    {
        bool operator()(const std::string& lhs, const std::string& rhs)
        {
            std::string lhs_lower;
            std::string rhs_lower;
            //std::transform(lhs.begin(), lhs.end(), std::back_inserter(lhs_lower), std::bind2nd(std::ptr_fun(std::tolower<char>), loc));
            //std::transform(rhs.begin(), rhs.end(), std::back_inserter(rhs_lower), std::bind2nd(std::ptr_fun(std::tolower<char>), loc));
            std::transform(lhs.begin(), lhs.end(), std::back_inserter(lhs_lower), bind(std::tolower<char>, _1, loc));
            std::transform(rhs.begin(), rhs.end(), std::back_inserter(rhs_lower), bind(std::tolower<char>, _1, loc));
            return lhs_lower < rhs_lower;
        }
        CaseInsensitiveCompare(const std::locale& loc_): loc(loc_){}
    private:
        std::locale loc;
    };

    int main()
    {
        std::string lhs = "GEW\334RZTRAMINER";
        std::string rhs = "gew\374rztraminer";
        std::cout << "lhs : " << lhs << std::endl;
        std::cout << "rhs : " << rhs << std::endl;
        CaseInsensitiveCompare cis((std::locale("German_germany")));
        //CaseInsensitiveCompare cis((std::locale()));
        std::cout << "compare result : " << cis(lhs,rhs) << std::endl;
    }

One obvious improvement/alternative could be not copying the strings and instead doing a per character based tolower/toupper and compare them. This has a 2nd advantage as well that it will break out as soon as a mismatch happens for a character, without the need to convert the whole 2 strings into a common case.

Herb Sutter, in one of his Gotw's, writes about a case insensitive string class but also shows the basic problems that would have. Quite simply put, it doesn't work with iostreams (cout/cerr etc). Here: Strings: A case insensitive string class.

No comments: