Saturday, May 03, 2008

comparing structs with memcmp

Can you use memcmp to compare C style structs reliably? Let's first see what the C-standards has to say about the function: memcmp.

From the C-standards, 7.21.4.1/The memcmp function:

[QUOTE]
Synopsis
1 #include <string.h>
int memcmp(const void *s1, const void *s2, size_t n);

Description
2 The memcmp function compares the first n characters of the object pointed to by s1 to the first n characters of the object pointed to by s2.(248))

Returns
3 The memcmp function returns an integer greater than, equal to, or less than zero, accordingly as the object pointed to by s1 is greater than, equal to, or less than the object pointed to by s2.
[QUOTE END]

The footnote (248) mentioned above is important and interesting:

[QUOTE]
The contents of ‘‘holes’’ used as padding for purposes of alignment within structure objects are indeterminate. <snipped>
[QUOTE END]

There's our trouble. Indeterminate values. There can be unused padding bytes with structures as need by alignment requirements for a platform and how they are filled is not defined by the standard. It is not even mentioned if they are going to be initialized or how they get initialized. You might get lucky with the indeterminate values but in the programming world, we always consider ourselves unlucky with something like that. So, unless you are very sure that there is no padding memcmp is rendered useless. You can find if your struct has padding or not as below :

[CODE]
    if (sizeof(mystruct) == sizeof(member1) + sizeof(member2) /*+ ..so on...*/)
    {
        //no padding - probably can use memcmp - but are you really sure!!!???
    }
    else
    {
        //has definitely some padding - dont use memcmp
    }
[CODE END]

Even though you find no padding, it is not safe to do memcmp in all cases. One reason that I can make out is because +0 and -0 for us would be same. They should evaluate equal but memcmp would not think the same way. To generalize it, whenever there are more than one bit representation of a particular value, memcmp may report false negatives, when the values might be equal but comparing their bit pattern comes unequal.

Below are few points of concern would prevent reliable use of memcmp to comparing 2 structs or even 2 basic fundamental datatypes or an array of those:

1. Unused bits : Not all bits for a type (except unsigned char) are used for its value representation. For example, on a platform where an int is 4 bytes and considering that for the platform, a byte has 8 bits, not all 32 bits be used to represent the value held by an int. So, on such platforms the largest possible value of an int may not be 2^(32-1) but lesser. How the unused bits are filled is implementation specific and hence memcmp would not be reliable to even compare the basic fundamental types like int even when the values are equal. The unused bits are called holes.

2. Padding bytes : Padding bytes within a struct between 2 members of the struct to cater to alignment requirements. The padded bytes will have indeterminate values that may compare equal or unequal in a non-deterministic way. So, memcmp would lead to false negatives for comparison. However, the positives returned would be reliable but that would be a smaller set out of the possible positives of the comparison.

3. Unreliable treatment of unused bits and padding bytes after memset : memset on the types or structs can probably work around the uninitialized padding bytes/holes indeterminate values (as it treats the data as a sequence of unsigned chars which have no holes) but there can be other reasons to failure and who knows if the values of those padding bytes might change during the process of the program between a memset call and a memcmp call. If the standards explicitly prohibits an implementation from doing so, please let me know and I shall correct myself. The relevant quote would be fantastic!

4. Floating types : floating point numbers cannot even be compared reliably using ==, forget about memcmp. They always need to be compared to be a around each other but varying by some value of epsilon() as defined by the compiler.

5. Pointer representation : Pointer's bit representations can be different for the same linear address and two completely different pointers may compare equal. Of particular peculiarity is the segment:offset representation where two different segment::offset pair values stored in a pointer's representation may actually by referring to the same linear address. Linear address here = segment*16 +offset. You will get 2 different pairs of segment and offset for which the equation would evaluate to the same linear address. The inbuilt == operator is guaranteed to work such that two pointers of the same type pointing to the same object will always compare equal. This is also considering you are just wanting to compare the pointers and not the values pointed to by them.


Considering the above cases when memcmp would be unreliable, I would choose member wise comparison to be the solution to choose to compare 2 structs of same type. If the structs are different types with comparable members (but could be of different types), memcmp would probably fail more miserably. If it is a non-POD struct, it is simply undefined behaviour.

It is worth going through the following discussion on comp.lang.c that discuss the above in greater detail : Expert-Q: (a!=b) != memcmp(&a, &b, sizeof a) ?.

2 comments:

CodeWizard said...

It's not just padding. If you have an array of characters, any space after the null termination character are undefined as well. So two strings could be equal but the memcmp would still fail.

Anonymous said...

Great point about strings there!