Here's a short one on string literals in C++. Ask yourself: what is their type? It is 'array of n const char', correct! So, we might think:
char* literal = "Hello World!";
would be "invalid"/"illegal" in C++. But you'd be surprised it is not and even Comeau online compiles it successfully without even a warning.
The C++ standard, however, tries to protect you hinting that the above is wrong by stating that it is a deprecated feature in C++ that probably was allowed to keep backward compatibility with C.
Here is what the standard says as part of section [2.13.4/2]:
[quote]
A string literal that does not begin with u, U, or L is an ordinary string literal, also referred to as a narrow string literal. An ordinary string literal has type “array of n const char”, where n is the size of the string as defined below; it has static storage duration (3.7) and is initialized with the given characters.
[/quote]
So, the following would have definitely been invalid in C++:
char* literal = "Hello World!";
"Hello World!" is an array of 13 [spooky :-)] constant characters. 'literal' is the pointer to the first element of the array and since that element is const, the pointer cannot be declared as a non-const char*. The pointer has to be of the type 'pointer to a const char'.
But as mentioned above, to have backward compatibility with C where the above works an implicit conversion is defined for array to pointer conversion where a string literal would be converted to an r-value of type "pointer to a char". The standard mentions this in section [4.2/2]:
[quote]
A string literal (2.13.4) with no prefix, with a u prefix, with a U prefix, or with an L prefix can be converted to an rvalue of type “pointer to char”, “pointer to char16_t”, “pointer to char32_t”, or “pointer to wchar_t”, respectively. In any case, the result is a pointer to the first element of the array. This conversion is considered only when there is an explicit appropriate pointer target type, and not when there is a general need to convert from an lvalue to an rvalue. [ Note: this conversion is deprecated. See Annex D. —end note ]
[/quote]
But, the thing to be happy about is the note above, that is re-iterated in Annexure D section [D.4/1] as:
[quote]
The implicit conversion from const to non-const qualification for string literals (4.2) is deprecated.
[/quote]
So, best is to keep the good habit of declaring the pointer to the literal as a pointer to a const. :-)
[The C++ standard's draft version used for quotes above, has document number : N2315=07-0175 dated 2007-06-25]
Showing posts with label C. Show all posts
Showing posts with label C. Show all posts
Tuesday, July 28, 2009
String literals in C++
Posted by
abnegator
at
7/28/2009 09:29:00 PM
1 comments
Labels:
backward compatibility,
C,
C++,
const,
function pointers,
literals,
rvalue,
standard conversions,
string,
undefined behaviour


Saturday, May 03, 2008
comparing structs with memcmp
Can you use memcmp to compare C style structs reliably? Let's first see what the C-standards has to say about the function: memcmp.
From the C-standards, 7.21.4.1/The memcmp function:
[QUOTE]
Synopsis
1 #include <string.h>
int memcmp(const void *s1, const void *s2, size_t n);
Description
2 The memcmp function compares the first n characters of the object pointed to by s1 to the first n characters of the object pointed to by s2.(248))
Returns
3 The memcmp function returns an integer greater than, equal to, or less than zero, accordingly as the object pointed to by s1 is greater than, equal to, or less than the object pointed to by s2.
[QUOTE END]
The footnote (248) mentioned above is important and interesting:
[QUOTE]
The contents of ‘‘holes’’ used as padding for purposes of alignment within structure objects are indeterminate. <snipped>
[QUOTE END]
There's our trouble. Indeterminate values. There can be unused padding bytes with structures as need by alignment requirements for a platform and how they are filled is not defined by the standard. It is not even mentioned if they are going to be initialized or how they get initialized. You might get lucky with the indeterminate values but in the programming world, we always consider ourselves unlucky with something like that. So, unless you are very sure that there is no padding memcmp is rendered useless. You can find if your struct has padding or not as below :
[CODE]
if (sizeof(mystruct) == sizeof(member1) + sizeof(member2) /*+ ..so on...*/)
{
//no padding - probably can use memcmp - but are you really sure!!!???
}
else
{
//has definitely some padding - dont use memcmp
}
[CODE END]
Even though you find no padding, it is not safe to do memcmp in all cases. One reason that I can make out is because +0 and -0 for us would be same. They should evaluate equal but memcmp would not think the same way. To generalize it, whenever there are more than one bit representation of a particular value, memcmp may report false negatives, when the values might be equal but comparing their bit pattern comes unequal.
Below are few points of concern would prevent reliable use of memcmp to comparing 2 structs or even 2 basic fundamental datatypes or an array of those:
1. Unused bits : Not all bits for a type (except unsigned char) are used for its value representation. For example, on a platform where an int is 4 bytes and considering that for the platform, a byte has 8 bits, not all 32 bits be used to represent the value held by an int. So, on such platforms the largest possible value of an int may not be 2^(32-1) but lesser. How the unused bits are filled is implementation specific and hence memcmp would not be reliable to even compare the basic fundamental types like int even when the values are equal. The unused bits are called holes.
2. Padding bytes : Padding bytes within a struct between 2 members of the struct to cater to alignment requirements. The padded bytes will have indeterminate values that may compare equal or unequal in a non-deterministic way. So, memcmp would lead to false negatives for comparison. However, the positives returned would be reliable but that would be a smaller set out of the possible positives of the comparison.
3. Unreliable treatment of unused bits and padding bytes after memset : memset on the types or structs can probably work around the uninitialized padding bytes/holes indeterminate values (as it treats the data as a sequence of unsigned chars which have no holes) but there can be other reasons to failure and who knows if the values of those padding bytes might change during the process of the program between a memset call and a memcmp call. If the standards explicitly prohibits an implementation from doing so, please let me know and I shall correct myself. The relevant quote would be fantastic!
4. Floating types : floating point numbers cannot even be compared reliably using ==, forget about memcmp. They always need to be compared to be a around each other but varying by some value of epsilon() as defined by the compiler.
5. Pointer representation : Pointer's bit representations can be different for the same linear address and two completely different pointers may compare equal. Of particular peculiarity is the segment:offset representation where two different segment::offset pair values stored in a pointer's representation may actually by referring to the same linear address. Linear address here = segment*16 +offset. You will get 2 different pairs of segment and offset for which the equation would evaluate to the same linear address. The inbuilt == operator is guaranteed to work such that two pointers of the same type pointing to the same object will always compare equal. This is also considering you are just wanting to compare the pointers and not the values pointed to by them.
Considering the above cases when memcmp would be unreliable, I would choose member wise comparison to be the solution to choose to compare 2 structs of same type. If the structs are different types with comparable members (but could be of different types), memcmp would probably fail more miserably. If it is a non-POD struct, it is simply undefined behaviour.
It is worth going through the following discussion on comp.lang.c that discuss the above in greater detail : Expert-Q: (a!=b) != memcmp(&a, &b, sizeof a) ?.
From the C-standards, 7.21.4.1/The memcmp function:
[QUOTE]
Synopsis
1 #include <string.h>
int memcmp(const void *s1, const void *s2, size_t n);
Description
2 The memcmp function compares the first n characters of the object pointed to by s1 to the first n characters of the object pointed to by s2.(248))
Returns
3 The memcmp function returns an integer greater than, equal to, or less than zero, accordingly as the object pointed to by s1 is greater than, equal to, or less than the object pointed to by s2.
[QUOTE END]
The footnote (248) mentioned above is important and interesting:
[QUOTE]
The contents of ‘‘holes’’ used as padding for purposes of alignment within structure objects are indeterminate. <snipped>
[QUOTE END]
There's our trouble. Indeterminate values. There can be unused padding bytes with structures as need by alignment requirements for a platform and how they are filled is not defined by the standard. It is not even mentioned if they are going to be initialized or how they get initialized. You might get lucky with the indeterminate values but in the programming world, we always consider ourselves unlucky with something like that. So, unless you are very sure that there is no padding memcmp is rendered useless. You can find if your struct has padding or not as below :
[CODE]
if (sizeof(mystruct) == sizeof(member1) + sizeof(member2) /*+ ..so on...*/)
{
//no padding - probably can use memcmp - but are you really sure!!!???
}
else
{
//has definitely some padding - dont use memcmp
}
[CODE END]
Even though you find no padding, it is not safe to do memcmp in all cases. One reason that I can make out is because +0 and -0 for us would be same. They should evaluate equal but memcmp would not think the same way. To generalize it, whenever there are more than one bit representation of a particular value, memcmp may report false negatives, when the values might be equal but comparing their bit pattern comes unequal.
Below are few points of concern would prevent reliable use of memcmp to comparing 2 structs or even 2 basic fundamental datatypes or an array of those:
1. Unused bits : Not all bits for a type (except unsigned char) are used for its value representation. For example, on a platform where an int is 4 bytes and considering that for the platform, a byte has 8 bits, not all 32 bits be used to represent the value held by an int. So, on such platforms the largest possible value of an int may not be 2^(32-1) but lesser. How the unused bits are filled is implementation specific and hence memcmp would not be reliable to even compare the basic fundamental types like int even when the values are equal. The unused bits are called holes.
2. Padding bytes : Padding bytes within a struct between 2 members of the struct to cater to alignment requirements. The padded bytes will have indeterminate values that may compare equal or unequal in a non-deterministic way. So, memcmp would lead to false negatives for comparison. However, the positives returned would be reliable but that would be a smaller set out of the possible positives of the comparison.
3. Unreliable treatment of unused bits and padding bytes after memset : memset on the types or structs can probably work around the uninitialized padding bytes/holes indeterminate values (as it treats the data as a sequence of unsigned chars which have no holes) but there can be other reasons to failure and who knows if the values of those padding bytes might change during the process of the program between a memset call and a memcmp call. If the standards explicitly prohibits an implementation from doing so, please let me know and I shall correct myself. The relevant quote would be fantastic!
4. Floating types : floating point numbers cannot even be compared reliably using ==, forget about memcmp. They always need to be compared to be a around each other but varying by some value of epsilon() as defined by the compiler.
5. Pointer representation : Pointer's bit representations can be different for the same linear address and two completely different pointers may compare equal. Of particular peculiarity is the segment:offset representation where two different segment::offset pair values stored in a pointer's representation may actually by referring to the same linear address. Linear address here = segment*16 +offset. You will get 2 different pairs of segment and offset for which the equation would evaluate to the same linear address. The inbuilt == operator is guaranteed to work such that two pointers of the same type pointing to the same object will always compare equal. This is also considering you are just wanting to compare the pointers and not the values pointed to by them.
Considering the above cases when memcmp would be unreliable, I would choose member wise comparison to be the solution to choose to compare 2 structs of same type. If the structs are different types with comparable members (but could be of different types), memcmp would probably fail more miserably. If it is a non-POD struct, it is simply undefined behaviour.
It is worth going through the following discussion on comp.lang.c that discuss the above in greater detail : Expert-Q: (a!=b) != memcmp(&a, &b, sizeof a) ?.
Posted by
abnegator
at
5/03/2008 05:31:00 PM
2
comments
Labels:
C,
memcmp,
memset,
padding,
sizeof,
structs


Sunday, February 17, 2008
const in C and C++
Globals are usually bad but not always. Consider C code that is to be migrated to C++ and has a bunch of global constants. As soon as that code is worked upon and is compiled as C++, those const globals will cause the compiler to start emitting "undefined reference" or similar errors.
This is because of fundamental different between how const is treated in C and C++ in the context of linkage.
In C, apart from the fact that const are non-modifiable variables, they are the same as any other variables. In C++, const have different linkage as well. C++ const objects/variables have internal linkage and hence you get unresolved symbol error when you try to access something in a different compilation unit having iternal linkage from another compilation unit.
The solution is to include the extern declaration following right after you define the const variables, the same one that you put in another files or simply define the variable as extern const instead of just const.
This is because of fundamental different between how const is treated in C and C++ in the context of linkage.
In C, apart from the fact that const are non-modifiable variables, they are the same as any other variables. In C++, const have different linkage as well. C++ const objects/variables have internal linkage and hence you get unresolved symbol error when you try to access something in a different compilation unit having iternal linkage from another compilation unit.
The solution is to include the extern declaration following right after you define the const variables, the same one that you put in another files or simply define the variable as extern const instead of just const.
Subscribe to:
Posts (Atom)