unicode does not support multiple word languare.

damingyipai · March 2004

for the below code, the assert failed.

std::string ss = "´ïÃ÷Ò»ÅÅ=damingyipai";
std::wstring ws = Ice::stringToWstring( ss );
std::string ss2 = Ice::wstringToString( ws );
assert( ss2 == "´ïÃ÷Ò»ÅÅ=damingyipai" );

I think, if Ice direct call the API on windows be well:

namespace jf
{

namespace unicode
{

#if ( defined( WIN32) || defined(WIN64) )

unsigned int calcRequiredSize(const char* const srcText)
{
if ( ! srcText ) {
return 0;
}

// size_t retVal = -1;
//retVal = ::mbstowcs( 0, srcText, 0 );
size_t count = 0;
count = MultiByteToWideChar( GetACP(),
MB_PRECOMPOSED |
MB_ERR_INVALID_CHARS,
srcText,
-1,
NULL,
0 );

if ( count == (size_t)-1 ) {
return 0;
}

return (unsigned int)count;
}

unsigned int calcRequiredSize(const wchar_t* const srcText)
{
if (!srcText)
return 0;

BOOL defused = 0;
size_t count = 0;
count = WideCharToMultiByte( GetACP(),
0,
srcText,
-1,
NULL,
0,
NULL,
&defused );

if (count == (unsigned int)-1)
return 0;

return (unsigned int)count;
}

char* transcode(const wchar_t* const toTranscode)
{
if (!toTranscode)
return 0;

char* retVal = 0;
if (*toTranscode)
{
// Calc the needed size
//const size_t neededLen = ::wcstombs(0, toTranscode, 0);

size_t count = calcRequiredSize( toTranscode );

if (count == (unsigned int)-1 || 0 == count)
return 0;

count -= 1;

// Allocate a buffer of that size plus one for the null and transcode
retVal = new char[count + 1];
// ::wcstombs(retVal, toTranscode, neededLen + 1);
BOOL defused = 0;
WideCharToMultiByte( GetACP(),
0,
toTranscode,
-1,
retVal,
(int)count + 1,
NULL,
&defused );

// And cap it off anyway just to make sure
retVal[count] = 0;
}
else
{
retVal = new char[1];
retVal[0] = 0;
}
return retVal;
}

wchar_t* transcode(const char* const toTranscode)
{
if (!toTranscode)
return 0;

wchar_t* retVal = 0;
if (*toTranscode)
{
// Calculate the buffer size required
const unsigned int neededLen = calcRequiredSize(toTranscode);
if (neededLen == 0)
{
retVal = new wchar_t[1];
retVal[0] = 0;
return retVal;
}

// Allocate a buffer of that size plus one for the null and transcode
retVal = new wchar_t[neededLen + 1];
// ::mbstowcs(retVal, toTranscode, neededLen + 1);
MultiByteToWideChar( GetACP(),
MB_PRECOMPOSED,
toTranscode,
-1,
retVal,
neededLen );

// Cap it off just to make sure. We are so paranoid!
retVal[neededLen] = 0;
}
else
{
retVal = new wchar_t[1];
retVal[0] = 0;
}
return retVal;
}

#else
#error other platform
#endif

} // namespace unicode

thanks

marc · March 2004

The string you are trying to convert is not valid UTF-8. It starts with 0xB4. However, the first byte of an UTF-8 multibyte sequence that represents a non-ASCII character is always in the range 0xC0 to 0xFD.

Have a look at: http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8

damingyipai · March 2004

It's more discommodiousness.

We not a well tools to generate UTF8 string, this question can appear in all country that use multibyte language.
can Ice give us UTF16 or MBCS support?

and, the slice has same problem, the string just support UTF8.... <:-( so, i can't write chinese words in slice constant statement.

of curse, this is not exigence, but I hope the Ice give us more powerfull support at feature, because i like it. :-)

marc · March 2004

Of course you can use UTF-16. Just convert your UTF-16 strings to UTF-8 before transmission, using wstringToString(). In fact, you can use any string encoding, as long as you have a converter for this encoding to and from UTF-8.

Regarding string constants: You could use excape sequences to express UTF-8 strings in Slice. I admit that this is not very convenient, but on the other hand, string constants in Slice are used very rarely.

michi · March 2004

This is currently a shortcoming in the Slice compilers. What really should be happening is that characters that are outside the ASCII set, as well as characters that use universal character names should be UTF-8 encoded automatically.

I have this on my todo list.

Cheers,

Michi.

Archived

unicode does not support multiple word languare.

Comments

Categories