I'm in the process of trying to convert our database from latin1 to UTF-8. Unfortunately I can't do a massive single switchover as the application needs to stay online and we have 700GB of database to convert.
So I'm trying to leverage a little mysql hack of converting tables to UTF-8 however not the data. I'd like the data to be read, converted, and replaced in real time. (A JIT conversion if you will)
Our php app currently uses all of the defaults so it's connecting to mysql using the latin1 character set and it drops UTF-8 data encoded in latin1. When you view the data with latin1 the UTF-8 characters show up as expected. When you view the data with UTF-8 things get jumbled up.
So I propose forcing the mysql character set to UTF-8 and then doing a just in time conversion of the data if necessary. Now, seeing as cp1252/windows-1252 is a subset of UTF-8 it's not so straight forward (as far as I can see) to detect the cp1252/windows-1252 encoding.
I've written the following code that attempts to detect cp1252/windows-1252 encoding and convert as necessary. It should also detect properly encoded UTF-8 characters and do nothing.
$a = 'Card☃'; //cp1252 encoded
$a_test = '☃'.$a; //add known UTF8 character
$c = mb_convert_encoding($a_test, 'cp1252', 'UTF-8');
// attempt to detect known utf8 character after conversion
if (mb_strpos($c, '☃') === false) {
// not found, original string was not cp1252 encoded, so print
var_dump($a);
} else {
// found, original string was cp1252 encoded, remove test character and print
// This case runs
$c = mb_strcut($c, 1);
var_dump($c);
}
$a = 'COD☃'; //proper UTF8 encoded
$a_test = '☃'.$a; //add known UTF8 character
$c = mb_convert_encoding($a_test, 'cp1252', 'UTF-8');
// attempt to detect known utf8 character after conversion
if (mb_strpos($c, '☃') === false) {
// not found, original string was not cp1252 encoded, so print
// This case runs
var_dump($a);
} else {
// found, original string was cp1252 encoded, remove test character and print
$c = mb_strcut($c, 1);
var_dump($c);
}
The output of running this code is:
string 'Card☃' (length=7)
string 'COD☃' (length=6)
I understand that running this on all strings coming out of the database will have a performance impact, yet to be measured, but if I can do a JIT conversion before switching everything completely it's worth it to me.
Does anyone have any pointers on how to optimize this?
Firstly, Windows-1252 is not a subset of UTF-8. You could argue that ASCII is a subset of UTF-8, but that is usually more of an ideological debate.
Secondly, it is impossible to handle strings with both CP1252 and UTF-8 "characters" in them (really for CP1252 it's a byte and for Unicode it's a code point). Either you try to read it as CP1252, and see all the Unicode characters as single bytes, or you read it as UTF-8 and it cuts out any invalid byte sequences (or creates random characters if the CP1252 characters match a Unicode code point). You are not removing the test character with $c = mb_strcut($c, 1);
, you are removing a question mark created by mb_convert_encoding because it could not convert that Unicode character into a CP1252 character.
Thirdly, you should never convert a String, and then after the fact try to determine the encoding. After you converted your second test string, it was ?COD?
. There is no reason to check if a Unicode character exists in it, because you converted it to CP1252. There can't be Unicode characters in it. As the programmer, you have to know what the output is.
The only solution is to check if the string is CP1252, convert the offending characters to placeholders, and then convert that string to Unicode:
function convert_cp1252_to_utf8($input, $default = '', $replace = array()) {
if ($input === null || $input == '') {
return $default;
}
// https://en.wikipedia.org/wiki/UTF-8
// https://en.wikipedia.org/wiki/ISO/IEC_8859-1
// https://en.wikipedia.org/wiki/Windows-1252
// http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1252.TXT
$encoding = mb_detect_encoding($input, array('Windows-1252', 'ISO-8859-1'), true);
if ($encoding == 'ISO-8859-1' || $encoding == 'Windows-1252') {
/*
* Use the search/replace arrays if a character needs to be replaced with
* something other than its Unicode equivalent.
*/
/*$replace = array(
128 => "€", // http://www.fileformat.info/info/unicode/char/20AC/index.htm EURO SIGN
129 => "", // UNDEFINED
130 => "‚", // http://www.fileformat.info/info/unicode/char/201A/index.htm SINGLE LOW-9 QUOTATION MARK
131 => "ƒ", // http://www.fileformat.info/info/unicode/char/0192/index.htm LATIN SMALL LETTER F WITH HOOK
132 => "„", // http://www.fileformat.info/info/unicode/char/201e/index.htm DOUBLE LOW-9 QUOTATION MARK
133 => "…", // http://www.fileformat.info/info/unicode/char/2026/index.htm HORIZONTAL ELLIPSIS
134 => "†", // http://www.fileformat.info/info/unicode/char/2020/index.htm DAGGER
135 => "‡", // http://www.fileformat.info/info/unicode/char/2021/index.htm DOUBLE DAGGER
136 => "ˆ", // http://www.fileformat.info/info/unicode/char/02c6/index.htm MODIFIER LETTER CIRCUMFLEX ACCENT
137 => "‰", // http://www.fileformat.info/info/unicode/char/2030/index.htm PER MILLE SIGN
138 => "Š", // http://www.fileformat.info/info/unicode/char/0160/index.htm LATIN CAPITAL LETTER S WITH CARON
139 => "‹", // http://www.fileformat.info/info/unicode/char/2039/index.htm SINGLE LEFT-POINTING ANGLE QUOTATION MARK
140 => "Œ", // http://www.fileformat.info/info/unicode/char/0152/index.htm LATIN CAPITAL LIGATURE OE
141 => "", // UNDEFINED
142 => "Ž", // http://www.fileformat.info/info/unicode/char/017d/index.htm LATIN CAPITAL LETTER Z WITH CARON
143 => "", // UNDEFINED
144 => "", // UNDEFINED
145 => "‘", // http://www.fileformat.info/info/unicode/char/2018/index.htm LEFT SINGLE QUOTATION MARK
146 => "’", // http://www.fileformat.info/info/unicode/char/2019/index.htm RIGHT SINGLE QUOTATION MARK
147 => "“", // http://www.fileformat.info/info/unicode/char/201c/index.htm LEFT DOUBLE QUOTATION MARK
148 => "”", // http://www.fileformat.info/info/unicode/char/201d/index.htm RIGHT DOUBLE QUOTATION MARK
149 => "•", // http://www.fileformat.info/info/unicode/char/2022/index.htm BULLET
150 => "–", // http://www.fileformat.info/info/unicode/char/2013/index.htm EN DASH
151 => "—", // http://www.fileformat.info/info/unicode/char/2014/index.htm EM DASH
152 => "˜", // http://www.fileformat.info/info/unicode/char/02DC/index.htm SMALL TILDE
153 => "™", // http://www.fileformat.info/info/unicode/char/2122/index.htm TRADE MARK SIGN
154 => "š", // http://www.fileformat.info/info/unicode/char/0161/index.htm LATIN SMALL LETTER S WITH CARON
155 => "›", // http://www.fileformat.info/info/unicode/char/203A/index.htm SINGLE RIGHT-POINTING ANGLE QUOTATION MARK
156 => "œ", // http://www.fileformat.info/info/unicode/char/0153/index.htm LATIN SMALL LIGATURE OE
157 => "", // UNDEFINED
158 => "ž", // http://www.fileformat.info/info/unicode/char/017E/index.htm LATIN SMALL LETTER Z WITH CARON
159 => "Ÿ", // http://www.fileformat.info/info/unicode/char/0178/index.htm LATIN CAPITAL LETTER Y WITH DIAERESIS
);*/
if (count($replace) != 0) {
$find = array();
foreach (array_keys($replace) as $key) {
$find[] = chr($key);
}
$input = str_replace($find, array_values($replace), $input);
}
/*
* Because ISO-8859-1 and CP1252 are identical except for 0x80 through 0x9F
* and control characters, always convert from Windows-1252 to UTF-8.
*/
$input = iconv('Windows-1252', 'UTF-8//IGNORE', $input);
if (count($replace) != 0) {
$input = html_entity_decode($input);
}
}
return $input;
}
The trick is that you have to check for both ISO-8859-1
and CP1252
because they are so similar. I found this out the hard way after hours of playing around with this function, only to have this answer save me. If you found this function helpful, go +1 that answer.
Basically, this function replaces all those bad CP1252 bytes with HTML entities representing the Unicode characters. We then convert the string from ISO-8859-1
/CP1252
to UTF-8
, while none of our new Unicode characters are mangled because they are simple ASCII characters. Finally, we decode the HTML entities and finally have a 100% Unicode string.