So, I stayed up bashing my head against a brick wall all last night! PHP's string functions aren't (yet) UTF-8 aware.
This is a replacement for subtr_replace which should work on UTF-8 Strings:
function utf8_substr_replace($original, $replacement, $position, $length)
$startString = mb_substr($original, 0, $position, "UTF-8");
$endString = mb_substr($original, $position + $length, mb_strlen($original), "UTF-8");
$out = $startString . $replacement . $endString;
Take this typical string from App.net
» Hello @bob how are you?
According to App.net's entities, @bob occurs at position 9 and has length of 3.
Normally, we would just use substr_replace.
However, PHP will count any unicode character like "»" as two characters. So it thinks that the position of @bob is 10.
So, given we have the position of the substring, and its length, we can use PHP's multibyte functions to split the string in two.
$startString = mb_substr($originalString, 0, $position, "UTF-8");
» Hello @
$endString = mb_substr($originalString, $position + $length, mb_strlen($originalString), "UTF-8");
how are you?
Finally, we stitch them back together
$newString = $startString . $replacement . $endString;