A UTF-8 Aware substr_replace (for use in App.net)


So, I stayed up bashing my head against a brick wall all last night! PHP's string functions aren't (yet) UTF-8 aware.

This is a replacement for subtr_replace which should work on UTF-8 Strings:

function utf8_substr_replace($original, $replacement, $position, $length)
{
    $startString = mb_substr($original, 0, $position, "UTF-8");
    $endString = mb_substr($original, $position + $length, mb_strlen($original), "UTF-8");

    $out = $startString . $replacement . $endString;

    return $out;
}

Take this typical string from App.net

» Hello @bob how are you?

According to App.net's entities, @bob occurs at position 9 and has length of 3.

Normally, we would just use substr_replace.

However, PHP will count any unicode character like "»" as two characters. So it thinks that the position of @bob is 10.

Arse.

So, given we have the position of the substring, and its length, we can use PHP's multibyte functions to split the string in two.

First,

$startString = mb_substr($originalString, 0, $position, "UTF-8");

Gives us:

» Hello @

Secondly,

$endString = mb_substr($originalString, $position + $length, mb_strlen($originalString), "UTF-8");

Gives us

 how are you?

Finally, we stitch them back together

$newString = $startString . $replacement . $endString;

Share this post on…

  • Mastodon
  • Facebook
  • LinkedIn
  • BlueSky
  • Threads
  • Reddit
  • HackerNews
  • Lobsters
  • WhatsApp
  • Telegram

What are your reckons?

All comments are moderated and may not be published immediately. Your email address will not be published.

Allowed HTML: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong> <p> <pre> <br> <img src="" alt="" title="" srcset="">