Class ActiveSupport::Multibyte::Handlers::UTF8Handler
In: lib/active_support/multibyte/handlers/utf8_handler.rb
Parent: Object

UTF8Handler implements Unicode aware operations for strings, these operations will be used by the Chars proxy when $KCODE is set to ‘UTF8’.

Methods

Constants

HANGUL_SBASE = 0xAC00   Hangul character boundaries and properties
HANGUL_LBASE = 0x1100
HANGUL_VBASE = 0x1161
HANGUL_TBASE = 0x11A7
HANGUL_LCOUNT = 19
HANGUL_VCOUNT = 21
HANGUL_TCOUNT = 28
HANGUL_NCOUNT = HANGUL_VCOUNT * HANGUL_TCOUNT
HANGUL_SCOUNT = 11172
HANGUL_SLAST = HANGUL_SBASE + HANGUL_SCOUNT
HANGUL_JAMO_FIRST = 0x1100
HANGUL_JAMO_LAST = 0x11FF
UNICODE_WHITESPACE = [ (0x0009..0x000D).to_a, # White_Space # Cc [5] <control-0009>..<control-000D> 0x0020, # White_Space # Zs SPACE 0x0085, # White_Space # Cc <control-0085> 0x00A0, # White_Space # Zs NO-BREAK SPACE 0x1680, # White_Space # Zs OGHAM SPACE MARK 0x180E, # White_Space # Zs MONGOLIAN VOWEL SEPARATOR (0x2000..0x200A).to_a, # White_Space # Zs [11] EN QUAD..HAIR SPACE 0x2028, # White_Space # Zl LINE SEPARATOR 0x2029, # White_Space # Zp PARAGRAPH SEPARATOR 0x202F, # White_Space # Zs NARROW NO-BREAK SPACE 0x205F, # White_Space # Zs MEDIUM MATHEMATICAL SPACE 0x3000, # White_Space # Zs IDEOGRAPHIC SPACE ].flatten.freeze   All the unicode whitespace
UNICODE_LEADERS_AND_TRAILERS = UNICODE_WHITESPACE + [65279]   BOM (byte order mark) can also be seen as whitespace, it‘s a non-rendering character used to distinguish between little and big endian. This is not an issue in utf-8, so it must be ignored.
UTF8_PAT = /\A(?: [\x00-\x7f] | [\xc2-\xdf] [\x80-\xbf] | \xe0 [\xa0-\xbf] [\x80-\xbf] | [\xe1-\xef] [\x80-\xbf] [\x80-\xbf] | \xf0 [\x90-\xbf] [\x80-\xbf] [\x80-\xbf] | [\xf1-\xf3] [\x80-\xbf] [\x80-\xbf] [\x80-\xbf] | \xf4 [\x80-\x8f] [\x80-\xbf] [\x80-\xbf] )*\z/xn   Borrowed from the Kconv library by Shinji KONO - (also as seen on the W3C site)
UNICODE_TRAILERS_PAT = /(#{codepoints_to_pattern(UNICODE_LEADERS_AND_TRAILERS)})+\Z/
UNICODE_LEADERS_PAT = /\A(#{codepoints_to_pattern(UNICODE_LEADERS_AND_TRAILERS)})+/
UCD = UnicodeDatabase.new   UniCode Database

External Aliases

size -> length
slice -> []

Public Class methods

Works just like the indexed replace method on string, except instead of byte offsets you specify character offsets.

Example:

  s = "Müller"
  s.chars[2] = "e" # Replace character with offset 2
  s # => "Müeler"

  s = "Müller"
  s.chars[1, 2] = "ö" # Replace 2 characters at character offset 1
  s # => "Möler"

Returns a copy of str with the first character converted to uppercase and the remainder to lowercase

Works just like String#center, only integer specifies characters instead of bytes.

Example:

  "¾ cup".chars.center(8).to_s
  # => " ¾ cup  "

  "¾ cup".chars.center(8, " ").to_s # Use non-breaking whitespace
  # => " ¾ cup  "

Perform composition on the characters in the string

Checks if the string is valid UTF8.

Perform decomposition on the characters in the string

Convert characters in the string to lowercase

Returns the number of grapheme clusters in the string. This method is very likely to be moved or renamed in future versions.

Returns the position of the passed argument in the string, counting in codepoints

Inserts the passed string at specified codepoint offsets

Works just like String#ljust, only integer specifies characters instead of bytes.

Example:

  "¾ cup".chars.rjust(8).to_s
  # => "¾ cup   "

  "¾ cup".chars.rjust(8, " ").to_s # Use non-breaking whitespace
  # => "¾ cup   "

Does Unicode-aware lstrip

Returns the KC normalization of the string by default. NFKC is considered the best normalization form for passing strings to databases and validations.

  • str - The string to perform normalization on.
  • form - The form you want to normalize in. Should be one of the following: :c, :kc, :d, or :kd. Default is ActiveSupport::Multibyte::DEFAULT_NORMALIZATION_FORM.

Reverses codepoints in the string.

Works just like String#rjust, only integer specifies characters instead of bytes.

Example:

  "¾ cup".chars.rjust(8).to_s
  # => "   ¾ cup"

  "¾ cup".chars.rjust(8, " ").to_s # Use non-breaking whitespace
  # => "   ¾ cup"

Does Unicode-aware rstrip

Returns the number of codepoints in the string

Implements Unicode-aware slice with codepoints. Slicing on one point returns the codepoints for that character.

Removed leading and trailing whitespace

Replaces all the non-utf-8 bytes by their iso-8859-1 or cp1252 equivalent resulting in a valid utf-8 string

Used to translate an offset from bytes to characters, for instance one received from a regular expression match

Convert characters in the string to uppercase

Protected Class methods

Compose decomposed characters to the composed form

Decompose composed characters to the decomposed form

Reverse operation of g_unpack

Unpack the string at grapheme boundaries instead of codepoint boundaries

Detect whether the codepoint is in a certain character class. Primarily used by the grapheme cluster support.

Justifies a string in a certain way. Valid values for way are :right, :left and :center. Is primarily used as a helper method by rjust, ljust and center.

Generates a padding string of a certain size.

Re-order codepoints so the string becomes canonical

Convert characters to a different case

Unpack the string at codepoints boundaries

[Validate]