Skip to content

mlocati/unipoints

Repository files navigation

Tests

A Unicode Codepoint library for PHP

Simplified Unicode Terminology

Codepoints

Codepoints are characters, spaces, symbols, punctuations, separators, ... that is, the single units that compose texts.

Blocks

Codepoints are grouped in blocks, that is, groups of contiguous codepoints that are part of a common set.

Examples:

  • a is contained in the Basic Latin block
  • α is contained in the Greek and Coptic block
  • 𝅘𝅥𝅮 is contained in the Musical Symbols block
  • ↩ is contained in the Arrows block
  • ☂ is contained in the Miscellaneous Symbols block

Planes

Planes are blocks of 65,536 contiguous codepoints and may contain zero, one or many blocks.

General Category

This library also provides the general category of every codepoint, that is, you can know if a codepoint is a lowercase letter, a symbol, a punctuation, and so on.

Surrogate Codepoints

In order to extend the number of codepoints that can be represented with 16 bits, Unicode introduced "Surrogates". A single character (or punctuation, ...) can be represented by combining two consecutive surrogates (called "high surrogate" and "low surrogate"). That means that such codepoints have a meaning only in pair.

Sample Usage

Codepoints are listed in the string-backed MLUnipoints\Codepoint enum. The value of the enum cases strings contain the unicode symbol: that way, for example in order to get the case of a, you simply can simply write:

use MLUnipoints\Codepoint;

$codepoint = Codepoint::from('a');

Since the MLUnipoints\Codepoint enum is rather big (it can use tens of MB of memory when you autoload it), you can also use the block-specific instances defined under the MLUnipoints\Codepoint namespace (but that requires that you already know the block in advance). For example:

use MLUnipoints\Codepoint;

$codepoint = Codepoint\Basic_Latin::from('a');

Every case of the MLUnipoints\Codepoint enum has a MLUnipoints\Info\CodepointInfo attribute. You can easily retrieve this attribute by writing

use MLUnipoints\Codepoint;
use MLUnipoints\Info\CodepointInfo;

$codepoint = Codepoint::from('a');
$codepointInfo = CodepointInfo::from(Codepoint::from('a'));

This attribute provides the numeric value of the codepoint, the Unicode name, the general category, and (if you don't use the block-specific enums) the block.

You can also similarly the details of the block, plane and the general category.

For example, this code:

use MLUnipoints\Codepoint;
use MLUnipoints\Info\BlockInfo;
use MLUnipoints\Info\CategoryInfo;
use MLUnipoints\Info\CodepointInfo;
use MLUnipoints\Info\PlaneInfo;

$codepoint = Codepoint::from('a');
$codepointInfo = CodepointInfo::from($codepoint);
$categoryInfo = CategoryInfo::from($codepointInfo->category);
$blockInfo = BlockInfo::from($codepointInfo->block);
$planeInfo = PlaneInfo::from($blockInfo->plane);

echo 'Codepoint: ', $codepointInfo->id, "\n";

echo 'Codepoint name: ', $codepointInfo->name, "\n";

echo 'Codepoint general category: ', $categoryInfo->description, "\n";

foreach ($categoryInfo->parentCategories as $parentCategory) {
    echo 'Codepoint parent general category: ', CategoryInfo::from($parentCategory)->description, "\n";
}

echo 'Block name: ', $blockInfo->name, "\n";

echo 'Plane name: ', $planeInfo->name, "\n";

echo 'Plane short name: ', $planeInfo->shortName, "\n";

will output:

Codepoint: 97
Codepoint name: LATIN SMALL LETTER A
Codepoint general category: a lowercase letter
Codepoint parent general category: a cased letter
Codepoint parent general category: a letter
Block name: Basic Latin
Plane name: Basic Multilingual Plane
Plane short name: BMP

You can also use the Unicode enums to print out characters and symbols.

For example:

use MLUnipoints\Codepoint;

echo Codepoint::SUN_BEHIND_CLOUD->value;

will print

Do you really want to say thank you?

You can offer me a monthly coffee or a one-time coffee 😉