WTF UTF8?
There are many good tutorials on how Elixir deals with UTF8 and why it is the only language so far that properly converts “José” string to capitals out of the box. If you are unfamiliar with the subject, I would recommend the brilliant writing Unicode and UTF-8 Explained by Nathan Long.
In a nutshell, Elixir’s String.upcase/1
walks through the string, performs a normalization (for accents and other combined
symbols), and finally converts graphemes to their capital representation. This
process is fully automated and—which is more important—is always up-to-date,
since it reads the conversion rules directly from Consortium’s
definition files. Two used in
case conversion are UnicodeData.txt
and SpecialCasing.txt
, if you one’s curious.
In this post we’ll build the relatively same stuff to address fancy UTF8 symbols by their names. The whole codebase contains 119 LOCs. In the end we’ll yield a bundle of modules, providing functions returning the UTF8 symbol given it’s name as a function name. The practical use of this package is beyond the scope of this post, but I am positive there could be many.
Bruteforce approach
So far, so good. Our goal is to end up with something like:
iex|1 ▶ StringNaming.AnimalSymbols.monkey
"🐒"
The gracefully stolen from native Elixir unicode/properties.ex
approach would be to:
— parse text file provided by Consortium;
— prepare the data structure to walk through;
— build the functions in compile time using meta programming features.
This is all mighty and will prevail, except it ain’t so.
There is no problem reading the file, as well as the data structure needed is to be built without a glitch. The problem is that we need nested modules to be produced on the fly. Look:
iex|2 ▶ StringNaming.AnimalSymbols.<TAB>
Baby Bactrian Dromedary Fox
Front Hatching Lady Lion
Paw Spiral Tropical Unicorn
Water ant/0 bat/0 bird/0
blowfish/0 boar/0 bug/0 butterfly/0
cat/0 chicken/0 chipmunk/0 cow/0
crab/0 crocodile/0 deer/0 dog/0
dolphin/0 dragon/0 duck/0 eagle/0
elephant/0 fish/0 goat/0 gorilla/0
honeybee/0 horse/0 koala/0 leopard/0
lizard/0 monkey/0 mouse/0 octopus/0
owl/0 ox/0 penguin/0 pig/0
poodle/0 rabbit/0 ram/0 rat/0
rhinoceros/0 rooster/0 scorpion/0 shark/0
sheep/0 shrimp/0 snail/0 snake/0
spider/0 squid/0 tiger/0 turkey/0
turtle/0 whale/0
iex|2 ▶ StringNaming.AnimalSymbols.Baby.chick
"🐤"
From the above we see that there are nested modules along with plain functions inside nearly each module. And, unluckily, Elixir does not allow to re-open modules as Ruby does. So, we are to use recursion in our metaprogramming adventure. Exciting?
Reading the file
There is nothing special with reading the file: it’s plain text in very
simple format. Category names are prepended with "@\t"
making it easy
to pattern match, codepoints are followed by their names: take and parse.
On that stage we simply collect everything in the list of tuples
{code, name, category}
. E.g. for the monkey face above, this tuple would
be {"1F412", "MONKEY", "AnimalSymbols"}
.
The boring code, in case anybody is curious, may be found here.
Converting the list to nested map
Second stage is to convert everything to the nested map, so that later on we could recursively iterate it. This is plain old good Elixir as well: no tricks, no exciting shining ideas.
Codepoint is the leaf in each nested map.
Building modules
That is the part the whole post was written for. Impatient readers might just read these 36 LOCs, for others we’ll walk through step by step.
The first pitfall is that we need to call to produce the nested modules from inside
the definition of the currently operated one. That said, we need to declare
the dedicated module to deal with that, otherwise scopes won’t allow us to do that.
BTW, we’ll :code.delete
and :code.purge
this helper module afterward.
The first step is to write a flat level iteration. That is relatively easy:
def nesteds(nested, %{} = map) do
Enum.each(map, fn
{_key, code} when is_binary(code) -> :ok # leaf, skip it
{k, v} ->
mod = :lists.reverse([k | :lists.reverse(nested)])
StringNaming.H.nested_module(mod, v)
end)
end
The above just iterates the map and calls a producer for all the nested
modules. That simple. StringNaming.H.nested_module/2
is where the deal
happens. First of all, we are to split functions (leaves) and modules (branches).
We could not prepare this in advance, since we had no clue at parsing stage
whether this would be a leaf or not.
[funs, mods] = Enum.reduce(children, [%{}, %{}], fn
{k, v}, [funs, mods] when is_binary(v) ->
[Map.put(funs, k, v), mods]
{k, v}, [funs, mods] ->
[funs, Map.put(mods, k, v)]
end)
We consider binaries to be a codepoint value, and, hence, a leaf. Yes, I am aware
of Enum.split_with/2
, but
here it’s simpler (and faster) to produce maps explicitly. Now we have two maps.
It’s time to rock!
The first—hacky and basically wrong—approach was to use
Code.eval_quoted/3
,
since I could not figure out how to dynamically create a module inside other module:
defmodule Module.concat(mod) do
Enum.each(funs, fn {name, value} ->
# name might be numeric, e.g. 1 ⇒ make it a proper atom here
name = name
|> String.replace(~r/\A(\d)/, "N_\\1")
|> Macro.underscore
|> String.to_atom
# abstract syntax tree of
# ★ def monkey, do: "🐒"
# value is a codepoint
ast = quote do
def unquote(name)() do
<<String.to_integer(unquote(value), 16)::utf8>>
end
end
Code.eval_quoted(ast, [name: name, value: value], __ENV__)
end)
# TODO: def __all__
StringNaming.H.nesteds(mod, mods) # call back for the nesteds
end
I have posted a question on SO and
Dogbert helped me to make
the code clean with Module.create/3
:
ast = for {name, value} <- funs do
name = name |> String.replace(~r/\A(\d)/, "N_\\1") |> Macro.underscore |> String.to_atom
quote do: def unquote(name)(), do: <<String.to_integer(unquote(value), 16)::utf8>>
end
# TODO: def __all__
Module.create(Module.concat(mod), ast, Macro.Env.location(__ENV__))
StringNaming.H.nesteds(mod, mods)
That is basically it. The only thing left is to implement __MODULE__.__all__/0
function
to return a keyword list of all the functions available, with their values:
def __all__ do
:functions
|> __MODULE__.__info__()
|> Enum.map(fn
{:__all__, 0} -> nil
{k, 0} -> {k, apply(__MODULE__, k, [])}
_ -> nil
end)
|> Enum.filter(& &1)
end
Now we just call
StringNaming.H.nesteds(["String", "Naming"], names_tree)
on the top level, and the tree of modules is built under StringNaming
namespace. Enjoy:
StringNaming.ChessSymbols.Black.Chess.king
"♚"
Get the pill
There is not much more code in the package, besides the above, but for those picky persons, we have string_naming @ github, also the package is string_naming @ hex.pm.