Programming Linguistics #5 — Type Away

January 26, 2016 · Posted in Programming Linguistics 

In Programming Linguistics #4, we talked about classes and the type system. This post will expand on those topics a little further.

In the previous post, I spent some time talking about run-time type information. There have been a few points that did not get covered: the type-of operator, the special Any type, enumerations, and strings. Let’s dive into those subjects, shall we?

The ‘type-of’ operator

The language employs a built-in, non-overloadable operator that lets you access information about any data type. The operator itself is the percent sign (%), which is immediately followed by either the name of a type, or the name of a variable. Doing so yields an object of the built-in class TypeInfo (which is actually an entry in the table of type information described in the previous post), which can then be accessed to retrieve information about a data type.

MyClass { … }
int32 $variable;

%MyClass.size; // returns the number of bytes that a MyClass object occupies in memory
%variable.id; // returns the 32-bit type identifier

This operator replaces the sizeof() construct that exists in C (and is actually far more powerful because it allows access to much more information).

The Any type

In some cases, it may be desirable to let a function accept an argument of an arbitrary type. Or, when a function accepts a variable number of arguments, it is not known in advance what the types of those arguments are going to be. To deal with both of these scenarios, the special built-in Any type exists.

Any is internally constructed of a 32-bit type identifier value, along with a pointer to an actual object in memory. Depending on the platform, this means that instances of the Any type will normally use either 8 or 12 bytes of memory. The special feature of Any is that it can used as if it was an object of the referenced type.

Consider the following:

MyClass { … }
MyClass $foo;

void MyFunction( Any $bar )
{
    …
}

At this point, if MyFunction() was to be called with $foo as its argument, $bar can be used inside the function in the exact same manner as $foo outside of it: properties can be accessed, methods can be called, and using the type-of operator gives access to the TypeInfo object for MyClass.

The Any type can be used for function arguments, or instances of it can be created directly, in which case it must be initialized with the variable that it is pointing to. If an attempt is made to access it before it has been initialized, an exception will be triggered.

Any can potentially also be used in the same manner as C++11’s auto type (with the exception that the type it is going to become doesn’t need to be known at compile time).

The combination of the Any type and the information available through RTTI makes it very easy to write a printf-style function, because it no longer needs to be pre-programmed with how to handle every type of data, and can get away with having very few formatting options, because they’ll be able to handle nearly any data type thrown at it.

If you are wondering how all of this can be done within a strong-typed language, remember that when variables are initially defined they are of a fixed type. Whenever they are passed as function arguments, or assigned as an Any‘s referenced variable, the type is known because of that, and when Any references are chained, the type identifier is copied from the first one. All of this can happen with very minimal overhead, and does not require every in-memory object to store some kind of type identifier along with it (objects don’t need a property saying what type of object it is), making it a feasible technology even in embedded systems.

Enumerations

Enumerations are commonly used in many languages, however, I feel that they often lack some essential features that would make them significantly easier to use. Let’s look at an example of what I feel they should be.

enumeration SearchEngines
  is strict
  is convertible
  is using(uint8)
{
    Google,
    Bing,
    Yahoo
}

Items in an enumeration are always enumerated starting at 0, unless a different value is specified. Each following item is enumerated at the highest existing value plus one.

Enumerations are implicitly compatible with the type of integer they use to store the value. You can always assign the enumeration by numeric value, or convert it to an integer, without needing to cast it.

Enumerations can have attributes similar to the previously discussed attributes that classes can have, however, the specific attributes that can be used are different:

  • strict enables automatic checks that ensure that whenever a value is written to an instance of the enumeration, it is a value that is valid. When only the values 0 through 5 represent valid items of the enumeration, the strict attribute prevents code from ever writing a 6 to it (an an exception would be thrown if it was attempted).
  • convertible automatically generates code to allow converting objects to and from strings. In the above example, converting an object with value 2 to a string would yield “Yahoo". Converting the string “Google" to the enumeration would yield 0. This option is disabled by default, because it can generate a sizable amount of extra code.
  • The using attribute can be used to explicitly define what underlying type is used to store the assigned value. It must always be one of the basic integer types. If not specified, the compiler will automatically choose the smallest integer that can accomodate all possible values for the enumeration (preferring unsigned over signed types when possible).

The labels of items within the enumeration are scoped inside the enumeration (they do not become globally accessible). If you wanted to access the value of the item Bing directly, you would have to write SearchEngines::Bing. This keeps the namespace as clean as possible. However, when an enumeration item is expected in the code (such as when assigning a value or comparing it), the item name may be written directly (i.e. $variable = Bing; or if( $variable == Yahoo )).

Strings

Except in some embedded environments, strings are a very important type of data. They represent a series of characters. My intention is to have a very complete string library available, which can handle any language out-of-the-box.

Key to this is the use of Unicode. Unicode standardizes the way characters are encoded, and brings an end to the many, many different encoding schemes that were in use before its introduction: it currently supports more than 120.000 unique characters, and can encode many hundreds of thousands more.

To do this, a single byte per character isn’t always enough. Unicode defines three main encoding schemes: UTF-8, UTF-16, and UTF-32. My intention is to let the default String implementation store strings using UTF-16 encoding (consuming 2 bytes per character in most cases), but individual characters are accessed as if they were UTF-32 9the char type would be implemented as 32-bit).

The main reason is that UTF-16, while using only two bytes in most cases, cannot encode all possible Unicode characters. UTF-32 can, but consumes twice as much memory. But, since the vast majority of characters used in modern languages all reside on Unicode’s Basic Multilingual Plane (or BMP for short), most strings can be encoded using UTF-16 and never have to use more than two bytes to encode a character.

The String class internally uses a 32-bit unsigned integer to store the length of the string it holds (counted in 16-bit code units). This removes the restriction that exists in traditional C strings of not allowing null characters within a string, and greatly speeds up the many operations that need to know the actual length of the string, because it is no longer necessary to count the characters each time.

Despite using a 32-bit integer, strings will be limited to being 2147483647 code units long. That is because the highest bit is not used to encode the length of the string. Instead, it is a flag indicating whether or not the string contains any characters that use two 16-bit code units. The main reason for this is that, as long as that is not the case, all string functions can work under the assumption that each 16-bit code unit encodes one character, which enables them to work a lot faster: individual characters can be accessed by simply calculating their memory address. However, if the string contains any characters that need more than one 16-bit code unit, the entire string needs to be traversed each time a character at a particular location is accessed, making the process a lot slower. (Again, thanks to the fact that most text can be represented without ever needing more than one code unit, that will rarely be needed.)

My goal is for the compiler to natively support source files encoded in any of the Unicode formats (UTF-8, UTF-16, and UTF-32), and for the String class to have full Unicode support. This should go a very long way towards eliminating character encoding issues and supporting all of the world’s languages.

For embedded systems or other limited environments that require strings, but don’t need full Unicode support, a smaller, ASCII-only version of the String class will be available by setting the right compiler settings.

Converting To & From Strings

We already discussed how classes can implement operator overloading to specify what to do when any of the normal operators are used on them. An aspect of that that has not yet been covered, is that there will be a way to specify how to cast the object to a different type of object.

By simply using that mechanism to define how a class can be converted to an from a String, classes can instantly be made compatible with any existing string handling functions.

Internally, this will be supported on all numeric types and booleans out-of-the-box. Integers and floats can be cast to strings, and numeric strings can be interpreted as integers or floating-point numbers natively. Some other standard library classes might provide similar functionality, if it makes sense for that class to be able to be converted to a string.

Coming Up Next…

There are two important aspects of the type system that haven’t been discussed yet: unions, which allow the data in a given memory space to be interpreted in different ways, and the very important subject of pointers and references. These things are planned for the next installment of Programming Linguistics.

Comments

Leave a Reply




A Soul Waking