Programming Linguistics #2 — Second Things… Second?

December 8, 2015 · Posted in Programming Linguistics 

In the first post of this series, I started outlining some of the basic underlying concepts of a programming language if I were to design and build it. In part 2, we’ll be looking at the first little glimmers of actual code!

If you’re familiar with languages like C or C++, you’ll understand the concept of header files. In languages like these, header files contain definitions for variables, types, classes, and functions, which are then implemented in source files. Other source files that would like to make use of the things defined somewhere else include just that header file to get the definitions, leaving it up to the linker to actually piece things together.

When you look at how compilers for these languages work, it’s understandable why a mechanism like this is needed. If you were to simply include the full implementation of a class or function in each source file that needed it, it’d take a lot longer to compile the project, and the linker is left with dozens of copies of the same code, not knowing which one to actually use. But especially if you’re not a seasoned programmer, understanding why header files are needed, how to write them, and what things belong in which file, can be very confusing subjects. And dealing with things that depend on eachother can quickly become a nightmare even for experienced developers.

Does it really need to be that difficult?

I think not.

Source Files & Translation Units

In a traditional C or C++ compiler, source files are compiled individually, one by one, each including whichever header files they need so that they have the definitions of things that they depend on. These things together form a translation unit. After compiling, the resulting intermediate code is stored in an object file, and when all individual source files have been compiled, a linker pieces those object files together to form a functional program.

Since the definitions of classes and functions can always be derived from their implementations, writing header files is a task that I consider redundant. You always end up writing the same things twice, which is a waste of time, and introduces potential for errors. It’s simply not necessary. The way I would handle things, is slightly different.

First, no more header files — all we need is source files. Instead of #include directives, whenever things defined in another source file are needed, a simple directive is used, which can be placed anywhere in a source file (though it will usually be located near the top): using "foo.bar"; This tells the compiler that, to build this source file, we need the code from this other file, so it can automatically switch to compiling the other file first. We no longer need to maintain a list of files that need to be compiled — just point the compiler to your main file, and it can figure out the rest on its own. This alone eliminates 80% of what an average build system needs to do.

Files are still handled in individual translation units, however along with the intermediate code that the object file stores, it also still contains the definitions for everything implemented in that object file. Because of that, header files become unnecessary — we no longer have to write definitions and implementations separately.

In some cases, the compiler will need to deal with circular dependencies — situations where two source files rely on eachother. Normally, this can be a tricky situation to handle. The obvious solution when a situation like this is encountered is to let the compiler first extract all definitions from each source file and make them known to the other file. This should be enough to resolve the majority of these cases. In more complicated cases, the compiler might have to weave declarations in order to resolve surch circular dependencies — but trying to resolve them by hand should not be necessary.

Modules

Within the scope of the language, a translation unit is defined as a source file plus whatever other input is required to compile it into an object file. Object files are compiled, intermediate versions of source files; not yet linked together into modules, but containing relocatable machine code for whatever target architecture it’s being compiled for.

Once all source files of a project have been turned into object files, they can be linked together into a module. Modules are one step down from final products (program executables or dynamically-linkable libraries), however they are the final form for statically linked libraries. A module, which will have its own, standardized file format, may contain the following things:

  • Program code. Program code is in the form of relocatable machine code, that can be divided into one or more sections — much in the same way that existing executable files usually consist of several sections (.text, .rodata, and so on). The departure from ‘traditional’ formats lies in the fact that module files may contain program code for multiple platforms, so that a single module file can contain the code for x86 CPU’s, as well as their 64-bit counterparts.
  • Debug information. This is the data that debuggers need to make sense of the program code.
  • Linker information. Smaller module files may be used to accompany dynamically-linked libraries, containing only the information needed to link programs against the library, and possibly documentation for using that library.
  • Resources. Many programs need additional files, such as configuration, images, and other data. The standard library will include an easy-to-use, a cross-platform method of handling these.
  • Documentation. There will be a standardized way of documenting code (everyone’s familiar with doxygen and the likes), and if enabled, that gets embedded into the module file. This is also helpful for IDE’s with code completion and such features.
  • Meta data. This includes the name and version of the module, its authors, its webpage, and copyright information.

Once the compiler has produced a module file out of your source code, it is linked with other modules (unless you’re running on a really bare-bones embedded environment), and a final executable is produced. Or, if you’re building a statically linked library, the module file is the final product

In fact, those module files are also what the compiler draws from when linking your code against the standard libraries and anything else that you might need. It no longer needs tons of header files which half the time aren’t consistent with the binaries, and we no longer need to dig through code to find out how to use a function. The module file provides it all.

The previously mentioned using statement also powers inclusion of other modules as libraries (whether it be statically or dynamically linked). The syntax for that use case is using module "foo";.

Data Types

Before diving into some first elements of syntax, let’s go over the basics of data types.

The language will support 21 primitive data types, divided in 4 categories, upon which everything else is built.

  • The most primitive type of variable is the boolean, which takes up a single bit of space, and can only represent one of two values: true (1) or false (0). In some cases, the compiler will store multiple booleans in a single byte, reducing the amount of memory used. The basic boolean data type is named simply bool.
  • Next up are the integers. Integers exist in various widths (8, 16, 32, 64 being universally supported, with a possible extension for 128 bit wide integers). Wider integers take up more memory, but can represent larger numbers. Integers are divided into signed and unsigned. Signed integers can represent negative as well as positive numbers, while unsigned integers can only represent positive numbers (0 and up). Up to 10 types of integers exist. A few examples:
    • uint32 is an unsigned 32-bit integer, that can have a value ranging from 0 to 4,294,967,295.
    • sint16 is a signed 16-bit integer, that can have a value ranging from −32,768 to 32,767.
  • The third category is floating-point numbers, which can represent decimals, as well as some special values, such as NaN or infinity. They tend to be more computationally expensive, and some values cannot be represented exactly using them, so they’re best reserved to situations when integers really aren’t enough. The system of floating point numbers is divided into binary and decimal representations. Similar to integers, they are named depending on the amount of space they use:
    • float16 through float128 are binary encoded floating-point numbers.
    • decimal32 through decimal128 are decimal encodings.
  • The fourth category is utility types. These include void (unspecified or non-existent), byte (a byte of information, essentially an 8-bit integer without interpreting it as either signed or unsigned), char (a textual character, of which strings are built — most likely implemented as a 16-bit integer), and pointers and references. Pointers and references are a special case, because like C, the type they are pointing to must be specified, and additionally, they are the only type that is platform-dependent (pointers are 8 bits wide on 8-bit systems, 32 bits wide on 32-bit systems, and so on).

Other types of data (including strings) are generally a combination of one or more primitives, and these are known as composite types.

Even though these are considered primitive data types, from the programmer’s perspective they can be treated like classes. Properties like minimum and maximum values of integers and floats can be accessed as class constants (rather than C’s macros), and they all at least have methods to support converting the numer to (or from) a string representation as well as casting to different types of integers or floats.

Some Syntax Basics

Variables must be declared before they are used, and are always of a specified type. Names of variables always start with a dollar sign ($), so that the compilar can easily distinguish variable names from reserved keywords. Names can include uppercase and lowercase letters, numbers, and underscores, but no other characters. Names are case sensitive (however if two names are used that would be identical if they were not case sensitive, such as foo and Foo, the compiler should give off a warning, because this is usually considered bad practice).

Variables can optionally include an initial value (which is always encouraged). Decimal, binary, and hexadecimal notations are allowed. Defining a variable goes like this:

uint32 $foo = 0;

byte $bar = 0x4A;

char $baz = 'F';

Like in C or C++, pointers are denoted with an asterisk, and references using an ampersand. The asterisk or ampersand is written next to the type of the variable, not its name, because it defines the type.

void* $foo;

sint8& $bar;

As in most other languages, the equals sign represents an assignment operation. Many other standard notations for arithmatic and logical operations are also allowed:

  • Assignment: regular assignments (=) and compound assignments (+=, -=, *=, /=, %=, |=, &=, ^=, <<=, >>=, <<<=, >>>=).
  • Arithmetic operators: addition (+), subtraction or negation (-), multiplication (*), division (/), modulo (%), pre- and post-increment (++$variable and $variable++), and pre- and post-decrement (--$variable and $variable--).
  • Comparision: equality (==), non-equality (!=), greater than (>), greater than or equal to (>=), less than (<), less than or equal to (<=).
  • Logical operators: not (!), and (&&), or (||).
  • Bitwise operations: invert (~), and (&), or (|), exclusive or (^), left shift (<<), right shift (>>), left rotate (<<<), right rotate (>>>).
  • Members and pointers: array subscript ([…]), address of (&), indirection (*), member access (.).
  • Others: scope resolution (::), function call (()), comma (,), ternary conditional (… ? … : …), object creation (new, new[]), object destruction (delete, delete[]).

Note: this is not necessarily a complete or authoritative listing.

Some examples:

uint32 $foo = 15 + 3;

uint32* $bar = &foo;

byte $baz = 15 | 32;

bool $qux = *bar == foo;

In addition, parentheses can be used to prioritize and group parts of expressions:

uint32 $quux = ( 85 + 3 ) % 5;

Similar to C++, most operators can be overloaded by classes to power advanced functions and syntactic sugar.

Coming Up Next…

Now that we have some basic variable types and expressions, we can start looking at namespaces and functions — which is exactly what I’m planning to do in the next post!

Comments

Leave a Reply




A Soul Waking