Post

How Programming Languages Really Work

How Programming Languages Really Work

Have you ever wondered how programming languages really work? There are several phases that your code goes through. Some compilers have more phases, but I will only talk about the basic ones. These stages are included in my own compiler. We will talk about a compiler made fully from scratch without using LLVM, to go more into detail.

Tokenizer

The tokenizer phase is really simple. Basically, the tokenizer gets the file input and converts it into tokens (structs).

Example:

Input: "int variable_name;"

Output:

1
2
3
4
5
6
7
8
9
10
11
12
13
[
  {
    "token_type": "type",
    "value": "int"
  },
  {
    "token_type": "identifier",
    "value": "variable_name"
  },
  {
    "token_type": "semicolon"
  }
]

Example token type in C:

1
2
3
4
struct Token {
    enum TokenType token_type;
    union TokenValue value;
};

Parser

The parser converts tokens into instructions or nodes. Usually a semicolon ends the current node, so when you see a semicolon, you finish the current node and start the next one. For example, the tokens from above would turn into a node like this:

1
2
3
4
5
6
{
  "node_type": "variable_declaration",
  "variable_type": "int",
  "variable_name": "variable_name",
  "initial_value": "none"
}

Scope Analysis

This phase creates scopes. Every function or lambda is basically a scope. Each scope keeps track of how much stack memory to allocate, what variables exist, and the parsed nodes inside it. Scope analysis goes through all nodes, detects where new scopes start, tracks variables, and collects other important data.

Example:

1
2
3
4
5
6
// New scope
// Stores variables, how many bytes to allocate (4)
int function() {
    int variable;
    return 0;
}

Semantic Analysis

This stage works like an error checker. It goes through all the nodes and checks type matching and whether identifiers are valid.

Example:

1
2
3
4
unsigned int var = 10;
int var2 = var;
// Semantic analysis checks if variable "var" exists in current or higher scope and checks its type.
// If the variable is invalid or types don't match, it throws an error.

Code Generation

This is the final phase. We loop through every scope (function) and convert the nodes into assembly instructions (or translate to another language if you are compiling to something else).

Let’s say this is the input code:

1
2
3
void function() {
    int var = 10;
}

Based on scope analysis, we know the scope needs 4 bytes of stack space for one variable.

Here’s a simplified assembly output (modern assembly usually aligns the stack to 16 bytes, but we skip that for clarity):

1
2
3
4
5
6
function:
    sub sp, sp, #4     ; allocate 4 bytes
    mov w1, #10        ; move number 10 to a 32-bit register
    str w1, [sp, #0]   ; store at offset 0 (we calculate variable offsets before code gen)
    add sp, sp, #4     ; deallocate stack
    ret                ; return
This post is licensed under CC BY 4.0 by the author.