Skip to content

Commit

Permalink
Typeset N* as N$^\star$
Browse files Browse the repository at this point in the history
  • Loading branch information
Mesabloo committed Mar 8, 2021
1 parent 45ae129 commit 8913eea
Showing 1 changed file with 31 additions and 31 deletions.
62 changes: 31 additions & 31 deletions part2-nstar.tex
Original file line number Diff line number Diff line change
Expand Up @@ -14,12 +14,12 @@ \chapter{Introduction}\label{chap:nstar-abstract}

\vspace{\baselineskip}

N*'s goal is to assist users with a simple but powerful type system, yet still allowing for low-level data manipulation.
But before even being a usable programming language, N* aims at being a compiler backend (much like for example LLVM), and is used that way in the Zilch project. Differences with other compiler backends are mostly the type-system, allowing the compilation of Zilch source code into type-safe instructions.
N$^\star$'s goal is to assist users with a simple but powerful type system, yet still allowing for low-level data manipulation.
But before even being a usable programming language, N$^\star$ aims at being a compiler backend (much like for example LLVM), and is used that way in the Zilch project. Differences with other compiler backends are mostly the type-system, allowing the compilation of Zilch source code into type-safe instructions.

\vspace{\baselineskip}

Because N* supports compiling to multiple architectures, using different grammars, describing N* will at first be platform-agnostic, treating common aspects between all CPU architectures, and then will be divided into multiple categories, explaining in more details some features on a per-architecture basis\footnote{Note that the target executable format (ELF, PE, \ldots) is also considered as an architecture-specific thing, but should not influence N* much.}.
Because N$^\star$ supports compiling to multiple architectures, using different grammars, describing N$^\star$ will at first be platform-agnostic, treating common aspects between all CPU architectures, and then will be divided into multiple categories, explaining in more details some features on a per-architecture basis\footnote{Note that the target executable format (ELF, PE, \ldots) is also considered as an architecture-specific thing, but should not influence N$^\star$ much.}.

\section*{Small notes on the notation used in the next chapters}\label{sec:nstar-abstract-notation}

Expand Down Expand Up @@ -60,17 +60,17 @@ \chapter{Non platform-specific features}\label{chap:nstar-common}

\section{Types}\label{sec:nstar-common-ts}

One of the differences between classical assembly languages and N* is its type system.
Compared to other higher level programming languages like Java, C++, etc, N* has a very simple yet powerful and expressive enough type system.
One of the differences between classical assembly languages and N$^\star$ is its type system.
Compared to other higher level programming languages like Java, C++, etc, N$^\star$ has a very simple yet powerful and expressive enough type system.

In programming, types are used mostly to prove at compile-time that a given program should behave well if it type-checks. While this works for more elaborated programming languages like Haskell, Idris, etc, most type systems aren't expressive enough to absolutely guarantee that everything will work at run-time (in fact, there is no possible way of doing this, because for example a memory allocation may fail, and this cannot be checked at compile-time). However, we can try to guarantee as much as possible.

N* doesn't try to solve this issue, because it would be really hard to target a dependently typed assembly language from a non-dependently typed programming language. But where all used assembly languages do not even consider types (only numbers, in fact), N* embeds a powerful type system used to remove the possibility of bugs (like incorrect structure addresses passed as a parameter function, or incoherent types in some instructions).
N$^\star$ doesn't try to solve this issue, because it would be really hard to target a dependently typed assembly language from a non-dependently typed programming language. But where all used assembly languages do not even consider types (only numbers, in fact), N$^\star$ embeds a powerful type system used to remove the possibility of bugs (like incorrect structure addresses passed as a parameter function, or incoherent types in some instructions).

\subsection{Kinds}\label{subsec:nstar-common-ts-kinds}

Kinds (also known as types of types) mainly serve the purpose of indicating type sizes.
There are three type of kinds in N*:
There are three type of kinds in N$^\star$:
\begin{itemize}
\item Stack kind, denotating stack-like types, which can be safely offsetted
\item Kinds whose size is abstracted away, useful to ask for any sized type
Expand All @@ -91,7 +91,7 @@ \subsection{Kinds}\label{subsec:nstar-common-ts-kinds}
\subsection{Integer types}\label{subsec:nstar-common-ts-integer}

Numbers are the building block of any assembly language. Most of data manipulated is manipulated as numbers, e.g.\ addresses, characters, strings, enumerations, etc.
This is not the case in N*, where ``integer'' only really means ``number''.
This is not the case in N$^\star$, where ``integer'' only really means ``number''.
The syntax for the types of integers is given in Figure~\ref{fig:nstar-common-ts-integer-syntax}.

\begin{figure}[htb]
Expand All @@ -105,7 +105,7 @@ \subsection{Integer types}\label{subsec:nstar-common-ts-integer}

Integers have two varying parameters: their sign and sizes.
According to the sign (i.e.\ signed or unsigned), some operations may not perform the same (for example \texttt{mul} does not behave the same).
The size is nothing more than the number of bits occupied by the integer (in N*, those are restricted to powers of 2 between $8$ and $64$ included).
The size is nothing more than the number of bits occupied by the integer (in N$^\star$, those are restricted to powers of 2 between $8$ and $64$ included).
Most operations should perform the same no matter the integer size, however it is recommended to search in the target architecture manual for further reference.

Kinds of integers are written under the form of inference rules in Figure~\ref{fig:nstar-common-ts-integer-kindrules}.
Expand Down Expand Up @@ -168,7 +168,7 @@ \subsection{Other atomic types}\label{subsec:nstar-common-ts-otheratomic}
In all assembly languages, characters are merely syntactic sugar for their ASCII code. This is how their are put in the machine code anyway, so it is not a huge problem (it might even not be at all).

Pointers, on the other hand, are unabstracted memory addresses.
In N*, there are two types of pointers: data pointers and stack pointers.
In N$^\star$, there are two types of pointers: data pointers and stack pointers.
Stack pointers are covered in Subsection~\ref{subsec:nstar-common-ts-stack}~``\nameref{subsec:nstar-common-ts-stack}''.
Data pointers simply represent an address where we know (or not, see the Subsection~\ref{subsec:nstar-common-unsafe-derefliteraladdr}~``\nameref{subsec:nstar-common-unsafe-derefliteraladdr}'') that there is a value of the given pointed type.

Expand Down Expand Up @@ -279,7 +279,7 @@ \subsection{Context types}\label{subsec:nstar-common-ts-records}

\subsection{Stack types}\label{subsec:nstar-common-ts-stack}

There is one stack type in N*: the stack constructor \texttt{::}.
There is one stack type in N$^\star$: the stack constructor \texttt{::}.
Note that there is no ``empty stack'' type as would be the case with e.g.\ lists in Haskell.
The reason is that it forces the developer to abstract the stack to be able to have a ``stack tail'' (the part of the stack on the right of the stack constructor \texttt{::}) at some point.\footnote{It also serves the purpose to ensure that we cannot construct a stack from nothing, and that it should always be given to us, to e.g.\ the \texttt{main} function.}
An example is given in Listing~\ref{lst:nstar-common-ts-records-stackmask}.
Expand Down Expand Up @@ -379,7 +379,7 @@ \subsection{Union types}\label{subsec:nstar-common-ts-unions}

\section{Expressions}\label{sec:nstar-common-expressions}

There are four types of expressions in N*: immediate values, data labels, registers and pointer offsets.
There are four types of expressions in N$^\star$: immediate values, data labels, registers and pointer offsets.
All of these can be used as the source operand of the \texttt{mov} instruction.

\subsection{Immediate values}\label{subsec:nstar-common-expressions-immediate}
Expand All @@ -397,7 +397,7 @@ \subsection{Immediate values}\label{subsec:nstar-common-expressions-immediate}
\scalebox{.5}{
\includegraphics{nstar/exprs/imm-int-grammar}
}
\caption{Grammar for immediate values in N*.}
\caption{Grammar for immediate values in N$^\star$.}
\label{fig:nstar-common-expressions-immediate-grammar}
\end{figure}

Expand All @@ -421,7 +421,7 @@ \subsection{Immediate values}\label{subsec:nstar-common-expressions-immediate}
\infer2{\Xi;\Gamma;\chi;\sigma;\epsilon\vdash^T_{code}$ i : t$}
\end{prooftree}

\caption{Type inference rules for immediate values in N*.}
\caption{Type inference rules for immediate values in N$^\star$.}
\label{fig:nstar-common-expressions-immediate-typerules}
\end{figure}

Expand All @@ -430,7 +430,7 @@ \subsection{Labels}\label{subsec:nstar-common-expressions-labels}
\subsubsection{Data labels}\label{subsubsec:nstar-common-expressions-labels-data}

Data labels are pointers to some piece of data (may it even be an array).
They can be dereferenced and offset using the constructs given in N*.
They can be dereferenced and offset using the constructs given in N$^\star$.
Grammar and inference rules are given respectively in Figure~\ref{fig:nstar-common-expressions-labels-data-grammar} and Figure~\ref{fig:nstar-common-expressions-labels-data-typerules}.

\begin{figure}[H]
Expand All @@ -439,7 +439,7 @@ \subsubsection{Data labels}\label{subsubsec:nstar-common-expressions-labels-data
\scalebox{.5}{
\includegraphics{nstar/exprs/label-expr-grammar}
}
\caption{Grammar for data labels in N*.}
\caption{Grammar for data labels in N$^\star$.}
\label{fig:nstar-common-expressions-labels-data-grammar}
\end{figure}

Expand All @@ -451,7 +451,7 @@ \subsubsection{Data labels}\label{subsubsec:nstar-common-expressions-labels-data
\infer1{\Xi;\Gamma;\chi;\sigma;\epsilon\vdash^T_{code}$ l : *t$}
\end{prooftree}

\caption{Type inference rules for data labels in N*.}
\caption{Type inference rules for data labels in N$^\star$.}
\label{fig:nstar-common-expressions-labels-data-typerules}
\end{figure}

Expand All @@ -467,7 +467,7 @@ \subsubsection{Code labels}\label{subsubsec:nstar-common-expressions-labels-code
\scalebox{.5}{
\includegraphics{nstar/exprs/label-value-grammar}
}
\caption{Grammar for code labels in N*.}
\caption{Grammar for code labels in N$^\star$.}
\label{fig:nstar-common-expressions-labels-code-grammar}
\end{figure}

Expand All @@ -482,14 +482,14 @@ \subsubsection{Code labels}\label{subsubsec:nstar-common-expressions-labels-code
\infer1{\Xi;\Gamma;\chi;\sigma;\epsilon\vdash^T_{code}$ l<$\vec{s}$> : $\forall$().ctx$}
\end{prooftree}

\caption{Type inference rules for code labels in N*.}
\caption{Type inference rules for code labels in N$^\star$.}
\label{fig:nstar-common-expressions-labels-code-typerules}
\end{figure}

\subsection{Registers}\label{subsec:nstar-common-expressions-registers}

Registers in N* are abstracted away from the target architectures.
Instead of manipulating registers \texttt{\%eax} or \texttt{\%rsp} or \texttt{\$a2}, which all are platform-specific, there are 6 general purpose registers in N*, named \texttt{r0} to \texttt{r5}.
Registers in N$^\star$ are abstracted away from the target architectures.
Instead of manipulating registers \texttt{\%eax} or \texttt{\%rsp} or \texttt{\$a2}, which all are platform-specific, there are 6 general purpose registers in N$^\star$, named \texttt{r0} to \texttt{r5}.
They can be 4 to 8 bytes large depending on the target architecture mappings.

Grammar for all those registers is given in Figure~\ref{fig:nstar-common-expressions-registers-grammar} and inference rules are given in Figure~\ref{fig:nstar-common-expressions-registers-typerules}.
Expand All @@ -501,7 +501,7 @@ \subsection{Registers}\label{subsec:nstar-common-expressions-registers}
\includegraphics{nstar/exprs/register-grammar}
}

\caption{Grammar for registers in N*.}
\caption{Grammar for registers in N$^\star$.}
\label{fig:nstar-common-expressions-registers-grammar}
\end{figure}

Expand All @@ -512,7 +512,7 @@ \subsection{Registers}\label{subsec:nstar-common-expressions-registers}
\infer0{\Xi;\Gamma;\chi$, r : t$;\sigma;\epsilon\vdash^T_{code}$ r : t$}
\end{prooftree}

\caption{Type inference rules for registers in N*}
\caption{Type inference rules for registers in N$^\star$}
\label{fig:nstar-common-expressions-registers-typerules}
\end{figure}

Expand Down Expand Up @@ -568,12 +568,12 @@ \subsection{Pointer offsets}\label{fig:nstar-common-expressions-pointeroffsets}

\section{File sections}\label{sec:nstar-common-sections}

Sections in N* serve the exact same purpose as in other assembly languages. They divide a file into multiple parts depending on what the semantics of the current section is supposed to be (code, data, etc).
Sections in N$^\star$ serve the exact same purpose as in other assembly languages. They divide a file into multiple parts depending on what the semantics of the current section is supposed to be (code, data, etc).
Section names obviously differ from one target format to another. As an example, the ``\texttt{.rela.dyn}'' section from the ELF format may not exist in the PE format.

N* tries to unify target formats section names (simplifying targetting N* as a compiler backend) by having a fixed set of section names, all with different meanings. While you can put anything anywhere in classical assembly languages, this is not the case in N*.
N$^\star$ tries to unify target formats section names (simplifying targetting N$^\star$ as a compiler backend) by having a fixed set of section names, all with different meanings. While you can put anything anywhere in classical assembly languages, this is not the case in N$^\star$.

Sections in N* can be named ``\texttt{data}'', ``\texttt{code}'', ``\texttt{rodata}'', ``\texttt{udata}'' or ``\texttt{extern}''. Each of them has defined semantics as described below.
Sections in N$^\star$ can be named ``\texttt{data}'', ``\texttt{code}'', ``\texttt{rodata}'', ``\texttt{udata}'' or ``\texttt{extern}''. Each of them has defined semantics as described below.

\subsection{The \texttt{code} section}\label{subsec:nstar-common-sections-code}

Expand Down Expand Up @@ -734,12 +734,12 @@ \section{Constant values}\label{sec:nstar-common-constvalue}

\section{Restrictions applied to branching}\label{sec:nstar-common-bs-restrictions}

Branching ((un-)\ conditional jumps, calls, etc) is restricted in N* in order to prevent stack leaks, unknown caller address return or even missing return values.
It also is restricted not to integrate some sort of scoping, which would add some (mostly useless) complexity when using/generating N*.
Branching ((un-)\ conditional jumps, calls, etc) is restricted in N$^\star$ in order to prevent stack leaks, unknown caller address return or even missing return values.
It also is restricted not to integrate some sort of scoping, which would add some (mostly useless) complexity when using/generating N$^\star$.

\subsection{Returning to a known code-space address}\label{subsec:nstar-common-bs-restrictions-ret}

N* offers what we call ``continuations'', which are a mean of easily controlling the instruction control flow of the program.
N$^\star$ offers what we call ``continuations'', which are a mean of easily controlling the instruction control flow of the program.
Thanks to that, this is actually easy to check whether we would be returning to some valid piece of code: the continuation part of th context must be bound to something of the form $\forall().ctx$.
When anything else is in the continuation (in fact, it cannot be, because we do not allow overwriting the continuation --- which can be worked around by moving the continuation somewhere else before), it is strictly impossible to either \texttt{call} or \texttt{jmp} to a label, or \texttt{ret} from a piece of code, which would yield a very nasty undefined behavior at runtime in mainstream assembly languages.

Expand Down Expand Up @@ -789,7 +789,7 @@ \subsection{Pointer offsetting}\label{subsec:nstar-common-unsafe-ptroffset}
\texttt{Vec<T>} in Rust, \texttt{std::vector<T>} in C++, \texttt{ArrayList<T>} in Java, and many other vector types are also implemented in terms of a pointer to a chunk of memory, but with an added container size, allowing to safely access elements of the vector without going out of bounds.
But in Rust, we still need to have some \texttt{unsafe} blocks in your code, in order to use the container.

There is the same problem in N*.
There is the same problem in N$^\star$.
Because there is no built-in array type, we have to rely on pointers to be able to achieve such thing, therefore needing a way to offset a pointer to access the various elements in the array.
It would also be really hard to manipulate plain array types (because, for example, of the size to store with the type).
As offsetting a pointer can lead to an invalid address dereferencing (or at least dereferencing an address which doesn't belong to the application memory, i.e.\ allocated by another process), it is considered an unsafe operation and therefore needs to be wrapped in an \texttt{unsafe} block.
Expand Down Expand Up @@ -1210,7 +1210,7 @@ \subsection{Register mappings}\label{subsec:nstar-specific-x86amd64-registers}

\begin{tabularx}{\textwidth}{Y Y Y}
\toprule
N* & x86 & amd64 \\
N$^\star$ & x86 & amd64 \\
\midrule
r0 & eax & rax \\
r1 & ecx & rcx \\
Expand Down

0 comments on commit 8913eea

Please sign in to comment.