Skip to content

Commit

Permalink
Proofreading
Browse files Browse the repository at this point in the history
  • Loading branch information
Solumin committed Oct 28, 2014
1 parent 0cc65a0 commit c976793
Showing 1 changed file with 13 additions and 13 deletions.
26 changes: 13 additions & 13 deletions doc/report/project_report.tex
Original file line number Diff line number Diff line change
Expand Up @@ -86,12 +86,12 @@ \section{Introduction}\label{intro}
\section{Parser}\label{parser}

\paragraph{}
The bytecode generated by a Python compiler is stored in marshal format. Each \pyc file is composed of:
The Python compiler turns script files into compiled files by storing the data and code of the script in a binary format. This process is referred to as marshaling the data and code. Each \pyc file, which contains the marshalled script, is composed of:

\begin{itemize}
\item Magic number: consist of the first 8 bytes of the \pyc file and it indicates from which Python version the bytecode was generated;
\item Date: the next 8 bytes and it indicates the date of compilation;
\item Code object: the remaining of the file. More details on Section~\ref{code object}
\item Magic number: the first 8 bytes of the \pyc file. It indicates which Python version the bytecode was generated by;
\item Date: the next 8 bytes. It indicates the date of compilation;
\item Code object: the remainder of the file. More details in Section~\ref{code object}
\end{itemize}

We define the class Unmarshaller to handle the parsing phase. The class has the attributes index, input, magicNumber, date, internedStrs, and output.
Expand All @@ -101,7 +101,7 @@ \section{Parser}\label{parser}
\item \texttt{input}: the path for the \pyc file that is going to be processed
\item \texttt{magicNumber}: stores the magic number
\item \texttt{date}: stores the date of compilation
\item \texttt{internedStrs}: it stores the strings that were preceded with `t', meaning that they should be stored for later use.
\item \texttt{internedStrs}: it stores the strings that were preceded with `t', meaning that they should be stored for later use. These strings can be referred to by other marshalled items.
\item \texttt{output}: the code object generated after parsing the \pyc file
\end{itemize}

Expand All @@ -115,10 +115,10 @@ \section{Parser}\label{parser}


\paragraph{}
The Unmarshaller, as we have named it, parses a \pyc file by recursing over the file and creating a code object. The Marshal format encodes data by using a single character for the object's type (`\texttt{i}' for 32-bit integers, `\texttt{s}' for strings, and so on) followed by some number of bytes. A long integer (Python's arbitrary precision integer type) consists of the letter `\texttt{l}', a 32-bit integer indicating the length of the number, and a corresponding sequence of base-15 digits. Data whose type are already known, such as the fields of code objects (discussed below), are stored as raw bytes.
The Unmarshaller, as we have named it, parses a \pyc file by recursing over the file and creating a code object. The Marshal format encodes data by using a single character for the object's type (`\texttt{i}' for 32-bit integers, `\texttt{s}' for strings, and so on) followed by some number of bytes. For example, a long integer (Python's arbitrary precision integer type) consists of the letter `\texttt{l}', a 32-bit integer indicating the length of the number, and a corresponding sequence of base-15 digits. Data whose type are already known, such as the fields of code objects (discussed below), are stored as raw bytes.

\paragraph{}
The advantage of this design is being able to define the marshal format algebraically. A tuple, a common data structure in \pyc files, is stored the number of elements (a 32-bit integer) followed by the sequence of elements. The Unmarshaller decodes the tuples by reading the length, creating an empty array, unmarshalling each element in turn and pushing it into the array. Since we don't know anything about the contents of the tuple besides the number of elements, we can extract arbitrary objects with no problem.
The advantage of this design is being able to define the marshal format algebraically. A tuple, a common data structure in \pyc files, is stored with the number of elements (a 32-bit integer) followed by the sequence of elements. The Unmarshaller decodes the tuples by reading the length, creating an empty array, unmarshalling each element in turn and pushing it into the array. Since we don't know anything about the contents of the tuple besides the number of elements, we can extract arbitrary objects with no problem.

\paragraph{}
Our Unmarshaller was inspired in part by Cody Brocious's work with \pyc files and his RMarshal gem\footnote{\url{https://github.com/daeken/RMarshal}}.
Expand Down Expand Up @@ -161,7 +161,7 @@ \subsection{Frame Objects}\label{py-frameobject}
In a typical execution model, each function call pushes a stack frame to a central stack, which contains the function's arguments and other information. In our Python interpreter, the situation is reversed: each frame has its own stack. Arguments to the function are contained in dictionaries on the stack. When a Python function is called, the interpreter creates a new stackframe with the variables in the current scope, a pointer to the previous stack frame, and the code to execute. The interpreter then starts executing this frame.

\paragraph{}
As with the code objects mentioned above, our implementation of frame objects was inspired by the official Python documentation. However, we originally used a single stack in the interpreter, since it was the simplest appraoch and the documentation did not mention the stack. We found that, while this implementation was simple, it made more complex code difficult. For example, the central stack design had no clear boundary between function calls, and a function would just leave its return value on the stack when it exited. Having a stack per frame gave the RETURN\_VALUE opcode significance and clearly marked where the interpreter should stop executing the current frame. It also made clean up easier, since the leftover data could be discarded.
As with the code objects mentioned above, our implementation of frame objects was inspired by the official Python documentation. However, we originally used a single stack in the interpreter, since it was the simplest appraoch and the documentation did not mention the stack. We found that, while this implementation was simple, it made more complex code difficult to interpret. For example, the central stack design had no clear boundary between function calls, and a function would just leave its return value on the stack when it exited. Having a stack per frame gave the RETURN\_VALUE opcode significance and clearly marked where the interpreter should stop executing the current frame. It also made clean up easier, since the leftover data could be discarded.

\paragraph{}
The following are the attributes and some methods of the Py\_FrameObject class:
Expand Down Expand Up @@ -197,10 +197,10 @@ \subsection{Numeric Types} \label{numeric}
Besides functions, another significant feature of our interpreter is the numeric methods. Python provides four basic numeric types: Integers (32- and 64-bit), Long Integers (arbitrary precision), Floating-point numbers (IEEE-754) and Complex numbers. However, programmers do not need to concern themselves with which number type they are using, as Python will invisibly transition between types as needed. This is a key feature of the language, and we felt it was important to make sure our numbers met this standard.

\paragraph{}
Our original implementation was straightforward: 32-bit integers and floating-point numbers were represented using normal JavaScript numbers, 64-bit integers were stored in gLong\footnote{A 64-bit integer library developed at Google, later adapted to TypeScript as part of the Doppion project. \url{https://github.com/plasma-umass/doppio/blob/master/src/gLong.ts}} objects, longs in Decimal\footnote{An arbitrary-precision library created by Michael Mclaughlin \url{https://github.com/MikeMcl/decimal.js/}} objects, and complex numbers as a simple class with two JavaScript number fields, representing the real and imaginary parts of the number. The limitations of this approach became clear very quickly. The key problem was widening; Python defines a hierarchy of numeric types, where integers are narrower than longs are narrower than floats are narrower than complex numbers. An operation between a narrow and wider number would cast the narrow number up to the wider type. This was impossible when 32-bit integers were both distinct from 64-bit integers (Python does not make this distinction) and indistinguishable from floating-point numbers. Additionally, these operations are defined by functions (e.g. addition has an \texttt{add} function) and reverse functions (e.g. \texttt{radd}). Reverse functions are used in widening. If \texttt{a + b => a.radd(b)} is undefined due to \texttt{b}'s type, \texttt{b.radd(a)} is called instead. These functions would have to be added to the class definitions, which was impossible for floats since they are primitive types.
Our original implementation was straightforward: 32-bit integers and floating-point numbers were represented using normal JavaScript numbers, 64-bit integers were stored in gLong\footnote{A 64-bit integer library developed at Google, later adapted to TypeScript as part of the Doppion project. \url{https://github.com/plasma-umass/doppio/blob/master/src/gLong.ts}} objects, longs in Decimal\footnote{An arbitrary-precision library created by Michael Mclaughlin \url{https://github.com/MikeMcl/decimal.js/}} objects, and complex numbers as a simple class with two JavaScript number fields, representing the real and imaginary parts of the number. The limitations of this approach became clear very quickly. The key problem was widening; Python defines a hierarchy of numeric types, where integers are narrower than longs are narrower than floats are narrower than complex numbers. An operation between a narrow and wider number would cast the narrow number up to the wider type. This was impossible when 32-bit integers were both distinct from 64-bit integers (Python does not make this distinction) and indistinguishable from floating-point numbers. Additionally, these operations are defined by functions (e.g. addition has an \texttt{add} function) and reverse functions (e.g. \texttt{radd}). Reverse functions are used in widening. If \texttt{a + b => a.add(b)} is undefined due to \texttt{b}'s type, \texttt{b.radd(a)} is called instead. These functions would have to be added to the class definitions, which was impossible for floats since they are primitive types.

\paragraph{}
We implemented one class for each type: Py\_Int, which combined 32- and 64-bit integers in one class with gLong objects as the backend; Py\_Long, which simply wrapped Decimal objects; Py\_Float, which used JavaScript number; and Py\_Complex, which used two Py\_Floats for its fields. In each class, we implemented the mathematical operations and their reversed counterparts. Since each operation needed to handle the widening operation, we defined a general ``mathOp'' function, which widens arguments or defers execution as needed. Each operation passes an anonymous function to this function. We used a similar approach to define the comparison operators, since they also widened arguments as needed. The \texttt{mathOp} function for the Py\_Float class looks as follows:
We implemented one class for each type: Py\_Int, which combined 32- and 64-bit integers in one class with gLong objects as the backend; Py\_Long, which simply wrapped Decimal objects; Py\_Float, which used JavaScript numbers; and Py\_Complex, which used two Py\_Floats for its fields. In each class, we implemented the mathematical operations and their reversed counterparts. Since each operation needed to handle the widening operation, we defined a general ``mathOp'' function, which widens arguments or defers execution as needed. Each operation passes an anonymous function to this function. We used a similar approach to define the comparison operators, since they also widened arguments as needed. The \texttt{mathOp} function for the Py\_Float class looks as follows:

\begin{verbatim}
private mathOp(other: any, op: (a: Py_Float, b: Py_Float) => any): any {
Expand All @@ -226,12 +226,12 @@ \section{Using the Interpreter}\label{using interpreter}
We treated the interpreter as a Node.js application during development, since this made it easier to develop and debug. Each class was given its own file, which the other classes imported as needed. External libraries were included as separate TypeScript and JavaScript files. Obviously, this approach does not work in the browser. The first step to enabling our interpreter in the browser was replacing NodeJS with John Vilk's BrowserFS\footnote{\url{https://github.com/jvilk/BrowserFS}}. We only used the file system and Buffer interfaces from NodeJS, which BrowserFS replaced with browser-friendly versions. The second step was turning our many JavaScript files into one script for the browser. We used Browserify\footnote{\url{http://browserify.org/}}, which takes a single sources JavaScript file and bundles all of its dependencies into one script. Thanks to these two resources, our Python bytecode interpreter is usable in the browser and on the command line.

\paragraph{}
The project's \textsc{Readme} describes how to use the interpreter in both of these situations. It should be noted that the Node.JS command line version is used for testing, while the browser version is used for interpreting random \pyc files.
The project's \textsc{Readme} describes how to use the interpreter in both of these situations. It should be noted that the Node.JS command line version is used for testing, while the browser version is used for interpreting any given \pyc file.

\subsection{Test Suite}

\paragraph{}
Given the size of this project, testing was a vital step for ensuring the correctness of our implementation. We first used very simple tests to in a sort of ``test-driven development'' model. Write a short Python script that tested a feature we wanted to implement, then throw it into the interpreter to see which opcodes we still needed to implement.
Given the size of this project, testing was a vital step for ensuring the correctness of our implementation. We first used very simple tests in a sort of ``test-driven development'' model. We would write a short Python script that tested a feature we wanted to implement, then throw it into the interpreter to see which opcodes we still needed to implement.

\paragraph{}
The current tests are targeted at our numeric methods, function implementation, and comparison methods. We developed the reference values by running each test in Python as we wrote the tests. The tests are not exhaustive, though more elaborate testing schemes (namely computer-generated tests) could be helpful for ensuring all corner cases and argument type combinations are covered.
Expand All @@ -244,7 +244,7 @@ \section{Results}\label{results}
\begin{itemize}
\item Functions, including keyword and default arguments
\item Condition statements and boolean values, comparisons between numeric types
\item Four numeric types: Integers, Longs, Floats and Complex numbers
\item Four numeric types: Integers, Longs, Floats and Complex numbers, which can be used together with no programmer overhead
\item Most unary and binary opcodes, along with those necessary for loading and storing variables and executing functions
\end{itemize}

Expand Down

0 comments on commit c976793

Please sign in to comment.