"The pickle module implements binary protocols for serializing and de-serializing a Python object structure." - official Python documentation
Pickle is a deserialization protocol native to Python that is used to export and import Python objects. It is mainly used to export complex classes, objects, and other structures into a single stream of bytes that can be saved to disk and later imported back to a Python session to recreate the original objects. The process of turning a Python object to a bytestream is called "pickling", and the process of turning the byte stream back into the original Python object is called "unpickling". Pickle is intended to be backwards-compatible, so any updates to pickle shouldn't break previously-pickled bytestreams. As an example, new opcodes have been released in new versions, but old opcodes are never "retired".
Just as new Python versions are often released to improve upon existing ones, new versions of Pickle are also released; the version numbers are called "protocols". As of this writing, the latest pickle protocol is 5, and the default protocol used when pickling an object is 4. The first version of pickle (protocol 0) was a "human-readable" protocol and only contained ASCII characters found on a keyboard. Versions since then have incorporated non-printable binary bytes since then, mostly as that allows more opcodes and more efficient serializing.
This documentation focuses specifically on the unpickling process and doesn't cover pickling. There is quite a bit of good official documentation covering the basics of pickles, including how it compares with marshal or JSON, the classes and functions available from this module, how to ensure custom Python classes are pickled properly, and how to make unpickling secure. This documentation covers the gaps in documentation, going in a lot more detail. Included below is documentation about different pickle implementations across all official source code, details about the memory areas used during unpickling, and specifics on interesting functionality.
To see documentation on specific Pickle Virtual Machine (PVM) opcodes, see Opcodes.md
.
Please note that this is not official documentation for the pickle protocol, nor am I officially affiliated with Python in any way.
Pickle is not natively Turing-complete as it does not support conditional branching, such as if
statements or loops. Unpickling is meant to proceed through all opcodes one by one, not skipping any, and stopping when all instructions have been executed. However, unpickling may access Python globals and builtins which greatly increases its capabilities. For this reason, custom unpickling conditions may be created that turn it into a Turing-complete language, or act like one.
One example provided by splitline is rewriting the code
ans = input("Yes/No: ")
if ans == 'Yes':
print("Great!")
elif ans == 'No':
exit()
as
from functools import partial
condition = {'Yes': partial(print, 'Great!'), 'No': exit}
ans = input("Yes/No: ")
condition.get(ans, repr)()
Another implementation done in the SECCON CTF 2023 Quals CTF Challenge "Sickle" uses the Unpickler
class to process an BytesIO
instance and uses f.seek()
to move the instruction pointer, thereby achieving conditional branching.
Since unpickling a bytestream can lead to arbitrary Python functions being called, a security warning is provided in the documentation stating:
Warning: The pickle module is not secure. Only unpickle data you trust.
It is possible to construct malicious pickle data which will execute arbitrary code during unpickling. Never unpickle data that could have come from an untrusted source, or that could have been tampered with.
There are 3 main implementations of the unpickling process in the Python source code: a Python version, a C version, and pickletools. The C version of pickle is called _pickle
(Python 2 called it cPickle
), and pickletools is a module meant to imitate unpickling in order to analyzed pickled data. The code for pickle.py
and pickletools.py
are found in the Lib
folder of cpython, while _pickle.c
is located in the Modules
folder.
By default, using the pickle
module will attempt to use the C-optimized _pickle.c
version and fall back to the Python pickle.py
implementation if that fails using the following code:
# Use the faster _pickle if possible
try:
from _pickle import (
PickleError,
PicklingError,
UnpicklingError,
Pickler,
Unpickler,
dump,
dumps,
load,
loads
)
except ImportError:
Pickler, Unpickler = _Pickler, _Unpickler
dump, dumps, load, loads = _dump, _dumps, _load, _loads
To specifically use the C-optimized version, you can use the following code:
import _pickle
_pickle.loads(b'...')
To specifically use the Python version, you can use the following code:
from pickle import _loads
_loads(b'...')
Pickletools is designed to iterate through the bytestream and identify opcodes and opcode arguments and even keeps track of how many elements on the stack there are. However, it does not evaluate the opcodes outside of determining what the opcode arguments are. For some opcodes this doesn't mean much. As an example, the bytes b'K\x01'
use the BININT
opcode to push 1
to the stack; pickletools will match the opcode K
to BININT
and decode the byte \x01
as 1. However, an opcode like REDUCE
uses the byte b'R'
; pickletools will report that the REDUCE
opcode is being used, but since it doesn't keep track of the values of the items on the stack it never actually runs the opcode.
The pickletools
module determines the opcodes and opcode arguments using the genops
function, which returns a list of (opcode, arg, position)
tuples. The dis
function is used to print out all of these tuples in a nice format as a disassembly for users.
Because pickletools doesn't actually evaluate opcodes nor keep track of the values of stack items, it is secure and untrusted pickled data can be run through it. It also means that there may be invalid pickled data that it does not pick up on.
Pickletools also keeps track of the memo (without the actual values) to ensure that memo keys exist when accessed through opcodes like GET
or BINGET
.
The Pickle Virtual Machine uses a Harvard architecture, meaning the program instructions and writable data memory are separated (preventing self-modifying code and memory corruption vulnerabilities). Program instructions are provided through a file or BytesIO
object, and data is kept in one of three storage areas: the stack, the metastack, and the memo.
Just like in any other programming language, the stack is a LIFO data structure where data is stored and processed. A majority of the PVM opcodes are meant for pushing different Python data structures and objects onto the stack. Opcodes also exist for modifying the objects on the stack and popping objects from the stack. Any Python data structure can exist on the stack, and all opcodes that modify stack variables do so using a relative offset from the end of the stack. For example, the APPEND
opcode takes the object off the top of the stack and adds it to the list placed second-to-the-end of the stack (using code like stack[-2].append(stack.pop())
).
The metastack is a storage area that only really exists in the Python implementation of pickle. Some opcodes operate on a variable number of objects and don't have the specific number embedded in the instruction. Instead, the MARK
opcode places a special "marker" onto the stack and some opcodes will operate on ALL objects from the top of the stack until the first MARK
object is reached. In the C _pickle
module, this concept is implemented as a separate array of MARK
objects (with each element of the array containing the index of the stack where a MARK
object is) and a fence
variable, marking the position of the top MARK
object, and therefore where the stack ends and metastack begins. In the Python pickle
module, this concept is implemented as a single Python list making up the stack, and whenever a MARK
object is placed, the current values on the stack are popped (as a list) onto the end of the metastack list. Therefore, the C implementation has a single array with MARK
placement annotated, while the Python implementation has two lists, and the metastack is implemented as a list of various stack segments separated by MARK
objects.
Here's a visualization of the stack and metastack in C:
Here's a visualization of the stack and metastack in Python:
For more details, see the MARK
and POP_MARK
opcode documentation.
The memo is a third storage area implemented as a dictionary. Python data structures that are used multiple times can be stored in the memo area (using an incrementing integer as the key) so they don't need to be stored in the pickle multiple times. For example, if you want to use the string "Hello World" twice, instead of having to use the STRING
opcode twice to get two instances of it on the stack, you can:
- Use
STRING
to load "Hello World" onto the stack - Use
PUT
to store "Hello World" in the memo at index 1 (keeping it on the stack) - Do stuff with "Hello World"
- (do other stuff)
- Use
GET
to load "Hello World" from the memo at index 1 onto the stack - Do stuff with "Hello World"
Note that pickle.dumps()
will automatically start data placement at index 0 and increment by 1, but custom pickles can choose any numeric index to store data in.
Version 5 of the Pickle Virtual Machine has 68 opcodes. Documentation, descriptions, examples, and Python 3.11 source code for each opcodes are present in Opcodes.md
.
In order to avoid transferring massive amounts of data in a pickle, out-of-band buffers became supported in Python 3.8. If both the provider and consumer support transferring out-of-band data, then the pickle protocol may be extended to version 5. These buffers must be passed in as arguments to the Unpickler
class and the two relevant opcodes (NEXT_BUFFER
and READONLY_BUFFER
) only support pulling data from the buffer or making bytearrays
read-only.
The extension registry is like a cross-over between the memo and out-of-band buffers. The extension registry is a dictionary with keys as positive integers and values as ("module_name", "name")
tuples provided by the copyreg
module. The extension registry must be defined outside of (and before) the unpickling process using code like copyreg.add_extension("module_name", "attr_name", 1)
, then when an opcode like EXT1
is used with the argument 1
, it will refer to copyreg
's _inverted_registry
variable and pull out the ("module_name", "name")
tuple associated with code 1
. Then, it will pass that tuple through find_class()
to recover the actual object and place it on the stack.
Note that in copyreg
, the _extension_registry
variable has ("module_name", "name")
tuples as the keys and code
integers as values, but the _inverted_registry
variable has code
integers as keys and ("module_name", "name")
tuples as values.
There also exists an _extension_cache
object that has the same format as the _extension_registry
variable, but values are only populated after actually being accessed from the _extension_registry
during the unpickling process.
Pickle actually has a STOP
opcode that indicates the end of the data; normally virtual machines that don't support conditional branching don't require this as they can finish once there are no more instructions. Once the STOP
opcode is encountered, all remaining bytes in the bytestream are ignored and the top of the stack is returned to the user. An error is thrown if no STOP
opcode is encountered or if there are no values left on the stack to return.
Interestingly, the pickletools
module enforces extra requirements for pickled data that are not actually enforced in the pickle.py
and _pickle.c
implementations. Pickletools requires that only one object be left on the stack and will error if multiple objects are present. However, pickletools
treats the MARK
opcode as a stack object, and will not throw an error if a "mark object" is the only object present on the stack. See more details in the description for the STOP
opcode.
If you are interesting in contributing to the documentation, feel free to create an issue or make a pull request.