Skip to content

Unofficial documentation for the Python Pickle deserialization protocol

Notifications You must be signed in to change notification settings

Legoclones/pickledoc

Folders and files

NameName
Last commit message
Last commit date

Latest commit

355fe97 · Jan 4, 2025

History

1 Commit
Jan 4, 2025
Jan 4, 2025
Jan 4, 2025

Repository files navigation

Python Pickle Documentation

"The pickle module implements binary protocols for serializing and de-serializing a Python object structure." - official Python documentation

Pickle is a deserialization protocol native to Python that is used to export and import Python objects. It is mainly used to export complex classes, objects, and other structures into a single stream of bytes that can be saved to disk and later imported back to a Python session to recreate the original objects. The process of turning a Python object to a bytestream is called "pickling", and the process of turning the byte stream back into the original Python object is called "unpickling". Pickle is intended to be backwards-compatible, so any updates to pickle shouldn't break previously-pickled bytestreams. As an example, new opcodes have been released in new versions, but old opcodes are never "retired".

Just as new Python versions are often released to improve upon existing ones, new versions of Pickle are also released; the version numbers are called "protocols". As of this writing, the latest pickle protocol is 5, and the default protocol used when pickling an object is 4. The first version of pickle (protocol 0) was a "human-readable" protocol and only contained ASCII characters found on a keyboard. Versions since then have incorporated non-printable binary bytes since then, mostly as that allows more opcodes and more efficient serializing.

This documentation focuses specifically on the unpickling process and doesn't cover pickling. There is quite a bit of good official documentation covering the basics of pickles, including how it compares with marshal or JSON, the classes and functions available from this module, how to ensure custom Python classes are pickled properly, and how to make unpickling secure. This documentation covers the gaps in documentation, going in a lot more detail. Included below is documentation about different pickle implementations across all official source code, details about the memory areas used during unpickling, and specifics on interesting functionality.

To see documentation on specific Pickle Virtual Machine (PVM) opcodes, see Opcodes.md.

Please note that this is not official documentation for the pickle protocol, nor am I officially affiliated with Python in any way.

Conditional Branching

Pickle is not natively Turing-complete as it does not support conditional branching, such as if statements or loops. Unpickling is meant to proceed through all opcodes one by one, not skipping any, and stopping when all instructions have been executed. However, unpickling may access Python globals and builtins which greatly increases its capabilities. For this reason, custom unpickling conditions may be created that turn it into a Turing-complete language, or act like one.

One example provided by splitline is rewriting the code

ans = input("Yes/No: ")
if ans == 'Yes':
    print("Great!")
elif ans == 'No':
    exit()

as

from functools import partial
condition = {'Yes': partial(print, 'Great!'), 'No': exit}
ans = input("Yes/No: ")
condition.get(ans, repr)()

Another implementation done in the SECCON CTF 2023 Quals CTF Challenge "Sickle" uses the Unpickler class to process an BytesIO instance and uses f.seek() to move the instruction pointer, thereby achieving conditional branching.

Since unpickling a bytestream can lead to arbitrary Python functions being called, a security warning is provided in the documentation stating:

Warning: The pickle module is not secure. Only unpickle data you trust.

It is possible to construct malicious pickle data which will execute arbitrary code during unpickling. Never unpickle data that could have come from an untrusted source, or that could have been tampered with.

Pickle Implementations

There are 3 main implementations of the unpickling process in the Python source code: a Python version, a C version, and pickletools. The C version of pickle is called _pickle (Python 2 called it cPickle), and pickletools is a module meant to imitate unpickling in order to analyzed pickled data. The code for pickle.py and pickletools.py are found in the Lib folder of cpython, while _pickle.c is located in the Modules folder.

By default, using the pickle module will attempt to use the C-optimized _pickle.c version and fall back to the Python pickle.py implementation if that fails using the following code:

# Use the faster _pickle if possible
try:
    from _pickle import (
        PickleError,
        PicklingError,
        UnpicklingError,
        Pickler,
        Unpickler,
        dump,
        dumps,
        load,
        loads
    )
except ImportError:
    Pickler, Unpickler = _Pickler, _Unpickler
    dump, dumps, load, loads = _dump, _dumps, _load, _loads

To specifically use the C-optimized version, you can use the following code:

import _pickle
_pickle.loads(b'...')

To specifically use the Python version, you can use the following code:

from pickle import _loads
_loads(b'...')

Pickletools

Pickletools is designed to iterate through the bytestream and identify opcodes and opcode arguments and even keeps track of how many elements on the stack there are. However, it does not evaluate the opcodes outside of determining what the opcode arguments are. For some opcodes this doesn't mean much. As an example, the bytes b'K\x01' use the BININT opcode to push 1 to the stack; pickletools will match the opcode K to BININT and decode the byte \x01 as 1. However, an opcode like REDUCE uses the byte b'R'; pickletools will report that the REDUCE opcode is being used, but since it doesn't keep track of the values of the items on the stack it never actually runs the opcode.

The pickletools module determines the opcodes and opcode arguments using the genops function, which returns a list of (opcode, arg, position) tuples. The dis function is used to print out all of these tuples in a nice format as a disassembly for users.

Because pickletools doesn't actually evaluate opcodes nor keep track of the values of stack items, it is secure and untrusted pickled data can be run through it. It also means that there may be invalid pickled data that it does not pick up on.

Pickletools also keeps track of the memo (without the actual values) to ensure that memo keys exist when accessed through opcodes like GET or BINGET.

Memory Areas

The Pickle Virtual Machine uses a Harvard architecture, meaning the program instructions and writable data memory are separated (preventing self-modifying code and memory corruption vulnerabilities). Program instructions are provided through a file or BytesIO object, and data is kept in one of three storage areas: the stack, the metastack, and the memo.

Stack and Metastack

Just like in any other programming language, the stack is a LIFO data structure where data is stored and processed. A majority of the PVM opcodes are meant for pushing different Python data structures and objects onto the stack. Opcodes also exist for modifying the objects on the stack and popping objects from the stack. Any Python data structure can exist on the stack, and all opcodes that modify stack variables do so using a relative offset from the end of the stack. For example, the APPEND opcode takes the object off the top of the stack and adds it to the list placed second-to-the-end of the stack (using code like stack[-2].append(stack.pop())).

The metastack is a storage area that only really exists in the Python implementation of pickle. Some opcodes operate on a variable number of objects and don't have the specific number embedded in the instruction. Instead, the MARK opcode places a special "marker" onto the stack and some opcodes will operate on ALL objects from the top of the stack until the first MARK object is reached. In the C _pickle module, this concept is implemented as a separate array of MARK objects (with each element of the array containing the index of the stack where a MARK object is) and a fence variable, marking the position of the top MARK object, and therefore where the stack ends and metastack begins. In the Python pickle module, this concept is implemented as a single Python list making up the stack, and whenever a MARK object is placed, the current values on the stack are popped (as a list) onto the end of the metastack list. Therefore, the C implementation has a single array with MARK placement annotated, while the Python implementation has two lists, and the metastack is implemented as a list of various stack segments separated by MARK objects.

Here's a visualization of the stack and metastack in C:

Here's a visualization of the stack and metastack in Python:

For more details, see the MARK and POP_MARK opcode documentation.

Memo

The memo is a third storage area implemented as a dictionary. Python data structures that are used multiple times can be stored in the memo area (using an incrementing integer as the key) so they don't need to be stored in the pickle multiple times. For example, if you want to use the string "Hello World" twice, instead of having to use the STRING opcode twice to get two instances of it on the stack, you can:

  • Use STRING to load "Hello World" onto the stack
  • Use PUT to store "Hello World" in the memo at index 1 (keeping it on the stack)
  • Do stuff with "Hello World"
  • (do other stuff)
  • Use GET to load "Hello World" from the memo at index 1 onto the stack
  • Do stuff with "Hello World"

Note that pickle.dumps() will automatically start data placement at index 0 and increment by 1, but custom pickles can choose any numeric index to store data in.

Opcodes

Version 5 of the Pickle Virtual Machine has 68 opcodes. Documentation, descriptions, examples, and Python 3.11 source code for each opcodes are present in Opcodes.md.

Out-of-Band Buffers

In order to avoid transferring massive amounts of data in a pickle, out-of-band buffers became supported in Python 3.8. If both the provider and consumer support transferring out-of-band data, then the pickle protocol may be extended to version 5. These buffers must be passed in as arguments to the Unpickler class and the two relevant opcodes (NEXT_BUFFER and READONLY_BUFFER) only support pulling data from the buffer or making bytearrays read-only.

Extension Registry

The extension registry is like a cross-over between the memo and out-of-band buffers. The extension registry is a dictionary with keys as positive integers and values as ("module_name", "name") tuples provided by the copyreg module. The extension registry must be defined outside of (and before) the unpickling process using code like copyreg.add_extension("module_name", "attr_name", 1), then when an opcode like EXT1 is used with the argument 1, it will refer to copyreg's _inverted_registry variable and pull out the ("module_name", "name") tuple associated with code 1. Then, it will pass that tuple through find_class() to recover the actual object and place it on the stack.

Note that in copyreg, the _extension_registry variable has ("module_name", "name") tuples as the keys and code integers as values, but the _inverted_registry variable has code integers as keys and ("module_name", "name") tuples as values.

There also exists an _extension_cache object that has the same format as the _extension_registry variable, but values are only populated after actually being accessed from the _extension_registry during the unpickling process.

Stopping

Pickle actually has a STOP opcode that indicates the end of the data; normally virtual machines that don't support conditional branching don't require this as they can finish once there are no more instructions. Once the STOP opcode is encountered, all remaining bytes in the bytestream are ignored and the top of the stack is returned to the user. An error is thrown if no STOP opcode is encountered or if there are no values left on the stack to return.

Interestingly, the pickletools module enforces extra requirements for pickled data that are not actually enforced in the pickle.py and _pickle.c implementations. Pickletools requires that only one object be left on the stack and will error if multiple objects are present. However, pickletools treats the MARK opcode as a stack object, and will not throw an error if a "mark object" is the only object present on the stack. See more details in the description for the STOP opcode.

Contributing

If you are interesting in contributing to the documentation, feel free to create an issue or make a pull request.

About

Unofficial documentation for the Python Pickle deserialization protocol

Topics

Resources

Stars

Watchers

Forks