Skip to content

Implementation considerations

Marios Fragkoulis edited this page Apr 10, 2014 · 5 revisions

Table of Contents

Safety

Incorrect locking in virtual tables and kernel corruption can result in erroneous values in PiCO QL queries. To provide some protection against kernel bugs and incorrect locking, PiCO QL checks pointers with virt addr valid() kernel function before they are dereferenced to ensure they fall within some mapped range of page areas. Caught invalid pointers show up in the result set as INVALID_P. However, the kernel can still corrupt PiCO QL via e.g. mapped but incorrect pointers.

Consistency

Definition

A PiCO QL query extracts a view of a subset of the kernel’s state as reflected by the data structures referenced in the query. Synchronized access to the referenced data structures is required to obtain this view and the view’s quality depends on it. Consistency requires that a kernel state view is extracted while the referenced data structures are in the same consistent state as if by acquiring exclusive read access to them atomically, for the whole query evaluation period. This definition involves two important characteristics. First, it refers to a part of the kernel’s state that regards the state of particular data structures; not its global state. Second, it refers to consistency in the strong sense, that is, all data structures referenced in a query do not change state during query evaluation. In many cases non-protected kernel data pose a limitation to the consistency we can achieve. The individual fields of many kernel data structures are not necessarily protected by a lock. In fact, this also holds for data structures protected through non-blocking synchronization such as Read-Copy-Update (RCU). RCU guarantees that the protected pointers will be alive within a critical section, but the data they point to need not be consistent. The values of non-protected fields can change without notice. Consider the list of processes as an example. The list is protected but not its individual elements. The pid of each process has the same value for its whole lifecycle, but this is not true for, say, its virtual memory resident size. The latter can indeed change in the course of a PiCO QL query so that SUM(RSS) provides a different result in two consecutive traversals of the process list while the list itself is locked.

Kernel synchronization disciplines and PiCO QL

The kernel features non-blocking and blocking synchronization disciplines; PiCO QL supports both.

Non-blocking synchronization that favours read accesses is useful to our work in terms of performance and usability, but does not provide any consistency guarantees. The Linux kernel implements RCU locking, which provides wait-free low overhead read access.

Blocking synchronization PiCO QL supports blocking synchronization, but combining the different protocols requires caution. It is the responsibility of the person defining the virtual table specification to map the kernel data structure specifics correctly. Specifically, the writer of a data structure specification is responsible for three things: a) to specify the proper locking scheme/protocol required by a kernel data structure, b) to ensure that the specified data structure’s synchronization protocol is compatible with the synchronization protocols of the existing data structure specifications, and c) queries across related data structures entail a valid locking order. For queries across unrelated data structures protected through blocking synchronization, the ordering of the locking primitives is the responsibility of the query writer. This technique imitates how kernel code would access the data structures collectively. This is a generic solution, but it is not always easy to implement, since it requires verifying that it is safe to use a specific combination of data structures in a given query. Lock acquisition in PiCO QL works in two respects: a) all locks for globally accessible data structures are acquired prior to query execution and released at the end, and b) locks regarding data structures nested to the first are acquired during the query at the time the virtual table is instantiated and released once the query’s evaluation has progressed to the next instantiation. Currently we use locks to provide for correct operation and consistency. The Linux kernel does not have a unique rule for acquiring locks in order. PiCO QL has a deterministic lock ordering rule: it acquires locks according to the syntactic position of virtual tables in a query. Notably, from our experience, the majority of join operations in queries concern nested data structures; hence, in these cases syntactic means outer to inner and our traversal method corresponds to the appropriate locking order. An illustrated query combines the list of processes in the system protected through RCU, the list of a process’s open files protected through RCU, and the list of socket buffers, that is a socket’s receive buffer queue, protected through a spin lock. The specification for the latter data structure can be seen in pico_ql_dsl.sql where we reference a kernel function, which, apart from acquiring the spinlock, disables interrupts until unlock time and saves/restores the active interrupt flags to resume appropriately after a query. The flags variable can be defined at the boilerplate part of the DSL file. Except for unprotected data, PiCO QL guarantees consistency for queries across unrelated data structures protected though blocking synchronization if the correct locking order is followed; queries across related ones do not provide consistency due to incremental lock acquisition. Our implementation currently does not offer the same transaction isolation levels as a database system due to incremental lock acquisition and non-protected kernel data including data referenced by RCU-protected pointers. Notably, the treatment of synchronization in this work is implementation dependent up to an extent. It is possible to alter our implementation, or provide different configurations, so that all necessary locks for a given query are acquired in consecutive instructions prior to query evaluation and interrupts are disabled for the duration of the query.

When things can go wrong

The virtual tables we defined for accessing Linux kernel’s data structures are correct, use the right synchronization protocols, and can be safely combined together in queries. However, inappropriate virtual table definitions and ordering of virtual tables in queries might result in a kernel crash.