Reading Sqlite Schema Tables the Hard Way

2024-09-25 • 7 minutes to read

Parsing a Sqlite database file is a nice way to brush up on data structures, bit manipulation, and recursion. I know this because I recently implemented the read_schema_tables function below such that the following test passes:

import sqlite3

def test_read_table_names(db_file):
    con = sqlite3.connect(db_file)
    assert list(con.execute("SELECT * FROM sqlite_schema;")) == list(read_schema_tables(db_file))

read_schema_tables doesn’t use the sqlite3 python package. That’d make for a trivial test case and would be too easy. For comp sci fun, we have to do it the hard way.

Recursing through Sqlite btree pages

Sqlite stores data in pages. These pages are arranged in a generalization of a binary tree called a “btree” where nodes can have more than 2 children. The docs call non-leaf pages “internal pages.” All pages have cells, but cells of internal pages merely point to child pages. The cells of leaf pages, on the other hand, actually contain data.

Sometimes the tables from sqlite_schema fit on a single page. Sometimes they don’t. My calling code shouldn’t care:

class Sqlitedatabase:
    @property
    def table_names(self):
        return Page.from_number(
            self.file, 1, # <-- page # ...
        ).leaf_cells

An abstract class gives us a nice way of…well…abstracting this:

class Page(ABC):
    @property
    @abstractmethod
    def leaf_cells(self):
        ...

Internal pages merely get child pages and recursively call Page.leaf_cells on their children:

class InternalPage(Page):
    @property
    def leaf_cells(self):
        return itertools.chain.from_iterable(
            [
                self.get_page(page_num).leaf_cells
                for page_num in self.page_nums
            ]
        )
    

Leaf pages, on the other hand, actually have to read data:

class LeafPage(Page):
    @property
    def leaf_cells(self):
        result = []
        for cell_idx in range(self.num_cells):
            self._seek_to_cell_contents(cell_idx)
            # Skip # of bytes for now. We arent handling page overflow
            _ = self._varintreader.read()                
            # Skip the rowid for now
            _ = self._varintreader.read()
            header_size = self._varintreader.read()
            record = []
            # subtract 1 because header_size is the number of bytes in the header including a size byte
            for type in self._read_col_types(header_size - 1):
                record.append(self._read_col(type))
            result.append(tuple(record))
            
        return result

Up until this point, reading data from the Sqlite file is relatively straight-forward. There are some unexpected indirections and pointers to pointers, but it can all be handled by carefully reading the spec and seeking around the file. Reading table data, however, requires us to parse variable length integers called “varints” (as indicated by the above calls to self._varintreader.read())

Parsing Sqlite varints

Varints are a clever way of saving space in the database. Here are the docs on how they work:

A varint is between 1 and 9 bytes in length. The varint consists of either zero or more bytes which have the high-order bit set followed by a single byte with the high-order bit clear, or nine bytes, whichever is shorter. The lower seven bits of each of the first eight bytes and all 8 bits of the ninth byte are used to reconstruct the 64-bit twos-complement integer.

Here’s an example test case:

def test_read_varint_reads_128():
    assert read_varint(to_int_list(0b1000_0001_0000_0000.to_bytes(length=2))) == 128

128 can’t fit in a single byte varint because the first bit is reserved for indicating that there is another byte whose bits make up the integer. So, to say “128” in varint, we have

  • a single byte with a continuation bit set and another bit set at the end of the byte which maps to 2^7 (even though its the 8th bit in the byte string)
  • a byte with the continuation bit clear followed by 7 0s

Since we don’t want to assume the Sqlite database can fit in memory, we can create a generator function that reads through the file byte by byte:

def byte_by_byte(file: BufferedReader):
    byte = file.read(1)
    while byte:
        yield byte
        byte = file.read(1)

We can iterate over that generator until we reach a byte without a continuation bit set:

def get_varint_bytes(bytes: Generator) -> List[int]:
    result_bytes = []
    for byte in bytes:
        byte_as_int = int.from_bytes(byte)
        result_bytes.append(byte_as_int)
        if not byte_as_int & 0b1000_0000:
            break
    return result_bytes

If we have a list of bytes (as ints) that make up our varint, then converting it to a number can be done like this:

def read_varint(bytes: List[int]) -> int:
    num = 0
    for i, byte in enumerate(bytes):
        # clear continuation bit if needed
        byte = byte ^ 0b1000_0000 if i < len(bytes) - 1 else byte 
        # shift bits to the correct position
        byte = byte << (7 * (len(bytes) - i - 1))
        num = byte | num
    return num

Conclusion

I’ve ommitted a bunch of relatively uninteresting code here, but my working implementation was less than 500 LoC including tests. Its a fun exercise. Try it. You don’t need anyting fancy like Code Crafters to do this, and writing your own test harness can be a plus for getting deep on your test framework.

Appendix A: How does OpenAI’s o1 fare at this task?

o1 produced code that looked reasonable, but didn’t work initially. Here was the initial prompt:

give me a script that’ll return the result of executing “SELECT * FROM sqlite_master;” without using the sqlite3 package

Running the initial code gave me the following output:

mattdupree@matts-air sqlite_python % pipenv run python ai_main.py
Unsupported page type: 83
type | name | tbl_name | rootpage | sql

Then I asked it to fix the bug. Here’s its explanation for the bug:

I’m sorry about the error you’re encountering. The issue arises because the script doesn’t account for the 100-byte database header in SQLite database files. SQLite files begin with a 100-byte header, and the actual page data starts immediately after this header.

The new code gave the following output:

mattdupree@matts-air sqlite_python % pipenv run python ai_main.py
type | name | tbl_name | rootpage | sql
 | None |  |  |  | b'\x00' | b'' | b'' |  |  | b'' |  | b'' |  |  | b'' |  | b'' |  | b'' | b'' | b'' |  | b'' |  |  |  | b'' | b'' |  | None | b'' | b'' | None | None | None | None | b'' |  | None | b'' | b'' | b'' | b'' | b'' | b'' |  | b'' | b'' | b'' |  | b'' | b'' | b'' |  | None | None | None | None | None | None | None | None | None | None | None | None | None | None | None | None | None | None | None | None | None | None | None | None

After an hour or so with the debuggger, I found two errors that enabled the script to work for files where the sqlite_schema tables existed on a leaf page:

         for ptr in cell_pointers:
             cell_offset = ptr
-            payload_size, offset = read_varint(page, cell_offset)
+            payload_size, offset = read_varint(page, cell_offset - 100)
             cell_offset = offset
             rowid, offset = read_varint(page, cell_offset)
             cell_offset = offset

Here the code doesn’t account for that fact that in cases where you’re parsing a leaf page that happens to be the first page the page’s cell pointers are relative to the start of the file, not the start of the page data. I made this same mistake when coding manually. Subtracting 100 happened to work here since the database header is 100 bytes long.

Here’s the next bug fix:

 
             header_start = cell_offset
             types = []
-            while cell_offset < header_start + header_size_varint:
+            while cell_offset < header_start + header_size_varint - 1:
                 serial_type, offset = read_varint(page, cell_offset)
                 types.append(serial_type)
                 cell_offset = offset

This is an off-by-one error. The code doesn’t account for the fact that the header size varint “is the size of the header in bytes including the size varint itself.”

Running the AI genreated script on a file where the sqlite_schema table is spread across multiple pages didn’t work:

mattdupree@matts-air sqlite_python % pipenv run python ai_main.py
Page number 987407 is out of range.
Page number 987407 is out of range.
Unsupported page type: 50
No entries found in sqlite_master.

I couldn’t bring myself to track down the bug within the AI-generated yucky code. I told ChatGPT to rewrite the code so that it was more readable. I generated broken code again.

Appendix B: Hire Josh

While I have your attention, I’m winding down a startup I’ve been working on, which means that Josh, an awesome engineer, person, and friend – is on the market. I met Josh 8 years ago while working at another startup. It was immediately clear to me that he was a brilliant engineer. Software was a craft that he cared deeply about and he studied with a fervor that I’ve haven’t seen in anyone else in my decade long career.

I wrote more about him here. If you have any questions about him, feel free to email me at [email protected]. If you’d like to contact him directly, here’s his LinkedIn.

Value-based pricing can be a trap for early startups

Which developers care most about security?