Fastest way to serialize an array in Python
To kick start the discussion, let’s first assume that we have the following 2-D
import numpy as np matrix = np.random.random((10000, 384)).astype(np.float32)
In Python, the built-in
picklemodule is often used to serialize/deserialize data. Here’s how it works:
import pickle pickled_dumped = pickle.dumps(matrix, protocol=pickle.HIGHEST_PROTOCOL) pickle.loads(pickled_dumped)
MessagePack is known to a fast and compact serialization format. But it can’t serialize a
numpyarray directly, so let’s first assume that we’ll use a nested Python
matrix_as_list = matrix.tolist()
Here’s how it works:
# Note how we set `use_single_float` to make sure float32 is used msgpack_dumped = msgpack.dumps(matrix_as_list, use_single_float=True) msgpack.loads(msgpack_dumped)
pyarrowis the Python library for Apache Arrow, which enabled fast serialization through zero-copy memory sharing.
pyarrow_dumped = pyarrow.serialize(matrix).to_buffer().to_pybytes() pyarrow.deserialize(pyarrow_dumped)
Non-general DIY Serialization
This is not a general serialization solution. It can only serialize/deserialize a 2-D
import struct def serialize(mat): n_rows = len(mat) n_columns = len(mat) shape = struct.pack('>II', n_rows, n_columns) return shape + mat.tobytes() def deserialize(data): n_rows, n_columns = struct.unpack('>II', data[:8]) mat = np.frombuffer(data, dtype=np.float32, offset=8) mat.shape = (n_rows, n_columns) return mat diy_dumped = serialize(matrix) deserialize(diy_dumped)
Now let’s compare these approaches in the following metrics:
- Size (of serialized data, in bytes)
- T_dump (Serializing time)
- T_load (Deserializaing time)
|pickle||15360156||15.7 ms||6.01 ms|
|MessagePack||19230003||97.6 ms||179 ms|
|pyarrow||15360704||17.3 ms||32.8 µs|
|DIY||15360008||12.5 ms||2.32 µs|
MessagePack dumps the biggest output and take the longest to serialize/deserialize. This may be related to the way
MessagePack encode floating point numbers: each
float32 takes 5 bytes, not 4 bytes.
The DIY solution has only 8 bytes of overhead in size for the shape of the 2-D array, and is the fastest one. It’s not a general solution, but it demonstrate that we can fallback to this approach when size and speed of serialization are the bottlenecks.
pyarrow is the fastest general solution I have tried in this experiment, though the overhead in size is a little larger than when using
The good thing about
pyarrow is that the time taken to deserialize doesn’t grow linearly with the size of array, it’s about 32 µs even when I make the array 5 times larger (This is because
pyarrow use zero-copy memory sharing to avoid moving the array around. It’s why I use
np.frombuffer in the DIY approach).
But if what you have is not a small number of big arrays, but a massive number of small arrays, using
pyarrow may not be the right choice. Because even for a 2-D array with only 1 row, it still takes about 20 µs, there are some fixed overhead there.