Fastest way to serialize an array in Python
To kick start the discussion, let’s first assume that we have the following 2-D numpy
array:
import numpy as np
matrix = np.random.random((10000, 384)).astype(np.float32)
Approaches tried
-
The built-in
pickle
In Python, the built-in
pickle
module is often used to serialize/deserialize data. Here’s how it works:import pickle pickled_dumped = pickle.dumps(matrix, protocol=pickle.HIGHEST_PROTOCOL) pickle.loads(pickled_dumped)
-
MessagePack
MessagePack is known to a fast and compact serialization format. But it can’t serialize a
numpy
array directly, so let’s first assume that we’ll use a nested Pythonlist
instead.matrix_as_list = matrix.tolist()
Here’s how it works:
# Note how we set `use_single_float` to make sure float32 is used msgpack_dumped = msgpack.dumps(matrix_as_list, use_single_float=True) msgpack.loads(msgpack_dumped)
-
pyarrow
pyarrow
is the Python library for Apache Arrow, which enabled fast serialization through zero-copy memory sharing.pyarrow_dumped = pyarrow.serialize(matrix).to_buffer().to_pybytes() pyarrow.deserialize(pyarrow_dumped)
-
Non-general DIY Serialization
This is not a general serialization solution. It can only serialize/deserialize a 2-D
numpy
array.import struct def serialize(mat): n_rows = len(mat) n_columns = len(mat[0]) shape = struct.pack('>II', n_rows, n_columns) return shape + mat.tobytes() def deserialize(data): n_rows, n_columns = struct.unpack('>II', data[:8]) mat = np.frombuffer(data, dtype=np.float32, offset=8) mat.shape = (n_rows, n_columns) return mat diy_dumped = serialize(matrix) deserialize(diy_dumped)
Comparison
Now let’s compare these approaches in the following metrics:
- Size (of serialized data, in bytes)
- T_dump (Serializing time)
- T_load (Deserializaing time)
Approach | Size | T_dump | T_load |
---|---|---|---|
pickle | 15360156 | 15.7 ms | 6.01 ms |
MessagePack | 19230003 | 97.6 ms | 179 ms |
pyarrow | 15360704 | 17.3 ms | 32.8 µs |
DIY | 15360008 | 12.5 ms | 2.32 µs |
Conclusion
MessagePack
dumps the biggest output and take the longest to serialize/deserialize. This may be related to the way MessagePack
encode floating point numbers: each float32
takes 5 bytes, not 4 bytes.
The DIY solution has only 8 bytes of overhead in size for the shape of the 2-D array, and is the fastest one. It’s not a general solution, but it demonstrate that we can fallback to this approach when size and speed of serialization are the bottlenecks.
pyarrow
is the fastest general solution I have tried in this experiment, though the overhead in size is a little larger than when using pickle
.
The good thing about pyarrow
is that the time taken to deserialize doesn’t grow linearly with the size of array, it’s about 32 µs even when I make the array 5 times larger (This is because pyarrow
use zero-copy memory sharing to avoid moving the array around. It’s why I use np.frombuffer
in the DIY approach).
But if what you have is not a small number of big arrays, but a massive number of small arrays, using pyarrow
may not be the right choice. Because even for a 2-D array with only 1 row, it still takes about 20 µs, there are some fixed overhead there.