Over the last few years, deep learning methods have been shown to outperform previous state-of-the-art machine learning techniques and other more traditional approaches in a large range of fields, with computer vision being one of the most notable domains. A new challenge arose with the expansion of use cases for Deep Learning and its soaring performances: running deep learning on edge devices. Typically, Deep Learning algorithms would be run on a server, which would process requests from devices and answer them with the algorithm’s outputs. This device-to-server interaction introduces a significant latency to the results, as it requires a back-and-forth communication. In many use cases, such as autonomous vehicles, this delay is not acceptable. To ensure quick processing of the data, Deep Learning algorithms can be operated directly on the devices.
This solution also comes with some issues, the most important one being the size of the libraries and of the models needed to run complex Deep Learning algorthms. At HarfangLab, we tackled this issue thanks to two principles: 1) Using lean dependencies that contain only the code that we need, and 2) Reducing the size of the model while preserving performances. In our case, we had to build tflite-runtime
(the lean version of tensorflow
for prediction) for the desired platforms (Windows 32 and 64 bits), reduce the size of numpy
by removing unnecessary code, and shrunk the size of our models by a factor of 10.
Here is how we completed those steps and now successfully run complex Deep Learning Neural Networks on Windows and Linux devices with python wheels under 5MB. The different custom wheels we built that are needed to run such algorithms will also be available in the following repository.
Running Deep Learning with small dependencies
Tensorflow and tensorflow-lite
Usage of tensorflow and tensorflow-lite
Tensorflow
is one of the most prominent Deep Learning librairies which provides a simple and efficient framework which provides a simple and efficient framework to implement and train elaborate models. Once the model is trained with this framework it has the strong advantage of enabling one to simply compress it to a tensorflow-lite
. The inference can then be run using only this model and the tensorflow-lite
(tflite-runtime
), discarding all the unnecessary dependencies that would come with the complete tensorflow
and tensorflow-lite
. Hence, by using tflite-runtime
we avoid getting the full tensorflow
(1.2 GB) and the tensorflow-lite
(50 MB) and instead retrieve only the 1MB interpreter to load and run the model in inference.
Installnig and using tflite-runtime
is extremely easy, as one may install it through a wheel with pip install tflite-runtime
. Unfortunately, no 64-bits windows wheel exists for tensorflow
after version 2.5.0 (and none for 32-bit Windows). There are a few guidelines on how to build them, published by tensorflow
: https://www.tensorflow.org/lite/guide/build_cmake_pip and https://github.com/tensorflow/tensorflow/tree/master/tensorflow/lite/tools/pip_package
These wheels for tensorflow 2.7.0
are available in our github repository. Here are some extra tips to build them.
Biulding the wheels
For Linux, you will find the wheels tflite-runtime
on PyPI.
To build tflite-runtime
for Windows x86
and Windows x64
along with a lot of courage, you will need to use a Windows x64
virtual machine (or your computer if it’s a Windows x64
). The official CMake
build should be used for the task. To run the build, you imperatively need Visual Studio Build Tools 2019
and Python 64-bits
(python -V
and python3 –V
should not produce errors) and Python 32-bits
(which must be in the same version as the one chosen for Python 64-bits
). You also need to install numpy
, wheel
and pybind11
.
This is the command for x64
and x86
: tensorflow/lite/tools/pip_package/build_pip_package_with_cmake.sh windows
We had many issues with this command, here is how we modified the critical lines in the file thetensorflow/lite/tools/pip_package/build_pip_package_with_cmake.sh
:
Copy
PYTHON_LIB=$(${PYTHON} -c "import distutils.sysconfig as sysconfig; print(sysconfig.PREFIX + 'libs')")
PYTHON_INCLUDE=$(${PYTHON} -c "from sysconfig import get_paths as gp; print(gp()['include'])")
PYBIND11_INCLUDE=$(${PYTHON} -c "import pybind11; print (pybind11.get_include())")
NUMPY_INCLUDE=$(${PYTHON} -c "import numpy; print (numpy.get_include())")
BUILD_FLAGS=${BUILD_FLAGS:-"-I${PYTHON_INCLUDE} -I${PYBIND11_INCLUDE} -I${NUMPY_INCLUDE} /EHsc"}
cmake -A Win32 -DCMAKE_C_FLAGS="${BUILD_FLAGS}" -DCMAKE_CXX_FLAGS="${BUILD_FLAGS}" -DCMAKE_SHARED_LINKER_FLAGS="/libpath:$PYTHON_LIB" "${TENSORFLOW_LITE_DIR}" -DTFLITE_ENABLE_XNNPACK=OFF -DCMAKE_WINDOWS_EXPORT_ALL_SYMBOLS=True
cmake --build . --verbose -j ${BUILD_NUM_JOBS} -t _pywrap_tensorflow_interpreter_wrapper --config Release
cd "${BUILD_DIR}"
cp "${BUILD_DIR}/cmake_build/Release/_pywrap_tensorflow_interpreter_wrapper.dll ${BUILD_DIR}/tflite_runtime/_pywrap_tensorflow_interpreter_wrapper${LIBRARY_EXTENSION}"
Numpy, size of wheels and outbuildings
For x86, we have added -A Win32
to specify that the build must be for Windows 32-bit.
Now that wheels tflite-runtime
are built, we can easily install them on a Linux or Windows machine and run a model using this 1MB wheel . However, the wheel tflite-runtime
requires numpy
as a dependency. Depending on your device, numpy
should be between 10 and 16 MB. It is possible to further optimize the size of the dependencies required to run the model by reducing the size of numpy
. To do so, you can build numpy
with precise requirements and manually reduce the size of the built wheel to only keep the core functionalities needed. On a Windows x86
, Windows x64
and Linux
, run the following steps:
CFLAGS="-g0 -Wl,--strip-all -I/usr/include:/usr/local/include -L/usr/lib:/usr/local/lib" pip install --cache-dir . --compile --global-option=build_ext --global-option="-j 4" numpy
- Get the built wheel and run wheel unpack
name-of-wheel.whl
- Find and remove the file
RECORD
in the folder unpacked - Randomly remove some files or directories in the folder unpacked that seem unnecessary (
numpy
comes with tests, docs, data, etc that significantly increase its size) - Pack the file again with
wheel pack name-of-unpacked-folder
- Install the
wheel
withpip
- Install
tflite-runtime
- Check that your model runs an inference without crashing
- Repeat the deletion of as many random files as possible as long as your model can run an inference
Using this manual method, we were able to increase the size of the wheel numpy-1.22.3
to less than 3 MB.
Reducing the size of Deep Learning models
After the previous steps, the dependencies required to run Deep Learning should now only be at around 4MB. The only missing object to run an algorithm is the model itself. The size of tflite
file containing the model grows quickly with the number of parameters of the model. Here are the techniques we used to reduce the size needed to save the model.
Model Quantization
A common and simple way to reduce the size of a model is post-training quantization. This method consists in reducing the number of bits used by the model parameters and operations. Commonly, the model’s parameters are encoded on 32 bits in tensorflow
but they can be reduced to 16-bit floats with little performance losses. More drastically, 32-bit floats can also be converted to 8-bit integers with a more significant performance loss risk. More variations of such techniques are available and straightforward to implement with tensorflow-lite
as described in the documentation: https://www.tensorflow.org/lite/performance/post_training_quantization#dynamic_range_quantization.
Dense and GlobalAveraging layers
In many neural networks (and in convolutional neural networks in particular), the dense layers at the end of the network outputting the predictions represent more than 80% of the parameters of the model. These layers can be critical layers needed for the discrimination of the task’s inputs, but the previous convolutional layers are often also sufficient. Simply removing dense layers partially (or fully by replacing them with a basic global average pooling layer) can considerably reduce the model size without substantially changing the model’s predictions.
Pruning
This last famous method consists in removing neurons and synapses that contribute less to the model’s outputs. This method is quick to implement with tensorflow
but requires more experimentations and fine-tuning than the previous two methods presented to optimize the size/performance trade-off.
Running under 5MB Deep Learning algorithms on edge devices
By building the tensorflow-lite
interpreter, garbaging unneeded parts of numpy
, customizing our models and reducing their size with the model quantization and pruning, we run Deep Learning algorithms on Windows and Linux edge devices from Python wheels under 5 MB!