Run Deep Learning algorithms under 5MB on Windows or Linux

Over the last few years, deep learning methods have been shown to outperform previous state-of-the-art machine learning techniques and other more traditional approaches in a large range of fields, with computer vision being one of the most notable domains. A new challenge arose with the expansion of use cases for Deep Learning and its soaring performances: running deep learning on edge devices. Typically, Deep Learning algorithms would be run on a server, which would process requests from devices and answer them with the algorithm’s outputs. This device-to-server interaction introduces a significant latency to the results, as it requires a back-and-forth communication. In many use cases, such as autonomous vehicles, this delay is not acceptable. To ensure quick processing of the data, Deep Learning algorithms can be operated directly on the devices.

This solution also comes with some issues, the most important one being the size of the libraries and of the models needed to run complex Deep Learning algorthms. At HarfangLab, we tackled this issue thanks to two principles: 1) Using lean dependencies that contain only the code that we need, and 2) Reducing the size of the model while preserving performances. In our case, we had to build tflite-runtime (the lean version of tensorflow for prediction) for the desired platforms (Windows 32 and 64 bits), reduce the size of numpy by removing unnecessary code, and shrunk the size of our models by a factor of 10.

Here is how we completed those steps and now successfully run complex Deep Learning Neural Networks on Windows and Linux devices with python wheels under 5MB. The different custom wheels we built that are needed to run such algorithms will also be available in the following repository.

Running Deep Learning with small dependencies

Tensorflow and tensorflow-lite

Usage of tensorflow and tensorflow-lite

Tensorflow is one of the most prominent Deep Learning librairies which provides a simple and efficient framework which provides a simple and efficient framework to implement and train elaborate models. Once the model is trained with this framework it has the strong advantage of enabling one to simply compress it to a tensorflow-lite. The inference can then be run using only this model and the tensorflow-lite (tflite-runtime), discarding all the unnecessary dependencies that would come with the complete tensorflow and tensorflow-lite. Hence, by using tflite-runtimewe avoid getting the full tensorflow (1.2 GB) and the tensorflow-lite (50 MB) and instead retrieve only the 1MB interpreter to load and run the model in inference.

Installnig and using tflite-runtime is extremely easy, as one may install it through a wheel with pip install tflite-runtime. Unfortunately, no 64-bits windows wheel exists for tensorflow after version 2.5.0 (and none for 32-bit Windows). There are a few guidelines on how to build them, published by tensorflow : https://www.tensorflow.org/lite/guide/build_cmake_pip and https://github.com/tensorflow/tensorflow/tree/master/tensorflow/lite/tools/pip_package

These wheels for tensorflow 2.7.0 are available in our github repository. Here are some extra tips to build them.

Biulding the wheels

For Linux, you will find the wheels tflite-runtime on PyPI.

To build tflite-runtime for Windows x86 and Windows x64 along with a lot of courage, you will need to use a Windows x64 virtual machine (or your computer if it’s a Windows x64). The official CMake build should be used for the task. To run the build, you imperatively need Visual Studio Build Tools 2019 and Python 64-bits (python -V and python3 –V should not produce errors) and Python 32-bits (which must be in the same version as the one chosen for Python 64-bits). You also need to install numpy, wheel and pybind11.

This is the command for x64 and x86 : tensorflow/lite/tools/pip_package/build_pip_package_with_cmake.sh windows

We had many issues with this command, here is how we modified the critical lines in the file thetensorflow/lite/tools/pip_package/build_pip_package_with_cmake.sh :

Copy

PYTHON_LIB=$(${PYTHON} -c "import distutils.sysconfig as sysconfig; print(sysconfig.PREFIX + 'libs')")
PYTHON_INCLUDE=$(${PYTHON} -c "from sysconfig import get_paths as gp; print(gp()['include'])")
PYBIND11_INCLUDE=$(${PYTHON} -c "import pybind11; print (pybind11.get_include())")
NUMPY_INCLUDE=$(${PYTHON} -c "import numpy; print (numpy.get_include())")

BUILD_FLAGS=${BUILD_FLAGS:-"-I${PYTHON_INCLUDE} -I${PYBIND11_INCLUDE} -I${NUMPY_INCLUDE} /EHsc"}
cmake -A Win32 -DCMAKE_C_FLAGS="${BUILD_FLAGS}" -DCMAKE_CXX_FLAGS="${BUILD_FLAGS}" -DCMAKE_SHARED_LINKER_FLAGS="/libpath:$PYTHON_LIB" "${TENSORFLOW_LITE_DIR}" -DTFLITE_ENABLE_XNNPACK=OFF -DCMAKE_WINDOWS_EXPORT_ALL_SYMBOLS=True

cmake --build . --verbose -j ${BUILD_NUM_JOBS} -t _pywrap_tensorflow_interpreter_wrapper --config Release
cd "${BUILD_DIR}"
cp "${BUILD_DIR}/cmake_build/Release/_pywrap_tensorflow_interpreter_wrapper.dll ${BUILD_DIR}/tflite_runtime/_pywrap_tensorflow_interpreter_wrapper${LIBRARY_EXTENSION}"

Numpy, size of wheels and outbuildings

For x86, we have added -A Win32 to specify that the build must be for Windows 32-bit.

Now that wheels tflite-runtime are built, we can easily install them on a Linux or Windows machine and run a model using this 1MB wheel . However, the wheel tflite-runtime requires numpy as a dependency. Depending on your device, numpy should be between 10 and 16 MB. It is possible to further optimize the size of the dependencies required to run the model by reducing the size of numpy. To do so, you can build numpy with precise requirements and manually reduce the size of the built wheel to only keep the core functionalities needed. On a Windows x86, Windows x64and Linux, run the following steps:

CFLAGS="-g0 -Wl,--strip-all -I/usr/include:/usr/local/include -L/usr/lib:/usr/local/lib" pip install --cache-dir . --compile --global-option=build_ext --global-option="-j 4" numpy
Get the built wheel and run wheel unpack name-of-wheel.whl
Find and remove the file RECORD in the folder unpacked
Randomly remove some files or directories in the folder unpacked that seem unnecessary (numpy comes with tests, docs, data, etc that significantly increase its size)
Pack the file again with wheel pack name-of-unpacked-folder
Install the wheel with pip
Install tflite-runtime
Check that your model runs an inference without crashing
Repeat the deletion of as many random files as possible as long as your model can run an inference

Using this manual method, we were able to increase the size of the wheel numpy-1.22.3 to less than 3 MB.

**Reducing the size of Deep Learning models**

After the previous steps, the dependencies required to run Deep Learning should now only be at around 4MB. The only missing object to run an algorithm is the model itself. The size of tflite file containing the model grows quickly with the number of parameters of the model. Here are the techniques we used to reduce the size needed to save the model.

Model Quantization

A common and simple way to reduce the size of a model is post-training quantization. This method consists in reducing the number of bits used by the model parameters and operations. Commonly, the model’s parameters are encoded on 32 bits in tensorflowbut they can be reduced to 16-bit floats with little performance losses. More drastically, 32-bit floats can also be converted to 8-bit integers with a more significant performance loss risk. More variations of such techniques are available and straightforward to implement with tensorflow-liteas described in the documentation: https://www.tensorflow.org/lite/performance/post_training_quantization#dynamic_range_quantization.

Dense and GlobalAveraging layers

In many neural networks (and in convolutional neural networks in particular), the dense layers at the end of the network outputting the predictions represent more than 80% of the parameters of the model. These layers can be critical layers needed for the discrimination of the task’s inputs, but the previous convolutional layers are often also sufficient. Simply removing dense layers partially (or fully by replacing them with a basic global average pooling layer) can considerably reduce the model size without substantially changing the model’s predictions.

Pruning

This last famous method consists in removing neurons and synapses that contribute less to the model’s outputs. This method is quick to implement with tensorflowbut requires more experimentations and fine-tuning than the previous two methods presented to optimize the size/performance trade-off.

Running under 5MB Deep Learning algorithms on edge devices

By building the tensorflow-liteinterpreter, garbaging unneeded parts of numpy, customizing our models and reducing their size with the model quantization and pruning, we run Deep Learning algorithms on Windows and Linux edge devices from Python wheels under 5 MB!