AI Without Python

An Intro to Machine Learning for C++ Programmers

Hello, World


  #include <iostream>

  int main()
  {
      std::cout << "Hi, I'm Borislav!\n";
      std::cout << "These slides are here: https://is.gd/aicppintro\n";
      return 0;
  }

Borislav Stanimirov

I mostly write C++
Professionally since 2002
2006—2018: game development
2019—2023: medical software
2023—now: machine learning
Open source: github.com/iboB

This talk

⚠️ More inspirational than educational
⚠️ Contains personal opinion on software
More technical than philosophical
The gist, rather than the detail
Mainly for programmers who are not in the ML field
... and who have experience or interest in low-level
Themes

Why should you consider this?
What can you do?

Background

Machine Learning in 2023

The Current Big Thing™ in software

Whisper, DALL·E, Craiyon 🖍, ChatGPT, GPT-J, LLaMa 🦙, LaMDA, Midjourney, Falcon LLM 🦅, Stable Diffusion, Unstable Diffusion 😉, GitHub Copilot, StarCoder, BERT 🐻, SAM, Chinchilla 🐭

…

A Cambrian explosion of AI tools

... and startups

... and software as a whole

There's something new every day.

(this talk will probably be outdated by tomorrow)

This software is no magic

Modern AI Software

In many regards software like any other
Written by teams (of humans)
...with conventional software development tools
It has some unusual, but not unique, features
Many libraries and frameworks exist to help
It's most often done in Python

Python Stacks

The big fish: PyTorch and TensorFlow
Every ML framwork has a Python front end
Why Python?

⚠️ Personal opinion time ⚠️

Borislav on Python

Is Python the best language for ML?

It it the worst language for ML?

No. But it's down there

Extreme care is needed for software in duck-typed languages
Python stacks are a mess

Python Stacks

Package managers: pip, pipenv, poetry, npm, conda
Env managers: conda, mamba, pyenv, containers
Notebooks and Scientific code

Modern ML software

Objective Truth

Python is slow

"No it's not slow!"

"This Python program is faster than its C++ equivalent!"
".pyc should do it"
"Python is the most optimized interpreter there is!"
"Python JIT compilers work!"
"No matter. The low-level framework does the actual work."

Opaque Frameworks

Data flow suffers
Tweaks are hard to impossible
Many similarities with game engines
Bloat intensifies

Something Good About Python


  slice = a[5:10, :20:2] # slicing is pretty neat

* Similar syntax coming soon to C++

I'm not the only one with such problems

Alternatives

CUDA — so C++
OpenCL — so C++
Metal — so C++, but Objective, or Swift
Vulkan — so C++... OK! OK! Many more options
CPU/SIMD (Like REAL men!)— Anything but Python
Mojo — ??? — Definitely not magic, though
... — Alternatives pop up by the hour

A Crash Course in ML

I am not an ML engineer

Borislav Stanimirov

I mostly write C++
Professionally since 2002
2006—2018: game development
2019—2023: medical software
2023—now: machine learning
Open source: github.com/iboB

Borislav Stanimirov

C++: yes
Low-level: yes
GPGPU: yes
Chasing microseconds: yes
Machine learning: well...

So, this is my perspective...

ML Techniques

Linear regression
Bayes classification
Support vector machine
Decision tree
Random forest
…

NOPE

Neural Networks

Neural networks

History

What is a neural network?

It's a function


    enum thing { ... };
    thing classifier(const image& input);


    enum thing { ... };
    struct result {
      thing t;
      float p;
    }
    std::vector<result> classifier(const image& input);


    std::string gpt(const std::string& input);


    using gpt_callback = std::function<void(const std::string&)>
    void gpt(const std::string& input, gpt_callback cb);

What is a neural network?

It's a computation with parameters


    enum thing { ... };
    thing classifier(const image& input, const std::vector<float>& parameters);

Parameters

LLaMa-7B - the LLaMa model with 7 billion parameters

Training Neural Networks

Solve the function with respect to the parameters
Gradient descent and differentiability
Learning rate
Over/Underfitting
Stacking
Shearing
Transfer Learning
Fine tuning
...

NOPE

Designing Neural Networks

It's magic
Mostly indistinguishable from fortune telling
Years of experience
Lots of untransferrable knowledge
Takes millions of hours
It seems that we do need Python here 😢

NOPE

Neural Network Applications

Design - not today
Training - not today
Inference - executing the computation - today
Inference on the edge - tomorrow

What is a neural network?

A network of neurons, duh

$y = g \left( \sum_{i=1}^{n} w_j x_j + b \right)$

Layers ("deep" means more than 2)

Wait! I know this

$\begin{pmatrix} y_1 \\ y_2 \\ y_3 \end{pmatrix} = g \left( \begin{pmatrix} w_{11} & w_{12} & w_{13} & w_{14} \\ w_{21} & w_{22} & w_{23} & w_{24} \\ w_{31} & w_{32} & w_{33} & w_{14} \end{pmatrix} \begin{pmatrix} x_1 \\ x_2 \\ x_3 \\ x_4 \end{pmatrix} + \begin{pmatrix} b_1 \\ b_2 \\ b_3 \end{pmatrix} \right)$

Yes. This is mostly everything

Types of layers

This was the linear (fully connected, dense) layer
Almost all layer types can be represented as fully connected
It's a matter of efficiency
Convolution/Pooling layers
Normalization layers
Attention layers
"Layer" actually is pretty fuzzy
The Neural Network Zoo

Activation Functions

Without them every output would be a linear function of the input
Layer count wouldn't matter
Sigmoids

Logistic function
tanh
smht

ReLU
Leaky ReLU
GELU

Convolution

Neurons don't depend on the entire input

Weights are shared

Feature maps

Pooling

(Subsampling)

Collecting "important" features

What is a neural network?

A collection of layers which define a computation

Terminology time

Tensors

No, not physical ones.
Just N-d arrays
Think std::vector
Shape: [[1,2],[3,4],[5,6]] -> (3, 2)... or maybe (2, 3)
Broadcast:

f([1,2,3]) = [f(1), f(2), f(3)]
[[1,2],[3,4]] + [10,20] = [[11,22],[13,24]]

Tensors for weight, bias, layer

ll_5 = mul(w_5, l_4) + b_5

Models

What is a model anyway?
Any of:

The layer/computation sequence
The parameter (weight) tensors

LeNet

AI like it's 1998

MNIST dataset

LeNet Model

Classify individual hand-wrritten digits

  class LeNet(nn.Module):
    def __init__(self):
        super(LeNet, self).__init__()

        self.convs = nn.Sequential(
          nn.Conv2d(in_channels=1, out_channels=4, kernel_size=(5, 5)),
          nn.Tanh(),
          nn.AvgPool2d(2, 2),

          nn.Conv2d(in_channels=4, out_channels=12, kernel_size=(5, 5)),
          nn.Tanh(),
          nn.AvgPool2d(2, 2)
        )

        self.linear = nn.Sequential(
          nn.Linear(4*4*12,10)
        )

    def forward(self, x: torch.Tensor):
        x = self.convs(x)
        x = torch.flatten(x, 1)
        x = self.linear(x)
        return nn.functional.softmax(x, dim = 0)

    m_input = create_tensor("input", {28, 28, 1});
    ggml_tensor* next;
    auto conv0_weight = create_weight_tensor("conv0_weight", {5, 5, 1, 4});
    auto conv0_bias = create_weight_tensor("conv0_bias", {1, 1, 4});
    next = ggml_conv_2d(m_ctx, conv0_weight, m_input, 1, 1, 0, 0, 1, 1);
    next = ggml_add(m_ctx, next, ggml_repeat(m_ctx, conv0_bias, next));
    next = ggml_tanh(m_ctx, next);
    next = ggml_pool_2d(m_ctx, next, GGML_OP_POOL_AVG, 2, 2, 2, 2, 0, 0);
    auto conv1_weight = create_weight_tensor("conv1_weight", {5, 5, 4, 12});
    auto conv1_bias = create_weight_tensor("conv1_bias", {1, 1, 12});
    next = ggml_conv_2d(m_ctx, conv1_weight, next, 1, 1, 0, 0, 1, 1);
    next = ggml_add(m_ctx, next, ggml_repeat(m_ctx, conv1_bias, next));
    next = ggml_tanh(m_ctx, next);
    next = ggml_pool_2d(m_ctx, next, GGML_OP_POOL_AVG, 2, 2, 2, 2, 0, 0);
    next = ggml_reshape_1d(m_ctx, next, 12 * 4 * 4);
    auto linear_weight = create_weight_tensor("linear_weight", {12 * 4 * 4, 10});
    auto linear_bias = create_weight_tensor("linear_bias", {10});
    next = ggml_mul_mat(m_ctx, linear_weight, next);
    next = ggml_add(m_ctx, next, linear_bias);
    m_output = ggml_soft_max(m_ctx, next);

Practical Challenges With Inference

Number crunching

GPGPU is the way to go
SIMD
gemm, BLAS and custom gemm
Cache-locality
Memory bandwitdh bottlenecks - M2 Ultra's time to shine
Quantizations. Yes, Q2 is a thing
Fusing kernels - hey, remember expression templates?
Streaming - finally a use for coroutines

Tweaks

They come more often that you would think
Quantization
Reshapes
Custom kernels
Sampling and resampling

How to Start?

First, forget about training!

Implement a simple model in the most naive way!

Yes, play with Python, too

Libs and Frameworks

Monsters: PyTorch, TensorFlow/Keras, onnx
ggerganov/ggml: Exotic quantizations, CPU, Metal, CUDA
OpenNMT/CTranslate2: CUDA
Tencent/ncnn: Vulkan
NVIDIA CUDA-only bloat: FasterTrasformer, cuDNN, TensorRT

Examples and Sources

Hugging Face: models, datasets, spaces
ggml examples
ggerganov/llama.cpp: LLaMa with ggml
karpathy/llama2.c: LLaMa in pure C
intel/intel-extension-for-transformers: Intel-specific
NVIDIA CUDA-only bloat: FasterTrasformer, cuDNN, TensorRT
And many, many more

Practical Steps

Find a model (for example on Hugging Face)
Look at the model description if available
Look at the Python implementation
Yes, there will be one
Implement tensor ops
Compare intermediate steps with the Python implementation
...
Profit

How to continue?

Try being faster than the Python implementation

Really, not such a tall order
on CPU
on GPU

Do more models

The Real World

Plugins
Profiling can be a challenge
The periphery

Tokenizers, streaming, decoders, encoders, guidance
Horizontal scaling, MPI

It's software like any other

But Why?

If you like number crunching
If you like chasing microseconds
If you like doing magic
If you don't like "scientific" code

You are needed!

Let's ride the hype wave!

End

Questions?

These slides: ibob.bg/slides/ai-cpp-intro/
Borislav Stanimirov / ibob.bg / @stanimirovb

Slides license: CC-BY 4.0