One of the nice features of OpenCL is that you can generate kernels on the fly from source code. During development of multiple operators I notices following patterns:

- I need numpy style broadcast operations
- I need reductions

And apparently I need lots of them. All these functions can be easily implemented via broadcast/reduce patterns: loss functions, elementwise functions (add, div), activations, mean, sum, batch normalization, etc. Lots of things I had written kernels manually I can actually automate… So I did it.

Below examples of use for MSELoss:

Forward op:

Preparation:

```
auto fwd_ = core::PointwiseOperationBroadcastReduce::create(ctx_,
in,out, // input and output vectors tensor specifications
0,dtype_, // extra scalar parameters count and their tope
"y0 = x0 - x1; y0 = y0*y0; ", // actual calculation
"reduce_y0 = 0;", // reduce init
"reduce_y0 += y0;"); // actual reduce
workspace_size_ = fwd_->workspace();
```

Execution

```
float scale = cfg_.reduce == cfg_.reduce_mean ? 1.0f/a.shape().total_size() : 1.0f;
fwd_->enqueue({a,b},{y},workspace,{},{scale},{0},q);
```

Backward op (for both gradients with accumulation of gradient):

```
core::pointwise_operation_broadcast({dy,a,b,da,db},{da,db},{scale,accum_0,accum_1},
R"xxx(
y0 = 2*(x1 - x2)*x0*w0;
y1 = -y0;
if(w1!=0)
y0 += x3 * w1;
if(w2!=0)
y1 += x4 * w2;
)xxx"
,e);
```

It makes it much simpler to implement lots of operators directly including handling of multiple types like `float`

, `float16`

, `bfloat16`

and various integer types.

For example use of broadcast:

Use of reduction (for mean/sum):