Wednesday, October 19, 2022

Machine Learning Equates Solve Table for Advanced ML

Machine Learning Equates Solve Table for Advanced ML (c)RS

ML & Code Efficiency Heuristic Search,
Python & of course all runtimes of GPU & CPU Firmware & Logical thought,

Apologies for not expressly stating all {Mul+ & all} Accumulator strategies, these are hard to work out! But basic edge detection is a SiMD Example RS


Core Motivations of ML

ML Learning is a branch of artificial intelligence that focuses on using data and algorithms to imitate the way that humans learn & improving ML Method accuracy.

ML Learning can be applied to various domains, such as image processing, natural language processing, speech recognition & code optimization.

ML Learning can use different techniques; Such as supervised learning, unsupervised learning & reinforcement learning, depending on the type and availability of data.

Some of the common techniques used in ML Learning are:

Edge detection: a process of identifying the boundaries of objects in images or videos.

Accent recognition: a process of identifying the regional or social variation of speech.

Language processing: a process of analyzing and generating natural language texts.

Code optimization: a process of improving the performance or quality of code by using various methods, Such as compilers, libraries, or heuristics.

The Objective is to improve both ML & Minds.


I think that considering the stated philosophy, There is more room for education on social conduct.


As the whole of Machine Learning is based on maths & comparators, There are some factors!

Jung & Feng Shuai

We need to teach maths better!

We need to learn how to have meaning in maths! Now what do we mean? What is life? (comparator, but what to?),

Love & friendship? Is that only maths or does Jesus/Thor/Loki/Odin/God count? and if so.. how?

What kind of maths do we need? & how do we compare on this list?

Can mankind, Aliens, Life & machines.. do better ?

Rupert S

I think that considering the stated philosophy, There is more room for education on social conduct.


Precision of operations has to be precisely managed:

Most precisely we define a thought?
While we think precisely of naught,
some precision within thought!

AVX for example can go upto 512Bit; But we can use 8Bit or 16Bit multiple operations,

In the mind of the thinker we chose how we optimise our precision,

Coding allows a person to think about how precise the decisions they make are!

But precisely what we need in the way of precision is a remark on how difficult that choice is?

When at school Pi often stops at 4r; Now in classical work we define absolute precision..

But what are we capable of ?

As we dream; the thoughts are imprecise; sometimes sharp!

The true value of precision is quantified by the desired goal,

What we do is achieve a goal in the range of our precision.

Schooling is the same; precisely how precise we work for our goals & how we achieve them.



TOP's are not the only unit in Machine Learning; TOP's are the Objective Definition & Definition Inference of correctness.

Role of FPU/SiMD/Vector Unit in TOP's

The FPU float unit, Example Dual Pipe 128Bit Float unit PS5

While not theoretically TOP's The Maths involved can solve many issues:

Once TOP's have thought of:

The role of Inferencing could depend on samples; Maths helps define samples,

The FPU (Floating Point Unit) and SIMD (Single Instruction, Multiple Data) units are important components of machine learning accelerators..

Because they are responsible for performing the floating-point arithmetic that is required for many machine learning algorithms.

The vector unit is also important because it allows machine learning accelerators to perform multiple operations on multiple data points in parallel, Which can significantly improve performance.

The mathematics involved in machine learning can be used to solve a wide variety of problems, including:

Defining samples: Machine learning models are often trained on data that is represented as samples.

The mathematics of probability can be used to define the properties of these samples,
Such as their distribution and their size.

Bonding atoms: Machine learning models can be used to predict the properties of molecules,
Such as their bonding energy and their stability.
Bonding atoms a Maths Solve can show bonding.

The mathematics of quantum mechanics can be used to calculate these properties.

Drawing graphics: Machine learning models can be used to generate realistic images and videos.
We can thread load, Polygons, Textures as wave tables, Audio & sounds such as drum kits.
We can draw a Ball in 128Bit, Draw a complex polygon; For example Random Shape Flier.
We can emulate a 128Bit Audio output.

The mathematics of geometry and trigonometry can be used to represent these graphics.

Emulating audio: Machine learning models can be used to synthesize sound and music. The mathematics of wavelets and Fourier transforms can be used to represent these sounds.



Int8:SiMD : Maths & Logic

This is about how you think about components such as INT8, INT4(Xbox) & SiMD, You have to classify by necessity & optimise the structure.

You can shape the game reality with specific control objects & statics!
Maths in SiMD & Int8 & Machine Learning in Int8 & SiMD; SiMD is hard maths, Int8 is soft edge inference...

Both are maths; But soft logic is not a PROOF Math but can be proof; Hard math is not 'Invention & Imagination 'Exactly''

But we have both to improve performance.


Solve Table of Statistically provable Machine Equates & Solves : Table of function competitors & Operators.

"I know this is depressing from my end with a FX8320E with AVX but if you multi tune the CPU Kernel for the RX / RTX that 512DL AVX would have meaning, If you are kind you will allow machine learning on the AVX FX8320E Level to work on SiMD Yes / No comparisons !"

#ML Learning: This explains why we teach kids art & reading first! But maths is quickly next, 
Because all else is pointless; That we do not learn with logic & Teach with logic.

Here is how to create a better mind #ML
Train your eyes with art on the concepts of edges, curves, Colours & Shading and love,
Educate your minds; Learn today & be quite aware how clever & sharp you will be.

Humain Operations

Edge Detection
Such as teaching your child edge detect in art ;)

Smooth & Blend & Sharpen,
All interpretive

Accent Recognitions & Language

Interpret as follows


Heuristic Code optimise

When it comes to sorting methods, We Identify common techniques..
For example frequently used technologies such as:

Audio & Visual information

Primarily we identify common optimisations; Compilers have libraries of them!

Audio & Video Encoded data use Wavelet Images, We can ResNet Them & also Edge Detect & Gaussian Detect contrast, Colour, Shape

Language is an uncommon syntax, But we have audio commons & Accent identification is also potentially Audio Context.

Code context is Logic, Function, Utility, Design, Motive



M.A.P NPU Matrix Processor Dimensional construct (c)RS

Primary reason for expansion of function data sets: 2D, 3D,< nD

P.D.C is a worker thread parallel 2D or 3D Grid,
Utilising QQ & A, B,C Array maths allows us to collapse or expand dimensions in a flexible way,

The same principles as SVM (S.V.M SiMD Vector matrix) can be used to culminate or expand dimensions...

That way a M.A.P Processor can expand or collapse all mathematical constructs,
We can therefore use all mathematical & statistical arrays for machine Learning & Maths.



Adams is an example of dimensional flattening, But:

Adams is an example of dimensional flattening; But we can use a statistical anomaly called Hallo Far Reach & list dimensions of a series,

n layers By n layers : N² & Nn+

8Bit : 8 layers By 8 layers:
2bit, 4Bit, 8Bit & So on
{ 2², 4², 8², 16², 32², 64²<N² }

In reality we can use parallel layers in 4Bit to 128Bit relatively easily & advantage is Memory.. alignment,

But also in Aligned memory arrangements we can also quantify ideally from
{ 2², 4², 8², 16², 32², 64²<N² }

So we end up with all processor features used in a single stack; Examples!

var Layers 8² = { 1 : {
4², 4²
4², 4²
2 : {
2², 2², 2², 2²,
2², 2², 2², 2²,
2², 2², 2², 2²,
2², 2², 2², 2²,
3 : {
32² : {

Rupert S


Adam's Resnet-50 128bit / 8bit or 16bit

Resnet-50 is an example of a network ML with an aligned 128bit = 8bit/16bit * (4 * 32) grid, suggested parameters ..

Aligned making sense.


An idea of alignment, Example EdgeTPU & Intel 8Bit 8*8:

in an 8Bit restricted machine; 2 Blocks of 2² = 8, 2 Cube(3) = 8, 4² = 8 4 Cube(3) = 2*8 in 4 segments,
8² = 8*8 so parallel and ideal for the 8 lane intel function...
at the level of 8Bit only operations; 8*8 intel.
8*8 and 32Bit SiMD operations; 8²*2, 8² * 4²

Inferencing 8Bit example : DOT : U32 8x4 : 32/4, U64 8x8 : 64/8,
Cache referencing: Block 4*U32, 2*U64, U128

So an 8Bit access and labeling ID Hash; All in 8Bit...

Has to group by preference into 8Bit groupings the resulting identifiers; We are going to assume U16 & U32 & U64 memory cells..

We are going to write those cells per 8Bit block in Sync/ASync Till Full..
We are going to process grouped CELLS in SiMD & of groupîngs 8, 16, 32, 64 < 512Bit AVX/SiMD,



SiMD Applications of basic maths operations in machine learning : RS

Applications of operators to machine learning is like a PHP Database...
What we need to do is convert database accesses into actionable results...

Google Bard & Bing/Cortana crawl the web; But too many results leave us inconclusive...

We will be using database analysis on basic queries & for that we need heuristic maths!

So what do we need ?

Input data collection : Text & speech processing

Sorting algorithms (Operators, Example Variable Sort : A*B =< C Sort)

Graph Maths table collation : 3D Matrix Math - A B C Matrix

Analysis of various results & statistical analysis of motivated search & conclusion testing..
With these we can test many math examples such as edge detect & sharpening or result maths...

With Operators >

FMA AVX Performance table: 2Flops per Cycle per FMA Unit
Architecture Fast Instructions for FMA

Reference Tables

Operators in C
● Arithmetic
a + b, a – b, a*b, a/b, a%b
● Bitwise
a | b, a & b, a ^ b, ~a
● Bit shift
a << b, a >> b (signed), a >> b (unsigned)
● Logical operators
a && b, a || b, !a
● Comparison operators
a == b, a != b, a < b, a <= b, a > b, a >= b
● Tertiary operator
x = a ? b : c
● Special functions:
sqrt(x), abs(x), fma(a,b,c), ceil(x), floor(x)

For when {U, X, Y, Z} = N Expressions
For when {(A+B/2)} = C Expressions

Rupert S,

Reference operators



Number Complexity Reduction for operations

I suppose you can use for example a - b & automatically see if it is larger? So you could 1 to 20 & sort them by remaining number; Before asking, Small number remainders are 8Bit 0-256 , 16Bit is 65535...
So reducing the value of a group of numbers you sort to 16Bit or 8Bit considerably reduces sorting cost...

Achievable complexity reduction by abstracting a simple number to do the following:

You link the Data in 64Bit, 32Bit to & Vector Table,
List of lower complexity is faster

Comparator matrix

Colour composing,{
The result is blended,
The result is High/Low Vector gradient,
We need a reduced colour set for compression

Where we sort files or names but reduced information (example First 4 Letters)
Sorting phone numbers fast...

Comparing lower complexity lists that have been; divided or had a static number removed from them,
This method reduces search & sort complexity; Like so:

Phone Number N +1 444555777

Sort N [+n]
N - last 6 digits (Zero 6 Digits, AVX has this feature)
Sort [N1 to N200]
List first 4, Sort by 4 to groups of 10
N - First 6 Digits (Zero First 6)
Return N1 to N200

That may well be a lot quicker with very large lists.




Complex feeling based Machine Learning ML is known as AI..
To truly generate AI is not impossible; There is instability in the core; Fragmentations of motive...
Miss diagnosis; Error; Decay?

So we do need a foundation; In us Education; Metabolised Data..
Analysis & then..
Application to motive & goal.

We require to understand humour,
We require to understand {Art, Science, Feeling, Life}
We require a goal or two; A {Sophie reward}; B {action reward}; C {Pleasurable reward}
We Require, {Goals, Life, Feeling, Action, Motive, Interest} : Creative intellect



Operation precision reductions : Effects General : RS

Operation precision reductions affect & effect more than Machine Learning & yes we have known this for years!
But we can learn from ML; In that in machine learning like the mind; A lack of precision affects so many issues!

The mind is self evidently the first place;
We lack logic when we do not precisely learn; We do not learn all...
We however learn quickly on reduced precisions... We Learn Fast; But do we learn well?
In school we teach as high a quality precision(Quality Education); As we can; But like machine RAM; We lack either time or memory & in truth we can learn all our lives..

So our core issues in all methods of enactment of thought:


Quality of information

(Training)Requalification of information correctness
Thought process


Reality & Truth

Rupert S

+Useful operation precision reductions : RS

Useful operation precision reductions; I observe that reducing precision to 1Bit & 2Bit..

While enhancing the definition of a positive, Negative Dipole & thus enhancing speed..
Further reduces reasoning capacity; That in order to reduce Processor bandwidth for reasoning..

In the example of the XBox & PS5; DOT4 & INT4, INT8 & F16 & bF16; Apply considerable improvement to reductions probable error related to a lack of remainder or float value depth enhancement!

By reason of probability i assume a value of 4Bit & 2Bit to allow the smallest packing ability; Existing alongside the word reasoned!

To reduce to 1 & 0; I assume a definite statement that a Value Integer Solve in the form of a vector..
Is most probably the solution & that furthermore that in most cases; Projected in pure maths & code ASM,
Both SiMD; Float & Integer...

Reduction to multiple 2 Bit values in short Integer instructions; I will state however that no such value is further away than a statistics table or PHP Data-Set.

Rupert S 2023-06

"The application of CNNs to resource-constrained embedded platforms has been a challenge, leading to the emergence of CNNs with various lightweight techniques. BNNs [22] are representative lightweight CNNs obtained by compressing CNN activation and weights into 1 and −1
values instead of using single-precision floating-point data. We simplified the multiply–accumulate operation, which was previously complex and required multiple cycles in CLs, by replacing it with a simple bitwise operation using 1-bit XNOR and popcount operations [23]. While BN in neural networks using single-precision floating-point data involves complex operations, a BNN simplifies this process by adding an offset to the resulting value. BN has four fixed parameters for network inference operations. Because 𝜎
is always a positive value, it can be expressed by Equations (2) and (3), depending on 𝛾

Reference to Table 24 found in

BNNs compress weights and input data into single bits to significantly reduce memory usage and perform hardware-optimized parallel operations using bitwise operations such as XNOR and popcount. However, there are limitations to using BNNs for complex networks, such as multi-keyword detection, owing to the decrease in accuracy caused by lightweight techniques. To address this issue, we propose a TNN that maintains the input data as binary while ternarizing the weights. The TNN has higher accuracy than the BNN owing to its higher bit precision; however, it can still use the bitwise operation method, and both networks have similar operational processes.
2.3. Depthwise Separable Convolutional Neural Network
In a typical CNN, multiple three-dimensional kernels repeatedly multiply and accumulate input feature maps to generate multiple output feature maps, which is computationally intensive with large memory usage. To solve this problem, we applied a DS-CNN that is highly accurate compared with the same parameters while reducing memory usage. A DS-CNN performs the local and global feature extraction functions of a typical convolutional operation in separate layers. Depthwise (DW) convolution matches a single input channel to an output channel, excluding interchannel correlations and reflecting local features. Pointwise (PW) convolution is equivalent to 1 × 1 convolution, reflecting interchannel correlations (i.e., global features). Figure 1 shows CNN and DS-CNN. In this figure, the use of the same color (e.g., red, blue, yellow) represents input channels with the same index being used to generate corresponding output channels in DW convolution. Table 1 lists the number of parameters and computations in specific layers with a 3 × 3 kernel. In one example from the network used in this paper, a layer with 128 input channels and 64 output channels experienced an approximately eight-fold reduction in the number of parameters and computational complexity using the DS-CNN."

Useful operation precision reductions
FPGA Implementation of Keyword Spotting System Using Depth-wise Separable Binarized and Ternarized Neural Networks


Precision Context of learning

Machine Learning : It is hard to say every function that we would use,

However we have years of experience of using computers to calculate precise maths..

So our objective from the past is to pick high precision maths to calculate graphs,
Now we can surmise the fact that high precision calculations have accuracy!

But in machine learning modeling we are heading for speed; On the other hand Maths Tools such as:

AVX & FPU : Very high precision; But we can use 16bit & 8Bit x many in AVX
BFloat F16b & F32b exist to allow us to explore precise results,
F4 F8, Int4 & Int8 exist to allow us to explore at speed & some times (at all :p),

We can surmise that most functions of a CPU are in fact available to machine learning ..

How so ?

Because we graph it!

Rupert S


RAM ADDER differential Inference (c)RS :

RAM Table Accumulated addition node network Accumulator with Accumulation comparison Inference


Inferencing 4Bit, lessons from the RS, 

Inference Tessellation Edge Enhancing : Detection <> Inference <> Interpolation Tessellation

Now in the case study we will be edge enhancing with an inferencer..

We do not assume we 4Bit inference; We assume any bit-width..

We however assume that we multibyte every inference so that we can fill the instruction with..

MPi multibyte parallel instructions.



& So on; for every instruction inference or edge, 4Bit, 8bit, ++Nbit

Now I have spoken to you before about edge detection in Python & observed that obviously this is a sharpening edge detection made to order!

So what do we do ?

4 Byte code: does ? A = B + C (edge interpolation, for training we assume the rule A + B = C)

We assume that if A + B = (C/2) , that they are the same C & then we...

A + C = (D/2) & B+C = (E/2),

And forever yep...

So what do we do this for, We know A & B are a line or a curve?, So why not ask?

Is G/Z buffered Polygon { A , B, C, D & so on} & Then:

A + B = (C/2) & A + C = (D/2) & B+C = (E/2) But also Shape from Polygon:{ A , B, C, D & so on},

Now normally can & will!

But we do not "Inferencing what we already know!"; We inference what we do not!

For example exploding fragment polygons without a buffer (in a shader in the 64KB RAM Cache),

A mouse pointer that we do not cache! &or DMA Device pointer.

Rupert S


Code & Python & ONNX & TensorFlow : Edge TPU & Movidius Int8 offloaded U32 logic:

Transparent Tier cache logic for precision sorting (c)RS

the main point is Int8 needs to be transparent in it's dynamic use for inferencing small precision batches...

My main argument is the application of High precision to low precision tiering..

Cache on load sort
Logical order grouping for main Precision RUN

var DTypes = {(
Load types = {((,data types, { Table } : V1, V2, V3, Vn));

({Cast Float F64,
Cast Integer u64},
Cast Float F32,
Cast Integer u32},
Cast Float F16,
Cast Integer u16},
Cast Float F8,
Cast Integer u8},
Cast Float F4,
Cast Integer u4}});

var Sort = { 'by precision of value' });

sort values { dataset { layers = 1, 2, 3, n });



The provided code relates to implementing a tiered caching system for inference tasks using different precision levels (Int8, F16, etc.),
On platforms like Edge TPUs and Movidius Myriad X..

Breakdown of the code and your argument:

Code Breakdown:

Data Types:

The code defines a dictionary called DTypes that maps load types (e.g., Float32, Integer u32) to cast operations for different precision levels (F16, u16).

A variable named Sort is defined, likely to indicate sorting based on precision.
Value Sorting:

The code suggests a function to sort values (sort values) based on dataset layers (1, 2, 3, etc.),

The Argument:

Proposing a system for using higher precision (e.g., F32) for initial loading and sorting,
Then transparently transitioning to lower precision formats (e.g., Int8) during inference for smaller batches..

Key Points:

Int8 Transparency: The goal is to make the use of Int8 transparent for inference; Meaning the system automatically switches to this lower precision format without affecting the overall functionality.

Tiered Caching: The code hints at a tiered caching system where data is loaded in a higher precision format and then potentially converted to a lower precision for inference on edge devices.

Precision Sorting: Sorting the data based on precision might be a strategy to optimize cache usage and inference speed.

Overall the approach focuses on using high precision for initial processing and then efficiently transitioning to lower precision for inference tasks on edge devices..

Most likely to improve performance and memory usage.

Further Discussion:

It would be beneficial to see the complete code implementation to understand how the caching and tier management work.

Optimizing the conversion between precision levels is crucial for minimizing performance overhead!

The effectiveness of this approach depends on the specific use case and the trade-off between precision loss and efficiency gains.



Multi-line Packed-Bit Int SiMD Maths : Relevance HDR, WCG, ML Machine Learning (Most advantaged ADDER Maths)

The rules of multiple Maths with lower Bit widths into SiMD 256Bit (example) 64Bit & 128Bit & 512Bit can be used

In all methods you use packed bits per save, so single line save or load, Parallel, No ram thrashing.

You cannot flow a 16Bit block into another segment (the next 16Bit block)

You can however use 9 bit as a separator & rolling an addition to the next bit means a more accurate result!
in 32Bit you do 3 * 8bit & 1 * 4Bit, in this example the 4Bit op has 5 Bit results & The 8Bit have 9Bit results..
This is preferable!

2Bit, 3Bit, 4Bit Operation 1 , 8Bit Operations 3: Table

4 : 1, 8 : 3

4 : 2, 8 : 6
2 : 1, 7 : 8
3 : 1, 8 : 1, 16 : 3

Addition is the only place where 16Bit * 4 = 64Bit works easily, but when you ADD or - you can only roll to the lowest boundary of each 16Bit segment & not into the higher or lower segment.

A: In order to multiply you need adaptable rules to division & multiply
B: you need a dividable Maths unit with And OR & Not gates to segment the registered Mul SiMD Unit..

In the case of + * you need to use single line rule addition (no over flow per pixel)..
& Either Many AND-OR / Not gate layer or Parallel 16Bit blocks..

You can however painful as it is Multi Load & Zero remainder registers & &or X or Not remainder 00000 on higher depth instructions & so remain pure!

8Bit blocks are a bit small and we use HDR & WCG, So mostly pointless!

We can however 8Bit Write a patch of pallet & sub divide our colour pallet & Light Shadow Curves in anything over 8Bit depth colour,

In the case of Intel 8Bit * 8 Inferencing unit : 16 Bit Colour in probably (WCG 8 * 8) + (HDR 8 * 8) Segments,

In any case Addition is fortunately what we need! so with ADD we can use SiMD & Integer Today.

Rupert S


Main Operation solves: Bit-Depth Conversions & Operations

Packed Bits, Multibyte Storage : u32, u64, u128

The storage of multiple bit operations with Sync Read & Write,
The purpose of this is to Read, Write & Store Operations on:

F16, F32, F64

In RAM of 32Bit, 64Bit, 128Bit

Values Storage Table

32Bit = [16bit:16Bit]
32Bit = [8bit:8Bit:8bit:8Bit]
32Bit = [4bit:4Bit:4bit:4Bit:4bit:4Bit:4bit:4Bit]

64Bit = [32bit:32Bit]
64Bit = [16bit:16Bit:16bit:16Bit]
64Bit = [8bit:8Bit:8bit:8Bit:8bit:8Bit:8bit:8Bit]
64Bit = [4bit:4Bit:4bit:4Bit:4bit:4Bit:4bit:4Bit:4bit:4Bit:4bit:4Bit:4bit:4Bit:4bit:4Bit]

128Bit = [64bit:64Bit]
128Bit = [32bit:32Bit:32bit:32Bit]
128Bit = [16bit:16Bit:16bit:16Bit:16bit:16Bit:16bit:16Bit]
128Bit = [8bit:8Bit:8bit:8Bit:8bit:8Bit:8bit:8Bit:8bit:8Bit:8bit:8Bit:8bit:8Bit:8bit:8Bit]
128Bit = [4bit:4Bit:4bit:4Bit:4bit:4Bit:4bit:4Bit:4bit:4Bit:4bit:4Bit:4bit:4Bit:4bit:4Bit:4bit:4Bit:4bit:4Bit:4bit:4Bit:4bit:4Bit:4bit:4Bit:4bit:4Bit:4bit:4Bit:4bit:4Bit]

Bear in mind that Integer 64Bit is 2 x 32Bit on AMD; So you can compute 2 operations at 32Bit per 64Bit operation,

Some 64Bit units are only 64Bit; So we need to know how many!

32Bit operations are fine! & Conversion of 16Bit value ranges into 32Bit Operations can still be within range of 16Bit Storage..
If we stick within the 16Bit value range on Multiply & ADD,
We can therefore simply post a 16Bit value range data set & expect to be able to Store 16Bit!

The simple method is to store 2 16Bit values in the same 32Bit table; like [16bit:16Bit] = 32Bit

With this we can Load, Store, Run & Save 8bit INT8 operations in 32Bit devices such as Alexa as 8bit x 4 = 32Bit, So we don't Waste RAM or resources!

But we still have access to 32Bit RAM Paging; But with values loaded in 4Bit, 8Bit, 16Bit, 32Bit & so on.

With NANO Android on F16 & F32 & MIPS the same & AMD, Intel, NVidia,
Learning F16 offers considerable value for performance with 16M Values!


Direct DMA 32Bit & 64Bit RAM : Multiple Sync 16Bit Texture:

A good example of where 8Bit & 16Bit Value load works well is in the case of the texture,
To load 4 x 16Bit into a single 64Bit Cache:

32Bit RAM = 16Bit, 16Bit
64Bit RAM = 16Bit, 16Bit, 16Bit, 16Bit
128Bit RAM = 16Bit, 16Bit, 16Bit, 16Bit

In the case of direct DMA, you would be aware that you have,
128Bit, 192Bit Buss on GPU
32Bit & 64Bit on CPU

So a direct 4 * 32Bit or 2 * 64Bit Cache loads is a logically fast method to DMA directly from Cache to GPU!
In short you convert 8 x 16Bit into a 2x 64Bit DMA push; Which is very fast!

You can do the same with batches of vertices in many storage sizes.



On the subject of how deep a personality of 4Bit, 8Bit, 16Bit is reference:


Quantization modelling : RS : Physics III Slit Experiment

Expanding on potentials for precise machine learning has the same qualities as Maths & quantified research,

A fully qualified result is often required for deep thought & precise thought!,

But we do not always have the RAM or resources that we require; When we need to prioritise load & data sets to specific RAM & Processor availability or location, or by necessity..

Optimise our resource footprint & speed, while maintaining precision to our fully optimised values & data set needs & requirements.


Dynamic Scaling

Ideas of FP8-F8 & FP16-F16 Interpolation to 32Bit & 64Bit, Gama Curves are usable(c)RS

Presenting the full precision neuron,;

var Me = expand, {

preload = dataset; {Ds1, Ds2, Dsn }, { condition = Present };

var Present = { Datapoint set }; {

var CC = Compose Compressed {Brotli-G > ZSTD };

4Bit to N-Bit { Brotli-G(GPU Shader) Compressed Data Bit with Tri-Linear Interpolation & Extrapolation };

var Pf = Processor contains N Features { F16, F32, F64, FPU } * { N, N2, N3, Nn };
var Ex = Expand Points { Series Precise { F16:<FPU }, Series Median { Int8:<Int64 }, Series Low priority { Int2:<Int32 };

load Present;

run ML, {epoch1 < epochNN };

test results, {log : logNN};



"(SmoothQuant).The optimized model achieves >3X latency improvement with a custom dequantization kernel for FP16 inference. Although the work does not map to Int8 engine"

In view that inferencing is being activated in Int4 & Int8 & Int16 & Floats f16b F8 & F4,

Now my view is a vision of a Slit experiment in Physics; Now a slit experiment shows light photos in slices through a screen..


Ratio 1:2:4 on contained knowledge

Minimal Origin of mankind's knowledge : IIII < IIIIIIII < IIIIIIIIIIIIIIII Defined Summit of all power

My method is to compress the point node data with

So what we do is take advantage of patterns; Creating tables of 1111 1010 as examples; These compress well & can be short noted as patterns,

We can expand 4Bit into 8Bit inference & compress as patterns; The total data point is 4Bit if it is a pattern,
The subject is not predictable unless we pick the patterns!

We can however Quantize the memory footprint; The Double/Single precision operations may be faster! :L

We need the models to work in F16 & Int8 & Int4 after-all, But i see a reason to use Floats because sub-quantization does leave a remainder for us to compare..

That relevant 'F16' >=-


Study Subject Reduction :

Quantification Analytics with combined operations
Accurate and Efficient Collaborative Optimizations for Fast Generative AI on AMD GPU


Automatic 4bit Activation-Aware Quantization (AWQ), 

I am confident that 8Bit is still logical; 4Bit defines well..
We need DOT4 support from Chrome Dev; U32/8 & U32/16 with line by line Sign Arrays:

U32/8 U32/8 U32/8 U32/8 S1
U32/16 U32/16 U32/16 U32/16 S1

Reference "Shared exponent"


Ideas of FP8-F8 & FP16-F16 Interpolation to 32Bit & 64Bit, Gama Curves are usable(c)RS - 'ocp' F8 & FP8 or smaller with interpolation-microscaling-formats-mx-v1-0-spec-final

Self Trained Auto Sparsity ML

Evolution-ML CNN Self Trained Auto Sparsity - Hybrid multi-objective evolutionary model compression with convolutional neural networks

ML Batch Matrix MAP in FPGA

ML Compressed Dynamic16bit-8Bit - Hardware-friendly compression and hardware acceleration for ML Transformer

Matrix Processors - Memory & command - All-Digital Compute-In-Memory FPGA Architecture for Deep Learning Acceleration

Matrix Processors - Inline Ram & Command { CMD : RAM }:{NET}

TAC (Tiny Anomaly Compression)

Inference on any device with a C99 compiler

to run without activating C99; Installs under Python 3.10+
git clone

With EmLearn you can compile really tight models of tensors & random forest & Gaussian Matrix,
These are very good for:

A1: Anti-Aliasing ( Gaussian, Tensor error diffusion, forested Random spread )
A2: sharpening & Shaping ( Tensor Edge detect with enhance, Gaussian estimation & line fill, Random forest A to B to D: E to B to F X + )
A3: Line & Curve estimation fills & Tessellation ( forested Random spread (Dither fills) & A1 & A2 & Differentiation in 3D Space : 1:2:3{ A B C : E B F }
A4: HDR & WCG, Combinations of dithering in colour space & light/Shadow differentiation in 3D Space : 1:2:3{ A B C : E B F }

36Minutes UpscaleDL

Megatron Classifies Images in Web Tensors
11m 34m space then 11m;

Mr420Megatron Classifies Images in Web Tensors & you know he's good right, That is just what he feels! for real bro

Batch 256 1 layer 128 neurons RNN

Batch 512 2 layer 256 neurons RNN GRU

You can see that 2 layers takes longer to train; More Stable though!
It would also take more RAM & processor time.

Phishing detection systems train fast, maybe there is hope yet for AV

RX580 supports DirectML Feature level 2, What does that mean? Technical question!

Rupert S

Batch Size 240W>65W, 32GB{64, 16}, 15W>5W, 4gb{16, 1} : 16, 8, 4 seems optimal,
Time taken compatible:

Coral TPU Micro Edge Learning with performance arranged Intel FPGA Arria 10 SX SoC Kit & Google Coral, NVIDIA Jetson Nano and CPU ROCK 4C Plus

ML Document Caches - USB Acceleration & Small devices - Combining Machine Learning and Edge Computing Opportunities Frameworks & Devices

Python & JS Configurations


ML Tensor, ONNX Machine learning model that involves direct compression & higher accuracy in preference to Bit Reduction; Because reducing Bit Depth on decisions makes results potentially overflow your maximum ML Node Point Depth...

Because of point overflow on low bit depth (less than 4Bit in most cases) We plan to use compression to multiply the RAM available to the ML..

With Brotli-G the Zip can be directly decompressed inside the GPU & therefore the results are much faster & more efficient for us..

We can further improve by selecting compression Compatable patterns such as 1111<1toN or 1010<10*N where N = Multiples of for example 1234 (repeating); R * N = RN,

So we can maximise compression in Processor & not need to pass uncompressed data points,
We Cache & Decompress & Recompress as required.


ML tensor + ONNX Learner libraries & files


Application of Data Compression to ML

Some examples of how Brotli-G compression can be used to improve the performance of machine learning models:

Compressing model parameters: Brotli-G can be used to compress the weights and biases of machine learning models,

Brotli-G can reduce the amount of memory required to store the model; Which can be beneficial for deploying the model on devices with limited memory.. For example:

Brotli G can be used to compress a model with 100 million parameters from 100MB to 50MB.

Compressing model inputs and outputs: Brotli-G can also be used to compress the inputs and outputs of machine learning models;

This can reduce the amount of data that needs to be transferred between the model and the data source or sink.. For example:

Brotli-G can be used to compress images from 1MB to 500KB.

Compressing model activations: Brotli-G can also be used to compress the activations of a machine learning model!

Reducing the amount of memory required to store the intermediate results of the model.. For example:

Brotli-G can be used to compress activations from 500MB to 250MB.

In addition to these specific examples, Brotli-G can also be used to compress other types of data that are used in machine learning; text & data,

Brotli-G is a high-performance compression algorithm that can provide significant performance improvements for machine learning applications.

These examples demonstrate the potential of Brotli-G to improve the performance of machine learning models. As Brotli-G becomes more widely adopted, we can expect to see even more innovative uses of powerful compression algorithms.



Inferencing & Classification : Protocols

To clarify that the inferencing unit such as Intel, AMD & ARM are expressly created with the opportunity to minimal instruction load; Edge detect & other machine learning comparators..

As the Inferencing instructions contain the logic of comparison.. & furthermore are created to facilitate the comparison of Inference tasks..

most logically you can see a wise person could see scope for edge detecting expressly with edge sharpening & shaping in mind; But also Trilinear filtering & of course Tessellation ..

Now i believe you have Displays, Cameras & Audio Systems to optimise!

Now we know that we can & also improve latency related issues such as frame tearing detection & also jitter & QFT & VRR.

How ? Inference all of the latency issues of frame arrival time, torn frames & misaligned audio & Electric signal jitter in what is effectively an Ethernet protocol AKA Frame Transmission & Reception ...

More ? Why not :L

Rupert S


Int8:SiMD : Maths & Logic

This is about how you think about components such as INT8, INT4(Xbox) & SiMD, You have to classify by necessity & optimise the structure.

You can shape the game reality with specific control objects & statics!
Maths in SiMD & Int8 & Machine Learning in Int8 & SiMD; SiMD is hard maths, Int8 is soft edge inference...

Both are maths; But soft logic is not a PROOF Math but can be proof; Hard math is not 'Invention & Imagination 'Exactly''

But we have both to improve performance.



SiMD Performance : RS

Performance per WATT of MMX & MMX+ & SSE & AVX Machine Learning & Shader code; Is a matter of 8x8Bit & 16x16Bit Code on GPU

Our role is to reduce complex un-cache-able ML to Cache Enabled 64KB
Modelling of 1990's without Quality loss of 32Bit++ 64Bit+

8x8Bit sharpening MMX Becomes Dual Pipe (16x16bit)*2 in 32Bit Dual 16 Pipeline & Twice as sharp
Machine Learning method for MMX Is Fast & Cheap, MMX2 More Compatible,
Intrinsic improvements such as combined ops & DOT4 Further improve the performance of under 1MB Code..

Performance & Function per WATT, Is unbeaten; Let us prove it!

For example Quake has MMX Emulation & MMX Dithering code on 3D Textures,
In 8Bit 256 Colours dithering is noticeable; In 15Bit to 32Bit the small shade difference in dithering colour is subtle & flawless,
Improving light subtilty & Colour pallet WCG & HDR 10Bit to 16Bit per channel.

SiMD & Int8 & dp4a & F16/F32/F64>:

The way SiMD Repeating Parallel batches of instruction can still side load data,
Data is loaded into the 'calculation set',_multiple_data

SiMD Consist of 8Bit to 64Bit Long & Floats,
SiMD are simple instructions; Or so they think; SiMD are relatively complex instructions..
For example 4/1 of a page full of arithmetic code; However our goal is to use Heuristics & logic to circumvent the Artifacts/Errors in self generated code,

In addition to using problem solving tables to choose instructions that advantage our analysis (Machine Learning),
We also can choose the most probably optimal code type.

Our outset objective is to decide if we want to use CPU Feature types:


Depending on the Mathematical Qualities of each ML Node & the questions they are asking,
For examples:

A simple ResNet Image identification uses edge detect & for that we need for example SiMD Matrix Edge Detection

Speech requires identifying Words in a codec, So obviously we need a Decoder & Encoder,
Word identifiers & correctness checking; But firstly we need to identify accent to correctly choose words..

We also need to classify words by Idea grouping (DataBase, Open Database)

As you can see; We will be defining many of these function groups as SiMD & Float,
Effective use of Int8 differentiation, Comparators & Maths operations has many benefits; So does JIT Compile.



Solve Table of Statistically provable Machine Equates & Solves : Table of function competitors & Operators.

Runtime Library - Multiple Solve Table

I would like a Solve Table of Statistically provable Machine Equates & Solves that make the equivalent of Maths Compilers such as RUST & Fortran's

For example basic ML code test function loops are basically compatible with X-OR Comparators on AVX! Other functions such as greater or less than; Are AVX Compatible.

Machine Learning : List of actions that are SiMD Baseline: Statistical Observance and Solve Tables

Yes or no comparator X-OR
Memory array Byte Swap
Greater or less than with swap or with X-OR Roll
Memory save & store
Edge comparisons
Compares (Colour, Math, Equate, Target, Solve if)

There are more! Statistical Observance and Solve Tables.

Examples 2:

Shape compare is a matter of inner & outer Vector : Comparison & X-OR, Larger outside & X-OR The differentiation:
By Dot,
By Mass (non literal dot difference comparator by axis),
Actual Mass
Density : Lumina, Weight, Mole, Mass / Area

Edge Solve : X-OR ~= Colour, Lumina, Shade, Vibrancy, Distance, Matrix Solve 3D>=2D Flattened Comparator
If = X-OR=N<0.0001 Then Compare &= Mutex Solve / Average

Polygon Join/Merge Tessellation : If Model = Same (T1 + T2 If (T1 + T2)/2 = Difference Less Than 0.0001 | = Merge/Converge


Audio, Video & High precision Float ML

tensors & full onnx configuration : Upscaling : While we are not sure how much ML we need & at what precision,

We can be sure that 32Bit (per channel) Value RGBA (Multiple layer) requires at least 8Bit to 16Bit per channel final precision; So here is a list:

Required Value of output, Neural Network precision guide table: RS

8Bit, 10Bit, 12Bit, 16Bit

Input network precision average bit retention (for RAM some error is allowed)
6Bit, 8Bit, 10Bit, 14Bit, 16Bit

Classifiers as we know can be,
Int 2Bit 4Bit, 8Bit, 16Bit, 32Bit
2 Bit is unlikely & 32Bit is for Dream Smooth 16Bit+ Precision output

Output Float (Mostly FP & F16b)
16Bit = { 8Bit, 10Bit, 12Bit }
24Bit, 32Bit, 64Bit = { 16Bit, 32Bit, 48Bit }
We can upscale : Audio, Video, Content & Polygons, We classify Quality by expectations & Quantify by percent %

Rupert S


8Bit vs 16Bit vs 32Bit

Stitching wounds is an example for use to compare inferencing bit depth:
An 8Bit reference photo constitutes approximately 1cm² Black & White / Grayscale 300ppi, maybe 1/2cm² Colour 8Bit 150PPI,

16Bit reference constitutes approximately 6cm² grey scale 600ppi, 3cm² Colour 15Bit 300ppi.

32Bit single precision still has more to examine.

Both 8Bit & 16Bit Inference offer a solution.

(c)Rupert S

Bit Depth and Colour Representation:

A bit is a fundamental unit of information in computing, representing either a 0 or a 1.

Bit depth refers to the number of bits used to represent a colour value for a single pixel in an image.

Higher bit depth translates to more possible colours or shades of grey.

An 8-bit image can represent 2 raised to the power of 8 (2⁸) which is 256 colour values,
This is often enough for basic images and applies well to grayscale images with high precision (300ppi in example).

A 16-bit image can represent 2¹⁶ (65,536) colour values, offering a significant increase in colour detail,
This can be beneficial for colour reference photos (like the 300ppi colour example).

Bit Depth and Image Quality in Stitching

In the context of stitching wounds together, accurate colour representation and detail are crucial.

An 8-bit grayscale image at 300ppi might provide enough detail for basic analysis,
But a 16-bit image (or even higher) would likely be preferable for capturing subtle variations in skin tone and tissue.

The provided information suggests that 16-bit color images might offer a good balance between detail and file size for this application (around 3cm² at 300ppi).
32-bit and Beyond

While 32-bit images offer an even greater range of colors, they might not be necessary for tasks like wound stitching, and would likely come with increased file size and processing demands..

Important Considerations

The suitability of a bit depth depends on the specific application.

File size also plays a role - higher bit depth images require more storage space.

Processing power required to manipulate the image can also be affected by bit depth.


TPU is discovering a new market in L1 Server class NPU Share; Minimal footprint NPU Class EdgeTPU has the edge you can't match..
Edge Server class mPCIe M.2 NPU learning.

An 8bit S Curve is 200 points curves over 15 Frames, now that is an Edge TPU

Remember that without Jenny, his dream of identifying cells wouldn't have come today, Jenny inspired the Resnet 50m photos to cure cancer story with her great energy; Resnet-50 Cell identification program

You may be wondering but as a doctor you may already know that they have a 3D XRay to destroy cancers,
But you might not know how Resnet50 could isolate & destroy cancer cell clusters in a 3D XRay/Image/MRI scan, Resnet-50 Service It takes 50m images identified to cut all polyp cancer cells from a victim,
Coral edge TPU and Movidius are the economy answers for cloudflare and

@cf/microsoft/resnet-50 50 layers deep-image classification CNN trained on more than 1M images from ImageNet

Worthy configurations for consoles such as dentists computers :
Cancer & searches for tissues such as FATS in the veins of the heart,

Fats in veins constitute an average width of healthy vein being a measurable statistical normal,
The fat amounts stuck to the vein constitutes an abnormal or statistical deviation..
Many of these measurements need official verification & should be signed as verified.

Heart pulse rate versus body size & arrhythmic or statistical variances beyond normal; That are not seen as healthy (if you verify knowledge).

There are many small tasks that the body does that are equivalent to vehicle verification & health checks..

The results of the statistical normal & small task..
The task that has many points of interest & thus takes hours for people to verify,
Computers do these tasks better & quicker.

ApplicationSensiMelia (c)RS

I estimate a tip of 15cents per client per hour would make the application work,

In the case of diabetics & other statistical anomalies like heart rate,
The App that works is a combination of Lamba LLM & statistics & average deviations in 8Bit inferencing,

Perfect for EdgeTPU.

(c)Rupert S

Cancer Research References

Skin cancer

Prostate Cancer

Breast Cancer

Deep learning radiomics based prediction of axillary lymph node metastasis in breast cancer

Improving image classification of gastrointestinal endoscopy using curriculum self-supervised learning

Cervical cancer
Prediction of lymph node metastasis in operable cervical cancer using clinical parameters and deep learning with MRI data: a multicentre study

$DeepCPD: deep learning with vision transformer for colorectal polyp detection

A Combined Ensemble Model (CEM) for a Liver Cancer Detection System

MRI ML Enhancement
Deep-learning-based reconstruction of under-sampled MRI to reduce scan times, a multicentre retrospective cohort study

PulmoNet: a novel deep learning based pulmonary diseases detection model

An efficient image classification of lung nodule classification approach using CT and PET fused images

Salivary gland tumours
Deep learning based ultrasound analysis facilitates precise distinction between parotid pleomorphic adenoma and Warthin tumour

Rapid and Label-Free Histopathology of Oral Lesions Using Deep Learning Applied to Optical and Infrared Spectroscopic Imaging Data

Brain Cancer
A multi-class brain tumour grading system based on histopathological images using a hybrid YOLO and RESNET networks

Deep-learning quantified cell-type-specific nuclear morphology predicts genomic instability and prognosis in multiple cancer types

Multiple path trained cancer & diagnostics study with full networks and specifically tuned: Progress : :D

Medical Data for ML Processes:
The data that support the findings of this study are openly available on Kaggle:


Machine Learning Processed 3D Fully Masked Identified Groups : MLp-MaskedIG screen & clense:

Dealing with Resnet Identify Masking & Precise Ion control in 3D Fully Masked Identified Groups

Observing that directed energy reduces radio exposure & sickness; works with surgeries also & is human processed or robotic.


Incident observations , download entitlement

Extra layer reference :

Retraining the last layer or Repointing a network;
In terms of ourselves this constitutes Retraining our degree along specialisation tasks,
That is the method RS

We need to clip the last layer or re profile the vision application if we wish to add networks like cancer or germs to a pre trained model,

According to them we repoint nodes or we clip inferencing layer & re train the network before inferencing..
Exact referencing is complex; But we need to retrain or repoint gan's and inferencing networks..

Pre trained networks that are not specific to our task cannot add nodes for tasks without re imprinting the network optimally or shaving off the last layer to further our identify tasks.

Rupert S

Reference française:

PCIe Acceleration modelling for Medical grade 3rd World #FirstClass : Question is, are you McGuiver? I Am ;D #DoctorLove

Your standard medical console may be using most probably Standard Python acceleration (older version),
Most likely a cancer screening could shave 30 seconds from your diagnostic timeline..

If you have one of the following available:

hailo 8, PCIe, M.2, M.2 in a PCIe Card such as a compatible Wifi M.2 E-key or AE Key...

Question is, are you mcGuiver? I Am ;D

Hailo3/5 with phiza & the like who 'donated it'

CORALS on sale 4TOPS, 8TOPS, Your choice, What you need for HPC 09:33 06/03/2024 : RS

ML tensor + ONNX Learner libraries & files
Model examples in models folder

The perfect Proposal RS


FPGA BitFile & Code Opt (c)RS 2021-01

In my view heuristics in compilers are a choice for those who do not wish to include direct ML compiled into their code,
This is understandable in terms of terminator & cylons & indeed flawed beings or even good ones with depression!

However the application of branch optimisation is a sample code optimisation that can 'Plug In' to branch caching on the CPU & GPU.

Heuristics are not just code in the compiler; They are also micro code selecting a probable branch; Although code that forces a branch can be flawed..

Both heuristics, Branch probability selection & ML can run in parts of the code to select probable path!

Yes fundamentally any code that modifies behaviour is a catch bullet frame for not sound 'Fortrans code is rock solid' Rust is also supposed to be solid.

Including soundly made heuristic code & branch probability code ML in your inline routines; 'Very much interpretive master jedi'; But it can be done!

Question is How big? & how fixed?

25KB per 3MB on average?

ML & Heuristics like my application FPGA BitFile & Code Opt (c)RS 2021-01

can be applied at runtime & remain only for selecting the fastest path or the best; In terms of which Processor function to run code for.

(c)Rupert S


TOPCloud Scaled Flexible WebASM & WebGPU & MathML!

Quite flexible for use on Monitors & TV's; Light processor load on simple tasks & offloadable such as TOPCloud!

You may be thinking Offloading is impracticable because that requires one of two things:

JIT Compiler Dongle..
USB device such as Firestick or GPU & CPU (With OpenCL Compat)

Server! so internet & service provision!
Impossible? No; WebAdvert supported TV's need both!
So why not HPC TOPCloud? could make a HOT TV a lot cooler & Eco friendly with Server repeating tasks:

Quality Service
Service availability

TOPCloud Offload Logic:

In terms of WebASM & WebGPU & MathML; TOPCloud provides sufficient advantages to be considered a core utility..

While Offloading repeating content such as Siteload core stack (Server) & Localising configuration such as Webpage size & DPI & Dynamic font arrangements that require thought.

In terms of Offloaded function & Efficient system load for large configurations..

Especially efficient configurations such as TPU, Coral, GPU work & Cloud CPU that have large optimised stacks & installed drivers.



#Sound Strategy game TOPCloud (c)RS

PCM & MP4 are 2D/3D Image so GPU Helps there also with 3D Audio mapping!
Games do not require cloud processing of images & a lot of local strategies are procedural Heuristic

You see RDP has GPU Connect (my innovation i might add) So Bluetooth & Wifi can connect RTP GPU; The port specifics are not particularly important; However a device such as music streamer can have ML TOP's available locally & from the cloud,

Due to how the TOPCloud strategy works with localised ML TOPS; Not all data has to be sent or received.. For example all Audio 3D Profiles for HQ Room audio can be done within a few MB of data; With some hard work? 150Kb of data & so in reach of phones & mobile!

Gaming is an example here. I give TickTackToe as the example where all that a device like Alexa or Google smart device has to think is Which square? but..

No physical picture needs to be sent for the game to be played & if required a small TickTack Strategy ML is desired locally for a quicker response!

You see with a low latency GPU RTP & GPU RDP connection to cloud GPU; Most localised thinking TOPS can be carried out in Seconds if not milliseconds & PCM & MP4 are 2D/3D Image so GPU Helps there also with 3D Audio mapping!

Rupert S


Core features of TOPCloud:

RTP ML TOPS are a processors friend

3D audio mapping & spatialization for realistic sound effects
3D Vector Support for various audio formats such as PCM, MP4, OGG, and WAV

Low latency & high bandwidth connection to cloud GPU servers via RDP

Procedural & heuristic algorithms for generating game scenarios & strategies & 3D Audio & Visuals
Localized & cloud-based machine learning models for optimizing game performance & user experience

RTP GPU Connect technology that allows users to access GPU resources from any device with Bluetooth or WiFi

TOPCloud is a revolutionary 'TOPS' way to enjoy & create audio games using your own music & the power of the cloud. Try it today & discover a new dimension of gaming!


Scaling; We can classify by colour or creativity. (c)RS

If you use TOPCloud, you can share between different displays in the TOP's Sense..
but mostly you would need cloud presence,

Mostly this would be about making the most out of TOP heavy Business GPU & personal ones in your computer or consoles.

But sharing common tasks such as scaling movies by type or by identifying a single movie to upscale...

Now you might be asking what we would be doing there?
Well a single movie uses the same materials in our ML; We can analyse the class & optimise the scaling by class..

For those familiar with games & FSR; We familiarise our code with a single game!
By doing this we improve our product and can therefore classify by:

Type, FPS for example & RTS

We can classify by colour or creativity...

We do not simply have to roll the dice on General Scaling, We can use classifiers:

Frame Rate
Colour & Composure

Rupert S

PoCL Source & Code


We all think our own way; Potential is always there on a Runtime Library - Multiple Solve Table

Machine learning | Equate ~= Multi Layer Wavelet Abstraction

(c)Rupert S 2022-10

This one will suite Dedicated ARM Machine in body armour 'mental state' ARM Router & TV
(ARM Learning 4K ROM; Safe Larger USB ROM)

Android & Linux ARM Processor configurations; routers & TV's upgrade files, Update & improve


Python Deep Learning: configurations

AndroLinuxML :

Linux :

Windows :

Genuinely good JS + Python & configuration work, Windows, Linux, ARM

ML tensor + ONNX Learner libraries & files
Model examples in models folder

The perfect Proposal RS


Machine Learning SDK's,
You may not have a Machine Learning SDK to accelerate your GPU/CPU/Device

3 main ones, but Python does not guarantee an accelerator!
Obviously Python Builds with Accelerators work!

HW Build Source : Upscale DL

PoCL Source & Code


Neural Engine

ML List & Services

Tokma ML

Batch Size 240W>65W, 32GB{64, 16}, 15W>5W, 4gb{16, 1} : 16, 8, 4 seems optimal,
Time taken compatible:

Python & JS Configurations


Training Networks

Coral TPU Micro Edge Learning with performance arranged Intel FPGA Arria 10 SX SoC Kit & Google Coral, NVIDIA Jetson Nano and CPU ROCK 4C Plus

With both USB Devices being 8Bit INT, I would imagine all of the models would run on both in 8Bit INT

39$ 2x Edge TPU, Prefer 6x or 8x M.2 & PCI 16x & 32x with 4GB+ RAM

You may be aware I promote this product the Coral Edge TPU & the Movidius USB by Intel,
You may not be quite aware of how well they accelerate! :D


EdgeTPU Coral.AI Video GIMP Photoshop

Gimp Speed Figures,

OpenCL per Selective Gaussian Blend
24GB RAM 8Core

CPU 1.3m
RX200 60s
Movidius 10s
+Coral Offloads Int32 & processes; Processing INT8,
In that way the main CPU is the handler of most complex non inference tasks..
In many networks F32 & Int32 would be used to represent computation tasks & can be sieved optimally.

With 8MB of Essentially RAM Writeback Cache; Input-Output though the USB or M.2/PCIe,
Loading tasks through the IO buffer; Into & out of the work buffer; The average flow cache would be around 256KB..
The machine Learning ML itself is between 32KB & Around 7MB.

The Processor itself has multiple threads & IO/DMA Processes to directly inference or solve programming.

Ideally Compressed RAM, with Rsrt ADD+ & MUL* & PACK & Min Mean & Max,
We can perform flexible basic maths,

Flexible compression by copy & replication
Compression consisting of MUL expansions or fractions, MUL ADD & roll Example n+n*y = , n+((n+10)*y),
Compression formula consisting of algebra operations to unroll or roll & gradients Min=m to Max=y

Examples of formula expansion compression

replication n+((n+10)*y)

Formula expansion n+(y*z) = , (n+y)*z =

gradients Min=m to Max=y , A++B till (N*t)=C then Min A to Max C | Median = D | D++t
(the above for example: ((n+y)*z = )Rsrt = )

Compare values A B | C , Compare A B = L(Larger); (L - {A, B})=C; C=Difference

Load Image {A, B}; Shape {A, B}; Compare A B = C; transpose C to S = Surface

8MB Work buffer
256KB IO Buffer (Fast Frame Buffer); The effective memory used by images or audio may reach 2MB.

The perfect Proposal RS

3D DR_LC : 3D Layers to Direct Render Layer Composing : OS, DSC, Codecs, DirectX & Vulkan

Here are the Operation Processor Extensions available to EdgeTPU:

The sample examples show what a powerful specialised instruction set can do!
To explain more; The EdgeTPU is a Matrix multiplier & Adder..

The instructions such as transpose allow for example mapping one image on another for difference detection...
Flexible uses for each function can literally be based on the basic concept of the instruction,

Basic assumptions lead to convoluted & complex examples..

Examples that are required to do such things as check one bitmap for identical content (in effect XOR)

Quantize : image pixels..

Max Min Mean : Dithering or gaussian blends (complex XOR & layering or edge feathering) & more!

StridedSlice & Slice : partition a frame into parts to render in a grid; slice CSS isolated content in rendering.

SpaceToDepth : Dynamically allocate depth layers to single frame content such as text boxes or photos..
So why ? so we can Dither edges & Fonts & minimise ram usage to dynamic content.

We AveragePool2d to gaussian blend the layers together; In principle we average weight the layers to a final,
Single layer; DSC VESA

Alternative is to Paint major content on a single layer involving the CSS backdrop..
Moving content on top of it on a secondary layer; Makes sense to me! speed wise,

We could use an average pool with Weights (+10/30/100 to -10/30/100) & Feather & Gaussian blend down if we like!

MaxPool2d define layer amount.

ResizeBilinear, ResizeNearestNeighbor : Resize textures for appropriate size of mouse pointers & cursors & content pictures or video..

DepthwiseConv2d : we can down convert layers to 2D, in principle in chrome we convert layered textures during final frame generation to a single flattened layer..

Transpose : layers folded into a single frame render fast! bear in mind that we have to HOLD THE LAYERS in a single fetch!
Buffer optimization to ram size required.. Memory optimization is crucial to hold all layers in a single fetch.

Alternatively combine layers with Transpose & DepthwiseConv2d combined.

In a genuine way layering mouse pointers ontop of DSC frames makes a lot of sense in the terms of response & compression..

you have to think in terms of knowing what is under that mouse pointer; there are two frames of reference to this:

deliberated previous frame forward predict with icon buffer to load over it (A small texture)

Layered responses; in layered responses the Processor processes a java script css layer in the form of the operating system & vectors..

The motion pointer or animation travels over the top in a secondary layered response,
The formation of layers lowers processing costs & speeds up UI response timers; lowering overall compression costs because the first layer is fully converted into an almost static prediction response &or reactionary differentiator system.


TOPS Conversion Table:

8000G 16TOPS NPU + SiMD 13TOPS total 39TOPS,
Standard FX 8TOPS to 13TOPS All SiMD used!
EdgeTPU*2 8TOPS + CPU SiMD &or NPU..

SiMD F32, F16, Int32, Int16, Combined 8Bit parallel ops..

NPU F16, Int16, Int8, Int4,
TPU U8 & Int8,


Conversion recommendation work for NPU & SiMD:

Int32/64 CPU + 8Bit Inference TPU
F32 conversion by removal of remainder Xor XMM & YMM XXX to Integer Inference; TPU 8Bit or NPU..

Rupert S
Hailo M.2 & mPCIe are both available a 28TOPs 199$ & the (MultiProcessor) PCI Cards for around 300$ to 800$,
Unfortunately not sources on amazon.

8TOPS M.2 38$; Worth a thought.

B Key / M Key M.2, M.2 NGFF B Key / M Key M.2 NGF, M 2 Specifications support : 2280/2260/2242/2230

GLOTRENDS M.2 M Key to E Key WiFi Adapter for M.2 WiFi Module



To my knowledge M.2 E is basically PCIe but smaller, So adapter is fairly simple.


The M.2 "E" key sockets are used for Wireless LAN/Bluetooth cards.
These sockets are common with laptop motherboards.
They are also found on some desktop motherboards (mITX, mATX, ATX).
Gigabyte offers mITX boards with this support.


Analogue ML - Including Additive-Capacitor-'Battery' - Using the IBM analogue in-memory hardware acceleration kit for neural network training and inference - APL Machine Learning

RAM ADDER differential Inference (c)RS :
RAM Table Accumulated addition node network Accumulator with Accumulation comparison Inference

IBM Analog Hardware Acceleration Kit

Matrix Processors - Multi Node SpiNNaker2 A Large-Scale Neuromorphic System

Isaac Gym - Preview Release

CALM: Conditional Adversarial Latent Models for Directable Virtual Characters

ML Strategic Workflow Training & Models - Machine Learning model guide Tensor to ONNX - Fraud Prevention & Statistics - Turning Data into Insight with IBM zOS16

Evolution-ML CNN Self Trained Auto Sparsity - Hybrid multi-objective evolutionary model compression with convolutional neural networks

ML Compressed Dynamic16bit-8Bit - Hardware-friendly compression and hardware acceleration for ML Transformer

AA-DLADMM - GD Gradient Descent - An Accelerated ADMM-based Framework for Training Deep Neural Networks


Personality UI : Have a friend

Alpaca Character Generation model
4Bit for speed, But not precise
trained 3Epoc Higher Precision

Base model

Python WebUI
Mac; Mostly MAC but fast

how to use & personality sets

On the subject of how deep a personality of 4Bit, 8Bit, 16Bit is reference:


Machine learning | Equate ~= Multi Layer Wavelet Abstraction

(documents) JIT & OpenCL & Codec :

Include vector today *important* RS

Eclectic & for the codecs of the world! OVCCANS (install and maintain as provided HPC Pack)


Transversal processing availability : Transparent Task Sharing Protocols

Machine Learning

Innate Compression, Decompression

Best NPM site on world

(Simple Install) Website Cache JS Updated 2021-11 (c)RS
(Simple Install) Science & Research Node High Performance Computing
Linux & Android

Presenting JIT for hardware interoperability & function :

(Simple Install) Website Server Cache JS Updated 2021-11 (c)RS
(Simple Install) Website Server Cache JS Work Files Zip Updated
2021-11 (c)RS


Direct ONNX Hardware Accelerated: F16

Ideal for 4Bit Int4 XBox & Int8 GPU
PULP-NN: accelerating quantized neural networks on parallel ultra-low-power RISC-V processors - Bus-width 8-bit, 4-bit, 2-bit and 1-bit

ML Proof case SVM (Multi-Dimensional-Elliptic,98%) aDaBoost M1(Mac,91%) - COVID-19 Prediction Using Supervised Machine Learning - Irfan_Ali_MEng_2023

Useful operation precision reductions
FPGA Implementation of Keyword Spotting System Using Depthwise Separable Binarized and Ternarized Neural Networks

Useful operation precision reductions; I observe that reducing precision to 1Bit & 2Bit..

While enhancing the definition of a positive, Negative Dipole & thus enhancing speed..
Further reduces reasoning capacity; That in order to reduce Processor bandwidth for reasoning..

In the example of the XBox & PS5; DOT4 & INT4, INT8 & F16 & bF16; Apply considerable improvement to reductions probable error related to a lack of remainder or float value depth enhancement!

By reason of probability i assume a value of 4Bit & 2Bit to allow the smallest packing ability; Existing alongside the word reasoned!

To reduce to 1 & 0; I assume a definite statement that a Value Integer Solve in the form of a vector..
Is most probably the solution & that furthermore that in most cases; Projected in pure maths & code ASM,
Both SiMD; Float & Integer...

Reduction to multiple 2 Bit values in short Integer instructions; I will state however that no such value is further away than a statistics table or PHP Data-Set.

Rupert S 2023-06



SiMD Gaussian Blending & Dithering - Better_Fixed_Point_Filtering_with_Averaging_Trees

Vectorization of Kernel and Image Subsampling in FIR Image Filtering

Implementation of a High-Quality Dolby Digital Decoder Using SiMD MMX™ Technology


Common techniques used in ML Learning are edge detection, accent recognition, language processing, and code optimization.

Basic ML Feature list; Also for learning

Edge detection is a process of identifying the boundaries of objects in images or videos.

Accent recognition is a process of identifying the regional or social variation of speech.

Language processing is a process of analyzing and generating natural language texts.

Code optimization is a process of improving the performance or quality of code.


Dynamic ML IRS-RIS 4G,5G Wave Shaping Edge detection with reflection angle calculation - strong wave localising edge (shaping)sharpening(c)RS

By quantifying how waves bounce from reflective surfaces it is possible to shape waves that bounce in a different direction from a mechanical reshaping surface called a RIS..

Reconfigurable intelligent surfaces & Intelligent reflecting surfaces bounce radio waves for wireless networks..

Presenting the example:

Coral TPU Micro Edge Learning with performance arranged Intel FPGA Arria 10 SX SoC Kit & Google Coral, NVIDIA Jetson Nano and CPU ROCK 4C Plus


No comments: