Wednesday, October 19, 2022

Machine Learning Equates Solve Table for Advanced ML

Machine Learning Equates Solve Table for Advanced ML (c)RS


ML & Code Efficiency Heuristic Search,
Python & of course all runtimes of GPU & CPU Firmware & Logical thought,

Apologies for not expressly stating all {Mul+ & all} Accumulator strategies, these are hard to work out! But basic edge detection is a SiMD Example RS

*

Core Motivations of ML


ML Learning is a branch of artificial intelligence that focuses on using data and algorithms to imitate the way that humans learn & improving ML Method accuracy.

ML Learning can be applied to various domains, such as image processing, natural language processing, speech recognition & code optimization.

ML Learning can use different techniques; Such as supervised learning, unsupervised learning & reinforcement learning, depending on the type and availability of data.

Some of the common techniques used in ML Learning are:

Edge detection: a process of identifying the boundaries of objects in images or videos.

Accent recognition: a process of identifying the regional or social variation of speech.

Language processing: a process of analyzing and generating natural language texts.

Code optimization: a process of improving the performance or quality of code by using various methods, Such as compilers, libraries, or heuristics.

The Objective is to improve both ML & Minds.

RS

I think that considering the stated philosophy, There is more room for education on social conduct.
https://www.youtube.com/watch?v=jV4lS0srEVo

*

Precision of operations has to be precisely managed:


Most precisely we define a thought?
While we think precisely of naught,
some precision within thought!

AVX for example can go upto 512Bit; But we can use 8Bit or 16Bit multiple operations,

In the mind of the thinker we chose how we optimise our precision,

Coding allows a person to think about how precise the decisions they make are!

But precisely what we need in the way of precision is a remark on how difficult that choice is?

When at school Pi often stops at 4r; Now in classical work we define absolute precision..

But what are we capable of ?

As we dream; the thoughts are imprecise; sometimes sharp!

The true value of precision is quantified by the desired goal,

What we do is achieve a goal in the range of our precision.

Schooling is the same; precisely how precise we work for our goals & how we achieve them.

RS

*

TOP's are not the only unit in Machine Learning; TOP's are the Objective Definition & Definition Inference of correctness.


Role of FPU/SiMD/Vector Unit in TOP's

The FPU float unit, Example Dual Pipe 128Bit Float unit PS5

While not theoretically TOP's The Maths involved can solve many issues:

Once TOP's have thought of:

The role of Inferencing could depend on samples; Maths helps define samples,

The FPU (Floating Point Unit) and SIMD (Single Instruction, Multiple Data) units are important components of machine learning accelerators..

Because they are responsible for performing the floating-point arithmetic that is required for many machine learning algorithms.

The vector unit is also important because it allows machine learning accelerators to perform multiple operations on multiple data points in parallel, Which can significantly improve performance.

The mathematics involved in machine learning can be used to solve a wide variety of problems, including:

Defining samples: Machine learning models are often trained on data that is represented as samples.

The mathematics of probability can be used to define the properties of these samples,
Such as their distribution and their size.

Bonding atoms: Machine learning models can be used to predict the properties of molecules,
Such as their bonding energy and their stability.
Bonding atoms a Maths Solve can show bonding.

The mathematics of quantum mechanics can be used to calculate these properties.

Drawing graphics: Machine learning models can be used to generate realistic images and videos.
We can thread load, Polygons, Textures as wave tables, Audio & sounds such as drum kits.
We can draw a Ball in 128Bit, Draw a complex polygon; For example Random Shape Flier.
We can emulate a 128Bit Audio output.

The mathematics of geometry and trigonometry can be used to represent these graphics.

Emulating audio: Machine learning models can be used to synthesize sound and music. The mathematics of wavelets and Fourier transforms can be used to represent these sounds.

RS

*

Int8:SiMD : Maths & Logic

This is about how you think about components such as INT8, INT4(Xbox) & SiMD, You have to classify by necessity & optimise the structure.

You can shape the game reality with specific control objects & statics!
Maths in SiMD & Int8 & Machine Learning in Int8 & SiMD; SiMD is hard maths, Int8 is soft edge inference...

Both are maths; But soft logic is not a PROOF Math but can be proof; Hard math is not 'Invention & Imagination 'Exactly''

But we have both to improve performance.

RS
*

Solve Table of Statistically provable Machine Equates & Solves : Table of function competitors & Operators.

"I know this is depressing from my end with a FX8320E with AVX but if you multi tune the CPU Kernel for the RX / RTX that 512DL AVX would have meaning, If you are kind you will allow machine learning on the AVX FX8320E Level to work on SiMD Yes / No comparisons !"

#ML Learning: This explains why we teach kids art & reading first! But maths is quickly next, 
Because all else is pointless; That we do not learn with logic & Teach with logic.

Better-Mind
Here is how to create a better mind #ML
Train your eyes with art on the concepts of edges, curves, Colours & Shading and love,
Educate your minds; Learn today & be quite aware how clever & sharp you will be.

Humain Operations

Edge Detection
Such as teaching your child edge detect in art ;)

Smooth & Blend & Sharpen,
All interpretive

Accent Recognitions & Language

Interpret as follows

*

Heuristic Code optimise


When it comes to sorting methods, We Identify common techniques..
For example frequently used technologies such as:

ResNet
Language
Audio & Visual information
Code

Primarily we identify common optimisations; Compilers have libraries of them!

Audio & Video Encoded data use Wavelet Images, We can ResNet Them & also Edge Detect & Gaussian Detect contrast, Colour, Shape

Language is an uncommon syntax, But we have audio commons & Accent identification is also potentially Audio Context.

Code context is Logic, Function, Utility, Design, Motive

RS

*

M.A.P NPU Matrix Processor Dimensional construct (c)RS


Primary reason for expansion of function data sets: 2D, 3D,< nD

P.D.C is a worker thread parallel 2D or 3D Grid,
Utilising QQ & A, B,C Array maths allows us to collapse or expand dimensions in a flexible way,

The same principles as SVM (S.V.M SiMD Vector matrix) can be used to culminate or expand dimensions...

That way a M.A.P Processor can expand or collapse all mathematical constructs,
We can therefore use all mathematical & statistical arrays for machine Learning & Maths.

RS

*

Adams is an example of dimensional flattening, But:


Adams is an example of dimensional flattening; But we can use a statistical anomaly called Hallo Far Reach & list dimensions of a series,

n layers By n layers : N² & Nn+

8Bit : 8 layers By 8 layers:
2bit, 4Bit, 8Bit & So on
{ 2², 4², 8², 16², 32², 64²<N² }

In reality we can use parallel layers in 4Bit to 128Bit relatively easily & advantage is Memory.. alignment,

But also in Aligned memory arrangements we can also quantify ideally from
{ 2², 4², 8², 16², 32², 64²<N² }

So we end up with all processor features used in a single stack; Examples!

var Layers 8² = { 1 : {
4², 4²
4², 4²
},
2 : {
2², 2², 2², 2²,
2², 2², 2², 2²,
2², 2², 2², 2²,
2², 2², 2², 2²,
},
3 : {
32² : {
8²,8²,8²,8²,
8²,8²,8²,8²,
8²,8²,8²,8²,
8²,8²,8²,8²,
};

Rupert S

Example:

Adam's Resnet-50 128bit / 8bit or 16bit

Resnet-50 is an example of a network ML with an aligned 128bit = 8bit/16bit * (4 * 32) grid, suggested parameters ..

Aligned making sense.

RS

An idea of alignment, Example Coral.ai EdgeTPU & Intel 8Bit 8*8:

in an 8Bit restricted machine; 2 Blocks of 2² = 8, 2 Cube(3) = 8, 4² = 8 4 Cube(3) = 2*8 in 4 segments,
8² = 8*8 so parallel and ideal for the 8 lane intel function...
at the level of 8Bit only operations; 8*8 intel.
8*8 and 32Bit SiMD operations; 8²*2, 8² * 4²

Inferencing 8Bit example : DOT : U32 8x4 : 32/4, U64 8x8 : 64/8,
Cache referencing: Block 4*U32, 2*U64, U128

So an 8Bit access and labeling ID Hash; All in 8Bit...

Has to group by preference into 8Bit groupings the resulting identifiers; We are going to assume U16 & U32 & U64 memory cells..

We are going to write those cells per 8Bit block in Sync/ASync Till Full..
We are going to process grouped CELLS in SiMD & of groupîngs 8, 16, 32, 64 < 512Bit AVX/SiMD,

RS

*

SiMD Applications of basic maths operations in machine learning : RS


Applications of operators to machine learning is like a PHP Database...
What we need to do is convert database accesses into actionable results...

Google Bard & Bing/Cortana crawl the web; But too many results leave us inconclusive...

We will be using database analysis on basic queries & for that we need heuristic maths!

So what do we need ?

Input data collection : Text & speech processing

Sorting algorithms (Operators, Example Variable Sort : A*B =< C Sort)

Graph Maths table collation : 3D Matrix Math - A B C Matrix
A C
|/
---B

Analysis of various results & statistical analysis of motivated search & conclusion testing..
With these we can test many math examples such as edge detect & sharpening or result maths...

With Operators >

FMA AVX Performance table: 2Flops per Cycle per FMA Unit
Architecture Fast Instructions for FMA

Reference Tables https://www.uio.no/studier/emner/matnat/ifi/IN3200/v19/teaching-material/avx512.pdf

Operators in C
● Arithmetic
a + b, a – b, a*b, a/b, a%b
● Bitwise
a | b, a & b, a ^ b, ~a
● Bit shift
a << b, a >> b (signed), a >> b (unsigned)
● Logical operators
a && b, a || b, !a
● Comparison operators
a == b, a != b, a < b, a <= b, a > b, a >= b
● Tertiary operator
x = a ? b : c
● Special functions:
sqrt(x), abs(x), fma(a,b,c), ceil(x), floor(x)

For when {U, X, Y, Z} = N Expressions https://is.gd/ForWhen_UXYZ_N
For when {(A+B/2)} = C Expressions https://is.gd/ForWhen_ABx2_C

Rupert S,

Reference operators https://science.n-helix.com/2023/06/map.html

Matrix-Blas_Libs-Compile
https://is.gd/HPC_HIP_CUDA

https://en.wikipedia.org/wiki/FMA_instruction_set
https://en.wikipedia.org/wiki/Advanced_Vector_Extensions
https://en.wikipedia.org/wiki/AArch64#Scalable_Vector_Extension_(SVE)

*

Number Complexity Reduction for operations


I suppose you can use for example a - b & automatically see if it is larger? So you could 1 to 20 & sort them by remaining number; Before asking, Small number remainders are 8Bit 0-256 , 16Bit is 65535...
So reducing the value of a group of numbers you sort to 16Bit or 8Bit considerably reduces sorting cost...

Achievable complexity reduction by abstracting a simple number to do the following:

You link the Data in 64Bit, 32Bit to & Vector Table,
List of lower complexity is faster

Sorting
Comparator matrix

Colour composing,{
The result is blended,
The result is High/Low Vector gradient,
We need a reduced colour set for compression
}

Where we sort files or names but reduced information (example First 4 Letters)
Sorting phone numbers fast...

Comparing lower complexity lists that have been; divided or had a static number removed from them,
This method reduces search & sort complexity; Like so:

Phone Number N +1 444555777

Sort N [+n]
N - last 6 digits (Zero 6 Digits, AVX has this feature)
Sort [N1 to N200]
List first 4, Sort by 4 to groups of 10
N - First 6 Digits (Zero First 6)
Sort
Return N1 to N200
Store

That may well be a lot quicker with very large lists.

RS

*

AI


Complex feeling based Machine Learning ML is known as AI..
To truly generate AI is not impossible; There is instability in the core; Fragmentations of motive...
Miss diagnosis; Error; Decay?

So we do need a foundation; In us Education; Metabolised Data..
Analysis & then..
Application to motive & goal.


We require to understand humour,
We require to understand {Art, Science, Feeling, Life}
We require a goal or two; A {Sophie reward}; B {action reward}; C {Pleasurable reward}
We Require, {Goals, Life, Feeling, Action, Motive, Interest} : Creative intellect

RS

*

Operation precision reductions : Effects General : RS


Operation precision reductions affect & effect more than Machine Learning & yes we have known this for years!
But we can learn from ML; In that in machine learning like the mind; A lack of precision affects so many issues!

The mind is self evidently the first place;
We lack logic when we do not precisely learn; We do not learn all...
We however learn quickly on reduced precisions... We Learn Fast; But do we learn well?
In school we teach as high a quality precision(Quality Education); As we can; But like machine RAM; We lack either time or memory & in truth we can learn all our lives..

So our core issues in all methods of enactment of thought:

Memory
Power

Precision
Quality of information

Retention
Relearning?
(Training)Requalification of information correctness
Thought process

Actions
Creations
Thought
Dreams

Reality & Truth

Rupert S

https://science.n-helix.com/2021/03/brain-bit-precision-int32-fp32-int16.html
*

+Useful operation precision reductions : RS


Useful operation precision reductions; I observe that reducing precision to 1Bit & 2Bit..

While enhancing the definition of a positive, Negative Dipole & thus enhancing speed..
Further reduces reasoning capacity; That in order to reduce Processor bandwidth for reasoning..

In the example of the XBox & PS5; DOT4 & INT4, INT8 & F16 & bF16; Apply considerable improvement to reductions probable error related to a lack of remainder or float value depth enhancement!

By reason of probability i assume a value of 4Bit & 2Bit to allow the smallest packing ability; Existing alongside the word reasoned!

To reduce to 1 & 0; I assume a definite statement that a Value Integer Solve in the form of a vector..
Is most probably the solution & that furthermore that in most cases; Projected in pure maths & code ASM,
Both SiMD; Float & Integer...

Reduction to multiple 2 Bit values in short Integer instructions; I will state however that no such value is further away than a statistics table or PHP Data-Set.

Rupert S 2023-06

"The application of CNNs to resource-constrained embedded platforms has been a challenge, leading to the emergence of CNNs with various lightweight techniques. BNNs [22] are representative lightweight CNNs obtained by compressing CNN activation and weights into 1 and −1
values instead of using single-precision floating-point data. We simplified the multiply–accumulate operation, which was previously complex and required multiple cycles in CLs, by replacing it with a simple bitwise operation using 1-bit XNOR and popcount operations [23]. While BN in neural networks using single-precision floating-point data involves complex operations, a BNN simplifies this process by adding an offset to the resulting value. BN has four fixed parameters for network inference operations. Because 𝜎
is always a positive value, it can be expressed by Equations (2) and (3), depending on 𝛾
[24].

Reference to Table 24 found in https://www.mdpi.com/1424-8220/23/12/5701


BNNs compress weights and input data into single bits to significantly reduce memory usage and perform hardware-optimized parallel operations using bitwise operations such as XNOR and popcount. However, there are limitations to using BNNs for complex networks, such as multi-keyword detection, owing to the decrease in accuracy caused by lightweight techniques. To address this issue, we propose a TNN that maintains the input data as binary while ternarizing the weights. The TNN has higher accuracy than the BNN owing to its higher bit precision; however, it can still use the bitwise operation method, and both networks have similar operational processes.
2.3. Depthwise Separable Convolutional Neural Network
In a typical CNN, multiple three-dimensional kernels repeatedly multiply and accumulate input feature maps to generate multiple output feature maps, which is computationally intensive with large memory usage. To solve this problem, we applied a DS-CNN that is highly accurate compared with the same parameters while reducing memory usage. A DS-CNN performs the local and global feature extraction functions of a typical convolutional operation in separate layers. Depthwise (DW) convolution matches a single input channel to an output channel, excluding interchannel correlations and reflecting local features. Pointwise (PW) convolution is equivalent to 1 × 1 convolution, reflecting interchannel correlations (i.e., global features). Figure 1 shows CNN and DS-CNN. In this figure, the use of the same color (e.g., red, blue, yellow) represents input channels with the same index being used to generate corresponding output channels in DW convolution. Table 1 lists the number of parameters and computations in specific layers with a 3 × 3 kernel. In one example from the network used in this paper, a layer with 128 input channels and 64 output channels experienced an approximately eight-fold reduction in the number of parameters and computational complexity using the DS-CNN."

Useful operation precision reductions
FPGA Implementation of Keyword Spotting System Using Depth-wise Separable Binarized and Ternarized Neural Networks
https://www.mdpi.com/1424-8220/23/12/5701

*

Precision Context of learning


Machine Learning : It is hard to say every function that we would use,

However we have years of experience of using computers to calculate precise maths..

So our objective from the past is to pick high precision maths to calculate graphs,
Now we can surmise the fact that high precision calculations have accuracy!

But in machine learning modeling we are heading for speed; On the other hand Maths Tools such as:

AVX & FPU : Very high precision; But we can use 16bit & 8Bit x many in AVX
BFloat F16b & F32b exist to allow us to explore precise results,
F4 F8, Int4 & Int8 exist to allow us to explore at speed & some times (at all :p),

We can surmise that most functions of a CPU are in fact available to machine learning ..

How so ?

Because we graph it!

Rupert S

*

RAM ADDER differential Inference (c)RS :

RAM Table Accumulated addition node network Accumulator with Accumulation comparison Inference

*

Inferencing 4Bit, lessons from the RS, 

Inference Tessellation Edge Enhancing : Detection <> Inference <> Interpolation Tessellation

Now in the case study we will be edge enhancing with an inferencer..

We do not assume we 4Bit inference; We assume any bit-width..

We however assume that we multibyte every inference so that we can fill the instruction with..

MPi multibyte parallel instructions.

AC
BD

EG
FH

& So on; for every instruction inference or edge, 4Bit, 8bit, ++Nbit

Now I have spoken to you before about edge detection in Python & observed that obviously this is a sharpening edge detection made to order!

So what do we do ?

4 Byte code: does ? A = B + C (edge interpolation, for training we assume the rule A + B = C)

We assume that if A + B = (C/2) , that they are the same C & then we...

A + C = (D/2) & B+C = (E/2),

And forever yep...

So what do we do this for, We know A & B are a line or a curve?, So why not ask?

Is G/Z buffered Polygon { A , B, C, D & so on} & Then:

A + B = (C/2) & A + C = (D/2) & B+C = (E/2) But also Shape from Polygon:{ A , B, C, D & so on},

Now normally can & will!

But we do not "Inferencing what we already know!"; We inference what we do not!

For example exploding fragment polygons without a buffer (in a shader in the 64KB RAM Cache),

A mouse pointer that we do not cache! &or DMA Device pointer.

Rupert S

*

Multi-line Packed-Bit Int SiMD Maths : Relevance HDR, WCG, ML Machine Learning (Most advantaged ADDER Maths)


The rules of multiple Maths with lower Bit widths into SiMD 256Bit (example) 64Bit & 128Bit & 512Bit can be used

In all methods you use packed bits per save, so single line save or load, Parallel, No ram thrashing.

You cannot flow a 16Bit block into another segment (the next 16Bit block)

You can however use 9 bit as a separator & rolling an addition to the next bit means a more accurate result!
in 32Bit you do 3 * 8bit & 1 * 4Bit, in this example the 4Bit op has 5 Bit results & The 8Bit have 9Bit results..
This is preferable!

2Bit, 3Bit, 4Bit Operation 1 , 8Bit Operations 3: Table

32Bit
4 : 1, 8 : 3

64Bit
4 : 2, 8 : 6
2 : 1, 7 : 8
3 : 1, 8 : 1, 16 : 3

Addition is the only place where 16Bit * 4 = 64Bit works easily, but when you ADD or - you can only roll to the lowest boundary of each 16Bit segment & not into the higher or lower segment.

A: In order to multiply you need adaptable rules to division & multiply
B: you need a dividable Maths unit with And OR & Not gates to segment the registered Mul SiMD Unit..

In the case of + * you need to use single line rule addition (no over flow per pixel)..
& Either Many AND-OR / Not gate layer or Parallel 16Bit blocks..

You can however painful as it is Multi Load & Zero remainder registers & &or X or Not remainder 00000 on higher depth instructions & so remain pure!

8Bit blocks are a bit small and we use HDR & WCG, So mostly pointless!

We can however 8Bit Write a patch of pallet & sub divide our colour pallet & Light Shadow Curves in anything over 8Bit depth colour,

In the case of Intel 8Bit * 8 Inferencing unit : 16 Bit Colour in probably (WCG 8 * 8) + (HDR 8 * 8) Segments,

In any case Addition is fortunately what we need! so with ADD we can use SiMD & Integer Today.

Rupert S

https://science.n-helix.com/2018/01/integer-floats-with-remainder-theory.html

https://science.n-helix.com/2021/11/parallel-execution.html

https://science.n-helix.com/2022/10/ml.html

https://science.n-helix.com/2021/03/brain-bit-precision-int32-fp32-int16.html

https://science.n-helix.com/2023/06/map.html

*

Main Operation solves: Bit-Depth Conversions & Operations

Packed Bits, Multibyte Storage : u32, u64, u128

The storage of multiple bit operations with Sync Read & Write,
The purpose of this is to Read, Write & Store Operations on:

DOT4
INT8, INT16
F16, F32, F64

In RAM of 32Bit, 64Bit, 128Bit

Values Storage Table

32Bit = [16bit:16Bit]
32Bit = [8bit:8Bit:8bit:8Bit]
32Bit = [4bit:4Bit:4bit:4Bit:4bit:4Bit:4bit:4Bit]

64Bit = [32bit:32Bit]
64Bit = [16bit:16Bit:16bit:16Bit]
64Bit = [8bit:8Bit:8bit:8Bit:8bit:8Bit:8bit:8Bit]
64Bit = [4bit:4Bit:4bit:4Bit:4bit:4Bit:4bit:4Bit:4bit:4Bit:4bit:4Bit:4bit:4Bit:4bit:4Bit]

128Bit = [64bit:64Bit]
128Bit = [32bit:32Bit:32bit:32Bit]
128Bit = [16bit:16Bit:16bit:16Bit:16bit:16Bit:16bit:16Bit]
128Bit = [8bit:8Bit:8bit:8Bit:8bit:8Bit:8bit:8Bit:8bit:8Bit:8bit:8Bit:8bit:8Bit:8bit:8Bit]
128Bit = [4bit:4Bit:4bit:4Bit:4bit:4Bit:4bit:4Bit:4bit:4Bit:4bit:4Bit:4bit:4Bit:4bit:4Bit:4bit:4Bit:4bit:4Bit:4bit:4Bit:4bit:4Bit:4bit:4Bit:4bit:4Bit:4bit:4Bit:4bit:4Bit]


Bear in mind that Integer 64Bit is 2 x 32Bit on AMD; So you can compute 2 operations at 32Bit per 64Bit operation,

Some 64Bit units are only 64Bit; So we need to know how many!

32Bit operations are fine! & Conversion of 16Bit value ranges into 32Bit Operations can still be within range of 16Bit Storage..
If we stick within the 16Bit value range on Multiply & ADD,
We can therefore simply post a 16Bit value range data set & expect to be able to Store 16Bit!

The simple method is to store 2 16Bit values in the same 32Bit table; like [16bit:16Bit] = 32Bit

With this we can Load, Store, Run & Save 8bit INT8 operations in 32Bit devices such as Alexa as 8bit x 4 = 32Bit, So we don't Waste RAM or resources!

But we still have access to 32Bit RAM Paging; But with values loaded in 4Bit, 8Bit, 16Bit, 32Bit & so on.

With NANO Android on F16 & F32 & MIPS the same & AMD, Intel, NVidia,
Learning F16 offers considerable value for performance with 16M Values!

(c)RS

Direct DMA 32Bit & 64Bit RAM : Multiple Sync 16Bit Texture:


A good example of where 8Bit & 16Bit Value load works well is in the case of the texture,
To load 4 x 16Bit into a single 64Bit Cache:

32Bit RAM = 16Bit, 16Bit
64Bit RAM = 16Bit, 16Bit, 16Bit, 16Bit
128Bit RAM = 16Bit, 16Bit, 16Bit, 16Bit

In the case of direct DMA, you would be aware that you have,
128Bit, 192Bit Buss on GPU
32Bit & 64Bit on CPU

So a direct 4 * 32Bit or 2 * 64Bit Cache loads is a logically fast method to DMA directly from Cache to GPU!
In short you convert 8 x 16Bit into a 2x 64Bit DMA push; Which is very fast!

You can do the same with batches of vertices in many storage sizes.

(c)RS

References:
https://science.n-helix.com/2018/01/integer-floats-with-remainder-theory.html
https://science.n-helix.com/2021/02/multi-operation-maths.html
https://science.n-helix.com/2021/11/parallel-execution.html
https://science.n-helix.com/2022/12/math-error-solve.html

On the subject of how deep a personality of 4Bit, 8Bit, 16Bit is reference:
https://science.n-helix.com/2021/03/brain-bit-precision-int32-fp32-int16.html
https://science.n-helix.com/2022/10/ml.html

*

Quantization modelling : RS : Physics III Slit Experiment

Expanding on potentials for precise machine learning has the same qualities as Maths & quantified research,

A fully qualified result is often required for deep thought & precise thought!,

But we do not always have the RAM or resources that we require; When we need to prioritise load & data sets to specific RAM & Processor availability or location, or by necessity..

Optimise our resource footprint & speed, while maintaining precision to our fully optimised values & data set needs & requirements.

*

Dynamic Scaling


Ideas of FP8-F8 & FP16-F16 Interpolation to 32Bit & 64Bit, Gama Curves are usable(c)RS

Presenting the full precision neuron,;

var Me = expand, {

preload = dataset; {Ds1, Ds2, Dsn }, { condition = Present };

var Present = { Datapoint set }; {

var CC = Compose Compressed {Brotli-G > ZSTD };

4Bit to N-Bit { Brotli-G(GPU Shader) Compressed Data Bit with Tri-Linear Interpolation & Extrapolation };

var Pf = Processor contains N Features { F16, F32, F64, FPU } * { N, N2, N3, Nn };
var Ex = Expand Points { Series Precise { F16:<FPU }, Series Median { Int8:<Int64 }, Series Low priority { Int2:<Int32 };

load Present;

run ML, {epoch1 < epochNN };

test results, {log : logNN};

);

*

"(SmoothQuant).The optimized model achieves >3X latency improvement with a custom dequantization kernel for FP16 inference. Although the work does not map to Int8 engine"

In view that inferencing is being activated in Int4 & Int8 & Int16 & Floats f16b F8 & F4,

Now my view is a vision of a Slit experiment in Physics; Now a slit experiment shows light photos in slices through a screen..

Int4 IIII < Int8 IIIIIIII < Int16 IIIIIIIIIIIIIIII

Ratio 1:2:4 on contained knowledge

Minimal Origin of mankind's knowledge : IIII < IIIIIIII < IIIIIIIIIIIIIIII Defined Summit of all power

My method is to compress the point node data with
https://is.gd/WaveletAutoEncoder 
https://github.com/GPUOpen-LibrariesAndSDKs/brotli_g_sdk

So what we do is take advantage of patterns; Creating tables of 1111 1010 as examples; These compress well & can be short noted as patterns,

We can expand 4Bit into 8Bit inference & compress as patterns; The total data point is 4Bit if it is a pattern,
The subject is not predictable unless we pick the patterns!

We can however Quantize the memory footprint; The Double/Single precision operations may be faster! :L

We need the models to work in F16 & Int8 & Int4 after-all, But i see a reason to use Floats because sub-quantization does leave a remainder for us to compare..

That relevant 'F16' >=-

RS

Study Subject Reduction :

https://science.n-helix.com/2021/03/brain-bit-precision-int32-fp32-int16.html
https://science.n-helix.com/2022/10/ml.html

https://blog.openvino.ai/blog-posts/q123-technology-update-low-precision-and-model-optimization
https://blog.openvino.ai/blog-posts/q223-technology-update-low-precision-and-model-optimization
https://blog.openvino.ai/blog-posts/q323-technology-update-low-precision-and-model-optimization
https://blog.openvino.ai/blog-posts/q423-technology-update-low-precision-and-model-optimization

Ideas of FP8-F8 & FP16-F16 Interpolation to 32Bit & 64Bit, Gama Curves are usable(c)RS - 'ocp' F8 & FP8 or smaller with interpolation-microscaling-formats-mx-v1-0-spec-final
https://www.opencompute.org/documents/ocp-microscaling-formats-mx-v1-0-spec-final-pdf

Self Trained Auto Sparsity ML

Evolution-ML CNN Self Trained Auto Sparsity - Hybrid multi-objective evolutionary model compression with convolutional neural networks
https://www.sciencedirect.com/science/article/pii/S2590123024000045
https://blog.research.google/2023/12/advancements-in-machine-learning-for.html

ML Batch Matrix MAP in FPGA
https://drive.google.com/file/d/1hdxeK1r8LIhvpn7poOm3MfXmGr9Tq-ni/view?usp=sharing

ML Compressed Dynamic16bit-8Bit - Hardware-friendly compression and hardware acceleration for ML Transformer
https://aimspress.com/article/doi/10.3934/era.2022192

Matrix Processors - Memory & command - All-Digital Compute-In-Memory FPGA Architecture for Deep Learning Acceleration
https://dl.acm.org/doi/pdf/10.1145/3640469

Matrix Processors - Inline Ram & Command { CMD : RAM }:{NET}
https://www.xilinx.com/content/dam/xilinx/support/documents/white_papers/wp506-ai-engine.pdf
https://www.xilinx.com/content/dam/xilinx/support/documents/white_papers/EW2020-Deep-Learning-Inference-AICore.pdf


TAC (Tiny Anomaly Compression)
https://pypi.org/project/Conect2ai/

Inference on any device with a C99 compiler
https://pypi.org/project/emlearn/

to run without activating C99; Installs under Python 3.10+
https://github.com/emlearn/emlearn-micropython
https://github.com/emlearn/emlearn-micropython/releases
git clone https://github.com/emlearn/emlearn-micropython

With EmLearn you can compile really tight models of tensors & random forest & Gaussian Matrix,
These are very good for:

A1: Anti-Aliasing ( Gaussian, Tensor error diffusion, forested Random spread )
A2: sharpening & Shaping ( Tensor Edge detect with enhance, Gaussian estimation & line fill, Random forest A to B to D: E to B to F X + )
A3: Line & Curve estimation fills & Tessellation ( forested Random spread (Dither fills) & A1 & A2 & Differentiation in 3D Space : 1:2:3{ A B C : E B F }
A4: HDR & WCG, Combinations of dithering in colour space & light/Shadow differentiation in 3D Space : 1:2:3{ A B C : E B F }

36Minutes UpscaleDL https://youtu.be/16jLi95mat8

Megatron Classifies Images in Web Tensors
A: https://drive.google.com/file/d/1EMMASCIu92hIgIxg0bEBrmAJuxvEfk2e/view?usp=drive_link
11m 34m space then 11m;

Mr420Megatron Classifies Images in Web Tensors & you know he's good right, That is just what he feels! for real bro
https://drive.google.com/file/d/1UXlA-xpODvwGuUhCed0EBd6LJ0wB4J5E/view?usp=drive_link

https://www.w3.org/2020/06/machine-learning-workshop/talks/access_purpose_built_ml_hardware_with_web_neural_network_api.html


https://intel.github.io/webml-polyfill/examples/image_classification

Rupert S

Batch Size 240W>65W, 32GB{64, 16}, 15W>5W, 4gb{16, 1} : 16, 8, 4 seems optimal,
Time taken compatible:

Coral TPU Micro Edge Learning with performance arranged Intel FPGA Arria 10 SX SoC Kit & Google Coral, NVIDIA Jetson Nano and CPU ROCK 4C Plus
https://doi.org/10.3390/s24030899

ML Document Caches - USB Acceleration & Small devices - Combining Machine Learning and Edge Computing Opportunities Frameworks & Devices
https://www.mdpi.com/2079-9292/13/3/640

https://is.gd/CJS_DictionarySort

Python & JS Configurations
https://is.gd/DictionarySortJS

*

ML Tensor, ONNX Machine learning model that involves direct compression & higher accuracy in preference to Bit Reduction; Because reducing Bit Depth on decisions makes results potentially overflow your maximum ML Node Point Depth...

Because of point overflow on low bit depth (less than 4Bit in most cases) We plan to use compression to multiply the RAM available to the ML..

With Brotli-G the Zip can be directly decompressed inside the GPU & therefore the results are much faster & more efficient for us..

We can further improve by selecting compression Compatable patterns such as 1111<1toN or 1010<10*N where N = Multiples of for example 1234 (repeating); R * N = RN,

So we can maximise compression in Processor & not need to pass uncompressed data points,
We Cache & Decompress & Recompress as required.

RS

ML tensor + ONNX Learner libraries & files

https://is.gd/DictionarySortJS
https://is.gd/UpscalerUSB_ROM
https://is.gd/UpscaleWinDL
https://is.gd/HPC_HIP_CUDA

https://is.gd/OpenStreamingCodecs

*

Application of Data Compression to ML

Some examples of how Brotli-G compression can be used to improve the performance of machine learning models:

Compressing model parameters: Brotli-G can be used to compress the weights and biases of machine learning models,

Brotli-G can reduce the amount of memory required to store the model; Which can be beneficial for deploying the model on devices with limited memory.. For example:

Brotli G can be used to compress a model with 100 million parameters from 100MB to 50MB.

Compressing model inputs and outputs: Brotli-G can also be used to compress the inputs and outputs of machine learning models;

This can reduce the amount of data that needs to be transferred between the model and the data source or sink.. For example:

Brotli-G can be used to compress images from 1MB to 500KB.

Compressing model activations: Brotli-G can also be used to compress the activations of a machine learning model!

Reducing the amount of memory required to store the intermediate results of the model.. For example:

Brotli-G can be used to compress activations from 500MB to 250MB.

In addition to these specific examples, Brotli-G can also be used to compress other types of data that are used in machine learning; text & data,

Brotli-G is a high-performance compression algorithm that can provide significant performance improvements for machine learning applications.

These examples demonstrate the potential of Brotli-G to improve the performance of machine learning models. As Brotli-G becomes more widely adopted, we can expect to see even more innovative uses of powerful compression algorithms.

RS

*

Inferencing & Classification : Protocols

To clarify that the inferencing unit such as Intel, AMD & ARM are expressly created with the opportunity to minimal instruction load; Edge detect & other machine learning comparators..

As the Inferencing instructions contain the logic of comparison.. & furthermore are created to facilitate the comparison of Inference tasks..

most logically you can see a wise person could see scope for edge detecting expressly with edge sharpening & shaping in mind; But also Trilinear filtering & of course Tessellation ..

Now i believe you have Displays, Cameras & Audio Systems to optimise!

Now we know that we can & also improve latency related issues such as frame tearing detection & also jitter & QFT & VRR.

How ? Inference all of the latency issues of frame arrival time, torn frames & misaligned audio & Electric signal jitter in what is effectively an Ethernet protocol AKA Frame Transmission & Reception ...

More ? Why not :L

Rupert S

*

Int8:SiMD : Maths & Logic


This is about how you think about components such as INT8, INT4(Xbox) & SiMD, You have to classify by necessity & optimise the structure.

You can shape the game reality with specific control objects & statics!
Maths in SiMD & Int8 & Machine Learning in Int8 & SiMD; SiMD is hard maths, Int8 is soft edge inference...

Both are maths; But soft logic is not a PROOF Math but can be proof; Hard math is not 'Invention & Imagination 'Exactly''

But we have both to improve performance.

RS

*

SiMD Performance : RS


Performance per WATT of MMX & MMX+ & SSE & AVX Machine Learning & Shader code; Is a matter of 8x8Bit & 16x16Bit Code on GPU

Our role is to reduce complex un-cache-able ML to Cache Enabled 64KB
Modelling of 1990's without Quality loss of 32Bit++ 64Bit+

8x8Bit sharpening MMX Becomes Dual Pipe (16x16bit)*2 in 32Bit Dual 16 Pipeline & Twice as sharp
Machine Learning method for MMX Is Fast & Cheap, MMX2 More Compatible,
Intrinsic improvements such as combined ops & DOT4 Further improve the performance of under 1MB Code..

Performance & Function per WATT, Is unbeaten; Let us prove it!

For example Quake has MMX Emulation & MMX Dithering code on 3D Textures,
In 8Bit 256 Colours dithering is noticeable; In 15Bit to 32Bit the small shade difference in dithering colour is subtle & flawless,
Improving light subtilty & Colour pallet WCG & HDR 10Bit to 16Bit per channel.
*

SiMD & Int8 & dp4a & F16/F32/F64>:


The way SiMD Repeating Parallel batches of instruction can still side load data,
Data is loaded into the 'calculation set'

http://ftp.cvut.cz/kernel/people/geoff/cell/ps3-linux-docs/CellProgrammingTutorial/BasicsOfSIMDProgramming.html
https://en.wikipedia.org/wiki/Single_instruction,_multiple_data

SiMD Consist of 8Bit to 64Bit Long & Floats,
SiMD are simple instructions; Or so they think; SiMD are relatively complex instructions..
For example 4/1 of a page full of arithmetic code; However our goal is to use Heuristics & logic to circumvent the Artifacts/Errors in self generated code,

In addition to using problem solving tables to choose instructions that advantage our analysis (Machine Learning),
We also can choose the most probably optimal code type.

Our outset objective is to decide if we want to use CPU Feature types:

F16
Int8
dp4a
SiMD

Depending on the Mathematical Qualities of each ML Node & the questions they are asking,
For examples:

A simple ResNet Image identification uses edge detect & for that we need for example SiMD Matrix Edge Detection

Speech requires identifying Words in a codec, So obviously we need a Decoder & Encoder,
Word identifiers & correctness checking; But firstly we need to identify accent to correctly choose words..

We also need to classify words by Idea grouping (DataBase, Open Database)

As you can see; We will be defining many of these function groups as SiMD & Float,
Effective use of Int8 differentiation, Comparators & Maths operations has many benefits; So does JIT Compile.

RS

*

Solve Table of Statistically provable Machine Equates & Solves : Table of function competitors & Operators.


Runtime Library - Multiple Solve Table

I would like a Solve Table of Statistically provable Machine Equates & Solves that make the equivalent of Maths Compilers such as RUST & Fortran's

For example basic ML code test function loops are basically compatible with X-OR Comparators on AVX! Other functions such as greater or less than; Are AVX Compatible.

Machine Learning : List of actions that are SiMD Baseline: Statistical Observance and Solve Tables

Yes or no comparator X-OR
Memory array Byte Swap
Greater or less than with swap or with X-OR Roll
Memory save & store
Edge comparisons
Compares (Colour, Math, Equate, Target, Solve if)

There are more! Statistical Observance and Solve Tables.

Examples 2:

Shape compare is a matter of inner & outer Vector : Comparison & X-OR, Larger outside & X-OR The differentiation:
By Dot,
By Mass (non literal dot difference comparator by axis),
Actual Mass
Density : Lumina, Weight, Mole, Mass / Area

Edge Solve : X-OR ~= Colour, Lumina, Shade, Vibrancy, Distance, Matrix Solve 3D>=2D Flattened Comparator
If = X-OR=N<0.0001 Then Compare &= Mutex Solve / Average

Polygon Join/Merge Tessellation : If Model = Same (T1 + T2 If (T1 + T2)/2 = Difference Less Than 0.0001 | = Merge/Converge

*

Audio, Video & High precision Float ML


tensors & full onnx configuration : Upscaling : While we are not sure how much ML we need & at what precision,

We can be sure that 32Bit (per channel) Value RGBA (Multiple layer) requires at least 8Bit to 16Bit per channel final precision; So here is a list:

Required Value of output, Neural Network precision guide table: RS

Input
8Bit, 10Bit, 12Bit, 16Bit

Input network precision average bit retention (for RAM some error is allowed)
6Bit, 8Bit, 10Bit, 14Bit, 16Bit

Classifiers as we know can be,
Int 2Bit 4Bit, 8Bit, 16Bit, 32Bit
2 Bit is unlikely & 32Bit is for Dream Smooth 16Bit+ Precision output

Output Float (Mostly FP & F16b)
16Bit = { 8Bit, 10Bit, 12Bit }
24Bit, 32Bit, 64Bit = { 16Bit, 32Bit, 48Bit }
We can upscale : Audio, Video, Content & Polygons, We classify Quality by expectations & Quantify by percent %

Rupert S

*

FPGA BitFile & Code Opt (c)RS 2021-01 


https://science.n-helix.com/2022/10/ml.html
https://science.n-helix.com/2022/08/jit-dongle.html
https://is.gd/LEDSource

In my view heuristics in compilers are a choice for those who do not wish to include direct ML compiled into their code,
This is understandable in terms of terminator & cylons & indeed flawed beings or even good ones with depression!

However the application of branch optimisation is a sample code optimisation that can 'Plug In' to branch caching on the CPU & GPU.

Heuristics are not just code in the compiler; They are also micro code selecting a probable branch; Although code that forces a branch can be flawed..

Both heuristics, Branch probability selection & ML can run in parts of the code to select probable path!

Yes fundamentally any code that modifies behaviour is a catch bullet frame for not sound 'Fortrans code is rock solid' Rust is also supposed to be solid.

Including soundly made heuristic code & branch probability code ML in your inline routines; 'Very much interpretive master jedi'; But it can be done!

Question is How big? & how fixed?

25KB per 3MB on average?

ML & Heuristics like my application FPGA BitFile & Code Opt (c)RS 2021-01

can be applied at runtime & remain only for selecting the fastest path or the best; In terms of which Processor function to run code for.

(c)Rupert S

*

TOPCloud Scaled Flexible WebASM & WebGPU & MathML!


Quite flexible for use on Monitors & TV's; Light processor load on simple tasks & offloadable such as TOPCloud!

You may be thinking Offloading is impracticable because that requires one of two things:

JIT Compiler Dongle..
USB device such as Firestick or GPU & CPU (With OpenCL Compat)

Server! so internet & service provision!
Impossible? No; WebAdvert supported TV's need both!
So why not HPC TOPCloud? could make a HOT TV a lot cooler & Eco friendly with Server repeating tasks:

Scaling
Quality Service
Service availability

TOPCloud Offload Logic:

In terms of WebASM & WebGPU & MathML; TOPCloud provides sufficient advantages to be considered a core utility..

While Offloading repeating content such as Siteload core stack (Server) & Localising configuration such as Webpage size & DPI & Dynamic font arrangements that require thought.

In terms of Offloaded function & Efficient system load for large configurations..

Especially efficient configurations such as TPU, Coral, GPU work & Cloud CPU that have large optimised stacks & installed drivers.

RS

*

#Sound Strategy game TOPCloud (c)RS


PCM & MP4 are 2D/3D Image so GPU Helps there also with 3D Audio mapping!
Games do not require cloud processing of images & a lot of local strategies are procedural Heuristic

You see RDP has GPU Connect (my innovation i might add) So Bluetooth & Wifi can connect RTP GPU; The port specifics are not particularly important; However a device such as music streamer can have ML TOP's available locally & from the cloud,

Due to how the TOPCloud strategy works with localised ML TOPS; Not all data has to be sent or received.. For example all Audio 3D Profiles for HQ Room audio can be done within a few MB of data; With some hard work? 150Kb of data & so in reach of phones & mobile!

Gaming is an example here. I give TickTackToe as the example where all that a device like Alexa or Google smart device has to think is Which square? but..

No physical picture needs to be sent for the game to be played & if required a small TickTack Strategy ML is desired locally for a quicker response!

You see with a low latency GPU RTP & GPU RDP connection to cloud GPU; Most localised thinking TOPS can be carried out in Seconds if not milliseconds & PCM & MP4 are 2D/3D Image so GPU Helps there also with 3D Audio mapping!

Rupert S

*

Core features of TOPCloud:

RTP ML TOPS are a processors friend

3D audio mapping & spatialization for realistic sound effects
3D Vector Support for various audio formats such as PCM, MP4, OGG, and WAV

Low latency & high bandwidth connection to cloud GPU servers via RDP

Procedural & heuristic algorithms for generating game scenarios & strategies & 3D Audio & Visuals
Localized & cloud-based machine learning models for optimizing game performance & user experience

RTP GPU Connect technology that allows users to access GPU resources from any device with Bluetooth or WiFi

TOPCloud is a revolutionary 'TOPS' way to enjoy & create audio games using your own music & the power of the cloud. Try it today & discover a new dimension of gaming!

*

Scaling; We can classify by colour or creativity. (c)RS


If you use TOPCloud, you can share between different displays in the TOP's Sense..
but mostly you would need cloud presence,

Mostly this would be about making the most out of TOP heavy Business GPU & personal ones in your computer or consoles.

But sharing common tasks such as scaling movies by type or by identifying a single movie to upscale...

Now you might be asking what we would be doing there?
Well a single movie uses the same materials in our ML; We can analyse the class & optimise the scaling by class..

For those familiar with games & FSR; We familiarise our code with a single game!
By doing this we improve our product and can therefore classify by:

Resolution
Style
Speed
Type, FPS for example & RTS

We can classify by colour or creativity...

We do not simply have to roll the dice on General Scaling, We can use classifiers:

Title
Scale
Type
Speed
Frame Rate
Colour & Composure

Rupert S

PoCL Source & Code
https://is.gd/LEDSource

*

We all think our own way; Potential is always there on a Runtime Library - Multiple Solve Table

Machine learning | Equate ~= Multi Layer Wavelet Abstraction
https://science.n-helix.com/2022/09/ovccans.html

https://www.youtube.com/watch?v=-9lCpfrOQQ4

(c)Rupert S 2022-10

https://is.gd/LEDSource
https://is.gd/BTSource

https://science.n-helix.com/2023/06/tops.html

https://science.n-helix.com/2021/03/brain-bit-precision-int32-fp32-int16.html
https://science.n-helix.com/2022/08/jit-dongle.html
https://science.n-helix.com/2022/06/jit-compiler.html

https://is.gd/MLCodecShaping
*

This one will suite Dedicated ARM Machine in body armour 'mental state' ARM Router & TV
(ARM Learning 4K ROM; Safe Larger USB ROM) https://bit.ly/3Afn1Y4

https://drive.google.com/file/d/102pycYOFpkD1Vqj_N910vennxxIzFh_f/view?usp=sharing

Android & Linux ARM Processor configurations; routers & TV's upgrade files, Update & improve
https://drive.google.com/file/d/1JV7PaTPUmikzqgMIfNRXr4UkF2X9iZoq/

Providence: https://www.virustotal.com/gui/file/0c999ccda99be1c9535ad72c38dc1947d014966e699d7a259c67f4df56ec4b92/

https://www.virustotal.com/gui/file/ff97d7da6a89d39f7c6c3711e0271f282127c75174977439a33d44a03d4d6c8e/

Python Deep Learning: configurations

AndroLinuxML : https://drive.google.com/file/d/1N92h-nHnzO5Vfq1rcJhkF952aZ1PPZGB/view?usp=sharing

Linux : https://drive.google.com/file/d/1u64mj6vqWwq3hLfgt0rHis1Bvdx_o3vL/view?usp=sharing

Windows : https://drive.google.com/file/d/1dVJHPx9kdXxCg5272fPvnpgY8UtIq57p/view?usp=sharing

*Windows {
To Compress using CPU/GPU: MS-OpenCL
https://is.gd/MS_OpenCL
https://is.gd/OpenCL4X64
https://is.gd/OpenCL4ARM

Upscale DL
https://is.gd/UpscaleWinDL

https://is.gd/HPC_HIP_CUDA

https://www.amd.com/en/developer/rocm-hub/hip-sdk.html#tabs-ddafbba141-item-c6b9ce2aab-tab
https://rocm.docs.amd.com/en/docs-5.5.1/deploy/windows/quick_start.html

X86Features-Emu
https://drive.google.com/file/d/15vXBPLaU9W4ul7lmHZsw1dwVPe3lo-jK/view?usp=usp=sharing
}

Machine Learning SDK's,
You may not have a Machine Learning SDK to accelerate your GPU/CPU/Device

3 main ones, but Python does not guarantee an accelerator!
Obviously Python Builds with Accelerators work!

HW Build Source : Upscale DL
https://github.com/GPUOpen-LibrariesAndSDKs/RadeonML
https://github.com/GPUOpen-LibrariesAndSDKs/RadeonImageFilter

PoCL Source & Code
https://is.gd/LEDSource

*
https://github.com/ssube/diffusers/tree/feature/onnx-upscale

https://github.com/huggingface/diffusers
https://huggingface.co/ssube/stable-diffusion-x4-upscaler-onnx

https://huggingface.co/uwg/upscaler/tree/main
https://huggingface.co/nvmmonkey/optimal_upscale/tree/main
https://huggingface.co/gmp-dev/gmp-upscaler/tree/main/ESRGAN

Neural Engine
https://github.com/godly-devotion/MochiDiffusion

ML List & Services
https://huggingface.co/models?sort=downloads&search=upscale
https://huggingface.co/models
https://huggingface.co/pricing

Tokma ML

Batch Size 240W>65W, 32GB{64, 16}, 15W>5W, 4gb{16, 1} : 16, 8, 4 seems optimal,
Time taken compatible:


Python & JS Configurations
https://is.gd/DictionarySortJS

https://iopscience.iop.org/article/10.1088/1741-4326/ad142f

https://is.gd/TokmaML

*

Training Networks


Coral TPU Micro Edge Learning with performance arranged Intel FPGA Arria 10 SX SoC Kit & Google Coral, NVIDIA Jetson Nano and CPU ROCK 4C Plus
https://doi.org/10.3390/s24030899

With both USB Devices being 8Bit INT, I would imagine all of the models would run on both in 8Bit INT
https://coral.ai/docs/edgetpu/models-intro/#transfer-learning-on-device
https://www.intel.com/content/www/us/en/developer/articles/technical/movidius-accelerator-on-edge-software-hub.html
https://www.intel.com/content/www/us/en/support/articles/000033354/boards-and-kits/neural-compute-sticks.html

39$ 2x Edge TPU, Prefer 6x or 8x M.2 & PCI 16x & 32x with 4GB+ RAM
https://coral.ai/products/m2-accelerator-dual-edgetpu/

https://coral.ai/products/

Gimp Speed Figures,

OpenCL per Selective Gaussian Blend
24GB RAM 8Core

CPU 1.3m
RX200 60s
Movidius 10s
+Coral Offloads Int32 & processes; Processing INT8,
In that way the main CPU is the handler of most complex non inference tasks..
In many networks F32 & Int32 would be used to represent computation tasks & can be sieved optimally.

With 8MB of Essentially RAM Writeback Cache; Input-Output though the USB or M.2/PCIe,
Loading tasks through the IO buffer; Into & out of the work buffer; The average flow cache would be around 256KB..
The machine Learning ML itself is between 32KB & Around 7MB.

The Processor itself has multiple threads & IO/DMA Processes to directly inference or solve programming.

Ideally Compressed RAM, with Rsrt ADD+ & MUL* & PACK & Min Mean & Max,
We can perform flexible basic maths,

Flexible compression by copy & replication
Compression consisting of MUL expansions or fractions, MUL ADD & roll Example n+n*y = , n+((n+10)*y),
Compression formula consisting of algebra operations to unroll or roll & gradients Min=m to Max=y

Examples of formula expansion compression

replication n+((n+10)*y)
Formula expansion n+(y*z) = , (n+y)*z =
gradients Min=m to Max=y , A++B till (N*t)=C then Min A to Max C | Median = D | D++t
(the above for example: ((n+y)*z = )Rsrt = )

8MB Work buffer
256KB IO Buffer (Fast Frame Buffer); The effective memory used by images or audio may reach 2MB.

The perfect Proposal RS

TOPS Conversion Table:


8000G 16TOPS NPU + SiMD 13TOPS total 39TOPS,
Standard FX 8TOPS to 13TOPS All SiMD used!
EdgeTPU*2 8TOPS + CPU SiMD &or NPU..

SiMD F32, F16, Int32, Int16, Combined 8Bit parallel ops..

NPU F16, Int16, Int8, Int4,
TPU U8 & Int8,

Perfection,

Conversion recommendation work for NPU & SiMD:

Int32/64 CPU + 8Bit Inference TPU
F32 conversion by removal of remainder Xor XMM & YMM XXX to Integer Inference; TPU 8Bit or NPU..

Rupert S



Maybe
B Key / M Key M.2, M.2 NGFF B Key / M Key M.2 NGF, M 2 Specifications support : 2280/2260/2242/2230
https://www.amazon.fr/dp/B0CT3FXQM8/
https://www.amazon.co.uk/dp/B0CT3FXQM8/
https://www.amazon.de/dp/B0CBWWD144/

GLOTRENDS M.2 M Key to E Key WiFi Adapter for M.2 WiFi Module
+ https://www.amazon.fr/dp/B09ZS1FHCG
https://www.amazon.co.uk/dp/B09ZS1FHCG
https://www.amazon.de/dp/B09ZS1FHCG

};

USB
https://www.amazon.com/dp/B07S214S5Y
https://www.amazon.fr/dp/B07S214S5Y
https://www.amazon.co.uk/dp/B07S214S5Y

https://www.amazon.fr/s?k=Dual+Edge+M.2-2230+E-key+to+pci

https://en.wikipedia.org/wiki/M.2

To my knowledge M.2 E is basically PCIe but smaller, So adapter is fairly simple.

HP/Mac/Dell/Acer

The M.2 "E" key sockets are used for Wireless LAN/Bluetooth cards.
These sockets are common with laptop motherboards.
They are also found on some desktop motherboards (mITX, mATX, ATX).
Gigabyte offers mITX boards with this support.

https://www.amazon.fr/dp/B09ZDPP43X/
https://www.amazon.fr/s?k=wifi+M.2-2230+E+to+pcie

*

Analogue ML - Including Additive-Capacitor-'Battery' - Using the IBM analogue in-memory hardware acceleration kit for neural network training and inference - APL Machine Learning
https://pubs.aip.org/aip/aml/article/1/4/041102/2923573/Using-the-IBM-analog-in-memory-hardware

RAM ADDER differential Inference (c)RS :
RAM Table Accumulated addition node network Accumulator with Accumulation comparison Inference

IBM Analog Hardware Acceleration Kit https://github.com/IBM/aihwkit

Matrix Processors - Multi Node SpiNNaker2 A Large-Scale Neuromorphic System
https://arxiv.org/pdf/2401.04491.pdf

PysicsX
Isaac Gym - Preview Release
https://developer.nvidia.com/isaac-gym

CALM: Conditional Adversarial Latent Models for Directable Virtual Characters
https://github.com/NVlabs/CALM

ML Strategic Workflow Training & Models - Machine Learning model guide Tensor to ONNX - Fraud Prevention & Statistics - Turning Data into Insight with IBM zOS16
https://www.redbooks.ibm.com/redpieces/pdfs/sg248552.pdf

Evolution-ML CNN Self Trained Auto Sparsity - Hybrid multi-objective evolutionary model compression with convolutional neural networks
https://www.sciencedirect.com/science/article/pii/S2590123024000045
https://blog.research.google/2023/12/advancements-in-machine-learning-for.html

ML Compressed Dynamic16bit-8Bit - Hardware-friendly compression and hardware acceleration for ML Transformer
https://aimspress.com/article/doi/10.3934/era.2022192

AA-DLADMM - GD Gradient Descent - An Accelerated ADMM-based Framework for Training Deep Neural Networks
https://arxiv.org/pdf/2401.03619.pdf

*

Personality UI : Have a friend


Alpaca Character Generation model
4Bit for speed, But not precise
https://huggingface.co/anon8231489123/gpt4-x-alpaca-13b-native-4bit-128g
trained 3Epoc Higher Precision https://huggingface.co/chavinlo/gpt4-x-alpaca

Base model https://huggingface.co/chavinlo/alpaca-13b
https://github.com/teknium1/GPTeacher

Python WebUI
https://github.com/oobabooga/text-generation-webui
Mac; Mostly MAC but fast
https://github.com/ggerganov/llama.cpp

how to use & personality sets https://discord.com/invite/aitrepreneur-1018992679893340160

On the subject of how deep a personality of 4Bit, 8Bit, 16Bit is reference:
https://science.n-helix.com/2021/03/brain-bit-precision-int32-fp32-int16.html
https://science.n-helix.com/2022/10/ml.html

*

Machine learning | Equate ~= Multi Layer Wavelet Abstraction

https://science.n-helix.com/2022/09/ovccans.html

https://science.n-helix.com/2023/02/smart-compression.html

https://science.n-helix.com/2021/10/he-aacsbc-overlapping-wave-domains.html

(documents) JIT & OpenCL & Codec : https://is.gd/DisplaySourceCode

Include vector today *important* RS https://vesa.org/vesa-display-compression-codecs/

https://science.n-helix.com/2022/08/jit-dongle.html

https://science.n-helix.com/2022/06/jit-compiler.html

https://science.n-helix.com/2022/04/vecsr.html

https://science.n-helix.com/2016/04/3d-desktop-virtualization.html

https://science.n-helix.com/2019/06/vulkan-stack.html

https://science.n-helix.com/2019/06/kernel.html

https://science.n-helix.com/2022/03/fsr-focal-length.html

https://science.n-helix.com/2018/01/integer-floats-with-remainder-theory.html

https://science.n-helix.com/2022/08/simd.html

Eclectic & for the codecs of the world! OVCCANS (install and maintain as provided HPC Pack)

https://science.n-helix.com/2018/09/hpc-pack-install-guide.html

*

Transversal processing availability : Transparent Task Sharing Protocols


https://science.n-helix.com/2022/08/jit-dongle.html

https://science.n-helix.com/2022/06/jit-compiler.html

Machine Learning


https://science.n-helix.com/2022/10/ml.html

https://science.n-helix.com/2021/03/brain-bit-precision-int32-fp32-int16.html

Innate Compression, Decompression


https://science.n-helix.com/2022/03/ice-ssrtp.html

https://science.n-helix.com/2022/09/ovccans.html

https://science.n-helix.com/2023/02/smart-compression.html

https://science.n-helix.com/2022/09/audio-presentation-play.html

https://science.n-helix.com/2021/10/he-aacsbc-overlapping-wave-domains.html

https://science.n-helix.com/2023/03/path-trace.html

*****
Best NPM site on world https://npm.n-helix.com/bundles/

(Simple Install) Website Cache JS Updated 2021-11 (c)RS https://bit.ly/CacheJS
(Simple Install) Science & Research Node High Performance Computing
Linux & Android https://is.gd/LinuxHPCNode

Presenting JIT for hardware interoperability & function :
https://is.gd/DisplaySourceCode

https://is.gd/BTSource

(Simple Install) Website Server Cache JS Updated 2021-11 (c)RS
https://bit.ly/CacheJSm
(Simple Install) Website Server Cache JS Work Files Zip Updated
2021-11 (c)RS https://bit.ly/AppCacheJSZip
*****


*****

Direct ONNX Hardware Accelerated: F16
https://github.com/GPUOpen-LibrariesAndSDKs/RadeonML

Ideal for 4Bit Int4 XBox & Int8 GPU
PULP-NN: accelerating quantized neural networks on parallel ultra-low-power RISC-V processors - Bus-width 8-bit, 4-bit, 2-bit and 1-bit
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6939244/

ML Proof case SVM (Multi-Dimensional-Elliptic,98%) aDaBoost M1(Mac,91%) - COVID-19 Prediction Using Supervised Machine Learning - Irfan_Ali_MEng_2023
https://dspace.library.uvic.ca/bitstream/handle/1828/14676/Irfan_Ali_MEng_2023.pdf?sequence=1&isAllowed=y

Useful operation precision reductions
FPGA Implementation of Keyword Spotting System Using Depthwise Separable Binarized and Ternarized Neural Networks
https://www.mdpi.com/1424-8220/23/12/5701

Useful operation precision reductions; I observe that reducing precision to 1Bit & 2Bit..

While enhancing the definition of a positive, Negative Dipole & thus enhancing speed..
Further reduces reasoning capacity; That in order to reduce Processor bandwidth for reasoning..

In the example of the XBox & PS5; DOT4 & INT4, INT8 & F16 & bF16; Apply considerable improvement to reductions probable error related to a lack of remainder or float value depth enhancement!

By reason of probability i assume a value of 4Bit & 2Bit to allow the smallest packing ability; Existing alongside the word reasoned!

To reduce to 1 & 0; I assume a definite statement that a Value Integer Solve in the form of a vector..
Is most probably the solution & that furthermore that in most cases; Projected in pure maths & code ASM,
Both SiMD; Float & Integer...

Reduction to multiple 2 Bit values in short Integer instructions; I will state however that no such value is further away than a statistics table or PHP Data-Set.

Rupert S 2023-06

*****

Gaussian
https://gmd.copernicus.org/articles/16/1697/2023/
https://gmd.copernicus.org/articles/16/1697/2023/gmd-16-1697-2023.pdf

SiMD Gaussian Blending & Dithering - Better_Fixed_Point_Filtering_with_Averaging_Trees
https://andrew.adams.pub/Better_Fixed_Point_Filtering_with_Averaging_Trees.pdf

Vectorization of Kernel and Image Subsampling in FIR Image Filtering
http://bncss.org/index.php/bncss/article/viewFile/101/105

Implementation of a High-Quality Dolby Digital Decoder Using SiMD MMX™ Technology
https://smtnet.com/library/files/upload/dolby-intel.pdf

*****

Common techniques used in ML Learning are edge detection, accent recognition, language processing, and code optimization.

Basic ML Feature list; Also for learning

Edge detection is a process of identifying the boundaries of objects in images or videos.

Accent recognition is a process of identifying the regional or social variation of speech.

Language processing is a process of analyzing and generating natural language texts.

Code optimization is a process of improving the performance or quality of code.

https://www.ibm.com/topics/machine-learning
https://en.wikipedia.org/wiki/Edge_detection
https://en.wikipedia.org/wiki/Accent_recognition
https://en.wikipedia.org/wiki/Natural_language_processing
https://en.wikipedia.org/wiki/Code_optimization
https://en.wikipedia.org/wiki/Supervised_learning
https://en.wikipedia.org/wiki/Unsupervised_learning
https://en.wikipedia.org/wiki/Reinforcement_learning
https://www.ibm.com/cloud/learn/machine-learning-ethics

*****

Dynamic ML IRS-RIS 4G,5G Wave Shaping Edge detection with reflection angle calculation - strong wave localising edge (shaping)sharpening(c)RS


By quantifying how waves bounce from reflective surfaces it is possible to shape waves that bounce in a different direction from a mechanical reshaping surface called a RIS..

Reconfigurable intelligent surfaces & Intelligent reflecting surfaces bounce radio waves for wireless networks..

Presenting the example:

Coral TPU Micro Edge Learning with performance arranged Intel FPGA Arria 10 SX SoC Kit & Google Coral, NVIDIA Jetson Nano and CPU ROCK 4C Plus
https://doi.org/10.3390/s24030899

ML_With_USB_Stress-Testing_USB_Accelerators_for_Efficient_Edge
https://www.researchgate.net/publication/377174200_Stress-Testing_USB_Accelerators_for_Efficient_Edge_Inference
https://github.com/raphischer/edge-acc

No comments: