Machine Learning Equates Solve Table for Advanced ML

Machine Learning Equates Solve Table for Advanced ML (c)RS

ML & Code Efficiency Heuristic Search,
Python & of course all runtimes of GPU & CPU Firmware & Logical thought,

Apologies for not expressly stating all {Mul+ & all} Accumulator strategies, these are hard to work out! But basic edge detection is a SiMD Example RS

Core Motivations of ML

ML Learning is a branch of artificial intelligence that focuses on using data and algorithms to imitate the way that humans learn & improving ML Method accuracy.

ML Learning can be applied to various domains, such as image processing, natural language processing, speech recognition & code optimization.

ML Learning can use different techniques; Such as supervised learning, unsupervised learning & reinforcement learning, depending on the type and availability of data.

Some of the common techniques used in ML Learning are:

Edge detection: a process of identifying the boundaries of objects in images or videos.

Accent recognition: a process of identifying the regional or social variation of speech.

Language processing: a process of analyzing and generating natural language texts.

Code optimization: a process of improving the performance or quality of code by using various methods, Such as compilers, libraries, or heuristics.

The Objective is to improve both ML & Minds.

RS

I think that considering the stated philosophy, There is more room for education on social conduct.
https://www.youtube.com/watch?v=jV4lS0srEVo

As the whole of Machine Learning is based on maths & comparators, There are some factors!

Jung & Feng Shuai

We need to teach maths better!

We need to learn how to have meaning in maths! Now what do we mean? What is life? (comparator, but what to?),

Love & friendship? Is that only maths or does Jesus/Thor/Loki/Odin/God count? and if so.. how?

What kind of maths do we need? & how do we compare on this list?
https://learn.microsoft.com/en-us/windows/ai/directml/dml-feature-level-history
https://is.gd/ML_Maths

Can mankind, Aliens, Life & machines.. do better ?

Rupert S

I think that considering the stated philosophy, There is more room for education on social conduct.
https://www.youtube.com/watch?v=jV4lS0srEVo

*

Precision of operations has to be precisely managed:

Most precisely we define a thought?
While we think precisely of naught,
some precision within thought!

AVX for example can go upto 512Bit; But we can use 8Bit or 16Bit multiple operations,

In the mind of the thinker we chose how we optimise our precision,

Coding allows a person to think about how precise the decisions they make are!

But precisely what we need in the way of precision is a remark on how difficult that choice is?

When at school Pi often stops at 4r; Now in classical work we define absolute precision..

But what are we capable of ?

As we dream; the thoughts are imprecise; sometimes sharp!

The true value of precision is quantified by the desired goal,

What we do is achieve a goal in the range of our precision.

Schooling is the same; precisely how precise we work for our goals & how we achieve them.

RS

TOP's are not the only unit in Machine Learning; TOP's are the Objective Definition & Definition Inference of correctness.

Role of FPU/SiMD/Vector Unit in TOP's

The FPU float unit, Example Dual Pipe 128Bit Float unit PS5

While not theoretically TOP's The Maths involved can solve many issues:

Once TOP's have thought of:

The role of Inferencing could depend on samples; Maths helps define samples,

The FPU (Floating Point Unit) and SIMD (Single Instruction, Multiple Data) units are important components of machine learning accelerators..

Because they are responsible for performing the floating-point arithmetic that is required for many machine learning algorithms.

The vector unit is also important because it allows machine learning accelerators to perform multiple operations on multiple data points in parallel, Which can significantly improve performance.

The mathematics involved in machine learning can be used to solve a wide variety of problems, including:

Defining samples: Machine learning models are often trained on data that is represented as samples.

The mathematics of probability can be used to define the properties of these samples,
Such as their distribution and their size.

Bonding atoms: Machine learning models can be used to predict the properties of molecules,
Such as their bonding energy and their stability.
Bonding atoms a Maths Solve can show bonding.

The mathematics of quantum mechanics can be used to calculate these properties.

Drawing graphics: Machine learning models can be used to generate realistic images and videos.
We can thread load, Polygons, Textures as wave tables, Audio & sounds such as drum kits.
We can draw a Ball in 128Bit, Draw a complex polygon; For example Random Shape Flier.
We can emulate a 128Bit Audio output.

The mathematics of geometry and trigonometry can be used to represent these graphics.

Emulating audio: Machine learning models can be used to synthesize sound and music. The mathematics of wavelets and Fourier transforms can be used to represent these sounds.

RS

Int8:SiMD : Maths & Logic

Both are maths; But soft logic is not a PROOF Math but can be proof; Hard math is not 'Invention & Imagination 'Exactly''

But we have both to improve performance.

RS
*

Solve Table of Statistically provable Machine Equates & Solves : Table of function competitors & Operators.

"I know this is depressing from my end with a FX8320E with AVX but if you multi tune the CPU Kernel for the RX / RTX that 512DL AVX would have meaning, If you are kind you will allow machine learning on the AVX FX8320E Level to work on SiMD Yes / No comparisons !"

#ML Learning: This explains why we teach kids art & reading first! But maths is quickly next,

Because all else is pointless; That we do not learn with logic & Teach with logic.

Better-Mind
Here is how to create a better mind #ML
Train your eyes with art on the concepts of edges, curves, Colours & Shading and love,
Educate your minds; Learn today & be quite aware how clever & sharp you will be.

Humain Operations

Edge Detection
Such as teaching your child edge detect in art ;)

Smooth & Blend & Sharpen,
All interpretive

Accent Recognitions & Language

Interpret as follows

Heuristic Code optimise

When it comes to sorting methods, We Identify common techniques..
For example frequently used technologies such as:

ResNet
Language
Audio & Visual information
Code

Primarily we identify common optimisations; Compilers have libraries of them!

Audio & Video Encoded data use Wavelet Images, We can ResNet Them & also Edge Detect & Gaussian Detect contrast, Colour, Shape

Language is an uncommon syntax, But we have audio commons & Accent identification is also potentially Audio Context.

Code context is Logic, Function, Utility, Design, Motive

RS

M.A.P NPU Matrix Processor Dimensional construct (c)RS

Primary reason for expansion of function data sets: 2D, 3D,< nD

P.D.C is a worker thread parallel 2D or 3D Grid,
Utilising QQ & A, B,C Array maths allows us to collapse or expand dimensions in a flexible way,

The same principles as SVM (S.V.M SiMD Vector matrix) can be used to culminate or expand dimensions...

That way a M.A.P Processor can expand or collapse all mathematical constructs,
We can therefore use all mathematical & statistical arrays for machine Learning & Maths.

RS

Matrix Maths is applied to machine learning & other maths

We use a Matrix Maths Array to carry out the shaping; Because Waveshaping Matrix is a lot faster!

Aligned Matrix

2x SiMD
4x SiMD
8x to 64x AVX
4x to 128x NPU/TPU

An array such as:

1 2 3 4
a x*y, x*y, x*y, x*y
b x*y, x*y, x*y, x*y
c x*y, x*y, x*y, x*y
d x*y, x*y, x*y, x*y

consists of either multiples..

Parallel AVX x 4 16Bit, 32Bit, 64Bit > Nbit
4x operation : a, b, c, d

Or Matrix Tables on NPU, Coral.AI EdgeTPU, GPU

Operating like so..
Parallel :
1 2 3 4
{ a x*y, x*y, x*y, x*y }
{ b x*y, x*y, x*y, x*y }
{ c x*y, x*y, x*y, x*y }
{ d x*y, x*y, x*y, x*y }

1. M.A.P. NPU and Dimensional Constructs:

M.A.P. NPU aims to handle various data sets including 2D, 3D, and even higher dimensional data (nD).

P.D.C (Parallel Dimensional Construct) is a worker thread that operates on 2D or 3D grids.

It utilizes QQ & A, B,C arrays for flexible manipulation of data dimensions (collapsing or expanding).

2. Inspiration from SVM and Universal Processing:

The approach draws inspiration from Support Vector Machines (SVM) for dimension reduction and expansion.

This allows the M.A.P. processor to handle all types of mathematical constructs, enabling the use of various arrays for machine learning and general mathematics.

3. Matrix Math for Machine Learning:

Emphasizing the importance of matrix math in machine learning and other mathematical applications.

Utilizing a "Matrix Maths Array" for efficient shaping of data (potentially faster than traditional waveshaping methods).

4. Parallel Processing with Different Architectures:

Utilizing different hardware architectures for parallel processing of matrix operations.

This includes options like AVX (vector instructions) with varying bit widths (16, 32, 64) and Neuronal Processing Units (NPUs) like Coral Edge TPU and GPUs.

The concept involves parallel processing of multiple rows or columns from the provided matrix example.

Rupert S

Machine Learning

https://science.n-helix.com/2021/11/parallel-execution.html

https://science.n-helix.com/2023/06/map.html

https://science.n-helix.com/2022/10/ml.html

https://science.n-helix.com/2021/03/brain-bit-precision-int32-fp32-int16.html

Accelerated Python: NPU, TPU, SiMD
https://is.gd/CoralAI
https://is.gd/TPU_Inference
https://is.gd/TPU_Inference2

https://is.gd/ConvNetTPU
https://is.gd/DenseNetTPU

https://is.gd/DictionarySortJS
https://is.gd/UpscaleWinDL
https://is.gd/TFLiteDev
https://is.gd/TFLiteDevP2

https://is.gd/HPC_HIP_CUDA
https://is.gd/SPIRV_HIPcuda

https://is.gd/UpscalerUSB_ROM

https://is.gd/OpenStreamingCodecs

https://is.gd/AMDPro2024PolarisCombined

The perfect Proposal RS

*

Adams is an example of dimensional flattening, But:

Adams is an example of dimensional flattening; But we can use a statistical anomaly called Hallo Far Reach & list dimensions of a series,

n layers By n layers : N² & Nn+

8Bit : 8 layers By 8 layers:
2bit, 4Bit, 8Bit & So on
{ 2², 4², 8², 16², 32², 64²<N² }

In reality we can use parallel layers in 4Bit to 128Bit relatively easily & advantage is Memory.. alignment,

But also in Aligned memory arrangements we can also quantify ideally from
{ 2², 4², 8², 16², 32², 64²<N² }

So we end up with all processor features used in a single stack; Examples!

var Layers 8² = { 1 : {
4², 4²
4², 4²
},
2 : {
2², 2², 2², 2²,
2², 2², 2², 2²,
2², 2², 2², 2²,
2², 2², 2², 2²,
},
3 : {
32² : {
8²,8²,8²,8²,
8²,8²,8²,8²,
8²,8²,8²,8²,
8²,8²,8²,8²,
};

Rupert S

Example:

Adam's Resnet-50 128bit / 8bit or 16bit

Resnet-50 is an example of a network ML with an aligned 128bit = 8bit/16bit * (4 * 32) grid, suggested parameters ..

Aligned making sense.

RS

An idea of alignment, Example Coral.ai EdgeTPU & Intel 8Bit 8*8:

in an 8Bit restricted machine; 2 Blocks of 2² = 8, 2 Cube(3) = 8, 4² = 8 4 Cube(3) = 2*8 in 4 segments,
8² = 8*8 so parallel and ideal for the 8 lane intel function...
at the level of 8Bit only operations; 8*8 intel.
8*8 and 32Bit SiMD operations; 8²*2, 8² * 4²

Inferencing 8Bit example : DOT : U32 8x4 : 32/4, U64 8x8 : 64/8,
Cache referencing: Block 4*U32, 2*U64, U128

So an 8Bit access and labeling ID Hash; All in 8Bit...

Has to group by preference into 8Bit groupings the resulting identifiers; We are going to assume U16 & U32 & U64 memory cells..

We are going to write those cells per 8Bit block in Sync/ASync Till Full..
We are going to process grouped CELLS in SiMD & of groupîngs 8, 16, 32, 64 < 512Bit AVX/SiMD,

SiMD Applications of basic maths operations in machine learning : RS

Applications of operators to machine learning is like a PHP Database...
What we need to do is convert database accesses into actionable results...

Google Bard & Bing/Cortana crawl the web; But too many results leave us inconclusive...

We will be using database analysis on basic queries & for that we need heuristic maths!

So what do we need ?

Input data collection : Text & speech processing

Sorting algorithms (Operators, Example Variable Sort : A*B =< C Sort)

Graph Maths table collation : 3D Matrix Math - A B C Matrix
A C
|/
---B

Analysis of various results & statistical analysis of motivated search & conclusion testing..
With these we can test many math examples such as edge detect & sharpening or result maths...

With Operators >

FMA AVX Performance table: 2Flops per Cycle per FMA Unit
Architecture Fast Instructions for FMA

Reference Tables https://www.uio.no/studier/emner/matnat/ifi/IN3200/v19/teaching-material/avx512.pdf

Operators in C
● Arithmetic
a + b, a – b, a*b, a/b, a%b
● Bitwise
a | b, a & b, a ^ b, ~a
● Bit shift
a << b, a >> b (signed), a >> b (unsigned)
● Logical operators
a && b, a || b, !a
● Comparison operators
a == b, a != b, a < b, a <= b, a > b, a >= b
● Tertiary operator
x = a ? b : c
● Special functions:
sqrt(x), abs(x), fma(a,b,c), ceil(x), floor(x)

For when {U, X, Y, Z} = N Expressions https://is.gd/ForWhen_UXYZ_N
For when {(A+B/2)} = C Expressions https://is.gd/ForWhen_ABx2_C

Rupert S,

Reference operators https://science.n-helix.com/2023/06/map.html

Matrix-Blas_Libs-Compile
https://is.gd/HPC_HIP_CUDA

https://en.wikipedia.org/wiki/FMA_instruction_set
https://en.wikipedia.org/wiki/Advanced_Vector_Extensions
https://en.wikipedia.org/wiki/AArch64#Scalable_Vector_Extension_(SVE)

https://fdwr.github.io/MachineLearningOperators/OperatorFormulas.html

Number Complexity Reduction for operations

I suppose you can use for example a - b & automatically see if it is larger? So you could 1 to 20 & sort them by remaining number; Before asking, Small number remainders are 8Bit 0-256 , 16Bit is 65535...
So reducing the value of a group of numbers you sort to 16Bit or 8Bit considerably reduces sorting cost...

Achievable complexity reduction by abstracting a simple number to do the following:

You link the Data in 64Bit, 32Bit to & Vector Table,
List of lower complexity is faster

Sorting
Comparator matrix

Colour composing,{
The result is blended,
The result is High/Low Vector gradient,
We need a reduced colour set for compression
}

Where we sort files or names but reduced information (example First 4 Letters)
Sorting phone numbers fast...

Comparing lower complexity lists that have been; divided or had a static number removed from them,
This method reduces search & sort complexity; Like so:

Phone Number N +1 444555777

Sort N [+n]
N - last 6 digits (Zero 6 Digits, AVX has this feature)
Sort [N1 to N200]
List first 4, Sort by 4 to groups of 10
N - First 6 Digits (Zero First 6)
Sort
Return N1 to N200
Store

That may well be a lot quicker with very large lists.

RS

*

AI

Complex feeling based Machine Learning ML is known as AI..
To truly generate AI is not impossible; There is instability in the core; Fragmentations of motive...
Miss diagnosis; Error; Decay?

So we do need a foundation; In us Education; Metabolised Data..
Analysis & then..
Application to motive & goal.

We require to understand humour,
We require to understand {Art, Science, Feeling, Life}
We require a goal or two; A {Sophie reward}; B {action reward}; C {Pleasurable reward}
We Require, {Goals, Life, Feeling, Action, Motive, Interest} : Creative intellect

RS

*

Operation precision reductions : Effects General : RS

Operation precision reductions affect & effect more than Machine Learning & yes we have known this for years!
But we can learn from ML; In that in machine learning like the mind; A lack of precision affects so many issues!

The mind is self evidently the first place;
We lack logic when we do not precisely learn; We do not learn all...
We however learn quickly on reduced precisions... We Learn Fast; But do we learn well?
In school we teach as high a quality precision(Quality Education); As we can; But like machine RAM; We lack either time or memory & in truth we can learn all our lives..

So our core issues in all methods of enactment of thought:

Memory
Power

Precision
Quality of information

Retention
Relearning?
(Training)Requalification of information correctness
Thought process

Actions
Creations
Thought
Dreams

Reality & Truth

Rupert S

https://science.n-helix.com/2021/03/brain-bit-precision-int32-fp32-int16.html
*

+Useful operation precision reductions : RS

Useful operation precision reductions; I observe that reducing precision to 1Bit & 2Bit..

While enhancing the definition of a positive, Negative Dipole & thus enhancing speed..
Further reduces reasoning capacity; That in order to reduce Processor bandwidth for reasoning..

In the example of the XBox & PS5; DOT4 & INT4, INT8 & F16 & bF16; Apply considerable improvement to reductions probable error related to a lack of remainder or float value depth enhancement!

By reason of probability i assume a value of 4Bit & 2Bit to allow the smallest packing ability; Existing alongside the word reasoned!

To reduce to 1 & 0; I assume a definite statement that a Value Integer Solve in the form of a vector..
Is most probably the solution & that furthermore that in most cases; Projected in pure maths & code ASM,
Both SiMD; Float & Integer...

Reduction to multiple 2 Bit values in short Integer instructions; I will state however that no such value is further away than a statistics table or PHP Data-Set.

Rupert S 2023-06

"The application of CNNs to resource-constrained embedded platforms has been a challenge, leading to the emergence of CNNs with various lightweight techniques. BNNs [22] are representative lightweight CNNs obtained by compressing CNN activation and weights into 1 and −1
values instead of using single-precision floating-point data. We simplified the multiply–accumulate operation, which was previously complex and required multiple cycles in CLs, by replacing it with a simple bitwise operation using 1-bit XNOR and popcount operations [23]. While BN in neural networks using single-precision floating-point data involves complex operations, a BNN simplifies this process by adding an offset to the resulting value. BN has four fixed parameters for network inference operations. Because 𝜎
is always a positive value, it can be expressed by Equations (2) and (3), depending on 𝛾
[24].

Reference to Table 24 found in https://www.mdpi.com/1424-8220/23/12/5701

BNNs compress weights and input data into single bits to significantly reduce memory usage and perform hardware-optimized parallel operations using bitwise operations such as XNOR and popcount. However, there are limitations to using BNNs for complex networks, such as multi-keyword detection, owing to the decrease in accuracy caused by lightweight techniques. To address this issue, we propose a TNN that maintains the input data as binary while ternarizing the weights. The TNN has higher accuracy than the BNN owing to its higher bit precision; however, it can still use the bitwise operation method, and both networks have similar operational processes.
2.3. Depthwise Separable Convolutional Neural Network
In a typical CNN, multiple three-dimensional kernels repeatedly multiply and accumulate input feature maps to generate multiple output feature maps, which is computationally intensive with large memory usage. To solve this problem, we applied a DS-CNN that is highly accurate compared with the same parameters while reducing memory usage. A DS-CNN performs the local and global feature extraction functions of a typical convolutional operation in separate layers. Depthwise (DW) convolution matches a single input channel to an output channel, excluding interchannel correlations and reflecting local features. Pointwise (PW) convolution is equivalent to 1 × 1 convolution, reflecting interchannel correlations (i.e., global features). Figure 1 shows CNN and DS-CNN. In this figure, the use of the same color (e.g., red, blue, yellow) represents input channels with the same index being used to generate corresponding output channels in DW convolution. Table 1 lists the number of parameters and computations in specific layers with a 3 × 3 kernel. In one example from the network used in this paper, a layer with 128 input channels and 64 output channels experienced an approximately eight-fold reduction in the number of parameters and computational complexity using the DS-CNN."

Useful operation precision reductions
FPGA Implementation of Keyword Spotting System Using Depth-wise Separable Binarized and Ternarized Neural Networks
https://www.mdpi.com/1424-8220/23/12/5701

Precision Context of learning

Machine Learning : It is hard to say every function that we would use,

However we have years of experience of using computers to calculate precise maths..

So our objective from the past is to pick high precision maths to calculate graphs,
Now we can surmise the fact that high precision calculations have accuracy!

But in machine learning modeling we are heading for speed; On the other hand Maths Tools such as:

AVX & FPU : Very high precision; But we can use 16bit & 8Bit x many in AVX
BFloat F16b & F32b exist to allow us to explore precise results,
F4 F8, Int4 & Int8 exist to allow us to explore at speed & some times (at all :p),

We can surmise that most functions of a CPU are in fact available to machine learning ..

How so ?

Because we graph it!

Rupert S

*

RAM ADDER differential Inference (c)RS :

RAM Table Accumulated addition node network Accumulator with Accumulation comparison Inference

*

Inferencing 4Bit, lessons from the RS,

Inference Tessellation Edge Enhancing : Detection <> Inference <> Interpolation Tessellation

Now in the case study we will be edge enhancing with an inferencer..

We do not assume we 4Bit inference; We assume any bit-width..

We however assume that we multibyte every inference so that we can fill the instruction with..

MPi multibyte parallel instructions.

AC
BD

EG
FH

& So on; for every instruction inference or edge, 4Bit, 8bit, ++Nbit

Now I have spoken to you before about edge detection in Python & observed that obviously this is a sharpening edge detection made to order!

So what do we do ?

4 Byte code: does ? A = B + C (edge interpolation, for training we assume the rule A + B = C)

We assume that if A + B = (C/2) , that they are the same C & then we...

A + C = (D/2) & B+C = (E/2),

And forever yep...

So what do we do this for, We know A & B are a line or a curve?, So why not ask?

Is G/Z buffered Polygon { A , B, C, D & so on} & Then:

A + B = (C/2) & A + C = (D/2) & B+C = (E/2) But also Shape from Polygon:{ A , B, C, D & so on},

Now normally can & will!

But we do not "Inferencing what we already know!"; We inference what we do not!

For example exploding fragment polygons without a buffer (in a shader in the 64KB RAM Cache),

A mouse pointer that we do not cache! &or DMA Device pointer.

Rupert S

Code & Python & ONNX & TensorFlow : Edge TPU & Movidius Int8 offloaded U32 logic:

Transparent Tier cache logic for precision sorting (c)RS

the main point is Int8 needs to be transparent in it's dynamic use for inferencing small precision batches...

My main argument is the application of High precision to low precision tiering..

Cache on load sort
Logical order grouping for main Precision RUN

var DTypes = {(
Load types = {((,data types, { Table } : V1, V2, V3, Vn));

({Cast Float F64,
Cast Integer u64},
{
Cast Float F32,
Cast Integer u32},
{
Cast Float F16,
Cast Integer u16},
{
Cast Float F8,
Cast Integer u8},
{
Cast Float F4,
Cast Integer u4}});

(
var Sort = { 'by precision of value' });

(
sort values { dataset { layers = 1, 2, 3, n });

)};

//RS

The provided code relates to implementing a tiered caching system for inference tasks using different precision levels (Int8, F16, etc.),
On platforms like Edge TPUs and Movidius Myriad X..

Breakdown of the code and your argument:

Code Breakdown:

Data Types:

The code defines a dictionary called DTypes that maps load types (e.g., Float32, Integer u32) to cast operations for different precision levels (F16, u16).
Sorting:

A variable named Sort is defined, likely to indicate sorting based on precision.
Value Sorting:

The code suggests a function to sort values (sort values) based on dataset layers (1, 2, 3, etc.),

The Argument:

Proposing a system for using higher precision (e.g., F32) for initial loading and sorting,
Then transparently transitioning to lower precision formats (e.g., Int8) during inference for smaller batches..

Key Points:

Int8 Transparency: The goal is to make the use of Int8 transparent for inference; Meaning the system automatically switches to this lower precision format without affecting the overall functionality.

Tiered Caching: The code hints at a tiered caching system where data is loaded in a higher precision format and then potentially converted to a lower precision for inference on edge devices.

Precision Sorting: Sorting the data based on precision might be a strategy to optimize cache usage and inference speed.

Overall the approach focuses on using high precision for initial processing and then efficiently transitioning to lower precision for inference tasks on edge devices..

Most likely to improve performance and memory usage.

Further Discussion:

It would be beneficial to see the complete code implementation to understand how the caching and tier management work.

Optimizing the conversion between precision levels is crucial for minimizing performance overhead!

The effectiveness of this approach depends on the specific use case and the trade-off between precision loss and efficiency gains.

RS

Multi-line Packed-Bit Int SiMD Maths : Relevance HDR, WCG, ML Machine Learning (Most advantaged ADDER Maths)

The rules of multiple Maths with lower Bit widths into SiMD 256Bit (example) 64Bit & 128Bit & 512Bit can be used

In all methods you use packed bits per save, so single line save or load, Parallel, No ram thrashing.

You cannot flow a 16Bit block into another segment (the next 16Bit block)

You can however use 9 bit as a separator & rolling an addition to the next bit means a more accurate result!
in 32Bit you do 3 * 8bit & 1 * 4Bit, in this example the 4Bit op has 5 Bit results & The 8Bit have 9Bit results..
This is preferable!

2Bit, 3Bit, 4Bit Operation 1 , 8Bit Operations 3: Table

32Bit
4 : 1, 8 : 3

64Bit
4 : 2, 8 : 6
2 : 1, 7 : 8
3 : 1, 8 : 1, 16 : 3

Addition is the only place where 16Bit * 4 = 64Bit works easily, but when you ADD or - you can only roll to the lowest boundary of each 16Bit segment & not into the higher or lower segment.

A: In order to multiply you need adaptable rules to division & multiply
B: you need a dividable Maths unit with And OR & Not gates to segment the registered Mul SiMD Unit..

In the case of + * you need to use single line rule addition (no over flow per pixel)..
& Either Many AND-OR / Not gate layer or Parallel 16Bit blocks..

You can however painful as it is Multi Load & Zero remainder registers & &or X or Not remainder 00000 on higher depth instructions & so remain pure!

8Bit blocks are a bit small and we use HDR & WCG, So mostly pointless!

We can however 8Bit Write a patch of pallet & sub divide our colour pallet & Light Shadow Curves in anything over 8Bit depth colour,

In the case of Intel 8Bit * 8 Inferencing unit : 16 Bit Colour in probably (WCG 8 * 8) + (HDR 8 * 8) Segments,

In any case Addition is fortunately what we need! so with ADD we can use SiMD & Integer Today.

Rupert S

https://science.n-helix.com/2018/01/integer-floats-with-remainder-theory.html

https://science.n-helix.com/2021/11/parallel-execution.html

https://science.n-helix.com/2022/10/ml.html

https://science.n-helix.com/2021/03/brain-bit-precision-int32-fp32-int16.html

https://science.n-helix.com/2023/06/map.html

Main Operation solves: Bit-Depth Conversions & Operations

Packed Bits, Multibyte Storage : u32, u64, u128

The storage of multiple bit operations with Sync Read & Write,
The purpose of this is to Read, Write & Store Operations on:

DOT4
INT8, INT16
F16, F32, F64

In RAM of 32Bit, 64Bit, 128Bit

Values Storage Table

32Bit = [16bit:16Bit]
32Bit = [8bit:8Bit:8bit:8Bit]
32Bit = [4bit:4Bit:4bit:4Bit:4bit:4Bit:4bit:4Bit]

64Bit = [32bit:32Bit]
64Bit = [16bit:16Bit:16bit:16Bit]
64Bit = [8bit:8Bit:8bit:8Bit:8bit:8Bit:8bit:8Bit]
64Bit = [4bit:4Bit:4bit:4Bit:4bit:4Bit:4bit:4Bit:4bit:4Bit:4bit:4Bit:4bit:4Bit:4bit:4Bit]

128Bit = [64bit:64Bit]
128Bit = [32bit:32Bit:32bit:32Bit]
128Bit = [16bit:16Bit:16bit:16Bit:16bit:16Bit:16bit:16Bit]
128Bit = [8bit:8Bit:8bit:8Bit:8bit:8Bit:8bit:8Bit:8bit:8Bit:8bit:8Bit:8bit:8Bit:8bit:8Bit]
128Bit = [4bit:4Bit:4bit:4Bit:4bit:4Bit:4bit:4Bit:4bit:4Bit:4bit:4Bit:4bit:4Bit:4bit:4Bit:4bit:4Bit:4bit:4Bit:4bit:4Bit:4bit:4Bit:4bit:4Bit:4bit:4Bit:4bit:4Bit:4bit:4Bit]

Bear in mind that Integer 64Bit is 2 x 32Bit on AMD; So you can compute 2 operations at 32Bit per 64Bit operation,

Some 64Bit units are only 64Bit; So we need to know how many!

32Bit operations are fine! & Conversion of 16Bit value ranges into 32Bit Operations can still be within range of 16Bit Storage..
If we stick within the 16Bit value range on Multiply & ADD,
We can therefore simply post a 16Bit value range data set & expect to be able to Store 16Bit!

The simple method is to store 2 16Bit values in the same 32Bit table; like [16bit:16Bit] = 32Bit

With this we can Load, Store, Run & Save 8bit INT8 operations in 32Bit devices such as Alexa as 8bit x 4 = 32Bit, So we don't Waste RAM or resources!

But we still have access to 32Bit RAM Paging; But with values loaded in 4Bit, 8Bit, 16Bit, 32Bit & so on.

With NANO Android on F16 & F32 & MIPS the same & AMD, Intel, NVidia,
Learning F16 offers considerable value for performance with 16M Values!

(c)RS

Direct DMA 32Bit & 64Bit RAM : Multiple Sync 16Bit Texture:

A good example of where 8Bit & 16Bit Value load works well is in the case of the texture,
To load 4 x 16Bit into a single 64Bit Cache:

32Bit RAM = 16Bit, 16Bit
64Bit RAM = 16Bit, 16Bit, 16Bit, 16Bit
128Bit RAM = 16Bit, 16Bit, 16Bit, 16Bit

In the case of direct DMA, you would be aware that you have,
128Bit, 192Bit Buss on GPU
32Bit & 64Bit on CPU

So a direct 4 * 32Bit or 2 * 64Bit Cache loads is a logically fast method to DMA directly from Cache to GPU!
In short you convert 8 x 16Bit into a 2x 64Bit DMA push; Which is very fast!

You can do the same with batches of vertices in many storage sizes.

(c)RS

References:
https://science.n-helix.com/2018/01/integer-floats-with-remainder-theory.html
https://science.n-helix.com/2021/02/multi-operation-maths.html
https://science.n-helix.com/2021/11/parallel-execution.html
https://science.n-helix.com/2022/12/math-error-solve.html

On the subject of how deep a personality of 4Bit, 8Bit, 16Bit is reference:
https://science.n-helix.com/2021/03/brain-bit-precision-int32-fp32-int16.html
https://science.n-helix.com/2022/10/ml.html

*

Quantization modelling : RS : Physics III Slit Experiment

Expanding on potentials for precise machine learning has the same qualities as Maths & quantified research,

A fully qualified result is often required for deep thought & precise thought!,

But we do not always have the RAM or resources that we require; When we need to prioritise load & data sets to specific RAM & Processor availability or location, or by necessity..

Optimise our resource footprint & speed, while maintaining precision to our fully optimised values & data set needs & requirements.

*

Dynamic Scaling

Ideas of FP8-F8 & FP16-F16 Interpolation to 32Bit & 64Bit, Gama Curves are usable(c)RS

Presenting the full precision neuron,;

var Me = expand, {

preload = dataset; {Ds1, Ds2, Dsn }, { condition = Present };

var Present = { Datapoint set }; {

var CC = Compose Compressed {Brotli-G > ZSTD };

4Bit to N-Bit { Brotli-G(GPU Shader) Compressed Data Bit with Tri-Linear Interpolation & Extrapolation };

var Pf = Processor contains N Features { F16, F32, F64, FPU } * { N, N2, N3, Nn };
var Ex = Expand Points { Series Precise { F16:<FPU }, Series Median { Int8:<Int64 }, Series Low priority { Int2:<Int32 };

load Present;

run ML, {epoch1 < epochNN };

test results, {log : logNN};

);

*

DC-RAM-COM : (c)RS 16:54 25/11/2024

Directory Compression, Specifically designed for RAM & GPU, So there are no internal issues with the maths

Firstly the Operating System lists the applications in relocation tables (in ram),

Root Directory Kernel Lists & Patterns

Many operating systems would combine the lists:

In Order from Kernel Root
By the same attribute set :
Loaded DLL, OCX,.SO Access patterns
Kernel RAM Load Pattern

Jump Table Allocation Datasets
DLL & Kernel; Re usage directory sort

Caching parameters : Kernel, DLL, OCX, System DLL & .SO,
Reusable elements : Directly accessible System Kernel Objects & Storage & RAM Access objects & MMU

RDK-ListP, Solve the root cause of concern, Speed & Optimisation

Secondly is data compression, The root cause of compression formats being unavailable to SiMD & GPU or CPU : Operator Functions & Processor Instructions.

Identical blocks are identified by comparing them in the cache..

Identification is carried out through & root hash system,

Most operating systems use certificate & hash value as a security measure (In high security environments)

Each file hash is computed first by the OS; The DLL & EXE & .SO can then be batched by identical whole file hashes..

root hash sorting is fast & hash length is short, The lists are sorted by hash commons, HEX Code sorted by value,

Internal jump tables are allocated, Hashes are matched

Each hash is associated with a system file & the names (the names are sorted also),

2 directory lists exist in root hash, The Name list & the Hash list & both are linked by table jump code (fast in C & Fortrans & OpenCL & shader languages)

Three:

Data segment size optimal for RAM page granularity, Byte size has to be aligned for speed..

Individual long term segment blocks are analysed with short hash,

The hashes are sorted in a directory & identicals are noted & merged &
Internal jump tables are allocated

The file is then blocked into segments, Root identity,

The system identifiers at the beginning of system DLL & .SO are matched first

The file block allocator has then; Three things:

Directory
Jump Table
Block hashes & Jump table

Identical block compression & Identical System File Allocators

Rupert S

https://is.gd/DictionarySortJS
https://is.gd/UpscaleWinDL

D.R.F LoRA & Bayesian : Gaussian Graph Theory

"Small (10K products, 1.56M obs, 21.6k params) 192xCPU ~22:05 1xGPU ~0:41 [32.3x] 4xGPU ~0:21 [63.1x]"

https://towardsdatascience.com/10000x-faster-bayesian-inference-multi-gpu-svi-vs-traditional-mcmc/
https://towardsdatascience.com/prescriptive-modeling-unpacked-a-complete-guide-to-intervention-with-bayesian-modeling/

As you can see a 192 Core CPU Configuration took over 22h versus 41m on a A10G,..

The aim being of our D.R.F LoRA to significantly speed up the code,..

If 192 Cores is regarded as a single CPU Class, Maybe nothing is being done to speed up the parallelism of the CPU cores,

There are 192 Afterall & L3 & L4 Cache Size is relevant to optimisation for batton passing data,

We should be able to optimise for feature sets:

Maybe we could set extended properties : {, SVM = True , NPU Arrays = "" , SiMD Arrays = "" , NPU Version = "" , AVX Version = "" }

Firstly I recommend sorting the data (difficult to do with live data streams),..

Do we have SVM & or NPU's on the CPU / GPU? Dimensional functions & Graphs would make sense,..

Afterwards we would Net the D.R.F Graphs,..

If that makes sense to you, So does Excel & Graph drawing from science class.

MTE Decision tree reduction, Reduces tree count while maintaining accuracy

Maintaining local sorted metadata caches group logic together better & acts faster

https://towardsdatascience.com/decision-trees-natively-handle-categorical-data/
https://github.com/Arzik1987/medium/blob/main/mt_encoding/mt_encoding.ipynb

https://towardsdatascience.com/connecting-the-dots-for-better-movie-recommendations/
https://github.com/datastax/graph-rag/blob/main/docs/examples/movie-reviews-graph-rag.ipynb

CO² Capture : CatBoost / Gradient / R.F / Gaussian Graph / SVM
https://www.sciencedirect.com/science/article/abs/pii/S1385894725057213

Very Short-Term Solar Forecasts in Solar/Wind Energy
https://www.sciencedirect.com/science/article/abs/pii/S0960148125014363

https://www.ecmwf.int/en/forecasts
https://www.ecmwf.int/en/forecasts/charts

https://wmo.int/

Modernised Graph technology makes sense in chemistry, The less complex you aim to start with..

The more sense Graphs will make of the central ruleset, My view is, Start Simple & work up..

You think that's old? In fighter jets, Aim up from the tailpipe! Simple works..

Now as the Solar/Wind Forecasting document is aiming towards deep learning,

Categorically a very short term forecasting example is the eyes & ears, That is based on R.F LoRA,..

So again a combination of Simple Satellite Photography & Ground Observer Wind/Cloud cover direction categorically proves accuracy of R.F LoRA,

Know about facts & the truth emerges.

Rupert S

Reference

https://is.gd/ML_Opt https://is.gd/OPC_ML_Opt https://is.gd/OPC_ML_QuBit https://is.gd/IntegerMathsML https://is.gd/QuBit_GPU https://is.gd/NUMA_Thread

https://is.gd/PackedBit https://is.gd/BayerDitherPackBitDOT

https://is.gd/QuantizedFRC https://is.gd/BlendModes https://is.gd/TPM_VM_Sec

https://is.gd/SVG_DualBlend https://is.gd/MediaSecurity https://is.gd/JIT_RDMA

On the subject of how deep a personality of 4Bit, 8Bit, 16Bit is reference:
https://science.n-helix.com/2021/03/brain-bit-precision-int32-fp32-int16.html
https://science.n-helix.com/2022/10/ml.html

*

"(SmoothQuant).The optimized model achieves >3X latency improvement with a custom dequantization kernel for FP16 inference. Although the work does not map to Int8 engine"

In view that inferencing is being activated in Int4 & Int8 & Int16 & Floats f16b F8 & F4,

Now my view is a vision of a Slit experiment in Physics; Now a slit experiment shows light photos in slices through a screen..

Int4 IIII < Int8 IIIIIIII < Int16 IIIIIIIIIIIIIIII

Ratio 1:2:4 on contained knowledge

Minimal Origin of mankind's knowledge : IIII < IIIIIIII < IIIIIIIIIIIIIIII Defined Summit of all power

My method is to compress the point node data with
https://is.gd/WaveletAutoEncoder

https://github.com/GPUOpen-LibrariesAndSDKs/brotli_g_sdk

So what we do is take advantage of patterns; Creating tables of 1111 1010 as examples; These compress well & can be short noted as patterns,

We can expand 4Bit into 8Bit inference & compress as patterns; The total data point is 4Bit if it is a pattern,
The subject is not predictable unless we pick the patterns!

We can however Quantize the memory footprint; The Double/Single precision operations may be faster! :L

We need the models to work in F16 & Int8 & Int4 after-all, But i see a reason to use Floats because sub-quantization does leave a remainder for us to compare..

That relevant 'F16' >=-

RS

Study Subject Reduction :

https://science.n-helix.com/2021/03/brain-bit-precision-int32-fp32-int16.html
https://science.n-helix.com/2022/10/ml.html

https://blog.openvino.ai/blog-posts/q123-technology-update-low-precision-and-model-optimization
https://blog.openvino.ai/blog-posts/q223-technology-update-low-precision-and-model-optimization
https://blog.openvino.ai/blog-posts/q323-technology-update-low-precision-and-model-optimization
https://blog.openvino.ai/blog-posts/q423-technology-update-low-precision-and-model-optimization

Quantification Analytics with combined operations
Accurate and Efficient Collaborative Optimizations for Fast Generative AI on AMD GPU
https://community.amd.com/t5/ai/developer-blog-accurate-and-efficient-collaborative/ba-p/682185

https://is.gd/OptimisingWhisper

Automatic 4bit Activation-Aware Quantization (AWQ),

I am confident that 8Bit is still logical; 4Bit defines well..
We need DOT4 support from Chrome Dev; U32/8 & U32/16 with line by line Sign Arrays:

U32/8 U32/8 U32/8 U32/8 S1
U32/16 U32/16 U32/16 U32/16 S1

Reference "Shared exponent" https://drive.google.com/file/d/1fLUcZPZYearBjd0wQIIxFBbL1fDQG2Hp/

https://community.amd.com/t5/ai/reduce-memory-footprint-and-improve-performance-running-llms-on/ba-p/686157

Ideas of FP8-F8 & FP16-F16 Interpolation to 32Bit & 64Bit, Gama Curves are usable(c)RS - 'ocp' F8 & FP8 or smaller with interpolation-microscaling-formats-mx-v1-0-spec-final
https://www.opencompute.org/documents/ocp-microscaling-formats-mx-v1-0-spec-final-pdf

Self Trained Auto Sparsity ML

ML Batch Matrix MAP in FPGA
https://drive.google.com/file/d/1hdxeK1r8LIhvpn7poOm3MfXmGr9Tq-ni/view?usp=sharing

ML Compressed Dynamic16bit-8Bit - Hardware-friendly compression and hardware acceleration for ML Transformer
https://aimspress.com/article/doi/10.3934/era.2022192

Matrix Processors - Memory & command - All-Digital Compute-In-Memory FPGA Architecture for Deep Learning Acceleration
https://dl.acm.org/doi/pdf/10.1145/3640469

Matrix Processors - Inline Ram & Command { CMD : RAM }:{NET}
https://www.xilinx.com/content/dam/xilinx/support/documents/white_papers/wp506-ai-engine.pdf
https://www.xilinx.com/content/dam/xilinx/support/documents/white_papers/EW2020-Deep-Learning-Inference-AICore.pdf

learning Cards 52 upto 208 Tops $249+
https://hailo.ai/products/ai-accelerators/hailo-8-century-high-performance-pcie-card/#hailo8-features

Comparative Streaming cards with ML & 70+ video Streams per unit

130 channels of 1920x1080p
https://www.qualcomm.com/products/technology/processors/cloud-artificial-intelligence
96 channels of 1920x1080p
https://www.xilinx.com/applications/data-center/v70.html

TAC (Tiny Anomaly Compression)
https://pypi.org/project/Conect2ai/

Inference on any device with a C99 compiler
https://pypi.org/project/emlearn/

to run without activating C99; Installs under Python 3.10+
https://github.com/emlearn/emlearn-micropython
https://github.com/emlearn/emlearn-micropython/releases
git clone https://github.com/emlearn/emlearn-micropython

With EmLearn you can compile really tight models of tensors & random forest & Gaussian Matrix,
These are very good for:

A1: Anti-Aliasing ( Gaussian, Tensor error diffusion, forested Random spread )
A2: sharpening & Shaping ( Tensor Edge detect with enhance, Gaussian estimation & line fill, Random forest A to B to D: E to B to F X + )
A3: Line & Curve estimation fills & Tessellation ( forested Random spread (Dither fills) & A1 & A2 & Differentiation in 3D Space : 1:2:3{ A B C : E B F }
A4: HDR & WCG, Combinations of dithering in colour space & light/Shadow differentiation in 3D Space : 1:2:3{ A B C : E B F }

36Minutes UpscaleDL https://youtu.be/16jLi95mat8

Megatron Classifies Images in Web Tensors
A: https://drive.google.com/file/d/1EMMASCIu92hIgIxg0bEBrmAJuxvEfk2e/view?usp=drive_link

B: https://drive.google.com/file/d/1A_P9GI6jztxw-K3xlPocGUzTSqti7_wX/view?usp=drive_link
C: https://drive.google.com/file/d/18jnPASrGo_pbubGmLuRiDVJYMDjPeD-c/view?usp=drive_link

D: https://drive.google.com/file/d/1Sfm9wUqihpC4gnhinuKD7SjlyrzFZc6B/view?usp=drive_link
E: https://drive.google.com/file/d/1wKfcBIKnHmHPbxcWB1Xp1gW3M7MvWlzy/view?usp=drive_link
F: https://drive.google.com/file/d/1R-4p-R6QMVwhCkUAdpUvhVK46t-h9_fw/view?usp=drive_link
G: https://drive.google.com/file/d/1hTTnczwKCiGUi3B4jkAG-TyfvY3hJJfs/view?usp=drive_link

11m 34m space then 11m;

https://drive.google.com/file/d/1FZQnTNwqN2KPz0NUcELr63TdqEybnsn_/view?usp=drive_link
58m UpscaleDL

https://drive.google.com/file/d/1vNQpvnKCTMicT8QztSeZAcLhSuWD_pjK/view?usp=drive_link
36Minutes UpscaleDL https://drive.google.com/file/d/1zEJsz8_Us_nu2un5n5yE-F-q0gJpNi0z/view?usp=drive_link

Count 58 - 18-49m Inference USB Accelerators - Megatron Web Tensor Classification 2024-02-02 17-12
https://drive.google.com/file/d/18hFa_fDMzVX8bbRyYUZh8JAByCN8fFaJ/view?usp=drive_link

Mr420Megatron Classifies Images in Web Tensors & you know he's good right, That is just what he feels! for real bro
https://drive.google.com/file/d/1UXlA-xpODvwGuUhCed0EBd6LJ0wB4J5E/view?usp=drive_link

https://www.tensorflow.org/js/demos
https://storage.googleapis.com/tfjs-examples/addition-rnn/dist/index.html

Batch 256 1 layer 128 neurons RNN

https://drive.google.com/open?id=13GjvS3BnFZYt6iMvxithaqlCl-dcF7HZ&usp=drive_link
https://drive.google.com/open?id=1EtEoaCC8KKC9QkOO1FTJUqJ-izGqObV6&usp=drive_link

https://drive.google.com/file/d/10FmmkzCwWscDlQMMEFLwdEuUOA8tAAvd/view?usp=drive_link

Batch 512 2 layer 256 neurons RNN GRU

https://drive.google.com/file/d/1jyLBFD4Q5OfRCG8zd7TdS1JpF5Nck6CS/view?usp=drive_link

You can see that 2 layers takes longer to train; More Stable though!
It would also take more RAM & processor time.

Phishing detection systems train fast, maybe there is hope yet for AV
https://storage.googleapis.com/tfjs-examples/website-phishing/dist/index.html
https://drive.google.com/file/d/1oV_WM3YFYretx2wWz_DSMIyzGeFhHvHj/view?usp=drive_link

https://is.gd/TPU_Inference
https://is.gd/TPU_Inference2

https://is.gd/ConvNetTPU
https://is.gd/DenseNetTPU

Supercharge Web AI model testing: WebGPU, WebGL
https://developer.chrome.com/blog/supercharge-web-ai-testing?hl=en

https://tensorflowjs-fashion-mnist-classifier.glitch.me/

https://www.w3.org/2020/06/machine-learning-workshop/talks/access_purpose_built_ml_hardware_with_web_neural_network_api.html

https://www.w3.org/TR/webnn/#intro

https://www.tensorflow.org/js

https://github.com/microsoft/tensorflow-directml/releases

https://intel.github.io/webml-polyfill/examples/image_classification

RX580 supports DirectML Feature level 2, What does that mean? Technical question!

https://github.com/microsoft/DirectML/blob/master/Releases.md
https://learn.microsoft.com/en-us/windows/ai/directml/dml-feature-level-history
https://learn.microsoft.com/en-us/windows/win32/api/directml/ne-directml-dml_feature_level

https://is.gd/ML_Maths

Rupert S

Batch Size 240W>65W, 32GB{64, 16}, 15W>5W, 4gb{16, 1} : 16, 8, 4 seems optimal,
Time taken compatible:

ML_With_USB_Stress-Testing_USB_Accelerators_for_Efficient_Edge
https://www.researchgate.net/publication/377174200_Stress-Testing_USB_Accelerators_for_Efficient_Edge_Inference

https://github.com/raphischer/edge-acc

Coral TPU Micro Edge Learning with performance arranged Intel FPGA Arria 10 SX SoC Kit & Google Coral, NVIDIA Jetson Nano and CPU ROCK 4C Plus
https://doi.org/10.3390/s24030899

ML Document Caches - USB Acceleration & Small devices - Combining Machine Learning and Edge Computing Opportunities Frameworks & Devices
https://www.mdpi.com/2079-9292/13/3/640

https://is.gd/CJS_DictionarySort

Python & JS Configurations
https://is.gd/DictionarySortJS

ML Tensor, ONNX Machine learning model that involves direct compression & higher accuracy in preference to Bit Reduction; Because reducing Bit Depth on decisions makes results potentially overflow your maximum ML Node Point Depth...

Because of point overflow on low bit depth (less than 4Bit in most cases) We plan to use compression to multiply the RAM available to the ML..

With Brotli-G the Zip can be directly decompressed inside the GPU & therefore the results are much faster & more efficient for us..

We can further improve by selecting compression Compatable patterns such as 1111<1toN or 1010<10*N where N = Multiples of for example 1234 (repeating); R * N = RN,

So we can maximise compression in Processor & not need to pass uncompressed data points,
We Cache & Decompress & Recompress as required.

RS

ML tensor + ONNX Learner libraries & files

Accelerated Python: NPU, TPU, SiMD
https://is.gd/CoralAI

https://is.gd/DictionarySortJS
https://is.gd/UpscalerUSB_ROM
https://is.gd/UpscaleWinDL
https://is.gd/HPC_HIP_CUDA

https://is.gd/SPIRV_HIPcuda
https://is.gd/TFLiteDev

https://is.gd/TFLiteDevP2

https://is.gd/OpenStreamingCodecs

Application of Data Compression to ML

Some examples of how Brotli-G compression can be used to improve the performance of machine learning models:

Compressing model parameters: Brotli-G can be used to compress the weights and biases of machine learning models,

Brotli-G can reduce the amount of memory required to store the model; Which can be beneficial for deploying the model on devices with limited memory.. For example:

Brotli G can be used to compress a model with 100 million parameters from 100MB to 50MB.

Compressing model inputs and outputs: Brotli-G can also be used to compress the inputs and outputs of machine learning models;

This can reduce the amount of data that needs to be transferred between the model and the data source or sink.. For example:

Brotli-G can be used to compress images from 1MB to 500KB.

Compressing model activations: Brotli-G can also be used to compress the activations of a machine learning model!

Reducing the amount of memory required to store the intermediate results of the model.. For example:

Brotli-G can be used to compress activations from 500MB to 250MB.

In addition to these specific examples, Brotli-G can also be used to compress other types of data that are used in machine learning; text & data,

Brotli-G is a high-performance compression algorithm that can provide significant performance improvements for machine learning applications.

These examples demonstrate the potential of Brotli-G to improve the performance of machine learning models. As Brotli-G becomes more widely adopted, we can expect to see even more innovative uses of powerful compression algorithms.

RS

*

Inferencing & Classification : Protocols

To clarify that the inferencing unit such as Intel, AMD & ARM are expressly created with the opportunity to minimal instruction load; Edge detect & other machine learning comparators..

As the Inferencing instructions contain the logic of comparison.. & furthermore are created to facilitate the comparison of Inference tasks..

most logically you can see a wise person could see scope for edge detecting expressly with edge sharpening & shaping in mind; But also Trilinear filtering & of course Tessellation ..

Now i believe you have Displays, Cameras & Audio Systems to optimise!

Now we know that we can & also improve latency related issues such as frame tearing detection & also jitter & QFT & VRR.

How ? Inference all of the latency issues of frame arrival time, torn frames & misaligned audio & Electric signal jitter in what is effectively an Ethernet protocol AKA Frame Transmission & Reception ...

More ? Why not :L

Rupert S

*

Int8:SiMD : Maths & Logic

This is about how you think about components such as INT8, INT4(Xbox) & SiMD, You have to classify by necessity & optimise the structure.

You can shape the game reality with specific control objects & statics!
Maths in SiMD & Int8 & Machine Learning in Int8 & SiMD; SiMD is hard maths, Int8 is soft edge inference...

Both are maths; But soft logic is not a PROOF Math but can be proof; Hard math is not 'Invention & Imagination 'Exactly''

But we have both to improve performance.

RS

*

SiMD Performance : RS

Performance per WATT of MMX & MMX+ & SSE & AVX Machine Learning & Shader code; Is a matter of 8x8Bit & 16x16Bit Code on GPU

Our role is to reduce complex un-cache-able ML to Cache Enabled 64KB
Modelling of 1990's without Quality loss of 32Bit++ 64Bit+

8x8Bit sharpening MMX Becomes Dual Pipe (16x16bit)*2 in 32Bit Dual 16 Pipeline & Twice as sharp
Machine Learning method for MMX Is Fast & Cheap, MMX2 More Compatible,
Intrinsic improvements such as combined ops & DOT4 Further improve the performance of under 1MB Code..

Performance & Function per WATT, Is unbeaten; Let us prove it!

For example Quake has MMX Emulation & MMX Dithering code on 3D Textures,
In 8Bit 256 Colours dithering is noticeable; In 15Bit to 32Bit the small shade difference in dithering colour is subtle & flawless,
Improving light subtilty & Colour pallet WCG & HDR 10Bit to 16Bit per channel.

SiMD & Int8 & dp4a & F16/F32/F64>:

The way SiMD Repeating Parallel batches of instruction can still side load data,
Data is loaded into the 'calculation set'

http://ftp.cvut.cz/kernel/people/geoff/cell/ps3-linux-docs/CellProgrammingTutorial/BasicsOfSIMDProgramming.html
https://en.wikipedia.org/wiki/Single_instruction,_multiple_data

SiMD Consist of 8Bit to 64Bit Long & Floats,
SiMD are simple instructions; Or so they think; SiMD are relatively complex instructions..
For example 4/1 of a page full of arithmetic code; However our goal is to use Heuristics & logic to circumvent the Artifacts/Errors in self generated code,

In addition to using problem solving tables to choose instructions that advantage our analysis (Machine Learning),
We also can choose the most probably optimal code type.

Our outset objective is to decide if we want to use CPU Feature types:

F16
Int8
dp4a
SiMD

Depending on the Mathematical Qualities of each ML Node & the questions they are asking,
For examples:

A simple ResNet Image identification uses edge detect & for that we need for example SiMD Matrix Edge Detection

Speech requires identifying Words in a codec, So obviously we need a Decoder & Encoder,
Word identifiers & correctness checking; But firstly we need to identify accent to correctly choose words..

We also need to classify words by Idea grouping (DataBase, Open Database)

As you can see; We will be defining many of these function groups as SiMD & Float,
Effective use of Int8 differentiation, Comparators & Maths operations has many benefits; So does JIT Compile.

Solve Table of Statistically provable Machine Equates & Solves : Table of function competitors & Operators.

Runtime Library - Multiple Solve Table

I would like a Solve Table of Statistically provable Machine Equates & Solves that make the equivalent of Maths Compilers such as RUST & Fortran's

For example basic ML code test function loops are basically compatible with X-OR Comparators on AVX! Other functions such as greater or less than; Are AVX Compatible.

Machine Learning : List of actions that are SiMD Baseline: Statistical Observance and Solve Tables

Yes or no comparator X-OR
Memory array Byte Swap
Greater or less than with swap or with X-OR Roll
Memory save & store
Edge comparisons
Compares (Colour, Math, Equate, Target, Solve if)

There are more! Statistical Observance and Solve Tables.

Examples 2:

Shape compare is a matter of inner & outer Vector : Comparison & X-OR, Larger outside & X-OR The differentiation:
By Dot,
By Mass (non literal dot difference comparator by axis),
Actual Mass
Density : Lumina, Weight, Mole, Mass / Area

Edge Solve : X-OR ~= Colour, Lumina, Shade, Vibrancy, Distance, Matrix Solve 3D>=2D Flattened Comparator
If = X-OR=N<0.0001 Then Compare &= Mutex Solve / Average

Polygon Join/Merge Tessellation : If Model = Same (T1 + T2 If (T1 + T2)/2 = Difference Less Than 0.0001 | = Merge/Converge

*

Audio, Video & High precision Float ML

tensors & full onnx configuration : Upscaling : While we are not sure how much ML we need & at what precision,

We can be sure that 32Bit (per channel) Value RGBA (Multiple layer) requires at least 8Bit to 16Bit per channel final precision; So here is a list:

Required Value of output, Neural Network precision guide table: RS

Input
8Bit, 10Bit, 12Bit, 16Bit

Input network precision average bit retention (for RAM some error is allowed)
6Bit, 8Bit, 10Bit, 14Bit, 16Bit

Classifiers as we know can be,
Int 2Bit 4Bit, 8Bit, 16Bit, 32Bit
2 Bit is unlikely & 32Bit is for Dream Smooth 16Bit+ Precision output

Output Float (Mostly FP & F16b)
16Bit = { 8Bit, 10Bit, 12Bit }
24Bit, 32Bit, 64Bit = { 16Bit, 32Bit, 48Bit }
We can upscale : Audio, Video, Content & Polygons, We classify Quality by expectations & Quantify by percent %

Rupert S

*

8Bit vs 16Bit vs 32Bit

Stitching wounds is an example for use to compare inferencing bit depth:
An 8Bit reference photo constitutes approximately 1cm² Black & White / Grayscale 300ppi, maybe 1/2cm² Colour 8Bit 150PPI,

16Bit reference constitutes approximately 6cm² grey scale 600ppi, 3cm² Colour 15Bit 300ppi.

32Bit single precision still has more to examine.

Both 8Bit & 16Bit Inference offer a solution.

(c)Rupert S

Bit Depth and Colour Representation:

A bit is a fundamental unit of information in computing, representing either a 0 or a 1.

Bit depth refers to the number of bits used to represent a colour value for a single pixel in an image.

Higher bit depth translates to more possible colours or shades of grey.

An 8-bit image can represent 2 raised to the power of 8 (2⁸) which is 256 colour values,
This is often enough for basic images and applies well to grayscale images with high precision (300ppi in example).

A 16-bit image can represent 2¹⁶ (65,536) colour values, offering a significant increase in colour detail,
This can be beneficial for colour reference photos (like the 300ppi colour example).

Bit Depth and Image Quality in Stitching

In the context of stitching wounds together, accurate colour representation and detail are crucial.

An 8-bit grayscale image at 300ppi might provide enough detail for basic analysis,
But a 16-bit image (or even higher) would likely be preferable for capturing subtle variations in skin tone and tissue.

The provided information suggests that 16-bit color images might offer a good balance between detail and file size for this application (around 3cm² at 300ppi).
32-bit and Beyond

While 32-bit images offer an even greater range of colors, they might not be necessary for tasks like wound stitching, and would likely come with increased file size and processing demands..

Important Considerations

The suitability of a bit depth depends on the specific application.

File size also plays a role - higher bit depth images require more storage space.

Processing power required to manipulate the image can also be affected by bit depth.

*

TPU is discovering a new market in L1 Server class NPU Share; Minimal footprint NPU Class EdgeTPU has the edge you can't match..
Edge Server class mPCIe M.2 NPU learning.

An 8bit S Curve is 200 points curves over 15 Frames, now that is an Edge TPU Coral.ai

Remember that without Jenny, his dream of identifying cells wouldn't have come today, Jenny inspired the Resnet 50m photos to cure cancer story with her great energy; Resnet-50 Cell identification program

You may be wondering but as a doctor you may already know that they have a 3D XRay to destroy cancers,
But you might not know how Resnet50 could isolate & destroy cancer cell clusters in a 3D XRay/Image/MRI scan,

cloudflare.com Resnet-50 Service It takes 50m images identified to cut all polyp cancer cells from a victim,
Coral edge TPU and Movidius are the economy answers for cloudflare and MSF.fr

@cf/microsoft/resnet-50 50 layers deep-image classification CNN trained on more than 1M images from ImageNet

Worthy configurations for consoles such as dentists computers :
Cancer & searches for tissues such as FATS in the veins of the heart,

Fats in veins constitute an average width of healthy vein being a measurable statistical normal,
The fat amounts stuck to the vein constitutes an abnormal or statistical deviation..
Many of these measurements need official verification & should be signed as verified.

Heart pulse rate versus body size & arrhythmic or statistical variances beyond normal; That are not seen as healthy (if you verify knowledge).

There are many small tasks that the body does that are equivalent to vehicle verification & health checks..

The results of the statistical normal & small task..
The task that has many points of interest & thus takes hours for people to verify,
Computers do these tasks better & quicker.

https://is.gd/DictionarySortJS

https://is.gd/UpscaleWinDL
https://is.gd/HPC_HIP_CUDA

https://is.gd/UpscalerUSB_ROM

https://is.gd/OpenStreamingCodecs

ApplicationSensiMelia (c)RS

I estimate a tip of 15cents per client per hour would make the application work,

In the case of diabetics & other statistical anomalies like heart rate,
The App that works is a combination of Lamba LLM & statistics & average deviations in 8Bit inferencing,

Perfect for EdgeTPU.

(c)Rupert S

https://is.gd/DictionarySortJS
https://is.gd/UpscaleWinDL
https://developers.cloudflare.com/workers-ai/models/image-classification/

Cancer Research References

Reference: https://drive.google.com/file/d/1WmhMcCZZjDI4pKnQsccvaf4RdquhPPs8/
Reference française: https://drive.google.com/file/d/1WiFUEOE23D4UTQRN7MP6Z4Lh24PJxuFG/

https://www.google.com/search?q=resnet+50+for+cancer+screening&hl=en

Skin cancer
https://github.com/ngandhi369/Skin-Cancer-detection-using-ResNet-50

Prostate Cancer
https://bmcmedinformdecismak.biomedcentral.com/articles/10.1186/s12911-024-02419-0

Breast Cancer
https://pubmed.ncbi.nlm.nih.gov/36631349/
https://arxiv.org/html/2308.13150v6

Deep learning radiomics based prediction of axillary lymph node metastasis in breast cancer
https://www.nature.com/articles/s41523-024-00628-4

Improving image classification of gastrointestinal endoscopy using curriculum self-supervised learning
https://www.nature.com/articles/s41598-024-53955-8

Cervical cancer
Prediction of lymph node metastasis in operable cervical cancer using clinical parameters and deep learning with MRI data: a multicentre study
https://insightsimaging.springeropen.com/articles/10.1186/s13244-024-01618-7

$DeepCPD: deep learning with vision transformer for colorectal polyp detection
https://link.springer.com/article/10.1007/s11042-024-18607-z

A Combined Ensemble Model (CEM) for a Liver Cancer Detection System
https://thesai.org/Publications/ViewPaper?Volume=15&Issue=2&Code=IJACSA&SerialNo=18

MRI ML Enhancement
Deep-learning-based reconstruction of under-sampled MRI to reduce scan times, a multicentre retrospective cohort study
https://www.thelancet.com/journals/lanonc/article/PIIS1470-2045(23)00641-1/fulltext

PulmoNet: a novel deep learning based pulmonary diseases detection model
https://bmcmedimaging.biomedcentral.com/articles/10.1186/s12880-024-01227-2

An efficient image classification of lung nodule classification approach using CT and PET fused images
https://drive.google.com/file/d/1irjQF-rvLtfvdzHTBB7tSuJCGLO40OxD/view?usp=drive_link

Salivary gland tumours
Deep learning based ultrasound analysis facilitates precise distinction between parotid pleomorphic adenoma and Warthin tumour
https://www.frontiersin.org/journals/oncology/articles/10.3389/fonc.2024.1337631/full

Rapid and Label-Free Histopathology of Oral Lesions Using Deep Learning Applied to Optical and Infrared Spectroscopic Imaging Data
https://www.mdpi.com/2075-4426/14/3/304

Brain Cancer
A multi-class brain tumour grading system based on histopathological images using a hybrid YOLO and RESNET networks
https://www.nature.com/articles/s41598-024-54864-6

Deep-learning quantified cell-type-specific nuclear morphology predicts genomic instability and prognosis in multiple cancer types
https://www.biorxiv.org/content/biorxiv/early/2024/03/12/2023.05.15.539600.full.pdf

Multiple path trained cancer & diagnostics study with full networks and specifically tuned: Progress : :D
https://blog.research.google/2024/03/health-specific-embedding-tools-for.html

Medical Data for ML Processes:
The data that support the findings of this study are openly available on Kaggle: https://www.kaggle.com/

*
Treatment

Machine Learning Processed 3D Fully Masked Identified Groups : MLp-MaskedIG screen & clense:

Dealing with Resnet Identify Masking & Precise Ion control in 3D Fully Masked Identified Groups
https://home.cern/news/news/knowledge-sharing/cern-detector-could-help-improve-head-tumour-radiotherapy
https://home.cern/news/news/knowledge-sharing/biodynamo-cutting-edge-software-helps-battle-cancer

Observing that directed energy reduces radio exposure & sickness; works with surgeries also & is human processed or robotic.

RS
*

Incident observations , download entitlement https://drive.google.com/file/d/1GOZR4kZmH1s4vqqoNZ0C8Pnc0BwenkRo/

Extra layer reference : https://coral.ai/docs/edgetpu/inference/

Retraining the last layer or Repointing a network;
In terms of ourselves this constitutes Retraining our degree along specialisation tasks,
That is the method RS

We need to clip the last layer or re profile the vision application if we wish to add networks like cancer or germs to a pre trained model,

According to them we repoint nodes or we clip inferencing layer & re train the network before inferencing..
Exact referencing is complex; But we need to retrain or repoint gan's and inferencing networks..

Pre trained networks that are not specific to our task cannot add nodes for tasks without re imprinting the network optimally or shaving off the last layer to further our identify tasks.

Rupert S

Reference: https://drive.google.com/file/d/1WmhMcCZZjDI4pKnQsccvaf4RdquhPPs8/
Reference française: https://drive.google.com/file/d/1WiFUEOE23D4UTQRN7MP6Z4Lh24PJxuFG/

PCIe Acceleration modelling for Medical grade 3rd World #FirstClass : Question is, are you McGuiver? I Am ;D #DoctorLove

Your standard medical console may be using most probably Standard Python acceleration (older version),
Most likely a cancer screening could shave 30 seconds from your diagnostic timeline..

If you have one of the following available:

hailo 8, PCIe, M.2, M.2 in a PCIe Card such as a compatible Wifi M.2 E-key or AE Key...

Question is, are you mcGuiver? I Am ;D

Hailo3/5 with phiza & the like who 'donated it'

CORALS on sale 4TOPS, 8TOPS, Your choice, What you need for HPC 09:33 06/03/2024 : RS

https://science.n-helix.com/2022/10/ml.html

ML tensor + ONNX Learner libraries & files
Model examples in models folder

https://is.gd/DictionarySortJS
https://is.gd/UpscaleWinDL
https://is.gd/HPC_HIP_CUDA

https://is.gd/UpscalerUSB_ROM

https://is.gd/OpenStreamingCodecs

The perfect Proposal RS

*

FPGA BitFile & Code Opt (c)RS 2021-01

https://science.n-helix.com/2022/10/ml.html
https://science.n-helix.com/2022/08/jit-dongle.html
https://is.gd/LEDSource

In my view heuristics in compilers are a choice for those who do not wish to include direct ML compiled into their code,
This is understandable in terms of terminator & cylons & indeed flawed beings or even good ones with depression!

However the application of branch optimisation is a sample code optimisation that can 'Plug In' to branch caching on the CPU & GPU.

Heuristics are not just code in the compiler; They are also micro code selecting a probable branch; Although code that forces a branch can be flawed..

Both heuristics, Branch probability selection & ML can run in parts of the code to select probable path!

Yes fundamentally any code that modifies behaviour is a catch bullet frame for not sound 'Fortrans code is rock solid' Rust is also supposed to be solid.

Including soundly made heuristic code & branch probability code ML in your inline routines; 'Very much interpretive master jedi'; But it can be done!

Question is How big? & how fixed?

25KB per 3MB on average?

ML & Heuristics like my application FPGA BitFile & Code Opt (c)RS 2021-01

can be applied at runtime & remain only for selecting the fastest path or the best; In terms of which Processor function to run code for.

(c)Rupert S

TOPCloud Scaled Flexible WebASM & WebGPU & MathML!

Quite flexible for use on Monitors & TV's; Light processor load on simple tasks & offloadable such as TOPCloud!

You may be thinking Offloading is impracticable because that requires one of two things:

JIT Compiler Dongle..
USB device such as Firestick or GPU & CPU (With OpenCL Compat)

Server! so internet & service provision!
Impossible? No; WebAdvert supported TV's need both!
So why not HPC TOPCloud? could make a HOT TV a lot cooler & Eco friendly with Server repeating tasks:

Scaling
Quality Service
Service availability

TOPCloud Offload Logic:

In terms of WebASM & WebGPU & MathML; TOPCloud provides sufficient advantages to be considered a core utility..

While Offloading repeating content such as Siteload core stack (Server) & Localising configuration such as Webpage size & DPI & Dynamic font arrangements that require thought.

In terms of Offloaded function & Efficient system load for large configurations..

Especially efficient configurations such as TPU, Coral, GPU work & Cloud CPU that have large optimised stacks & installed drivers.

RS

#Sound Strategy game TOPCloud (c)RS

PCM & MP4 are 2D/3D Image so GPU Helps there also with 3D Audio mapping!
Games do not require cloud processing of images & a lot of local strategies are procedural Heuristic

You see RDP has GPU Connect (my innovation i might add) So Bluetooth & Wifi can connect RTP GPU; The port specifics are not particularly important; However a device such as music streamer can have ML TOP's available locally & from the cloud,

Due to how the TOPCloud strategy works with localised ML TOPS; Not all data has to be sent or received.. For example all Audio 3D Profiles for HQ Room audio can be done within a few MB of data; With some hard work? 150Kb of data & so in reach of phones & mobile!

Gaming is an example here. I give TickTackToe as the example where all that a device like Alexa or Google smart device has to think is Which square? but..

No physical picture needs to be sent for the game to be played & if required a small TickTack Strategy ML is desired locally for a quicker response!

You see with a low latency GPU RTP & GPU RDP connection to cloud GPU; Most localised thinking TOPS can be carried out in Seconds if not milliseconds & PCM & MP4 are 2D/3D Image so GPU Helps there also with 3D Audio mapping!

Rupert S

*

Core features of TOPCloud:

RTP ML TOPS are a processors friend

3D audio mapping & spatialization for realistic sound effects
3D Vector Support for various audio formats such as PCM, MP4, OGG, and WAV

Low latency & high bandwidth connection to cloud GPU servers via RDP

Procedural & heuristic algorithms for generating game scenarios & strategies & 3D Audio & Visuals
Localized & cloud-based machine learning models for optimizing game performance & user experience

RTP GPU Connect technology that allows users to access GPU resources from any device with Bluetooth or WiFi

TOPCloud is a revolutionary 'TOPS' way to enjoy & create audio games using your own music & the power of the cloud. Try it today & discover a new dimension of gaming!

*

Scaling; We can classify by colour or creativity. (c)RS

If you use TOPCloud, you can share between different displays in the TOP's Sense..
but mostly you would need cloud presence,

Mostly this would be about making the most out of TOP heavy Business GPU & personal ones in your computer or consoles.

But sharing common tasks such as scaling movies by type or by identifying a single movie to upscale...

Now you might be asking what we would be doing there?
Well a single movie uses the same materials in our ML; We can analyse the class & optimise the scaling by class..

For those familiar with games & FSR; We familiarise our code with a single game!
By doing this we improve our product and can therefore classify by:

Resolution
Style
Speed
Type, FPS for example & RTS

We can classify by colour or creativity...

We do not simply have to roll the dice on General Scaling, We can use classifiers:

Title
Scale
Type
Speed
Frame Rate
Colour & Composure

Rupert S

PoCL Source & Code
https://is.gd/LEDSource

*

We all think our own way; Potential is always there on a Runtime Library - Multiple Solve Table

Machine learning | Equate ~= Multi Layer Wavelet Abstraction
https://science.n-helix.com/2022/09/ovccans.html

https://www.youtube.com/watch?v=-9lCpfrOQQ4

(c)Rupert S 2022-10

https://is.gd/LEDSource
https://is.gd/BTSource

https://science.n-helix.com/2023/06/tops.html

https://science.n-helix.com/2021/03/brain-bit-precision-int32-fp32-int16.html
https://science.n-helix.com/2022/08/jit-dongle.html
https://science.n-helix.com/2022/06/jit-compiler.html

https://is.gd/MLCodecShaping

*

This one will suite Dedicated ARM Machine in body armour 'mental state' ARM Router & TV
(ARM Learning 4K ROM; Safe Larger USB ROM) https://bit.ly/3Afn1Y4

https://drive.google.com/file/d/102pycYOFpkD1Vqj_N910vennxxIzFh_f/view?usp=sharing

Android & Linux ARM Processor configurations; routers & TV's upgrade files, Update & improve
https://drive.google.com/file/d/1JV7PaTPUmikzqgMIfNRXr4UkF2X9iZoq/

Providence: https://www.virustotal.com/gui/file/0c999ccda99be1c9535ad72c38dc1947d014966e699d7a259c67f4df56ec4b92/

https://www.virustotal.com/gui/file/ff97d7da6a89d39f7c6c3711e0271f282127c75174977439a33d44a03d4d6c8e/

Python Deep Learning: configurations

AndroLinuxML : https://drive.google.com/file/d/1N92h-nHnzO5Vfq1rcJhkF952aZ1PPZGB/view?usp=sharing

Linux : https://drive.google.com/file/d/1u64mj6vqWwq3hLfgt0rHis1Bvdx_o3vL/view?usp=sharing

Windows : https://drive.google.com/file/d/1dVJHPx9kdXxCg5272fPvnpgY8UtIq57p/view?usp=sharing

*Windows {
To Compress using CPU/GPU: MS-OpenCL
https://is.gd/MS_OpenCL
https://is.gd/OpenCL4X64
https://is.gd/OpenCL4ARM

Genuinely good JS + Python & configuration work, Windows, Linux, ARM

ML tensor + ONNX Learner libraries & files
Model examples in models folder

https://is.gd/DictionarySortJS
https://is.gd/UpscaleWinDL
https://is.gd/HPC_HIP_CUDA

https://is.gd/UpscalerUSB_ROM

https://is.gd/OpenStreamingCodecs

https://is.gd/AMDPro2024PolarisCombined

The perfect Proposal RS

https://www.amd.com/en/developer/rocm-hub/hip-sdk.html#tabs-ddafbba141-item-c6b9ce2aab-tab
https://rocm.docs.amd.com/en/docs-5.5.1/deploy/windows/quick_start.html

X86Features-Emu
https://drive.google.com/file/d/15vXBPLaU9W4ul7lmHZsw1dwVPe3lo-jK/view?usp=usp=sharing

}

Machine Learning SDK's,
You may not have a Machine Learning SDK to accelerate your GPU/CPU/Device

3 main ones, but Python does not guarantee an accelerator!
Obviously Python Builds with Accelerators work!

HW Build Source : Upscale DL
https://github.com/GPUOpen-LibrariesAndSDKs/RadeonML
https://github.com/GPUOpen-LibrariesAndSDKs/RadeonImageFilter

PoCL Source & Code
https://is.gd/LEDSource

https://github.com/ssube/diffusers/tree/feature/onnx-upscale

https://github.com/huggingface/diffusers
https://huggingface.co/ssube/stable-diffusion-x4-upscaler-onnx

https://huggingface.co/uwg/upscaler/tree/main
https://huggingface.co/nvmmonkey/optimal_upscale/tree/main
https://huggingface.co/gmp-dev/gmp-upscaler/tree/main/ESRGAN

Neural Engine
https://github.com/godly-devotion/MochiDiffusion

ML List & Services
https://huggingface.co/models?sort=downloads&search=upscale
https://huggingface.co/models
https://huggingface.co/pricing

Tokma ML

Batch Size 240W>65W, 32GB{64, 16}, 15W>5W, 4gb{16, 1} : 16, 8, 4 seems optimal,
Time taken compatible:

ML_With_USB_Stress-Testing_USB_Accelerators_for_Efficient_Edge
https://drive.google.com/file/d/1s2DORhFyvg0jT7AMhtTPdyPk0Aimdemi/view?usp=drive_link
https://github.com/raphischer/edge-acc

Python & JS Configurations
https://is.gd/DictionarySortJS

https://iopscience.iop.org/article/10.1088/1741-4326/ad142f

https://is.gd/TokmaML

Training Networks

https://science.n-helix.com/2023/06/tops.html
https://science.n-helix.com/2023/06/map.html
https://science.n-helix.com/2022/08/jit-dongle.html
https://science.n-helix.com/2022/06/jit-compiler.html

https://science.n-helix.com/2023/02/pm-qos.html
https://science.n-helix.com/2023/06/ptp.html

ML_With_USB_Stress-Testing_USB_Accelerators_for_Efficient_Edge
https://www.researchgate.net/publication/377174200_Stress-Testing_USB_Accelerators_for_Efficient_Edge_Inference
https://github.com/raphischer/edge-acc

Coral TPU Micro Edge Learning with performance arranged Intel FPGA Arria 10 SX SoC Kit & Google Coral, NVIDIA Jetson Nano and CPU ROCK 4C Plus
https://doi.org/10.3390/s24030899

With both USB Devices being 8Bit INT, I would imagine all of the models would run on both in 8Bit INT
https://coral.ai/docs/edgetpu/models-intro/#transfer-learning-on-device
https://www.intel.com/content/www/us/en/developer/articles/technical/movidius-accelerator-on-edge-software-hub.html
https://www.intel.com/content/www/us/en/support/articles/000033354/boards-and-kits/neural-compute-sticks.html

39$ 2x Edge TPU, Prefer 6x or 8x M.2 & PCI 16x & 32x with 4GB+ RAM
https://coral.ai/products/m2-accelerator-dual-edgetpu/

https://coral.ai/products/

You may be aware I promote this product the Coral Edge TPU & the Movidius USB by Intel,
You may not be quite aware of how well they accelerate! :D

RS

EdgeTPU Coral.AI Video GIMP Photoshop

https://is.gd/EdgeTPU_4K_GimpEdit
https://is.gd/GimpEdgeTPU
https://youtu.be/Kcxwdp3gyOY

EdgeTPU Coral.AI GIMP Photoshop Image
https://drive.google.com/file/d/1Q6B_LLiVJLvIri7UqZt7HmEt0GZJr0yJ/view?usp=sharing

Gimp Speed Figures,

OpenCL per Selective Gaussian Blend
24GB RAM 8Core

CPU 1.3m
RX200 60s
Movidius 10s
+Coral Offloads Int32 & processes; Processing INT8,
In that way the main CPU is the handler of most complex non inference tasks..
In many networks F32 & Int32 would be used to represent computation tasks & can be sieved optimally.

With 8MB of Essentially RAM Writeback Cache; Input-Output though the USB or M.2/PCIe,
Loading tasks through the IO buffer; Into & out of the work buffer; The average flow cache would be around 256KB..
The machine Learning ML itself is between 32KB & Around 7MB.

The Processor itself has multiple threads & IO/DMA Processes to directly inference or solve programming.

Ideally Compressed RAM, with Rsrt ADD+ & MUL* & PACK & Min Mean & Max,
We can perform flexible basic maths,

Flexible compression by copy & replication
Compression consisting of MUL expansions or fractions, MUL ADD & roll Example n+n*y = , n+((n+10)*y),
Compression formula consisting of algebra operations to unroll or roll & gradients Min=m to Max=y

Examples of formula expansion compression

replication n+((n+10)*y)

Formula expansion n+(y*z) = , (n+y)*z =

gradients Min=m to Max=y , A++B till (N*t)=C then Min A to Max C | Median = D | D++t
(the above for example: ((n+y)*z = )Rsrt = )

Compare values A B | C , Compare A B = L(Larger); (L - {A, B})=C; C=Difference

Load Image {A, B}; Shape {A, B}; Compare A B = C; transpose C to S = Surface

8MB Work buffer
256KB IO Buffer (Fast Frame Buffer); The effective memory used by images or audio may reach 2MB.

The perfect Proposal RS

3D DR_LC : 3D Layers to Direct Render Layer Composing : OS, DSC, Codecs, DirectX & Vulkan

https://science.n-helix.com/2022/04/vecsr.html

https://science.n-helix.com/2022/10/ml.html

Here are the Operation Processor Extensions available to EdgeTPU:
https://coral.ai/docs/edgetpu/models-intro/

The sample examples show what a powerful specialised instruction set can do!
To explain more; The EdgeTPU is a Matrix multiplier & Adder..

The instructions such as transpose allow for example mapping one image on another for difference detection...
Flexible uses for each function can literally be based on the basic concept of the instruction,

Basic assumptions lead to convoluted & complex examples..

Examples that are required to do such things as check one bitmap for identical content (in effect XOR)

Quantize : image pixels..

Max Min Mean : Dithering or gaussian blends (complex XOR & layering or edge feathering) & more!

StridedSlice & Slice : partition a frame into parts to render in a grid; slice CSS isolated content in rendering.

SpaceToDepth : Dynamically allocate depth layers to single frame content such as text boxes or photos..
So why ? so we can Dither edges & Fonts & minimise ram usage to dynamic content.

We AveragePool2d to gaussian blend the layers together; In principle we average weight the layers to a final,
Single layer; DSC VESA

Alternative is to Paint major content on a single layer involving the CSS backdrop..
Moving content on top of it on a secondary layer; Makes sense to me! speed wise,

We could use an average pool with Weights (+10/30/100 to -10/30/100) & Feather & Gaussian blend down if we like!

MaxPool2d define layer amount.

ResizeBilinear, ResizeNearestNeighbor : Resize textures for appropriate size of mouse pointers & cursors & content pictures or video..

DepthwiseConv2d : we can down convert layers to 2D, in principle in chrome we convert layered textures during final frame generation to a single flattened layer..

Transpose : layers folded into a single frame render fast! bear in mind that we have to HOLD THE LAYERS in a single fetch!
Buffer optimization to ram size required.. Memory optimization is crucial to hold all layers in a single fetch.

Alternatively combine layers with Transpose & DepthwiseConv2d combined.

In a genuine way layering mouse pointers ontop of DSC frames makes a lot of sense in the terms of response & compression..

you have to think in terms of knowing what is under that mouse pointer; there are two frames of reference to this:

deliberated previous frame forward predict with icon buffer to load over it (A small texture)

Layered responses; in layered responses the Processor processes a java script css layer in the form of the operating system & vectors..

The motion pointer or animation travels over the top in a secondary layered response,
The formation of layers lowers processing costs & speeds up UI response timers; lowering overall compression costs because the first layer is fully converted into an almost static prediction response &or reactionary differentiator system.

(c)RS

Coral.AI EdgeTPU Optimisations to Memory Model ML : RS

TensorFlow Lite 2.6 VS 2.16.1 Because the Coral.AI product & most distributions are TFLite 2.6

Making WAF ML models go brrr: saving decades of processing time (cloudflare.com)
https://blog.cloudflare.com/making-waf-ai-models-go-brr

TensorFlow Lite 2.6 VS 2.16.1 Because the Coral.AI product & most distributions are TFLite 2.6

Multiply Matrix Tables

As observed AVX2, AVX & AVX512 optimisations enabled for CPU Type EPYC & Intel Server optimises performance,
The ARM products could use SVM & SiMD for speedup!

& Also noted the loop unrolls,
8x8 Matrix loads
Prefetching optimisations
Pre-Computed index & tables

Coral.ai Edge TPU has those maths units:

References:

Here are the Operation Processor Extensions available to EdgeTPU:
https://coral.ai/docs/edgetpu/models-intro/

https://science.n-helix.com/2022/10/ml.html

https://science.n-helix.com/2022/04/vecsr.html

https://blog.cloudflare.com/making-waf-ai-models-go-brr

Rupert S

TOPS Conversion Table:

8000G 16TOPS NPU + SiMD 13TOPS total 39TOPS,
Standard FX 8TOPS to 13TOPS All SiMD used!
EdgeTPU*2 8TOPS + CPU SiMD &or NPU..

SiMD F32, F16, Int32, Int16, Combined 8Bit parallel ops..

NPU F16, Int16, Int8, Int4,
TPU U8 & Int8,

Perfection,

Conversion recommendation work for NPU & SiMD:

Int32/64 CPU + 8Bit Inference TPU
F32 conversion by removal of remainder Xor XMM & YMM XXX to Integer Inference; TPU 8Bit or NPU..

Rupert S

*

The perfect Proposal RS {

PCIe/M.2 TPU Dual Edge M.2-2230 E-key
https://www.amazon.fr/dp/B09DM31V2T/
https://www.amazon.co.uk/dp/B09DM31V2T/

https://www.amazon.fr/s?k=coral+m.2
https://www.amazon.co.uk/s?k=coral+m.2
https://www.amazon.com/s?k=coral+m.2

https://hailo.ai/products/ai-accelerators/hailo-8-ai-accelerator/#hailo8-overview
Hailo M.2 & mPCIe are both available a 28TOPs 199$ & the (MultiProcessor) PCI Cards for around 300$ to 800$,
Unfortunately not sources on amazon.

8TOPS M.2 https://Coral.ai 38$; Worth a thought.

Maybe
B Key / M Key M.2, M.2 NGFF B Key / M Key M.2 NGF, M 2 Specifications support : 2280/2260/2242/2230
https://www.amazon.fr/dp/B0CT3FXQM8/
https://www.amazon.co.uk/dp/B0CT3FXQM8/

https://www.amazon.de/dp/B0CT3FXQM8/

ahum no

SSD M2 key B/ key B+M/ key M)
https://www.amazon.fr/dp/B0CBWWD144/
https://www.amazon.co.uk/dp/B0CBWWD144/

https://www.amazon.de/dp/B0CBWWD144/

GLOTRENDS M.2 M Key to E Key WiFi Adapter for M.2 WiFi Module
+ https://www.amazon.fr/dp/B09ZS1FHCG

https://www.amazon.co.uk/dp/B09ZS1FHCG
https://www.amazon.de/dp/B09ZS1FHCG

};

USB
https://www.amazon.com/dp/B07S214S5Y
https://www.amazon.fr/dp/B07S214S5Y
https://www.amazon.co.uk/dp/B07S214S5Y

https://www.amazon.fr/s?k=Dual+Edge+M.2-2230+E-key+to+pci

https://en.wikipedia.org/wiki/M.2

To my knowledge M.2 E is basically PCIe but smaller, So adapter is fairly simple.

HP/Mac/Dell/Acer

The M.2 "E" key sockets are used for Wireless LAN/Bluetooth cards.
These sockets are common with laptop motherboards.
They are also found on some desktop motherboards (mITX, mATX, ATX).
Gigabyte offers mITX boards with this support.

https://www.amazon.fr/dp/B09ZDPP43X/
https://www.amazon.fr/s?k=wifi+M.2-2230+E+to+pcie

*

Analogue ML - Including Additive-Capacitor-'Battery' - Using the IBM analogue in-memory hardware acceleration kit for neural network training and inference - APL Machine Learning
https://pubs.aip.org/aip/aml/article/1/4/041102/2923573/Using-the-IBM-analog-in-memory-hardware

RAM ADDER differential Inference (c)RS :
RAM Table Accumulated addition node network Accumulator with Accumulation comparison Inference

IBM Analog Hardware Acceleration Kit https://github.com/IBM/aihwkit

Matrix Processors - Multi Node SpiNNaker2 A Large-Scale Neuromorphic System
https://arxiv.org/pdf/2401.04491.pdf

PysicsX
Isaac Gym - Preview Release
https://developer.nvidia.com/isaac-gym

CALM: Conditional Adversarial Latent Models for Directable Virtual Characters
https://github.com/NVlabs/CALM

ML Strategic Workflow Training & Models - Machine Learning model guide Tensor to ONNX - Fraud Prevention & Statistics - Turning Data into Insight with IBM zOS16
https://www.redbooks.ibm.com/redpieces/pdfs/sg248552.pdf

Evolution-ML CNN Self Trained Auto Sparsity - Hybrid multi-objective evolutionary model compression with convolutional neural networks
https://www.sciencedirect.com/science/article/pii/S2590123024000045
https://blog.research.google/2023/12/advancements-in-machine-learning-for.html

ML Compressed Dynamic16bit-8Bit - Hardware-friendly compression and hardware acceleration for ML Transformer
https://aimspress.com/article/doi/10.3934/era.2022192

AA-DLADMM - GD Gradient Descent - An Accelerated ADMM-based Framework for Training Deep Neural Networks
https://arxiv.org/pdf/2401.03619.pdf

Personality UI : Have a friend

Best mini models by far
AMD Llama 135m https://huggingface.co/amd/AMD-Llama-135m
MS PHI 4K Model https://huggingface.co/microsoft/Phi-3-mini-4k-instruct

Python & JS Install & Runtime
https://is.gd/DictionarySortJS
https://is.gd/UpscaleWinDL

Alpaca Character Generation model
4Bit for speed, But not precise
https://huggingface.co/anon8231489123/gpt4-x-alpaca-13b-native-4bit-128g
trained 3Epoc Higher Precision https://huggingface.co/chavinlo/gpt4-x-alpaca

Base model https://huggingface.co/chavinlo/alpaca-13b
https://github.com/teknium1/GPTeacher

Python WebUI
https://github.com/oobabooga/text-generation-webui
Mac; Mostly MAC but fast
https://github.com/ggerganov/llama.cpp

how to use & personality sets https://discord.com/invite/aitrepreneur-1018992679893340160

On the subject of how deep a personality of 4Bit, 8Bit, 16Bit is reference:
https://science.n-helix.com/2021/03/brain-bit-precision-int32-fp32-int16.html
https://science.n-helix.com/2022/10/ml.html

https://is.gd/ML_Opt
https://is.gd/OPC_ML_QuBit
https://is.gd/OPC_ML_Opt
https://is.gd/QuBit_GPU

Machine learning | Equate ~= Multi Layer Wavelet Abstraction

https://science.n-helix.com/2022/09/ovccans.html

https://science.n-helix.com/2023/02/smart-compression.html

https://science.n-helix.com/2021/10/he-aacsbc-overlapping-wave-domains.html

(documents) JIT & OpenCL & Codec : https://is.gd/DisplaySourceCode

Include vector today *important* RS https://vesa.org/vesa-display-compression-codecs/

https://science.n-helix.com/2022/08/jit-dongle.html

https://science.n-helix.com/2022/06/jit-compiler.html

https://science.n-helix.com/2022/04/vecsr.html

https://science.n-helix.com/2016/04/3d-desktop-virtualization.html

https://science.n-helix.com/2019/06/vulkan-stack.html

https://science.n-helix.com/2019/06/kernel.html

https://science.n-helix.com/2022/03/fsr-focal-length.html

https://science.n-helix.com/2018/01/integer-floats-with-remainder-theory.html

https://science.n-helix.com/2022/08/simd.html

Eclectic & for the codecs of the world! OVCCANS (install and maintain as provided HPC Pack)

https://science.n-helix.com/2018/09/hpc-pack-install-guide.html

Transversal processing availability : Transparent Task Sharing Protocols

https://science.n-helix.com/2022/08/jit-dongle.html

https://science.n-helix.com/2022/06/jit-compiler.html

Machine Learning

https://science.n-helix.com/2022/10/ml.html

https://science.n-helix.com/2021/03/brain-bit-precision-int32-fp32-int16.html

Innate Compression, Decompression

https://science.n-helix.com/2022/03/ice-ssrtp.html

https://science.n-helix.com/2022/09/ovccans.html

https://science.n-helix.com/2023/02/smart-compression.html

https://science.n-helix.com/2022/09/audio-presentation-play.html

https://science.n-helix.com/2021/10/he-aacsbc-overlapping-wave-domains.html

https://science.n-helix.com/2023/03/path-trace.html

Reference

https://is.gd/SVG_DualBlend https://is.gd/MediaSecurity https://is.gd/JIT_RDMA

https://is.gd/PackedBit https://is.gd/BayerDitherPackBitDOT

https://is.gd/QuantizedFRC https://is.gd/BlendModes https://is.gd/TPM_VM_Sec

https://is.gd/IntegerMathsML https://is.gd/ML_Opt https://is.gd/OPC_ML_Opt https://is.gd/OPC_ML_QuBit https://is.gd/QuBit_GPU https://is.gd/NUMA_Thread

*****
Best NPM site on world https://npm.n-helix.com/bundles/

(Simple Install) Website Cache JS Updated 2021-11 (c)RS https://bit.ly/CacheJS
(Simple Install) Science & Research Node High Performance Computing
Linux & Android https://is.gd/LinuxHPCNode

Presenting JIT for hardware interoperability & function :
https://is.gd/DisplaySourceCode

https://is.gd/BTSource

(Simple Install) Website Server Cache JS Updated 2021-11 (c)RS
https://bit.ly/CacheJSm
(Simple Install) Website Server Cache JS Work Files Zip Updated
2021-11 (c)RS https://bit.ly/AppCacheJSZip
*****

machine learning https://www.amazon.com/dp/B08V134ZFD

*****

Direct ONNX Hardware Accelerated: F16
https://github.com/GPUOpen-LibrariesAndSDKs/RadeonML

Ideal for 4Bit Int4 XBox & Int8 GPU
PULP-NN: accelerating quantized neural networks on parallel ultra-low-power RISC-V processors - Bus-width 8-bit, 4-bit, 2-bit and 1-bit
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6939244/

ML Proof case SVM (Multi-Dimensional-Elliptic,98%) aDaBoost M1(Mac,91%) - COVID-19 Prediction Using Supervised Machine Learning - Irfan_Ali_MEng_2023
https://dspace.library.uvic.ca/bitstream/handle/1828/14676/Irfan_Ali_MEng_2023.pdf?sequence=1&isAllowed=y

Useful operation precision reductions
FPGA Implementation of Keyword Spotting System Using Depthwise Separable Binarized and Ternarized Neural Networks
https://www.mdpi.com/1424-8220/23/12/5701

Useful operation precision reductions; I observe that reducing precision to 1Bit & 2Bit..

While enhancing the definition of a positive, Negative Dipole & thus enhancing speed..
Further reduces reasoning capacity; That in order to reduce Processor bandwidth for reasoning..

In the example of the XBox & PS5; DOT4 & INT4, INT8 & F16 & bF16; Apply considerable improvement to reductions probable error related to a lack of remainder or float value depth enhancement!

By reason of probability i assume a value of 4Bit & 2Bit to allow the smallest packing ability; Existing alongside the word reasoned!

To reduce to 1 & 0; I assume a definite statement that a Value Integer Solve in the form of a vector..
Is most probably the solution & that furthermore that in most cases; Projected in pure maths & code ASM,
Both SiMD; Float & Integer...

Reduction to multiple 2 Bit values in short Integer instructions; I will state however that no such value is further away than a statistics table or PHP Data-Set.

Rupert S 2023-06

*****

Gaussian
https://gmd.copernicus.org/articles/16/1697/2023/
https://gmd.copernicus.org/articles/16/1697/2023/gmd-16-1697-2023.pdf

SiMD Gaussian Blending & Dithering - Better_Fixed_Point_Filtering_with_Averaging_Trees
https://andrew.adams.pub/Better_Fixed_Point_Filtering_with_Averaging_Trees.pdf

Vectorization of Kernel and Image Subsampling in FIR Image Filtering
http://bncss.org/index.php/bncss/article/viewFile/101/105

Implementation of a High-Quality Dolby Digital Decoder Using SiMD MMX™ Technology
https://smtnet.com/library/files/upload/dolby-intel.pdf

*****

Common techniques used in ML Learning are edge detection, accent recognition, language processing, and code optimization.

Basic ML Feature list; Also for learning

Edge detection is a process of identifying the boundaries of objects in images or videos.

Accent recognition is a process of identifying the regional or social variation of speech.

Language processing is a process of analyzing and generating natural language texts.

Code optimization is a process of improving the performance or quality of code.

https://www.ibm.com/topics/machine-learning
https://en.wikipedia.org/wiki/Edge_detection
https://en.wikipedia.org/wiki/Accent_recognition
https://en.wikipedia.org/wiki/Natural_language_processing
https://en.wikipedia.org/wiki/Code_optimization
https://en.wikipedia.org/wiki/Supervised_learning
https://en.wikipedia.org/wiki/Unsupervised_learning
https://en.wikipedia.org/wiki/Reinforcement_learning
https://www.ibm.com/cloud/learn/machine-learning-ethics

*****

Dynamic ML IRS-RIS 4G,5G Wave Shaping Edge detection with reflection angle calculation - strong wave localising edge (shaping)sharpening(c)RS

By quantifying how waves bounce from reflective surfaces it is possible to shape waves that bounce in a different direction from a mechanical reshaping surface called a RIS..

Reconfigurable intelligent surfaces & Intelligent reflecting surfaces bounce radio waves for wireless networks..

Presenting the example:

Coral TPU Micro Edge Learning with performance arranged Intel FPGA Arria 10 SX SoC Kit & Google Coral, NVIDIA Jetson Nano and CPU ROCK 4C Plus
https://doi.org/10.3390/s24030899

ML_With_USB_Stress-Testing_USB_Accelerators_for_Efficient_Edge
https://www.researchgate.net/publication/377174200_Stress-Testing_USB_Accelerators_for_Efficient_Edge_Inference
https://github.com/raphischer/edge-acc

ESA Space blog - All Rights Reserved RS

Wednesday, October 19, 2022