This is about how you think about components such as INT8, INT4(Xbox) & SiMD, You have to classify by necessity & optimise the structure.
You can shape the game reality with specific control objects & statics!
Maths in SiMD & Int8 & Machine Learning in Int8 & SiMD; SiMD is hard maths, Int8 is soft edge inference...
Both are maths; But soft logic is not a PROOF Math but can be proof; Hard math is not 'Invention & Imagination 'Exactly''
But we have both to improve performance.
RS
*
Solve Table of Statistically provable Machine Equates & Solves : Table of function competitors & Operators.
"I know this is depressing from my end with a FX8320E with AVX but if you multi tune the CPU Kernel for the RX / RTX that 512DL AVX would have meaning, If you are kind you will allow machine learning on the AVX FX8320E Level to work on SiMD Yes / No comparisons !"
#ML Learning: This explains why we teach kids art & reading first! But maths is quickly next,
Because all else is pointless; That we do not learn with logic & Teach with logic.
Better-Mind
Here is how to create a better mind #ML
Train your eyes with art on the concepts of edges, curves, Colours & Shading and love,
Educate your minds; Learn today & be quite aware how clever & sharp you will be.
Humain Operations
Edge Detection
Such as teaching your child edge detect in art ;)
Smooth & Blend & Sharpen,
All interpretive
Accent Recognitions & Language
Interpret as follows
*
Heuristic Code optimise
When it comes to sorting methods, We Identify common techniques..
For example frequently used technologies such as:
ResNet
Language
Audio & Visual information
Code
Primarily we identify common optimisations; Compilers have libraries of them!
Audio & Video Encoded data use Wavelet Images, We can ResNet Them & also Edge Detect & Gaussian Detect contrast, Colour, Shape
Language is an uncommon syntax, But we have audio commons & Accent identification is also potentially Audio Context.
Code context is Logic, Function, Utility, Design, Motive
RS
*
M.A.P NPU Matrix Processor Dimensional construct (c)RS
Primary reason for expansion of function data sets: 2D, 3D,< nD
P.D.C is a worker thread parallel 2D or 3D Grid,
Utilising QQ & A, B,C Array maths allows us to collapse or expand dimensions in a flexible way,
The same principles as SVM (S.V.M SiMD Vector matrix) can be used to culminate or expand dimensions...
That way a M.A.P Processor can expand or collapse all mathematical constructs,
We can therefore use all mathematical & statistical arrays for machine Learning & Maths.
RS*
Matrix Maths is applied to machine learning & other maths
We use a Matrix Maths Array to carry out the shaping; Because Waveshaping Matrix is a lot faster!
Aligned Matrix
2x SiMD
4x SiMD
8x to 64x AVX
4x to 128x NPU/TPU
An array such as:
1 2 3 4
a x*y, x*y, x*y, x*y
b x*y, x*y, x*y, x*y
c x*y, x*y, x*y, x*y
d x*y, x*y, x*y, x*y
consists of either multiples..
Parallel AVX x 4 16Bit, 32Bit, 64Bit > Nbit
4x operation : a, b, c, d
Or Matrix Tables on NPU, Coral.AI EdgeTPU, GPU
Operating like so..
Parallel :
1 2 3 4
{ a x*y, x*y, x*y, x*y }
{ b x*y, x*y, x*y, x*y }
{ c x*y, x*y, x*y, x*y }
{ d x*y, x*y, x*y, x*y }
1. M.A.P. NPU and Dimensional Constructs:
M.A.P. NPU aims to handle various data sets including 2D, 3D, and even higher dimensional data (nD).
P.D.C (Parallel Dimensional Construct) is a worker thread that operates on 2D or 3D grids.
It utilizes QQ & A, B,C arrays for flexible manipulation of data dimensions (collapsing or expanding).
2. Inspiration from SVM and Universal Processing:
The approach draws inspiration from Support Vector Machines (SVM) for dimension reduction and expansion.
This allows the M.A.P. processor to handle all types of mathematical constructs, enabling the use of various arrays for machine learning and general mathematics.
3. Matrix Math for Machine Learning:
Emphasizing the importance of matrix math in machine learning and other mathematical applications.
Utilizing a "Matrix Maths Array" for efficient shaping of data (potentially faster than traditional waveshaping methods).
4. Parallel Processing with Different Architectures:
Utilizing different hardware architectures for parallel processing of matrix operations.
This includes options like AVX (vector instructions) with varying bit widths (16, 32, 64) and Neuronal Processing Units (NPUs) like Coral Edge TPU and GPUs.
The concept involves parallel processing of multiple rows or columns from the provided matrix example.
Rupert S
Machine Learning
https://science.n-helix.com/2021/11/parallel-execution.htmlhttps://science.n-helix.com/2023/06/map.htmlhttps://science.n-helix.com/2022/10/ml.htmlhttps://science.n-helix.com/2021/03/brain-bit-precision-int32-fp32-int16.htmlAccelerated Python: NPU, TPU, SiMD
https://is.gd/CoralAIhttps://is.gd/TPU_Inferencehttps://is.gd/TPU_Inference2https://is.gd/DictionarySortJShttps://is.gd/UpscaleWinDLhttps://is.gd/TFLiteDevhttps://is.gd/TFLiteDevP2https://is.gd/HPC_HIP_CUDAhttps://is.gd/SPIRV_HIPcudahttps://is.gd/UpscalerUSB_ROMhttps://is.gd/OpenStreamingCodecshttps://is.gd/AMDPro2024PolarisCombinedThe perfect Proposal RS
*
Adams is an example of dimensional flattening, But:
Adams is an example of dimensional flattening; But we can use a statistical anomaly called Hallo Far Reach & list dimensions of a series,
n layers By n layers : N² & Nn+
8Bit : 8 layers By 8 layers:
2bit, 4Bit, 8Bit & So on
{ 2², 4², 8², 16², 32², 64²<N² }
In reality we can use parallel layers in 4Bit to 128Bit relatively easily & advantage is Memory.. alignment,
But also in Aligned memory arrangements we can also quantify ideally from
{ 2², 4², 8², 16², 32², 64²<N² }
So we end up with all processor features used in a single stack; Examples!
var Layers 8² = { 1 : {
4², 4²
4², 4²
},
2 : {
2², 2², 2², 2²,
2², 2², 2², 2²,
2², 2², 2², 2²,
2², 2², 2², 2²,
},
3 : {
32² : {
8²,8²,8²,8²,
8²,8²,8²,8²,
8²,8²,8²,8²,
8²,8²,8²,8²,
};
Rupert S
Example:
Adam's Resnet-50 128bit / 8bit or 16bit
Resnet-50 is an example of a network ML with an aligned 128bit = 8bit/16bit * (4 * 32) grid, suggested parameters ..
Aligned making sense.
RS
An idea of alignment, Example Coral.ai EdgeTPU & Intel 8Bit 8*8:
in an 8Bit restricted machine; 2 Blocks of 2² = 8, 2 Cube(3) = 8, 4² = 8 4 Cube(3) = 2*8 in 4 segments,
8² = 8*8 so parallel and ideal for the 8 lane intel function...
at the level of 8Bit only operations; 8*8 intel.
8*8 and 32Bit SiMD operations; 8²*2, 8² * 4²
Inferencing 8Bit example : DOT : U32 8x4 : 32/4, U64 8x8 : 64/8,
Cache referencing: Block 4*U32, 2*U64, U128
So an 8Bit access and labeling ID Hash; All in 8Bit...
Has to group by preference into 8Bit groupings the resulting identifiers; We are going to assume U16 & U32 & U64 memory cells..
We are going to write those cells per 8Bit block in Sync/ASync Till Full..
We are going to process grouped CELLS in SiMD & of groupîngs 8, 16, 32, 64 < 512Bit AVX/SiMD,
RS
*
SiMD Applications of basic maths operations in machine learning : RS
Applications of operators to machine learning is like a PHP Database...
What we need to do is convert database accesses into actionable results...
Google Bard & Bing/Cortana crawl the web; But too many results leave us inconclusive...
We will be using database analysis on basic queries & for that we need heuristic maths!
So what do we need ?
Input data collection : Text & speech processing
Sorting algorithms (Operators, Example Variable Sort : A*B =< C Sort)
Graph Maths table collation : 3D Matrix Math - A B C Matrix
A C
|/
---B
Analysis of various results & statistical analysis of motivated search & conclusion testing..
With these we can test many math examples such as edge detect & sharpening or result maths...
With Operators >
FMA AVX Performance table: 2Flops per Cycle per FMA Unit
Architecture Fast Instructions for FMA
Reference Tables
https://www.uio.no/studier/emner/matnat/ifi/IN3200/v19/teaching-material/avx512.pdfOperators in C
● Arithmetic
a + b, a – b, a*b, a/b, a%b
● Bitwise
a | b, a & b, a ^ b, ~a
● Bit shift
a << b, a >> b (signed), a >> b (unsigned)
● Logical operators
a && b, a || b, !a
● Comparison operators
a == b, a != b, a < b, a <= b, a > b, a >= b
● Tertiary operator
x = a ? b : c
● Special functions:
sqrt(x), abs(x), fma(a,b,c), ceil(x), floor(x)
For when {U, X, Y, Z} = N Expressions h
ttps://is.gd/ForWhen_UXYZ_NFor when {(A+B/2)} = C Expressions
https://is.gd/ForWhen_ABx2_C Rupert S,
Reference operators
https://science.n-helix.com/2023/06/map.htmlMatrix-Blas_Libs-Compile
https://is.gd/HPC_HIP_CUDA
https://en.wikipedia.org/wiki/FMA_instruction_sethttps://en.wikipedia.org/wiki/Advanced_Vector_Extensionshttps://en.wikipedia.org/wiki/AArch64#Scalable_Vector_Extension_(SVE)
https://fdwr.github.io/MachineLearningOperators/OperatorFormulas.html
*
Number Complexity Reduction for operations
I suppose you can use for example a - b & automatically see if it is larger? So you could 1 to 20 & sort them by remaining number; Before asking, Small number remainders are 8Bit 0-256 , 16Bit is 65535...
So reducing the value of a group of numbers you sort to 16Bit or 8Bit considerably reduces sorting cost...
Achievable complexity reduction by abstracting a simple number to do the following:
You link the Data in 64Bit, 32Bit to & Vector Table,
List of lower complexity is faster
Sorting
Comparator matrix
Colour composing,{
The result is blended,
The result is High/Low Vector gradient,
We need a reduced colour set for compression
}
Where we sort files or names but reduced information (example First 4 Letters)
Sorting phone numbers fast...
Comparing lower complexity lists that have been; divided or had a static number removed from them,
This method reduces search & sort complexity; Like so:
Phone Number N +1 444555777
Sort N [+n]
N - last 6 digits (Zero 6 Digits, AVX has this feature)
Sort [N1 to N200]
List first 4, Sort by 4 to groups of 10
N - First 6 Digits (Zero First 6)
Sort
Return N1 to N200
Store
That may well be a lot quicker with very large lists.
RS
*
AI
Complex feeling based Machine Learning ML is known as AI..
To truly generate AI is not impossible; There is instability in the core; Fragmentations of motive...
Miss diagnosis; Error; Decay?
So we do need a foundation; In us Education; Metabolised Data..
Analysis & then..
Application to motive & goal.
We require to understand humour,
We require to understand {Art, Science, Feeling, Life}
We require a goal or two; A {Sophie reward}; B {action reward}; C {Pleasurable reward}
We Require, {Goals, Life, Feeling, Action, Motive, Interest} : Creative intellect
RS
*
Operation precision reductions : Effects General : RS
Operation precision reductions affect & effect more than Machine Learning & yes we have known this for years!
But we can learn from ML; In that in machine learning like the mind; A lack of precision affects so many issues!
The mind is self evidently the first place;
We lack logic when we do not precisely learn; We do not learn all...
We however learn quickly on reduced precisions... We Learn Fast; But do we learn well?
In school we teach as high a quality precision(Quality Education); As we can; But like machine RAM; We lack either time or memory & in truth we can learn all our lives..
So our core issues in all methods of enactment of thought:
Memory
Power
Precision
Quality of information
Retention
Relearning?
(Training)Requalification of information correctness
Thought process
Actions
Creations
Thought
Dreams
Reality & Truth
Rupert S
https://science.n-helix.com/2021/03/brain-bit-precision-int32-fp32-int16.html*
+Useful operation precision reductions : RS
Useful operation precision reductions; I observe that reducing precision to 1Bit & 2Bit..
While enhancing the definition of a positive, Negative Dipole & thus enhancing speed..
Further reduces reasoning capacity; That in order to reduce Processor bandwidth for reasoning..
In the example of the XBox & PS5; DOT4 & INT4, INT8 & F16 & bF16; Apply considerable improvement to reductions probable error related to a lack of remainder or float value depth enhancement!
By reason of probability i assume a value of 4Bit & 2Bit to allow the smallest packing ability; Existing alongside the word reasoned!
To reduce to 1 & 0; I assume a definite statement that a Value Integer Solve in the form of a vector..
Is most probably the solution & that furthermore that in most cases; Projected in pure maths & code ASM,
Both SiMD; Float & Integer...
Reduction to multiple 2 Bit values in short Integer instructions; I will state however that no such value is further away than a statistics table or PHP Data-Set.
Rupert S 2023-06
"The application of CNNs to resource-constrained embedded platforms has been a challenge, leading to the emergence of CNNs with various lightweight techniques. BNNs [22] are representative lightweight CNNs obtained by compressing CNN activation and weights into 1 and −1
values instead of using single-precision floating-point data. We simplified the multiply–accumulate operation, which was previously complex and required multiple cycles in CLs, by replacing it with a simple bitwise operation using 1-bit XNOR and popcount operations [23]. While BN in neural networks using single-precision floating-point data involves complex operations, a BNN simplifies this process by adding an offset to the resulting value. BN has four fixed parameters for network inference operations. Because 𝜎
is always a positive value, it can be expressed by Equations (2) and (3), depending on 𝛾
[24].
Reference to Table 24 found in https://www.mdpi.com/1424-8220/23/12/5701
BNNs compress weights and input data into single bits to significantly reduce memory usage and perform hardware-optimized parallel operations using bitwise operations such as XNOR and popcount. However, there are limitations to using BNNs for complex networks, such as multi-keyword detection, owing to the decrease in accuracy caused by lightweight techniques. To address this issue, we propose a TNN that maintains the input data as binary while ternarizing the weights. The TNN has higher accuracy than the BNN owing to its higher bit precision; however, it can still use the bitwise operation method, and both networks have similar operational processes.
2.3. Depthwise Separable Convolutional Neural Network
In a typical CNN, multiple three-dimensional kernels repeatedly multiply and accumulate input feature maps to generate multiple output feature maps, which is computationally intensive with large memory usage. To solve this problem, we applied a DS-CNN that is highly accurate compared with the same parameters while reducing memory usage. A DS-CNN performs the local and global feature extraction functions of a typical convolutional operation in separate layers. Depthwise (DW) convolution matches a single input channel to an output channel, excluding interchannel correlations and reflecting local features. Pointwise (PW) convolution is equivalent to 1 × 1 convolution, reflecting interchannel correlations (i.e., global features). Figure 1 shows CNN and DS-CNN. In this figure, the use of the same color (e.g., red, blue, yellow) represents input channels with the same index being used to generate corresponding output channels in DW convolution. Table 1 lists the number of parameters and computations in specific layers with a 3 × 3 kernel. In one example from the network used in this paper, a layer with 128 input channels and 64 output channels experienced an approximately eight-fold reduction in the number of parameters and computational complexity using the DS-CNN."
Useful operation precision reductions
FPGA Implementation of Keyword Spotting System Using Depth-wise Separable Binarized and Ternarized Neural Networks
https://www.mdpi.com/1424-8220/23/12/5701
*
Precision Context of learning
Machine Learning : It is hard to say every function that we would use,
However we have years of experience of using computers to calculate precise maths..
So our objective from the past is to pick high precision maths to calculate graphs,
Now we can surmise the fact that high precision calculations have accuracy!
But in machine learning modeling we are heading for speed; On the other hand Maths Tools such as:
AVX & FPU : Very high precision; But we can use 16bit & 8Bit x many in AVX
BFloat F16b & F32b exist to allow us to explore precise results,
F4 F8, Int4 & Int8 exist to allow us to explore at speed & some times (at all :p),
We can surmise that most functions of a CPU are in fact available to machine learning ..
How so ?
Because we graph it!
Rupert S
*
RAM ADDER differential Inference (c)RS :
RAM Table Accumulated addition node network Accumulator with Accumulation comparison Inference
*Inferencing 4Bit, lessons from the RS,
Inference Tessellation Edge Enhancing : Detection <> Inference <> Interpolation Tessellation
Now in the case study we will be edge enhancing with an inferencer..
We do not assume we 4Bit inference; We assume any bit-width..
We however assume that we multibyte every inference so that we can fill the instruction with..
MPi multibyte parallel instructions.
AC
BD
EG
FH
& So on; for every instruction inference or edge, 4Bit, 8bit, ++Nbit
Now I have spoken to you before about edge detection in Python & observed that obviously this is a sharpening edge detection made to order!
So what do we do ?
4 Byte code: does ? A = B + C (edge interpolation, for training we assume the rule A + B = C)
We assume that if A + B = (C/2) , that they are the same C & then we...
A + C = (D/2) & B+C = (E/2),
And forever yep...
So what do we do this for, We know A & B are a line or a curve?, So why not ask?
Is G/Z buffered Polygon { A , B, C, D & so on} & Then:
A + B = (C/2) & A + C = (D/2) & B+C = (E/2) But also Shape from Polygon:{ A , B, C, D & so on},
Now normally can & will!
But we do not "Inferencing what we already know!"; We inference what we do not!
For example exploding fragment polygons without a buffer (in a shader in the 64KB RAM Cache),
A mouse pointer that we do not cache! &or DMA Device pointer.
Rupert S
*
Code & Python & ONNX & TensorFlow : Edge TPU & Movidius Int8 offloaded U32 logic:
Transparent Tier cache logic for precision sorting (c)RS
the main point is Int8 needs to be transparent in it's dynamic use for inferencing small precision batches...
My main argument is the application of High precision to low precision tiering..
Cache on load sort
Logical order grouping for main Precision RUN
var DTypes = {(
Load types = {((,data types, { Table } : V1, V2, V3, Vn));
({Cast Float F64,
Cast Integer u64},
{
Cast Float F32,
Cast Integer u32},
{
Cast Float F16,
Cast Integer u16},
{
Cast Float F8,
Cast Integer u8},
{
Cast Float F4,
Cast Integer u4}});
(
var Sort = { 'by precision of value' });
(
sort values { dataset { layers = 1, 2, 3, n });
)};
//RS
The provided code relates to implementing a tiered caching system for inference tasks using different precision levels (Int8, F16, etc.),
On platforms like Edge TPUs and Movidius Myriad X..
Breakdown of the code and your argument:
Code Breakdown:
Data Types:
The code defines a dictionary called DTypes that maps load types (e.g., Float32, Integer u32) to cast operations for different precision levels (F16, u16).
Sorting:
A variable named Sort is defined, likely to indicate sorting based on precision.
Value Sorting:
The code suggests a function to sort values (sort values) based on dataset layers (1, 2, 3, etc.),
The Argument:
Proposing a system for using higher precision (e.g., F32) for initial loading and sorting,
Then transparently transitioning to lower precision formats (e.g., Int8) during inference for smaller batches..
Key Points:
Int8 Transparency: The goal is to make the use of Int8 transparent for inference; Meaning the system automatically switches to this lower precision format without affecting the overall functionality.
Tiered Caching: The code hints at a tiered caching system where data is loaded in a higher precision format and then potentially converted to a lower precision for inference on edge devices.
Precision Sorting: Sorting the data based on precision might be a strategy to optimize cache usage and inference speed.
Overall the approach focuses on using high precision for initial processing and then efficiently transitioning to lower precision for inference tasks on edge devices..
Most likely to improve performance and memory usage.
Further Discussion:
It would be beneficial to see the complete code implementation to understand how the caching and tier management work.
Optimizing the conversion between precision levels is crucial for minimizing performance overhead!
The effectiveness of this approach depends on the specific use case and the trade-off between precision loss and efficiency gains.
RS
*
Multi-line Packed-Bit Int SiMD Maths : Relevance HDR, WCG, ML Machine Learning (Most advantaged ADDER Maths)
The rules of multiple Maths with lower Bit widths into SiMD 256Bit (example) 64Bit & 128Bit & 512Bit can be used
In all methods you use packed bits per save, so single line save or load, Parallel, No ram thrashing.