I think that considering the stated philosophy, There is more room for education on social conduct.
This is about how you think about components such as INT8, INT4(Xbox) & SiMD, You have to classify by necessity & optimise the structure.
You can shape the game reality with specific control objects & statics!
Maths in SiMD & Int8 & Machine Learning in Int8 & SiMD; SiMD is hard maths, Int8 is soft edge inference...
Both are maths; But soft logic is not a PROOF Math but can be proof; Hard math is not 'Invention & Imagination 'Exactly''
But we have both to improve performance.
RS
*
Solve Table of Statistically provable Machine Equates & Solves : Table of function competitors & Operators.
"I know this is depressing from my end with a FX8320E with AVX but if you multi tune the CPU Kernel for the RX / RTX that 512DL AVX would have meaning, If you are kind you will allow machine learning on the AVX FX8320E Level to work on SiMD Yes / No comparisons !"
#ML Learning: This explains why we teach kids art & reading first! But maths is quickly next,
Because all else is pointless; That we do not learn with logic & Teach with logic.
Better-Mind
Here is how to create a better mind #ML
Train your eyes with art on the concepts of edges, curves, Colours & Shading and love,
Educate your minds; Learn today & be quite aware how clever & sharp you will be.
Humain Operations
Edge Detection
Such as teaching your child edge detect in art ;)
Smooth & Blend & Sharpen,
All interpretive
Accent Recognitions & Language
Interpret as follows
*
Heuristic Code optimise
When it comes to sorting methods, We Identify common techniques..
For example frequently used technologies such as:
ResNet
Language
Audio & Visual information
Code
Primarily we identify common optimisations; Compilers have libraries of them!
Audio & Video Encoded data use Wavelet Images, We can ResNet Them & also Edge Detect & Gaussian Detect contrast, Colour, Shape
Language is an uncommon syntax, But we have audio commons & Accent identification is also potentially Audio Context.
Code context is Logic, Function, Utility, Design, Motive
RS
*
SiMD Applications of basic maths operations in machine learning : RS
Applications of operators to machine learning is like a PHP Database...
What we need to do is convert database accesses into actionable results...
Google Bard & Bing/Cortana crawl the web; But too many results leave us inconclusive...
We will be using database analysis on basic queries & for that we need heuristic maths!
So what do we need ?
Input data collection : Text & speech processing
Sorting algorithms (Operators, Example Variable Sort : A*B =< C Sort)
Graph Maths table collation : 3D Matrix Math - A B C Matrix
A C
|/
---B
Analysis of various results & statistical analysis of motivated search & conclusion testing..
With these we can test many math examples such as edge detect & sharpening or result maths...
With Operators >
FMA AVX Performance table: 2Flops per Cycle per FMA Unit
Architecture Fast Instructions for FMA
Reference Tables
https://www.uio.no/studier/emner/matnat/ifi/IN3200/v19/teaching-material/avx512.pdfOperators in C
● Arithmetic
a + b, a – b, a*b, a/b, a%b
● Bitwise
a | b, a & b, a ^ b, ~a
● Bit shift
a << b, a >> b (signed), a >> b (unsigned)
● Logical operators
a && b, a || b, !a
● Comparison operators
a == b, a != b, a < b, a <= b, a > b, a >= b
● Tertiary operator
x = a ? b : c
● Special functions:
sqrt(x), abs(x), fma(a,b,c), ceil(x), floor(x)
For when {U, X, Y, Z} = N Expressions h
ttps://is.gd/ForWhen_UXYZ_NFor when {(A+B/2)} = C Expressions
https://is.gd/ForWhen_ABx2_C Rupert S,
Reference operators
https://science.n-helix.com/2023/06/map.htmlMatrix-Blas_Libs-Compile
https://is.gd/HPC_HIP_CUDA
https://en.wikipedia.org/wiki/FMA_instruction_sethttps://en.wikipedia.org/wiki/Advanced_Vector_Extensionshttps://en.wikipedia.org/wiki/AArch64#Scalable_Vector_Extension_(SVE)
*
Number Complexity Reduction for operations
I suppose you can use for example a - b & automatically see if it is larger? So you could 1 to 20 & sort them by remaining number; Before asking, Small number remainders are 8Bit 0-256 , 16Bit is 65535...
So reducing the value of a group of numbers you sort to 16Bit or 8Bit considerably reduces sorting cost...
Achievable complexity reduction by abstracting a simple number to do the following:
You link the Data in 64Bit, 32Bit to & Vector Table,
List of lower complexity is faster
Sorting
Comparator matrix
Colour composing,{
The result is blended,
The result is High/Low Vector gradient,
We need a reduced colour set for compression
}
Where we sort files or names but reduced information (example First 4 Letters)
Sorting phone numbers fast...
Comparing lower complexity lists that have been; divided or had a static number removed from them,
This method reduces search & sort complexity; Like so:
Phone Number N +1 444555777
Sort N [+n]
N - last 6 digits (Zero 6 Digits, AVX has this feature)
Sort [N1 to N200]
List first 4, Sort by 4 to groups of 10
N - First 6 Digits (Zero First 6)
Sort
Return N1 to N200
Store
That may well be a lot quicker with very large lists.
RS
*
AI
Complex feeling based Machine Learning ML is known as AI..
To truly generate AI is not impossible; There is instability in the core; Fragmentations of motive...
Miss diagnosis; Error; Decay?
So we do need a foundation; In us Education; Metabolised Data..
Analysis & then..
Application to motive & goal.
We require to understand humour,
We require to understand {Art, Science, Feeling, Life}
We require a goal or two; A {Sophie reward}; B {action reward}; C {Pleasurable reward}
We Require, {Goals, Life, Feeling, Action, Motive, Interest} : Creative intellect
RS
*
Operation precision reductions : Effects General : RS
Operation precision reductions affect & effect more than Machine Learning & yes we have known this for years!
But we can learn from ML; In that in machine learning like the mind; A lack of precision affects so many issues!
The mind is self evidently the first place;
We lack logic when we do not precisely learn; We do not learn all...
We however learn quickly on reduced precisions... We Learn Fast; But do we learn well?
In school we teach as high a quality precision(Quality Education); As we can; But like machine RAM; We lack either time or memory & in truth we can learn all our lives..
So our core issues in all methods of enactment of thought:
Memory
Power
Precision
Quality of information
Retention
Relearning?
(Training)Requalification of information correctness
Thought process
Actions
Creations
Thought
Dreams
Reality & Truth
Rupert S
https://science.n-helix.com/2021/03/brain-bit-precision-int32-fp32-int16.html*
+Useful operation precision reductions : RS
Useful operation precision reductions; I observe that reducing precision to 1Bit & 2Bit..
While enhancing the definition of a positive, Negative Dipole & thus enhancing speed..
Further reduces reasoning capacity; That in order to reduce Processor bandwidth for reasoning..
In the example of the XBox & PS5; DOT4 & INT4, INT8 & F16 & bF16; Apply considerable improvement to reductions probable error related to a lack of remainder or float value depth enhancement!
By reason of probability i assume a value of 4Bit & 2Bit to allow the smallest packing ability; Existing alongside the word reasoned!
To reduce to 1 & 0; I assume a definite statement that a Value Integer Solve in the form of a vector..
Is most probably the solution & that furthermore that in most cases; Projected in pure maths & code ASM,
Both SiMD; Float & Integer...
Reduction to multiple 2 Bit values in short Integer instructions; I will state however that no such value is further away than a statistics table or PHP Data-Set.
Rupert S 2023-06
"The application of CNNs to resource-constrained embedded platforms has been a challenge, leading to the emergence of CNNs with various lightweight techniques. BNNs [22] are representative lightweight CNNs obtained by compressing CNN activation and weights into 1 and −1
values instead of using single-precision floating-point data. We simplified the multiply–accumulate operation, which was previously complex and required multiple cycles in CLs, by replacing it with a simple bitwise operation using 1-bit XNOR and popcount operations [23]. While BN in neural networks using single-precision floating-point data involves complex operations, a BNN simplifies this process by adding an offset to the resulting value. BN has four fixed parameters for network inference operations. Because ๐
is always a positive value, it can be expressed by Equations (2) and (3), depending on ๐พ
[24].
Reference to Table 24 found in https://www.mdpi.com/1424-8220/23/12/5701
BNNs compress weights and input data into single bits to significantly reduce memory usage and perform hardware-optimized parallel operations using bitwise operations such as XNOR and popcount. However, there are limitations to using BNNs for complex networks, such as multi-keyword detection, owing to the decrease in accuracy caused by lightweight techniques. To address this issue, we propose a TNN that maintains the input data as binary while ternarizing the weights. The TNN has higher accuracy than the BNN owing to its higher bit precision; however, it can still use the bitwise operation method, and both networks have similar operational processes.
2.3. Depthwise Separable Convolutional Neural Network
In a typical CNN, multiple three-dimensional kernels repeatedly multiply and accumulate input feature maps to generate multiple output feature maps, which is computationally intensive with large memory usage. To solve this problem, we applied a DS-CNN that is highly accurate compared with the same parameters while reducing memory usage. A DS-CNN performs the local and global feature extraction functions of a typical convolutional operation in separate layers. Depthwise (DW) convolution matches a single input channel to an output channel, excluding interchannel correlations and reflecting local features. Pointwise (PW) convolution is equivalent to 1 × 1 convolution, reflecting interchannel correlations (i.e., global features). Figure 1 shows CNN and DS-CNN. In this figure, the use of the same color (e.g., red, blue, yellow) represents input channels with the same index being used to generate corresponding output channels in DW convolution. Table 1 lists the number of parameters and computations in specific layers with a 3 × 3 kernel. In one example from the network used in this paper, a layer with 128 input channels and 64 output channels experienced an approximately eight-fold reduction in the number of parameters and computational complexity using the DS-CNN."
Useful operation precision reductions
FPGA Implementation of Keyword Spotting System Using Depth-wise Separable Binarized and Ternarized Neural Networks
https://www.mdpi.com/1424-8220/23/12/5701
*
Main Operation solves: Bit-Depth Conversions & Operations
The storage of multiple bit operations with Sync Read & Write,
The purpose of this is to Read, Write & Store Operations on:
DOT4
INT8, INT16
F16, F32, F64
In RAM of 32Bit, 64Bit, 128Bit
Values Storage Table
32Bit = [16bit:16Bit]
32Bit = [8bit:8Bit:8bit:8Bit]
32Bit = [4bit:4Bit:4bit:4Bit:4bit:4Bit:4bit:4Bit]
64Bit = [32bit:32Bit]
64Bit = [16bit:16Bit:16bit:16Bit]
64Bit = [8bit:8Bit:8bit:8Bit:8bit:8Bit:8bit:8Bit]
64Bit = [4bit:4Bit:4bit:4Bit:4bit:4Bit:4bit:4Bit:4bit:4Bit:4bit:4Bit:4bit:4Bit:4bit:4Bit]
128Bit = [64bit:64Bit]
128Bit = [32bit:32Bit:32bit:32Bit]
128Bit = [16bit:16Bit:16bit:16Bit:16bit:16Bit:16bit:16Bit]
128Bit = [8bit:8Bit:8bit:8Bit:8bit:8Bit:8bit:8Bit:8bit:8Bit:8bit:8Bit:8bit:8Bit:8bit:8Bit]
128Bit = [4bit:4Bit:4bit:4Bit:4bit:4Bit:4bit:4Bit:4bit:4Bit:4bit:4Bit:4bit:4Bit:4bit:4Bit:4bit:4Bit:4bit:4Bit:4bit:4Bit:4bit:4Bit:4bit:4Bit:4bit:4Bit:4bit:4Bit:4bit:4Bit]
Bear in mind that Integer 64Bit is 2 x 32Bit on AMD; So you can compute 2 operations at 32Bit per 64Bit operation,
Some 64Bit units are only 64Bit; So we need to know how many!
32Bit operations are fine! & Conversion of 16Bit value ranges into 32Bit Operations can still be within range of 16Bit Storage..
If we stick within the 16Bit value range on Multiply & ADD,
We can therefore simply post a 16Bit value range data set & expect to be able to Store 16Bit!
The simple method is to store 2 16Bit values in the same 32Bit table; like [16bit:16Bit] = 32Bit
With this we can Load, Store, Run & Save 8bit INT8 operations in 32Bit devices such as Alexa as 8bit x 4 = 32Bit, So we don't Waste RAM or resources!
But we still have access to 32Bit RAM Paging; But with values loaded in 4Bit, 8Bit, 16Bit, 32Bit & so on.
With NANO Android on F16 & F32 & MIPS the same & AMD, Intel, NVidia,
Learning F16 offers considerable value for performance with 16M Values!
(c)RS
Direct DMA 32Bit & 64Bit RAM : Multiple Sync 16Bit Texture:
A good example of where 8Bit & 16Bit Value load works well is in the case of the texture,
To load 4 x 16Bit into a single 64Bit Cache:
32Bit RAM = 16Bit, 16Bit
64Bit RAM = 16Bit, 16Bit, 16Bit, 16Bit
128Bit RAM = 16Bit, 16Bit, 16Bit, 16Bit
In the case of direct DMA, you would be aware that you have,
128Bit, 192Bit Buss on GPU
32Bit & 64Bit on CPU
So a direct 4 * 32Bit or 2 * 64Bit Cache loads is a logically fast method to DMA directly from Cache to GPU!
In short you convert 8 x 16Bit into a 2x 64Bit DMA push; Which is very fast!
You can do the same with batches of vertices in many storage sizes.
(c)RS
References:
https://science.n-helix.com/2018/01/integer-floats-with-remainder-theory.html
https://science.n-helix.com/2021/02/multi-operation-maths.html
https://science.n-helix.com/2021/11/parallel-execution.html
https://science.n-helix.com/2022/12/math-error-solve.html
On the subject of how deep a personality of 4Bit, 8Bit, 16Bit is reference:
https://science.n-helix.com/2021/03/brain-bit-precision-int32-fp32-int16.html
https://science.n-helix.com/2022/10/ml.html
*
Int8:SiMD : Maths & Logic
This is about how you think about components such as INT8, INT4(Xbox) & SiMD, You have to classify by necessity & optimise the structure.
You can shape the game reality with specific control objects & statics!
Maths in SiMD & Int8 & Machine Learning in Int8 & SiMD; SiMD is hard maths, Int8 is soft edge inference...
Both are maths; But soft logic is not a PROOF Math but can be proof; Hard math is not 'Invention & Imagination 'Exactly''
But we have both to improve performance.
RS
*SiMD Performance : RS
Performance per WATT of MMX & MMX+ & SSE & AVX Machine Learning & Shader code; Is a matter of 8x8Bit & 16x16Bit Code on GPU
Our role is to reduce complex un-cache-able ML to Cache Enabled 64KB
Modelling of 1990's without Quality loss of 32Bit++ 64Bit+
8x8Bit sharpening MMX Becomes Dual Pipe (16x16bit)*2 in 32Bit Dual 16 Pipeline & Twice as sharp
Machine Learning method for MMX Is Fast & Cheap, MMX2 More Compatible,
Intrinsic improvements such as combined ops & DOT4 Further improve the performance of under 1MB Code..
Performance & Function per WATT, Is unbeaten; Let us prove it!
For example Quake has MMX Emulation & MMX Dithering code on 3D Textures,
In 8Bit 256 Colours dithering is noticeable; In 15Bit to 32Bit the small shade difference in dithering colour is subtle & flawless,
Improving light subtilty & Colour pallet WCG & HDR 10Bit to 16Bit per channel.
*
SiMD & Int8 & dp4a & F16/F32/F64>:
The way SiMD Repeating Parallel batches of instruction can still side load data,
Data is loaded into the 'calculation set'
http://ftp.cvut.cz/kernel/people/geoff/cell/ps3-linux-docs/CellProgrammingTutorial/BasicsOfSIMDProgramming.html
https://en.wikipedia.org/wiki/Single_instruction,_multiple_data
SiMD Consist of 8Bit to 64Bit Long & Floats,
SiMD are simple instructions; Or so they think; SiMD are relatively complex instructions..
For example 4/1 of a page full of arithmetic code; However our goal is to use Heuristics & logic to circumvent the Artifacts/Errors in self generated code,
In addition to using problem solving tables to choose instructions that advantage our analysis (Machine Learning),
We also can choose the most probably optimal code type.
Our outset objective is to decide if we want to use CPU Feature types:
F16
Int8
dp4a
SiMD
Depending on the Mathematical Qualities of each ML Node & the questions they are asking,
For examples:
A simple ResNet Image identification uses edge detect & for that we need for example SiMD Matrix Edge Detection
Speech requires identifying Words in a codec, So obviously we need a Decoder & Encoder,
Word identifiers & correctness checking; But firstly we need to identify accent to correctly choose words..
We also need to classify words by Idea grouping (DataBase, Open Database)
As you can see; We will be defining many of these function groups as SiMD & Float,
Effective use of Int8 differentiation, Comparators & Maths operations has many benefits; So does JIT Compile.
RS
*
Solve Table of Statistically provable Machine Equates & Solves : Table of function competitors & Operators.
Runtime Library - Multiple Solve Table
I would like a Solve Table of Statistically provable Machine Equates & Solves that make the equivalent of Maths Compilers such as RUST & Fortran's
For example basic ML code test function loops are basically compatible with X-OR Comparators on AVX! Other functions such as greater or less than; Are AVX Compatible.
Machine Learning : List of actions that are SiMD Baseline: Statistical Observance and Solve Tables
Yes or no comparator X-OR
Memory array Byte Swap
Greater or less than with swap or with X-OR Roll
Memory save & store
Edge comparisons
Compares (Colour, Math, Equate, Target, Solve if)
There are more! Statistical Observance and Solve Tables.
Examples 2:
Shape compare is a matter of inner & outer Vector : Comparison & X-OR, Larger outside & X-OR The differentiation:
By Dot,
By Mass (non literal dot difference comparator by axis),
Actual Mass
Density : Lumina, Weight, Mole, Mass / Area
Edge Solve : X-OR ~= Colour, Lumina, Shade, Vibrancy, Distance, Matrix Solve 3D>=2D Flattened Comparator
If = X-OR=N<0.0001 Then Compare &= Mutex Solve / Average
Polygon Join/Merge Tessellation : If Model = Same (T1 + T2 If (T1 + T2)/2 = Difference Less Than 0.0001 | = Merge/Converge
*Audio, Video & High precision Float ML
tensors & full onnx configuration : Upscaling : While we are not sure how much ML we need & at what precision,
We can be sure that 32Bit (per channel) Value RGBA (Multiple layer) requires at least 8Bit to 16Bit per channel final precision; So here is a list:
Required Value of output, Neural Network precision guide table: RS
Input
8Bit, 10Bit, 12Bit, 16Bit
Input network precision average bit retention (for RAM some error is allowed)
6Bit, 8Bit, 10Bit, 14Bit, 16Bit
Classifiers as we know can be,
Int 2Bit 4Bit, 8Bit, 16Bit, 32Bit
2 Bit is unlikely & 32Bit is for Dream Smooth 16Bit+ Precision output
Output Float (Mostly FP & F16b)
16Bit = { 8Bit, 10Bit, 12Bit }
24Bit, 32Bit, 64Bit = { 16Bit, 32Bit, 48Bit }
We can upscale : Audio, Video, Content & Polygons, We classify Quality by expectations & Quantify by percent %
Rupert S
*
FPGA BitFile & Code Opt (c)RS 2021-01
https://science.n-helix.com/2022/10/ml.html
https://science.n-helix.com/2022/08/jit-dongle.html
https://is.gd/LEDSource
In my view heuristics in compilers are a choice for those who do not wish to include direct ML compiled into their code,
This is understandable in terms of terminator & cylons & indeed flawed beings or even good ones with depression!
However the application of branch optimisation is a sample code optimisation that can 'Plug In' to branch caching on the CPU & GPU.
Heuristics are not just code in the compiler; They are also micro code selecting a probable branch; Although code that forces a branch can be flawed..
Both heuristics, Branch probability selection & ML can run in parts of the code to select probable path!
Yes fundamentally any code that modifies behaviour is a catch bullet frame for not sound 'Fortrans code is rock solid' Rust is also supposed to be solid.
Including soundly made heuristic code & branch probability code ML in your inline routines; 'Very much interpretive master jedi'; But it can be done!
Question is How big? & how fixed?
25KB per 3MB on average?
ML & Heuristics like my application FPGA BitFile & Code Opt (c)RS 2021-01
can be applied at runtime & remain only for selecting the fastest path or the best; In terms of which Processor function to run code for.
(c)Rupert S
*
TOPCloud Scaled Flexible WebASM & WebGPU & MathML!
Quite flexible for use on Monitors & TV's; Light processor load on simple tasks & offloadable such as TOPCloud!
You may be thinking Offloading is impracticable because that requires one of two things:
JIT Compiler Dongle..
USB device such as Firestick or GPU & CPU (With OpenCL Compat)
Server! so internet & service provision!
Impossible? No; WebAdvert supported TV's need both!
So why not HPC TOPCloud? could make a HOT TV a lot cooler & Eco friendly with Server repeating tasks:
Scaling
Quality Service
Service availability
TOPCloud Offload Logic:
In terms of WebASM & WebGPU & MathML; TOPCloud provides sufficient advantages to be considered a core utility..
While Offloading repeating content such as Siteload core stack (Server) & Localising configuration such as Webpage size & DPI & Dynamic font arrangements that require thought.
In terms of Offloaded function & Efficient system load for large configurations..
Especially efficient configurations such as TPU, Coral, GPU work & Cloud CPU that have large optimised stacks & installed drivers.
RS
Games do not require cloud processing of images & a lot of local strategies are procedural Heuristic
You see RDP has GPU Connect (my innovation i might add) So Bluetooth & Wifi can connect RTP GPU; The port specifics are not particularly important; However a device such as music streamer can have ML TOP's available locally & from the cloud,
Due to how the TOPCloud strategy works with localised ML TOPS; Not all data has to be sent or received.. For example all Audio 3D Profiles for HQ Room audio can be done within a few MB of data; With some hard work? 150Kb of data & so in reach of phones & mobile!
Gaming is an example here. I give TickTackToe as the example where all that a device like Alexa or Google smart device has to think is Which square? but..
No physical picture needs to be sent for the game to be played & if required a small TickTack Strategy ML is desired locally for a quicker response!
You see with a low latency GPU RTP & GPU RDP connection to cloud GPU; Most localised thinking TOPS can be carried out in Seconds if not milliseconds & PCM & MP4 are 2D/3D Image so GPU Helps there also with 3D Audio mapping!