Friday, June 23, 2023

[M.A.P] [=====] [H.P.C] - Matrix Array Processor Unit (c)RS

Matrix Array Processor Unit (c)RS


[M.A.P] [=====] [H.P.C] - Matrix Array Processor Unit (c)RS

*
The M.A.P Processor is ideal as a Tensor Unit, For Small Array Solving; Such as MP3, MP4 & AC4 3D Audio,
The Base Map is simply to Fit a large static conversion M.A.P into the device,
For example a 32Bit Audio Sample Pluse 3D Layer for Bluetooth would simply be around 64Bits for Stereo 32Bit Audio MP4; Plus 32Bits for the 3D Map,
The M.A.P Process is not static; But you stick to the maths you wanted.

In parallel instructions, one calls interrupts if bad; IRQ & DMA Notes if you want to have better performance,
But in a processor Internals you have to call the main loops in your App; & OS Task Instruction cache..

Instruct The loop; Don't Interrupt; Stop, Look, Listen! Look, Slowdown, Showtime!

Integer instructions multiple parallel example of The principle of,
M.A.P is based on wide multiple instructions, This suites AVX & SiMD,
Particularly in 16Bit Multi Parallel Instruction Mode

Rupert S

Soft Interrupt IRQ: Faster CPU Cycles: RS

A Soft Interrupt is where you direct the interrupt register to a compiled Code Block..
The code block handles the Wait Queue in a gentle way that allows processing to continue & Ram to be accessed..

While the HDD directly writes the IRQ messages to the Code Block; The Code block is below the size of Cache on the Processor..

In advanced scenarios the Soft Int Caches Read/Write in RAM while Directing DMA & R/W Cached Cycles; Good Bioses & Software do this.

But in a processor Internals you have to call the Main Micro loops (Soft Int) in your App; & OS Task Instruction cache.

RS

Interrupts particularly effect the Processor functions such as..
Machine Learning Load & Store of Frames, Also the internet..
In such as Network cards offloading is often required to handle interrupts..

*

VPDM-ST-LRS : Verified Processor Direct Memory Space Transactions Load, Register & Save (c)RS


In Concurrence with DM-TCP & DM-UDP & DM-Quicc Soft Interrupt IRQ

https://www.phoronix.com/news/Linux-Device-Memory-TCP

For SI-IRQ to safely directly write RAM for a SiMD & CPU/TPU; The following protocol is observed:

1 DMA Memory Management Processor, Device Bios/PCI Bus & Network Chipset/Network card..
Shall directly code check incoming traffic; But shall not void EEC Mode error check...

Bear in mind that AES, Common TLS & Packet Compression are in effect!
So you shall be using Networking features directly through the Transparent H.D.L Hardware Device Layer...

In effect the MMU & Network adapter transparently offload directly to Device Topography RAM & Cache!

2 The network card Certifies transactions & offloads security to internal features; Main Certification is still TPM & HMS.

3 You can handle directly to Processor of memory space matches internet Bit-depth; However this is usually 32Bit as with IP4 & 64Bit with IP6..

4 So the MMU & Network chipset work in sync; EEC, Security, TLS, M.S.T: Memory Space Translation...

5 VPDM-ST-LRS : Verified Processor Direct Memory Space Transactions Load, Register & Save (c)RS

So to be clear Automated Load, Register & Save Networking; Yes,
Device Low Level Firmware Translation Transactions; Yes
Processor Direct Memory Space Transactions; No, With Verification? Yes

To stop per Frame IO being a high cost transport processing; We process the entire frame per In/Out,
The same with TCP/UDP/Quicc; We process per whole Bit; For example 192Bits (SSL,AES),
Packet containment & control protocols; Mainly because Half packets caused inefficiency!

Rupert S

The IDFlow Work Networking : DMA to DMA Buffer write throughs with caching : (c)RS


DMA Offloading such devices as Network Cards, Audio & GPU to GPU connections : DMA, Direct Memory Access with direct device to device write-through caching

What is different about this approach; The TCP UDP non specific protocols allow motion through a computer system,

Routed through chiplets & internal networks; No latency issues & very little protocol overhead.

Applies to Ethernet, Wifi, Network, Internal Buss, Audio, Video, CPU or processor & is the internal data flow system : The IDFlow Work Networking

With GPU to GPU & Hard Drive to Hard drive transfers direct equivalence is the primary necessity!

To have a coherent transfer between two of the equivalent systems we need a Cache for input & output..

However we can create a load on arrival eta on transfer that automates correct RAM location that is optimally sorted!


The difference with IDFlow DMA :


Negotiated security profile..

The main thing about Mapped DMA is an ideal route

The routing table, To handle complexities in machinery & ethernet & wifi/BT

Negotiated Data Types, Traditional DMA is memory, IDFlow can use data types, For example textures or OpenCL Kernels & Data

Privacy, traditional DMA is quite private because information is not provided on route by intermediaries..

However you think about DMA System IDFlow,
It may be a tiny bit slower negotiating on boot,

However in Ethernet Negotiation only takes a second,
Once negotiation is accomplished... The system acts like a traditional DMA..

(c)RS

Certificate exchange IP Packets & then Device classifiers : RAM, Processing power & features, Priority, Availability, workload levels, common statistics, Routing table array.

Routing table array { passthrough hardware such as Motherboard chipset & special devices such as DMA, Busses & routing table storage & access }

IP Packet formula, Metadata {

Workload timer for OpenCL & DirectCompute workloads

Send cooky code packet on request reception or query

What we need to do first is send a quick burst of metadata; The metadata contains the application & use principle; We define preferred use & reception application!

Identification of the type of data being sent allows Direct RAM Allocation in the correct formula,
For example Textures or OpenCL & Direct Compute runtimes or Hard drive or Ram Data Blocks..

DMA Cache can then be directly allocated based on size & composition of data & that memory can be directly moved to the application memory allocation, Avoiding the cache being moved internally inside the Processor or GPU IPU, NPU etcetera..

We can pre formulate the data packet from a source such as QAM that sends Encryption offloaded packets for storage or use; This allows Prior work in the flow of data,

Where we need to directly allocate RAM Blocks to write but the end device needs to arrange the RAM block for write; Effectively a dynamic frame/Data block.

Example application where prior work from source device to end device is applicable:

QAM & Chipset to HDD & SDD & Drive direct transfer to QAM & Chipset for Processor use..

Maybe directly to Encrypted RAM as commanded by the Processor.

Direct Storage to GPU or decompression chipset to GPU

In terms of FPU & NPU to CPU task sharing, Dynamic metadata allows task optimisation & ram allocations..

Improving on that Dynamic Storage & Retrieval with optimal computation block.. Reduces overhead & repeated task processing.

In terms of HDMI & DisplayPort direct frame to frame DMA would speed up Ethernet transport protocols from the GPU to the display & back for when you frame copy..

In terms of Audio the per frame or tick cycle translation data to output would reduce overhead..

The principle if IDFlow is planned & secure DMA,

In principle the key point is the same as modern GPU direct RAM Access,

However because DMA is private & secured by being per application,

Direct DMA is a means of keeping secrets in the same way as PreFetch on the CPU,

Now you know that prefetch is bugged, DMA holds discrete secrets & privacy.

};

Rupert S

https://science.n-helix.com/2023/02/pm-qos.html

https://lore.kernel.org/dri-devel/20230710223304.1174642-1-almasrymina@google.com/

https://is.gd/HPC_PTP_Low_Latency_Network

https://www.linuxfoundation.org/press/announcing-ultra-ethernet-consortium-uec

https://ultraethernet.org/

https://jointdevelopment.org/

*

DMA & IO Device mapping


Dynamic Mapped Data flow with device compression
DMA & PIO needs to pass logically from device to device..
Memory allocation for buffers & cache; Input & direct load

https://science.n-helix.com/2023/06/ptp.html
https://science.n-helix.com/2023/02/pm-qos.html
https://science.n-helix.com/2023/06/map.html

RS

*

Embedded Hardened Pointer Table Cache for 3D Chips : RS


Based on PCI Edge RAM, Internal Loop Dynamic RAM; With internalised DMA Memory transfers..

In the process the feature has the ability to set a page table; 1MB, 2MB, 4MB, 16MB > 1TB,The Ram can be internally written to without invoking ALU or OS,

Pages are allocated; The GPU is an example; Physical pages are allocated in RAM that is directly Set by OS & Firmware/ROM Parameters...

Internal access to the RAM is set within the page allocation set, But all internal mapping & paging is done directly & though ALU & Memory Management Unit MMU.

With 1MB Cache set aside per feature; Not entirely unreasonable these days...

Most if a process such as SiMD can be carried out on internal loops..

Depending on Cache/RAM Space; Based on PCI Edge RAM

Internal DataSet Size based on Dynamic RAM Variable; That is set per USE &Or Per Settings or application,

That being said; RAM Allocations best be per session & directly after Setting is changed on reboot or refresh, Load & unload cycling.

Rupert S

*

Gather/Scatter Microcode no-overload ALU or Data/Code Cache, Just L3/RAM


When we look at the Instructions of the SiMD; We could see potential in them to further improve the Gather/Scatter Instructions; Although it has to be said that the instructions are well optimised!
Like many pre-Fetching Assembly code for earlier years they are well created & quick!

But we can do several things with them; So what ?

We can directly fetch the Cache in the code & Link to cache locations using linking (if we have enough & we do at L3/L2)

We can make a Hardlink table in cache(L3) for load and save processing (64Kb, Including header)

We can directly invoke pre-fetch with a system call (With SoftLink Pointer Tables)

We can incache modify (if a directive is singular in a chain of a, b, c, d)
We can individually SysCall a direct load of a single {a, b, c, d) statement & not reload it all...

For this we need a matrix table in L3 RAM; We can do this if we keep the table under 512KB,
But we do not intend to be selfish & RAM is fast these days! So we can directly load a single matrix Element {a, b, c, d} & not refresh the loading cycle for the code...

Thus we do not have to overload ALU or Data/Code Cache, Just L3/RAM

Rupert S


*

Temporary HardLinking in Prefetching Matrix instructions,

Gather/Scatter operations of localised random scattering of information to ram & retrieval

Gather
for (i = 0; i < N; ++i)
x[i] = y[idx[i]];

Scatter
for (i = 0; i < N; ++i)
y[idx[i]] = x[i];

Firstly i read statistical gathing & Seeding; Pre-Fetching is a method of anticipating & preloading data,
So what do i want to do ? In Vector Matrix Prefetch Logical Gather

Potentially i would like to use:

Softlink (ram retrieval & multiple value)
HardLink (maths)
Prefetching logic {such as,

Run length prefetching,
Follow & Forward loading Cache,
Entire instruction load & Timing Pre-fetch & Statistic for Loop time & load frequency
}

So on any potential layout for SiMD Matrix a most likely configuration is:

A B C : FMA
A B = C : Mul or ADD

So a logical statement is, A, B Gather/Seed C; Directly logical AKA Prefetch
A B C D; Logical fields of prefetch are localised to parameter...

Only likely to draw data from a specific subset of points,
Byte Swapping is obviously A1 B1,2,3

Most specifically if the command is a hardlink With A B C; Then most likely Storage is directly linked; Like a HardLink on a HDD in NT,

The hard link is direct value fetching from a specific Var table & most likely a sorted list!
If the list is not sorted; We are probably sorting the list..

If we do not HardLink data in a matrix (Example):

Var = V+n, Table
a b c d
1[V1][V1][V1][V1]
2[V2][V2][V2][V2]
3[V3][V3][V3][V3]
4[V4][V4][V4][V4]

A Matrix HardLink is a temporary Table specific logical reading of instructions & direct memory load and save,
Registers {A,B,C,D}=v{1,2,3,4}..

Directly read direct memory table logic & optimise resulting likely storage or retrieval locations & Soft Link (pointer table)

Solutions include multiple Gather/Scatter & 'Gather/Scatter Stride' Cube Block multi load/save..
Logical Cache Storage History Pointer Table, Group Sorted RAM Save/Load by classification {A,B,C,D}=v{1,2,3,4}
When X + Xa + Xb + Xc, When Y + a b c, When Y or X Prefetch Pointer Table + Data { a, b, c }

Example Gather/Scatter logical multiple

var pointer [p1] {a ,b, c, d}
var pointer [p2] {1 ,2, 3, 4}

Gather
for (i = 0; i < N; ++i)
x[i] = y[idx[i]];
fetch y {p1, p2}; {a, b, c, d}:{1 ,2, 3, 4}

Scatter
for (i = 0; i < N; ++i)
y[idx[i]] = x[i];
send x {p1, p2}; {a, b, c, d}:{1 ,2, 3, 4}
 
Rupert S : Reference https://en.wikipedia.org/wiki/Gather/scatter_(vector_addressing)

*

FMA is a Matrix SiMD feature & is common to ARM & AMD, CPU & GPU

Phone SIM cards can use FMA for GSM network acceleration,

We can use FMA fused MUL ADD for elliptic curve encryption to multiple Time * curve & ADD AES encryption in the form of time model & 3D dimensions,

Therefore we can use FMA to calculate the room area & add audio reverberation matrix as volume levels over time..

FMA as a basic GPU..

We can convert adder & fused MUL ADD ML,

Use all 3 types on integer function of CPU & internal GPU on echo dot type device's with internal GPU and CPU.. FPGA design.

Rupert S

*

Pre-Fetching; Statistically Ordered Gather/Scatter & The Scatter/Gather Commands


(SiMD) The gather/scatter commands may seem particularly random?
But we can use this in machine learning:

Gather
The equivalent of Gathering a group of factors or memories into a group & thinking about them in the context of our code! (our thought rules),

Scatter
Now if we think about scatter; we have to limit the radius of our through to a small area of brain matter (or ram)... Or the process will leave us "Scatter-Brained"

Statistical Pre-Fetching:

Ordered Scatter
When you know approximately where to scatter

Ordered Gather
Where you know approximately where to gather

Free Thought
So now we can associate scatter & gather as a form of free thought? Yes but chaotic...
So we add order to that chaos! We limit the scattering to a single field.

Stride
Stride is the equivalent of following a line in the field; Do we also gather &Or Scatter while we stride ?
Do we simply stride a field?

Now to answer this question we simply have to denote motive!
In seeding we can scatter; Will we do better with an Ordered Scatter ? Yes we could!

Statistically Ordered Gather/Scatter & The Scatter/Gather Commands
Pre-Fetched

Rupert S

*

Multi-line Packed-Bit Int SiMD Maths : Relevance HDR, WCG, ML Machine Learning (Most advantaged ADDER Maths)


The rules of multiple Maths with lower Bit widths into SiMD 256Bit (example) 64Bit & 128Bit & 512Bit can be used

In all methods you use packed bits per save, so single line save or load, Parallel, No ram thrashing.

You cannot flow a 16Bit block into another segment (the next 16Bit block)

You can however use 9 bit as a separator & rolling an addition to the next bit means a more accurate result!
in 32Bit you do 3 * 8bit & 1 * 4Bit, in this example the 4Bit op has 5 Bit results & The 8Bit have 9Bit results..
This is preferable!

2Bit, 3Bit, 4Bit Operation 1 , 8Bit Operations 3: Table

32Bit
4 : 1, 8 : 3

64Bit
4 : 2, 8 : 6
2 : 1, 7 : 8
3 : 1, 8 : 1, 16 : 3

Addition is the only place where 16Bit * 4 = 64Bit works easily, but when you ADD or - you can only roll to the lowest boundary of each 16Bit segment & not into the higher or lower segment.

A: In order to multiply you need adaptable rules to division & multiply
B: you need a dividable Maths unit with And OR & Not gates to segment the registered Mul SiMD Unit..

In the case of + * you need to use single line rule addition (no over flow per pixel)..
& Either Many AND-OR / Not gate layer or Parallel 16Bit blocks..

You can however painful as it is Multi Load & Zero remainder registers & &or X or Not remainder 00000 on higher depth instructions & so remain pure!

8Bit blocks are a bit small and we use HDR & WCG, So mostly pointless!

We can however 8Bit Write a patch of pallet & sub divide our colour pallet & Light Shadow Curves in anything over 8Bit depth colour,

In the case of Intel 8Bit * 8 Inferencing unit : 16 Bit Colour in probably (WCG 8 * 8) + (HDR 8 * 8) Segments,

In any case Addition is fortunately what we need! so with ADD we can use SiMD & Integer Today.

Rupert S

https://science.n-helix.com/2018/01/integer-floats-with-remainder-theory.html

https://science.n-helix.com/2021/11/parallel-execution.html

https://science.n-helix.com/2022/10/ml.html

https://science.n-helix.com/2021/03/brain-bit-precision-int32-fp32-int16.html

https://science.n-helix.com/2023/06/map.html

*

M.A.P NPU Matrix Processor Dimensional construct (c)RS


Primary reason for expansion of function data sets: 2D, 3D,< nD

P.D.C is a worker thread parallel 2D or 3D Grid,
Utilising QQ & A, B,C Array maths allows us to collapse or expand dimensions in a flexible way,

The same principles as SVM (S.V.M SiMD Vector matrix) can be used to culminate or expand dimensions...

That way a M.A.P Processor can expand or collapse all mathematical constructs,
We can therefore use all mathematical & statistical arrays for machine Learning & Maths.

RS

*

The Subject of 4x4 tables,

We are obviously looking for more like 16x16 for Physics maths!
The matrix processor is a large data set; Divisible into 4x2 & 4x4 & 8x8 groups for execution speedups,
Aligned Parallel processing....

Aligned Matrix tables need to be larger than 4x4 for Physics & Chemistry; So a matrix processor ideally can at a minimum:

Matrix Table

x1
16x16

16/2
x2
8x8,8x8
8x8,8x8

8/4
x4
4x4,4x4
4x4,4x4

RS

*

Matrix Method (c)RS


Any GPU & CPU SiMD can do a form of Matrix maths in an Array Parallel Load & Run as consecutive tasks..

Like So

Matrix Formulas : (c)RS

SiMD Array A to X, Usually 8, 16, 32, 64 Parallel Groups

Grouped Parallel Runs
A 1, 2, 3, N
B 1, 2, 3, N
to
Y 1, 2, 3, N
X 1, 2, 3, N
Run 1 {A1, B1 to X1, Y1} Run 2+ {A2, B2 to X2, Y2}++ {An, Bn to Xn, Yn}

Matrix Processor Method Synchronous Cube Map Usually 8x8, 16x16, 32x32, 64x64 Parallel Quad++ Groups

2D:3D Cube

A 1, 2, 3, N
B1, 2, 3, N
C1, 2, 3, N
D1, 2, 3, N

Run 1 2D:3D Cube {
A 1, 2, 3, N
B1, 2, 3, N
C1, 2, 3, N
D1, 2, 3, N
};

Run N 2D:3D Cube {
A 1, 2, 3, N
B1, 2, 3, N
C1, 2, 3, N
D1, 2, 3, N
}

Rupert Summerskill

*

SiMD Matrix maths begins with a 3D graph,


a
|___c
 \
   b

The graphs principal of 3 dimensions; We can use more dimensions but on paper we need to represent dimensions in colours so that all 3 dimensions that we can draw; are represented.

In algebra we represent 3+ dimensions with small glyphs next to each letter that represents our maths operation theoretical number.

During operation of computation we maintain in memory the specific dimensions interactions and interplay of complex matrix maths.

Rupert S

Numbers example 4D matrix

I love you 2, I love you 3, I love you 4 the ends of time... To be continued...

JN

*

The formula for the NPU (c)RS


Codecs & drivers with Matrix Mathematics Formula for AVX, NPU, TPU, Coral.ai Edge TPU & GPU,
Can potentially optimise 1000's of Web pages per second

Matrix Math Formula Get more Upscaling & performance per WATT, By Block Load/Run Parallel Processing:
SiMD, AVX, XMM, NPU, FPU, GPU, Processor

The formula for the NPU is +++ *** +*+*+* Adder, Multiplier, ADD MUL, This original formula was thought of by me in the 1980's as a child...

My basic reasoning (fighting for credit with a developer) Was for the Atari game cartridge system!

Now Why ? Because of several factors:

Parallel Adder Tables are FAST, Like so fast!
Parallel Adder Tables are cheap

Parallel is the new in of RISC, Extended parallel instruction sets, Vectors & MMX & 3DNow!

The philosophy behind Parallel instruction use is to be understood to be based on the Console requirements...

Speed, Performance, Price & the difference it makes!

Formula tables may seem complex! How does a child learn of this?

2 factors: I like game arcades & consoles.. & Basically Maths education at school!

Factor tables are a basic of Excel Spreadsheets & formulas for the process of examining the forex, foreign trade & equities & share markets of the world & New York makes one dream! Dream Big!

It is to be understood that APPLE understand the functional potential of ADDER MUL & +* Tables with memory...

They do I.T!

It is to be understood that AMD & Intel & NVidia Understand machine learning in a point of view.

Due to the complexity of Formula Tables as a basis of maths & science?

WE UNDERSTOOD IT.

Do you the client and the producers : APPLE, ARM, AMD, INTEL, NVidia, Motorola (6833++)

Truly understand the TRUE Potential of Formula Tables?

Basic formula table examples:

Codecs & drivers with Matrix Mathematics Formula for AVX, NPU, TPU, Coral.ai Edge TPU & GPU,
Can potentially optimise 1000's of Web pages per second

General Matrix Table maths with parallel arrays can optimise most table maths compatible Vector Units,

Code: SysCL, OpenCL, Assembler, Tight maths code in optimised & expressed in cube blocks of data in 256kb,128kb,64kb,32kb,8kb,4kb chunks as defined by grids..

AVX, SiMD & Coral.ai EdgeTPU & NPU Acceleration

Matrix Array Maths, Arrays & vector tables
Instanced_Arrays , Cubes & Polygons & curves..
AES, ECC, RSA, DSA, Array Maths

Anti-aliasing
Edge detection
Sharpening
Excel spreadsheets
Mathematical reduction
Statistics
Synergy
Artificial intelligence (AI)
Deep learning (DL)
Machine learning (ML)
Mathematics

Understand the requirements of Maths & Know the truth!

Basic assumptions of parallel processing require a full netmasked NPU Grid Matrix..
Now obviously an NPU does NOT fully map the entire Grid array in a single pass!

Sub cubes are to be mapped in allocation blocks with strict alignment..
That being said, if you cannot use a fully packed Byte..
You probably need to map better!

MAP(c)RS include the following parameters : ++++ **** +*+*+* In Matrix, 2D & 3D & varieties there-of


Matrix Formula Parallel Processing (c)RS

++++ **** +*+*+*+*
++++ **** +*+*+*+*
++++ **** +*+*+*+*
++++ **** +*+*+*+*

By multiplying by N*1 = +* = + & forms of cross line multiply with 0+* : N*Y

Honestly 2D & 3D Matrix SiMD qualify as significantly qualified...

The question of AVX, SiMD & Basic instructional formulas..
Is potentially only about latency & complexity!

Instruction hirachies with fully qualified SiMD & Vector instruction lists
Wiring on die is significant with complex instructions; So Buss Width is a significant challenge.

2x & 3x instruction Load/Store per operation Cycle reduces required Buss width..

Buss =I , SiMD S, Memory Cache M

SI=M=IS S=IM=S S=IM=S=IM
SI=M=IS S=IM=S S=IM=S=IM
SI=M=IS S=IM=S S=IM=S=IM
SI=M=IS S=IM=S S=IM=S=IM

The HerringBone Attribute allows Store & Run with faster Cache RAM & Dynamic Allocations?
Instruction out through buss, Parallel pipe.

Instructions can run, Left, Right, Up, Down...

Logic dictates Output direction is next operation or system ram & use.

RS

Map grid examples:

16KB Cubes [ ]

Grids are defined like so..

[ ]1a, [ ]2a, [ ]3a, [ ]4a
[ ]1b, [ ]2b, [ ]3b, [ ]4b
[ ]1c, [ ]2c, [ ]3c, [ ]4c
[ ]1d, [ ]2d, [ ]3d, [ ]4d

Example, We allocate an Address segment, 1, 2, 3, 4 or a1, a2, b1, b2 or 1a, 1b, 1c, 1d or a1:a4, b1:b4, c1:c4, d1:d4,
Independent parallel masking.

We can map multiple arrays in a bus & in a single pass..
With command load, run, save

(c)Rupert Summerskill 2024 'The Years to EXCELL'

*

TPM Verified Loop Code : Production Verified & Signed : Qualified Encryption & Compression Privacy (c)RS


Private loops : Security Level Verified Code & Byte Code

For security reasons the Block set of Lattice maths is loaded & fetched on secondary execution string,

Code dislocation involves no trace loop; For efficiency reasons the code optimal loop & fetch cycles are analysed in closed loop lab..

Data & code analytics are non disclosed for debugging clean stack code.

(c)RS

*

Matrix Formula block loading for SiMD Shaders makes sense, Most tasks can fit 4 commands in a row (in 64KB RAM)


Depending on the task; You can fill a grid { a1, a2,/ b1, b2 } ,

Or more depending on command length & data content..

SiMD Unit 2x 16Bit per row; 4 Rows per unit : grid { a1, a2,/ b1, b2 },

NPU & AVX 512 & 256 & 128 bit; have a much larger grid if supporting 16Bit values.

Rupert S

*

Standard deviation & derivatives (c)RS


There are many tasks suitable for standard, average, gaussian & mean deviation..

By perfect example; The Average, Mean, High & Low sample data set & Machine Learning &
Reason..

Cherished by the late Greeks, averaging data sets & pole data & metrics.

Standard deviation used for Dithering & Smoothing & Edge shaping & Sharpening with a smooth look,

In Codecs & Texture formats & can significantly improve look..

Used in statistical analysis, Image processing : Averaging, Error Diffusion Dithering, Averaging Dither, Sharpening & shaping,

By average mean & standard deviation : Tessellation, Vertice culling, Shape & colour composure, Colour matching & Identification tasks.

Rupert S

Understanding Standard Deviation and Derivatives


Standard Deviation:
A measure of how spread out data is from the mean.
A high standard deviation indicates a wide range of values..
A low standard deviation means data points are clustered closely around the mean.

Gaussian Distribution:
The normal distribution (or Gaussian distribution) is often used in statistical analysis and machine learning due to its symmetrical shape and known properties. Standard deviation is a key parameter of the Gaussian distribution.

Mean Deviation:
While less commonly used than standard deviation,
Mean deviation measures the average absolute distance from the mean.
It can be more robust to outliers than standard deviation.

Derivatives: A mathematical tool that measures the rate of change of a function.
In image processing, derivatives can help detect edges and features.

Applications

Image Processing:

Edge Detection:
Derivatives can highlight areas of rapid change in intensity, indicating edges.

Noise Reduction: Standard deviation can be used to identify and filter out outliers (noise) in images.

Gaussian Blur: A convolution with a Gaussian kernel (which is defined by its mean and standard deviation) is used to smooth images and reduce noise.

Dithering:
Standard deviation can help determine the optimal dithering pattern for reducing color banding.

Derivatives in Higher Dimensions:
For images, which are 2D & 3D signals..
We can use partial derivatives to measure changes in the x and y directions.

Edge Detection:

Convolution with Sobel or Laplacian kernels: These kernels are essentially derivatives.
The magnitude of the convolution output indicates the edge strength.

Canny Edge Detector: Uses standard deviation to determine thresholds for edge detection.

Median Filter: A non-linear filter that replaces each pixel with the median value of its neighborhood.
While not directly related to standard deviation, it's often used for noise reduction.

Statistics:

Hypothesis Testing:
Standard deviation is crucial for calculating test statistics and determining significance levels.

Confidence Intervals:
It helps construct confidence intervals around sample means.

T-test:
Compares the means of two groups. T
he standard deviation is used to calculate the t-statistic.

ANOVA:
Compares the means of multiple groups.
Standard deviation is used to calculate the F-statistic.

Clustering: It can help identify natural groupings in data.

Histogram of Oriented Gradients (HOG):
Uses derivatives to compute gradient magnitudes and orientations, which are then used to create feature descriptors.

Machine Learning & statistical analysis pt2:

Standard deviation can be used to normalize features & enhance average improving model performance & accuracy..
Normalization: Standard deviation is often used to normalize features, ensuring they have a similar scale.

Clustering: Algorithms like k-means use distance measures (which can involve standard deviation) to group data points.

Least Squares Regression:

The standard deviation of the residuals (the difference between the predicted and actual values) is used to assess the model's fit & fitness & accuracy.

Computational Efficiency:

While standard deviation and derivatives are powerful tools..
they can be computationally expensive for large datasets.
Efficient algorithms and hardware implementations are essential.

RS

*

ML, TFLite/ONNX : Wavelet & Array content such as HTML, JS, DNS & NTP protocols : RS


Example TFLite/ONNX can interpret Wavelet restoration as a final layer to sharpen encoding or decoding in Codecs & Texture formats or image compression libraries & DLL/.so H265 H264 H266 & DSC,
Such a compression advantage is due to the random bits in MPG & AAC & Huffmans.

We use a Matrix Maths Array to carry out the shaping; Because Waveshaping Matrix is a lot faster!

Aligned Matrix

2x SiMD
4x SiMD
8x to 64x AVX
4x to 128x NPU/TPU

Array formatted content such as DNS information can be ordered & sorted by logic & corrected for deviations from standard W3.org Mark-up language

We need TFLite/ONNX for games that have a light ML Payload for gaming AI & for antivirus or system flow such as servers route selection..

Don't kid yourself TFLite is a light load on a system!

ONNX is good but TFLite kernels come in under 50Kb

https://www.w3.org/TR/
https://www.w3.org/TR/webnn/#api
https://www.w3.org/TR/webgpu/#packed-formats

RS

*

Perfect sample for Matrix Tables : https://gpuopen.com/learn/sampling-normal-gaussian-distribution-gpus/


" // Method 2: Box-Muller Transform

float2 sampleGaussBoxMuller(float2 u, float mean, float standardDeviation)
{
const float a = standardDeviation * sqrt(-2.0f * log(1.0f - u.x));
const float b = TWO_PI * u.y;

return float2(cos(b), sin(b)) * a + mean;
}

"

We can either repeat loop solves : (cos(b), sin(b)) * a + mean,
Or we can form a table matrix

(cos(b), sin(b)) = x , * a + mean = y

1 2 3 4
a x*y, x*y, x*y, x*y
b x*y, x*y, x*y, x*y
c x*y, x*y, x*y, x*y
d x*y, x*y, x*y, x*y

Rupert S

*

Lattice Squares Kyber, Falcon, AES, DES, RSA, ECC:

The use of Lattice Squares, Otherwise known as Matrix Maths Formula..
In AVX, SiMD, NPU require efficient code modelling:

Lattice Grids are defined like so..

[ ]1a, [ ]2a, [ ]3a, [ ]4a
[ ]1b, [ ]2b, [ ]3b, [ ]4b
[ ]1c, [ ]2c, [ ]3c, [ ]4c
[ ]1d, [ ]2d, [ ]3d, [ ]4d

Multi-Threaded in parallel

Security for top quality Mobile Phone 3G, 4G, 5G, LTE & 2.4G & WiFi & Bluetooth : ICE-SSRTP GEA Replacement 2022 https://science.n-helix.com/2022/03/ice-ssrtp.html

A SiMD Variant Matrix Maths Formula, All kinds of work can be carried out :
Anti-Aliasing,
Sharpening Masks,
Code that requires a 4x4, 8x8, 16,16 Grid,

Bear in mind that SiMD is 2 lane 32Bt & 4 Lane 16Bit..
So a 4x4 matrix is ideal per SiMD Core Group @ 16Bit

4x4 = Single double lane SiMD Unit
8x8 = 2 Double Lanes
16x16 = 4 Double lanes Lanes
32x32 = 8 Double lanes Lanes

Double Fetch or Quad Fetch

RS

*

DML


Suitable for processing: Lattice, Kyber, Falcon, AES, ECC & drawing Vectors, JS & WebASM, PHP & HTML5 Web formats..

DML Level is important!

Matrix Maths Formula

Matrix Operation examples :

DML_FEATURE_LEVEL_2_1 : https://is.gd/DictionarySortJS

DML in relation to Instanced_Arrays & DirectX & Vulcan/OpenGL & CL

DML_OPERATOR_ELEMENT_WISE_
MAX, MEAN, MIN, MULTIPLY, SUBTRACT

RS

*
Reference the 'Parallel multiplication Grid NPU Simulation' Doc in https://is.gd/DictionarySortJS


https://science.n-helix.com/2022/04/vecsr.html

https://science.n-helix.com/2019/06/vulkan-stack.html


https://is.gd/UpscalerUSB_ROM

*

Directed Matrix Principle : RS


Matrix Principle directed at traditional parallel Integer & SiMD Instruction groups

The main problem with 32KB L1 tables is cache filling & domination of CPU/GPU by single program instruction groups..

Instruction cache is the primary challenge; Because Instruction cache L1 is commonly 32KB; Data cache 64KB,
L2 is 512KB to 4MB; L3 4MB to 16MB (can be more on Epyc)..

Optimised instruction groups by instruction, SiMD multiprocessing thread count:

Firstly requirements: (32KB instruction Cache L1, 512KB L2, 8MB L3)

L1 Instruction Group 32KB
L2 running group 512KB
L3 RAM & storage direct fetching 8MB

8KB core table for group threading,
24KB of grouped & Synchronised instructions

Data work Groups 512KB L2 / 64 Instruction Group sets (L1 32KB Table),
So Main instruction groups from L1 with larger data sets.

L3 4MB to 8MB of data & instruction caching load (directed from L1 & funneled into L2)

Instructions are cross threaded directly though L3 & L2 synchronised Load, Run & Save,

Optimised instruction groups by instruction, SiMD multiprocessing thread count.

Rupert S

*

Parallel Arrays : Matrix forms : RS


Matrix processor is a feature that will be more common & is relatively similar to an Abacus with a multiple array of + & * Operators..

Now a Matrix Array is X1 > Xn & Y1 > Yn

Commonly an array of 16 x 16 but can be 8 x 8 or 4 x 4,

Now we can perform such operations as Relativity & String theory on a lattice & that is very fast!

We can also perform these functions on SiMD, AVX in parallel; Such that 256Bit SiMD is 32Bit x 8 Parallel & so forth

Parallel
a : 64Bit
b : 64Bit
c : 64Bit
d : 64Bit

Matrix
a1a2a3a4
b1b2b3b4
c1c2c3c4
d1d2d3d4

Now we can see that we can perform a matrix operation such as lattice with both SiMD & SiMD-Matrix,

We can also see that a Matrix shall & can present our solution & that SiMD can also!
But we need Long operation SiMD or many passes to complete our operations; If Larger than our size..

We can also therefore most likely..

Use AES-NI S Letter Box & SVE & Matrix & SiMD to our advantage for many Lattice operations.

Multiplier Matrix Accelerated Encryption, Like i said A Parallel SiMD array may do the same; If all memory arrays are connected by a single RAM/Cache ALU Node,

As stated Parallel Arrays & Parallel Matrix Arrays.

Rupert Summerskill

https://science.n-helix.com/2023/06/map.html

https://science.n-helix.com/2022/03/ice-ssrtp.html

Bluetooth LE Protocol
https://drive.google.com/file/d/17csRnAfdceZiTSnQZvhaLqLSwL__zsIG/view?usp=sharing

*

Examples of Parallel execution pipeline : Parallel arrays:


Crypto lattice, Kyber/ML-KEM, AES : Parallelised Lattices, 8x & 16x Parallel SiMD F16/32/64/128/192/256Bit

parameterisation of groups of 4x Parallel SiMD F16 & 8x Parallel SiMD F16

Parallelised motion & Video/Audio Deblocking/Blocking

8x8 16x16 quantification of video is common in VVC & H265 & H264 & JPEG & MP3, MP4a & AAC,
Suggested parameterisation of 4x Parallel SiMD F16

8x8 16x16 quantification of video is common in HDR VVC & H265 & H264 & JPEG & MP3, MP4a & AAC & AC3 & AC4,
Suggested parameterisation of 4x Parallel SiMD F32

Shapes in motion 2D : 4x per Cube in motion,
Shapes in motion 2D : 6x per Texture Shaded Cube in motion,

Shapes in motion 3D : 6x per Cube in motion,
Shapes in motion 3D : 8x per Texture Shaded Cube in motion,

RS

*

Number relativity, Bit precision: RS


In gaming a player has access to palette of 16bit FFFFFFFFFFFFFFFFFFFF.FFFFFFFF BF16 F=16 HEX; In 32bit memory storage.

Average gamers recognise maybe 32000 colours directly,

Colour rich artist colourist's recognise almost 6000000 colours  TOPCloud.

Variety is king & queen of experience,
Artists specialist recognises more colours than a basic gamer or graphics artist in vectors..

Matrix maths operations precision is relative to hardware,
XBox 4bit FFFF, PLAYSTATION 8Bit FFFFFFFF

RollINT precision 1 to 4 bit + integer -1 to 4 bit F, FFFF, FFF+.F Xbox Or FFFFFFF+.F Ps

Bit precision is relative to your experience!

Rupert S

*

RollINT - Machine Learning for Console & Computer : RS

With True Value memory/Operation cache...

Application of RollINT to machine learning with definition,
A Playstation APU has 8Bit Integers for inference; XBox 4Bit..

In order to describe 4Bit as float; You would need to define 3Bit & 1Bit R remainder,
So how does this work?

In loading value the first 3Bit is the value & the 4th bit is remainder & when you load the value stored..

You fetch 3Bit as the value & 1 Bit as the remainder; Example:

FFFe > Value FFF &R e, So the value is FFF.e not FFFe
you can do multiple data type operations in this method; For example:

FFde = FF & de or FF.de or you could do Ffde & mean F.fde; Useful for definitions of Pi,

For example Pi in 4Bit (8Bits Prefered); Commonly used by kids at school!,

However you convert the stored 4Bit Pi to a fully accurate value on FPU & SiMD execution by loading pre-stored true value.

RollINT

We are using roll to roll a zero on or off an integer,

Therefore we are able to divide and multiply and add so that..

101-0 > 10.1+0 No can range practically from 0 to 00000000 practically.

So 10023-000 > 10.023+000

We can then store floating point numbers in integers.

(C) Rupert S,

Reference Int & FP Value Sizes; A reminder that Floats are 50% of highest Integer Value,
ROLLInt floats still have an amazing additional value!

https://learn.microsoft.com/en-us/dotnet/standard/numerics

*

RollINT : The Float Perfectionist


Playstation & XBox are primary examples where the Int8 unit could do a RollINT Floating point operation for machine learning that is specific to float FPU Solves,

Edge detection, Sharpening & Adaptive Contrast & Colour HDR..

Depending if you directly roll on SiMD & FPU then you can still sharpen with the bF16 & half precision FPU/SiMD Maths operations on the final run!

Imagine Luke SkyWalkers final Torpedo Salvo as FPU/SiMD Vectors DT

RS

*

Scaler is an argument for the role of RollINT & also a pointer to method


RollINT : A Float view of machine learning,
Essentially the core issue is the role float may play in a result...

Not the human mind does use a common integer format with a small float remainder?
Potential for this configuration is mainly because Integer values are in the main Substantive information..

Float value (the sub decimal place below 0.); Is in essence a precise small value of high importance to skills such as jumping, Running, Motions & skill actions like shooting..

Integer is the majority of action related to large steps; Particularly because people have the capacity to change from Meter to Centimetre to Millimetre,

Justifications for Float values diminish if you have scalar units such as the meter, the Yard, foot, Inch & 16th!

However; As may be pointed out, Roll Scalar? Is a form of floating unit expression; If Scalar measurements are regarded in terms of static's; Then Yes Integer:{Meter; FPU:{cm, mm} is a float value!

Nonetheless Scaler is an argument for the role of RollINT & also a pointer to method..

Scaling you see; is everything to detail; If you want to see this? Magnify or Zoom & Wide angle!
We further scale; By hitboxing our ML; In other words by training the AI on Centric value rewards..

AI Content:

{Content value reward targets};
{Centric Core values};

Return = Value;
end = infinite
Test Loop {AI C, End}; Begine

Epoches = {Satisfied End}

Rupert S

*

Float & Integer : RollINT : In Depth Analytics

RollINT List

Floats with small precision values : RollINT

Dreams have 'Small Randoms', Minor details make a true reality

(OS & Chrome Example)
The size of frames & text alignment
Main colour groups for desktop & browser colours : FFFFFF.FF
Frames forward & backward with submenus are worthy of low precision floats : FFF.F 300 Frames 16 sub allocated positions inside frame:{SubFrame}

Both low & high precision

High Efficiency ZLib, GZip Ram compression
Localised Error correction

Colour depth & contrast HDR, Low error rate/Higher

RS

*

RollINT Versus Metric principle of float reduction : RS

Scale correctly & avoid that FPU being needed

Scale correctly first; Example mouse is Millimetre & Micrometre & Large scale Centimetre,
Photon Microscope is Picometre, Milimetre, Centimetre,
Telescope is Kilometer, Metre, Milimetre..
Screens UpScale & Zoom, Do we need to rescale our measurement ?

https://learn.microsoft.com/en-us/dotnet/standard/numerics

X+- , Y+- 2D+- central point measurements
Int16 2 -32,768 32,767
Int32 4 -2,147,483,648 2,147,483,647
Int64 8 -9,223,372,036,854,775,808 9,223,372,036,854,775,807 (might want to use floats; A lot quicker)

Precision Floats
16Bit Half 2 ±65504
32Bit Single 4 ±3.4 x 1038
64Bit Double 8 ±1.7 × 10308

The main attack Vector being mice & touchscreens & utility scopes & measuring devices...
We wanted DPI without stress!

A range of options exist when using RollINT; The idea is to Roll a float on operation; To be fair hardware like the Amiga has the concept of Integer operation with a float as the final result..

However that option Is "the Final result" & does not mean that you could use RollINT to make a repeated Float maths for applications..

However RollINT could be used 2 Significant ways:

You could use FPU on the result (Previous integer operations save FPU for other tasks)
You could receive an Integer result from the float operation (Final float value on multiple operations not important to you?)

Perform Metrification & therefor avoid float value use; for example expand the data into a higher precision mode,

The principle of the Metric system is to use sub parts to reduce the necessity of floats : Meter, Centimetre, Milimetre, KG, Gram, Ounce..

So avoiding a floating unit..

The method is multiple operations, Large, Small, Smaller & can in reality be repeated down to picometer or tiny weights...

This method is multiple operation rounds,

RollINT & FPU Avoid rounds of CPU Cycles; But options exist.

RS

*

As you know the Matrix Array Processor is now frequent with Intel, Mac M1 & M2, AMD & NVidia Versions..

Quantum computers rely on Multi-Directional & Multi-Dimensional Arrays per Qbit!

Well this is a design structure for a Multi-Array Multi-Connection Matrix Array Processor..

The principle is basically quite logical!

Multi-Array Multi-Connection Matrix Array Co-Processor - Quanta Light Compute 2023-06-23

Percentage based 3D Processing to handle all 3D Array processing,

Central [H.P.C] Tasks map to probability over Networks [=====] & [M.A.P] Units in arrays

Table define

{

[M.A.P] = M.A.P , M.A.P 8 Way interconnect,
[H.P.C] = M.A.P High Precision Central Core,
[=====] = Buss Connections & networking

}

Top View

[M.A.P][M.A.P][M.A.P]
[M.A.P][H.P.C][M.A.P]
[M.A.P][M.A.P][M.A.P]

Side View 3D

[M.A.P][H.P.C][M.A.P]
[M.A.P][=====][M.A.P]
[=====][H.P.C][=====]
[M.A.P][=====][M.A.P]
[=====][H.P.C][=====]
[M.A.P][=====][M.A.P]
[=====][H.P.C][=====]

Each [H.P.C] Central Contains RAM & connections to the 8 [M.A.P] & Optionally to layers above & bellow in 3D Matrix,
Bottom of wafer contains high resolution buss to onboard controllers & networks & DPU/GPU/CPU's

Array = Matrix Array Processor Unit (c)RS

ffffffff ffffffff ffffffff
........+ ........*+ ........*
........+ ........*+ ........*
........+ ........*+ ........*

f=fp,unit
*=mul
+=add
.=Cache/Ram

Simple absolver table for MUL:ADD : MUL* Only = +0, +- Only = N*1 then +-
% = / 100 + ADD Table {N1 <> N...} : Result!

(c)Rupert S

SiMD:CMA (c)RS


Standard SiMD Features, Byte Swap, ADD,MUL[SSimd]
8 x Cache,Mul,ADD: [8xCMA]

[SSimd]
[8xCMA][8xCMA][8xCMA][8xCMA]

[SSimd] is additional features accessed by register poke, Standard Operation is CMA & RAM
[8xCMA] is used as RAM in most SiMD Operations & MUL+ADD, ADD, MUL

In SiMD Ops
On RAM upto 3x F16 can be stored (3xF16, F32 + F16, F48, F24x2)

MUL or ADD Operations can be {F16:F16:F16, F32 *+- F16, F24 *+- F24}
Operations are saved to Master Cache & sent to RAM or other functions & can be {F16, F24, F32, F48},
Because master cache is a full buffer; you have to save it first! before reuse!

Design uses the M.A.P basic MUL+ADD & RAM

(c)Rupert S

References: DOT4, INT8, INT16, F16, F32, F64 (c)Rupert S
https://science.n-helix.com/2023/02/pm-qos.html

https://science.n-helix.com/2023/07/3dchiplet.html

Nx-DeepMatrix Engines
https://www.nextplatform.com/2023/08/02/unleashing-an-open-source-torrent-on-cpus-and-ai-engines/
https://idstch.com/geopolitics/next-generation-neuromorphic-chips-bringing-deep-learning-from-cloud-to-iot-edge-devices-and-mobiles/
https://www.backblaze.com/blog/ai-101-gpu-vs-tpu-vs-npu/

Experimental CPU Proof : A proposal for an Open RISC V Processor, Statistical diagrams of function & graphs with function use under load...
https://www.researchgate.net/publication/373403576_Design_of_a_High_Performance_Vector_Processor_Based_on_RISIC-V_Architecture

ML Batch Matrix MAP in FPGA
https://drive.google.com/file/d/1hdxeK1r8LIhvpn7poOm3MfXmGr9Tq-ni/view?usp=sharing

ML Compressed Dynamic16bit-8Bit - Hardware-friendly compression and hardware acceleration for ML Transformer
https://aimspress.com/article/doi/10.3934/era.2022192

Matrix Processors - Memory & command - All-Digital Compute-In-Memory FPGA Architecture for Deep Learning Acceleration
https://dl.acm.org/doi/pdf/10.1145/3640469

Matrix Processors - Inline Ram & Command { CMD : RAM }:{NET}
https://www.xilinx.com/content/dam/xilinx/support/documents/white_papers/wp506-ai-engine.pdf
https://www.xilinx.com/content/dam/xilinx/support/documents/white_papers/EW2020-Deep-Learning-Inference-AICore.pdf

***

Cooperative Matrix Math : RS


Cooperative Matrix is a Math type where you formulate a Grid of number & math notations & solve them in sync,

The consequence for you is that the maths is both Faster; More Complex But also easier to correct for errors...

Usually Matrix Maths is used for Algebra, Image & 3D Mapping ML; Such as to see, Maps & Dungeons, Water tables, Technology Development.

Matrix

Var = V+n, Table
     a      b      c     d
1[V1][V1][V1][V1]
2[V2][V2][V2][V2]
3[V3][V3][V3][V3]
4[V4][V4][V4][V4]

There are 3 main ways for matrix maths:

V1a {/,*,+,-},Value, %, Fraction V1b, V2a, V2b : In effect a dither map or calulation; So connected.
Vector groups {V1a<>z} Maths to {V2a<>z} to {V3a<>z} to {V4a<>z} & more ..

Sorted by Type of operation example
M = Multi Complex Operations In Groups
    a         b        c        d
1[V1]+[V1]+[V1]+[V1]
2[V2]*[V2]*[V2]*[V2]
3[V3] / [V3] / [V3]/[V3]
4[V4]M[V4]M[V4]M[V4]

Refer to : Var = V+n, Table


Matrix Accumulator Header Matrix : {MAHM}
SiMD Wave : 32, 64 Group with finalised result + ALU : Work Group Wave Matrix : {WGWM}
Wave Matrix Accumulator Cube : {WMAC}

{MAHM}
{WMAC},{WMAC}
{WMAC},{WMAC}

{MAHM}
{WGWM},{WGWM}
{WGWM},{WGWM}

{MAHM}
{WGWM},{WGWM}
{WMAC},{WMAC}

CTP-HTM : CPU, TPU, Processor Hypervisor Thread Management : RS

Parallel Group Threads:

Work groups by Aligned by:

Work Group Size (aligned by Bit):

Memory Range {Half Float, b16Bit,b32Bit, 16Bit,32Bit , Double Float}
Aligned Cluster Size,
Bit-depth & Length of code

The logic is that Parallel Group Threads with the same Code complexity & Size should finish around the same time,
They also typically require the same processor priority so that system tasks have Runtime Availability.

RS

Guide to Cooperative Matrix Math : RS

Base principle of the Matrix & Graph goes beyond Accumulation of numbers..
I am reminded by microsofts dev post of Excel & Spreadsheet applications..

Yes they Graph/Matrix; But math solves require it! For example the Acidity/Alkaline matrix with Protons & Electrons,

However a more sophisticated form is algebra; But you have to simply the Algebra & put that in a table..
Einstein, Shrodinger, Physics, Chemestry & DNA By connection...

Algebra is the main reason we would use Float : {bF16 <> bF32} {Single Precision <> Double Precision} SiMD,
The chief objective is the solve; Complex SiMD offer the answer of flexibility..
MUL:DIV ADD

Simple absolver table for MUL:ADD : MUL* Only = +0, +- Only = N*1 then +-
% = / 100 + ADD Table {N1 <> N...} : Result!

(c)Rupert S

Graph Accumulator Multiply ADD - Cooperative Matrix


SDK Sample : https://github.com/ROCmSoftwarePlatform/rocWMMA

https://gpuopen.com/learn/amd-lab-notes/amd-lab-notes-matrix-cores-readme/

https://paperswithcode.com/paper/a-survey-on-deep-learning-hardware/review/

AMD 23.Q3Pro_HIP #HPC #DirectML MatrixMathOps 'Release unto me the great! Chobokniki' Thine Prayers Answered https://is.gd/AMD23Q3PRO_HIP
Run the .reg after install; Before reboot https://is.gd/AMDRebarReg

*

Inference & FMA De-Block Styles


For upscaling matrix: MMX+ & SiMD
16x16 Block as used just about in HD,
8x8 Blocks Certainly NTSC, PAL, JP_NTSC!,
Very usable for deblocking JPG,
16x16 & 8x8 is very good for Inferencing active on Scaling & Deblocking..

4x4 for main Inference XBox & 8x8 for PS5..
XBox can use (4x4)x4 for 8x8 & (4x4)x16 for 16x16; Very powerful!
PS5 can use (8x8)x1 or x2 for 8x8 & (8x8)x4 (x8 for additional processing) for 16x16; Very powerful!

​The table solves common issues with 4Bit & 8Bit direct loading of colour tables of the F16 Types..
16Bit is a bit more common in older hardware & luckily quite a lot more flexible!
But 8Bit & 4Bit inferencing have a number of uses...

Indirect load though F16 Register can work by sideloading the operation; With Inferencing Sub routine coding & Returns,
Processing the actual inference but losing data store & returns just information..

Sub Routine INT8 & INT4 can:
Directly manipulate a small palette; Scoped Palette,
Single channel colour or multiple operations..
Load, Store & Save

Inference & FMA De-Block Styles List

(4x4)x4
(4x4)x8
(4x4)x16 + processing
(4x4)x32 +++ processing

(8x8)x4
(8x8)x8 + processing
(8x8)x16 + processing

(16x16)x1 + processing
(16x16)x2 ++ processing
(16x16)x4 +++ processing

8:4Bit Concepts: 65535/255=8Bit 65535/16=4Bit

16bit/4bit : 4Bit colour pallet, But we can fraction 16Bit/4bit in essence 16/4! 65535/16; Compression Shapes & Gradients.
Polygon, Shadow, Contact
Alpha Channel 2Bit, 4Bit
Grayscale edge define sharpening
Single Colour Edge detect
Shape Fill in Alpha 10,10,10,2
Xor, Pattern, Shading, Shader, Cull, Shape & Depth Compare after define

For when {U, X, Y, Z} = N Expressions https://is.gd/ForWhen_UXYZ_N
For when {(A+B/2)} = C Expressions https://is.gd/ForWhen_ABx2_C

(c)RS

*

An example use of FMA Cooperative Matrix


In the example we use a formula like (U/X²)+(U/Y²)+(U/Z²)
Firstly the x²,y²,z² are MUL, So we need a * table or maybe with FMA we can use a (MUL)+0 ?
My primary observation is that we can use 2 methods:

MUL (U/X²), (U/Y²), (U/Z²) in tables, I suggest 3 * or FMA (MUL)+0
Or we can perform tables in order but complete all the MUL operations in Sync & then ADD with FMA,
Sync : (U/X²)+(U/Y²)+(U/Z²) to (Un/X²)+(Un/Y²)+(Un/Z²)

F1 = First Operation F2 = Second operation R = Result {R1:R3 = R4}

F1
R1=(U/X²) R2=(U/Y²) R3=(U/Z²)
F2
R1=+ R2=+ R3 = R4

So we have an example where MUL & then ADD is usable; But we could use Synced FMA

For when {U, X, Y, Z} = N Expressions https://is.gd/ForWhen_UXYZ_N

RS 

Brilliant examples of matrix maths
https://gpuopen.com/learn/amd-lab-notes/amd-lab-notes-finite-difference-docs-laplacian_part1/

VXEdDSA & XEdDSA & X25519 & X448
https://signal.org/docs/specifications/xeddsa/

SiMD-Matrix Maths example - Wave retrieval from quad-polarized Chinese Gaofen-3 SAR image using an improved tilt modulation transfer function
https://www.tandfonline.com/doi/full/10.1080/10095020.2023.2239849?src=
https://drive.google.com/file/d/1uN047PvBJhFkcdNJKqx6cBZ9vnAxcjPj/view?usp=drive_link

SiMD-Matrix Maths example D-Waves
https://drive.google.com/file/d/15iPy-Z24GsbcUdEycOfS1819Fdf0sWoE/view?usp=drive_link

*****

High speed Per operation Cycle operations of D R² Pi


An (A[diameter]*B²[Pi] : D * R² operation is 2 Cycles, this specialised Arc, Sin, Tan operation can be accomplished a couple of ways in a single cycle,

Options table : D R² Pi

Firstly by sideways memory load in lower Single Precision to double precision output in a SiMD

You need to pre cache R²You can use the same value for R or for D &or both
You can pre cache all static D &or R, So you can vary either D or R & single cycle
You need to perform 2 operations , Diameter & R² & obviously they are relational!

For examples:

R = Atom Zink (standard size!) Cache D R
You move a compass but the needle is the same size! Cache D
You draw faces but the width is the same, Cache D
You draw faces but the Shape is the same but size is not! Cache R

Rupert S

**********

How you use FMA, Basic MUL+ADD examples first & then Mul & ADD


Firstly in video,
MUL a float set A * B + C
Video Upscaling basic A:Pixel * B:PixelDiffRightPixel + C:RightPixel,
Do that 16 Times per pixel pair and you have 16*Interpolate, So a 16* Data set Wave!
You could obviously use a 32* Wave SiMD & do 4x8; So 4 Pixel groups per Wave.

So for example you can ADD Log Gama or other simple values, In A * B + C,
Pixel Values or whatever, You can use Point float 0.001 for example to do division on floats.

For all personal maths that you imagine:
Simple absolver table for MUL:ADD : MUL* Only = +0, +- Only = N*1 then +-
% = / 100 + ADD Table {N1 <> N...} : Result!

Interpolation & smoothing :

The method i am thinking of is ADD Mul/Div : Edge Left A+B Edge Right = C Center, (A to C)<>(C to A)

(A+B)/2 = C

Factor A_to_C
16 Steps

Factor C_to_B
16 Steps

*alternatives*

((A-C)/16)=F | (F* A over C)=F Step * 16 over Time or distance

(Call slope)
find 16 Fractions of A To C
find 16 Fractions of C to B

For when {(A+B/2)} = C Expressions https://is.gd/ForWhen_ABx2_C

RS

Pixel A to B, Interpolation upscaling


from A1 to B16 ADD Difference of A - B

Red A1 : 2 : 3 : 4 : 5 : 6 : 7 : 8 : 9 : 10 : 11 : 12: 13 : 14 : 15 : 16B
Green A1 : 2 : 3 : 4 : 5 : 6 : 7 : 8 : 9 : 10 : 11 : 12: 13 : 14 : 15 : 16B
Blue A1 : 2 : 3 : 4 : 5 : 6 : 7 : 8 : 9 : 10 : 11 : 12: 13 : 14 : 15 : 16B

Tables can be 16 Wide & 16 Long to advantage ourselves of Byte aligned F16

Pixel A to B, Interpolation upscaling

AAA
ABA
AAA

Example

R,G,B Value of A
R,G,B Value of B
RCv = Value per pixel of 16

Which is higher RA or RB
if RA
RA - RB = RC
If RB
RB - RA = RC

RB{1 to 16} repeat +- RCv

Sorry about the coding RS

Rupert S

*

FMA AVX Performance table: 2Flops per Cycle per FMA Unit
Architecture Fast Instructions for FMA


Reference Tables https://www.uio.no/studier/emner/matnat/ifi/IN3200/v19/teaching-material/avx512.pdf

Operators in C
● Arithmetic
a + b, a – b, a*b, a/b, a%b
● Bitwise
a | b, a & b, a ^ b, ~a
● Bit shift
a << b, a >> b (signed), a >> b (unsigned)
● Logical operators
a && b, a || b, !a
● Comparison operators
a == b, a != b, a < b, a <= b, a > b, a >= b
● Tertiary operator
x = a ? b : c
● Special functions:
sqrt(x), abs(x), fma(a,b,c), ceil(x), floor(x)

Fast division for constant divisors

Calculate r = a/b where b is a constant
With floating point we precompute (at compile time
or outside of the main loop) the inverse ib = 1.0/b.
r = ib*a
Floating point division with constant divisors
becomes multiplication
With integers the inverse is more complicated
ib,n = get_magic_numbers(b);
r = ib*a >> n

Integer division with constant divisors becomes
multiplication and a bit-shift

Fast Division Examples
● x/3 = x*1431655766/2^32
27*1431655766/2^32 = 3
● x/1000 = x*274877907/2^38
10000*274877907/2^32 = 10
● x/314159 = x*895963435/2
7*314159*895963435/2^48 = 7

Dividing integers by a power of two can be done with a bit shift which is very fast.

RS


High-Performance Elliptic Curve Cryptography: A SIMD Approach to Modern Curves
https://www.lasca.ic.unicamp.br/media/publications/FazHernandez_Armando_D.pdf
https://science.n-helix.com/2023/06/map.html
https://science.n-helix.com/2022/04/vecsr.html

https://gpuopen.com/learn/matrix-compendium/matrix-compendium-intro/

*

Triangle 3D Matrix graphs


C
|
|
_____b
\
  \
    A

Vector table for audio & video or graphics..

We will use integers for the 3D audio presentation & SiMD fpu for MP4 & AC4 & Alac decompression..

RS

So we will be using a form of float unit called..

RollINT

We are using roll to roll a zero on or off an integer,

Therefore we are able to divide and multiply and add so that..

101-0 > 10.1+0 No can range practically from 0 to 00000000 practically.

So 10023-000 > 10.023+000

We can then store floating point numbers in integer.

(C) Rupert S,

Reference Int & FP Value Sizes; A reminder that Floats are 50% of highest Integer Value,
ROLLInt floats still have an amazing additional value!

https://learn.microsoft.com/en-us/dotnet/standard/numerics

*

ECC elliptic curves & Gradients : RS


Leveraging FMA fused MUL ADD on Internet & Software ...

For examples:

Gradients vector compression..

Colour A to colour B

Compare dif {A:B}
Transform A over steps B

Same colour ranges {R,G,B}

(A - B) = Dif
Shift B over steps = A

Store Vec VTable = steps

VTable:

Steps S1 to Sn

Colour B1 to Bn + S1 to Sn

S1,Sn
B1,Bn
B1,Bn
B1,Bn

Same with time & dimensions in the ECC elliptic curve..

S=T*D
Vector= {B1,Bn}

(T*D)+Bn

VTable:

Steps S1 to Sn

Colour B1 to Bn + S1 to Sn

S1,Sn
B1,Bn
B1,Bn
B1,Bn

Rupert S

*

Einstein : Quad:20x30 Matrix table


With Einstein Formula being around 20 operations wide, 30 Lines long..
Single Operation Formula Matrix Tables could be popular,

Consequently matrix math : MTU/MAP processor features should be popular...

I take the view that 8 x 30 is about manageable on the Epyc & M2..
Bearing mind that a 32 Wide x 32 Long Operations SiMD is achievable...

An AVX512 SiMD could run Quad operations (128Bit AVX) x 4,
So 20/4 = 5x; So 6x AVX512(128Bit Operation); Now there is; I believe; 1 AVX core per 2 Core Groups!

So 24 Core has 8x or 4x or 2x (8 or 4 Cores per die unit)!
So 84 Core units should have enough AVX512?

But one Mac M2... :D

Einstein : Quad:20x30 Matrix table

With Einstein Formula being around 20 operations wide, 30 Lines long..
Single Operation Formula Matrix Tables could be popular,

Consequently matrix math : MTU/MAP processor features should be popular...

I take the view that 8 x 30 is about manageable on the Epyc & M2..
Bearing in mind that a 32 Wide x 32 Long Operations SiMD is achievable...

An AVX512 SiMD could run Quad operations (128Bit AVX) x 4,
So 20/4 = 5x; So 6x AVX512(128Bit Operation); Now there is; I believe; 1 AVX core per 2 Core Groups!

So 24 Core has 8x or 4x or 2x (8 or 4 Cores per die unit)!
So 84 Core units should have enough AVX512?

But one Mac M2... :D

In our case Einstein, the table is 20 Wide & 35 Long (roughly)

So : Einstein = Quad:20x35 | Alternative Quad:8x16, More manageable in
SiMD Parallel Executions; Quad:8x16 x 3, ....

One presume strict aligned multiple multiplication

4X4 Tables are still utility for Science maths; But we need
to get the point across what we need for Einstein! The Subject of 4x4
tables,

The Subject of 4x4 tables,

We are obviously looking for more like 16x16 for Physics maths!
The matrix processor is a large data set; Divisible into 4x2 & 4x4 &
8x8 groups for execution speedups,
Aligned Parallel processing....

Aligned Matrix tables need to be larger than 4x4 for Physics &
Chemistry; So a matrix processor ideally can at a minimum:

Matrix Table

x1
16x16

16/2
x2
8x8,8x8
8x8,8x8

8/4
x4
4x4,4x4
4x4,4x4
https://gpuopen.com/learn/matrix-compendium/matrix-compendium-intro/

https://marctenbosch.com/quaternions/
https://arxiv.org/abs/1101.4542

Quaternions > PGA Geometric : a+b+c : Rotational algebra : ax+by+c=0 | e1, e2, e3
https://www.youtube.com/watch?v=0i3ocLhbxJ4
https://www.youtube.com/watch?v=Idlv83CxP-8

Improving Structured Grid-Based Sparse Matrix-Vector Multiplication and Gauss–Seidel Iteration on GPDSP
https://www.mdpi.com/2076-3417/13/15/8952

SiMD Matrix Maths - Performance Portable SIMD Approach - Implementing Block Line Solver For Coupled PDEs
https://www.osti.gov/servlets/purl/1602621

SiMD Matrix Maths - Operations Details HIP AMD
https://rocm.docs.amd.com/_/downloads/en/latest/pdf/

SiMD double tables, M1 Matrix
https://developer.apple.com/documentation/accelerate/working_with_matrices


FMA AVX Performance table: 2Flops per Cycle per FMA Unit
Architecture Fast Instructions for FMA
https://www.uio.no/studier/emner/matnat/ifi/IN3200/v19/teaching-material/avx512.pdf

#RIP (Intro interesting!) Optimizing massively parallel sparse matrix computing on ARM many-core processor
https://www.sciencedirect.com/science/article/abs/pii/S0167819123000418

https://www.gamedeveloper.com/programming/implementing-a-3d-simd-geometry-and-lighting-pipeline
https://developer.apple.com/documentation/accelerate/working_with_matrices

CGal is a Matrix Math library for C; Luckily OpenBLAS is a compatible library & AMD Makes a version in HIP
https://cpp.libhunt.com/cgal-alternatives

Matrix Libs : L1 means compatible with CGAL, A+ means i rate them highly on science community use : RS

CGAL (L1)
GLM (L1)
QuantLib (L1)
Ceres-Solver (L1)

OpenBLAS (A+)
Eigan (A+)
MiraCL (A+)

C++ Matrix Maths

MPPT is Camera & FFMPeg complex install
https://docs.mrpt.org/reference/latest/compiling.html

C++ Matrix Maths : Simple
https://sourceforge.net/projects/arma/

C++ conversions between Numpy arrays and Armadillo matrices; Converts Into Numpy Py not out (needs work)
https://github.com/RUrlus/carma

https://sourceforge.net/software/product/NumPy/
https://sourceforge.net/software/product/NumPy/integrations/

Motivated applications of 3D Matrix Database ML

RS

Just shows how fast Blas & these NumPy & Arma & Mave is! 1998-man SigRS
Parallel matrix multiplication & diagonalization
https://www-users.york.ac.uk/~mijp1/teaching/grad_HPC_for_MatSci/Lecture4.pdf

Wasm Inefficiency
https://news.ycombinator.com/item?id=37387629

*

3D Matrix Web Codecs


Are presented as being JIT Compiler re-encoded when required; Frequently WebASM, WebGPU Code, JS...
Audio, Video, Sensation, Code Runtimes.

Web Codecs for devices are a modern concept & are available for common websites such as news & music,
devices such as Alexa Echo & Google Dot & Bluetooth Devices?

Media players & BT devices particularly suffer from small Storage potential!
So Web Codecs downloaded to the device from a source; Such as a smart phone or computer..
Are a clear-minded solution!

JIT Compiler

3D Matrix Tables in FMA, Mul & ADD code to be automatically recompiled locally when required!
Directed to a common API, Direct Compute, WebGPU, WebASM, Jit Compiler OpenCL

Many Operations can be done from unique device specific optimisation; Examples:

API, DirectX & OpenCL & Vulkan & WebGPU & WebASM
Texture & Audio Shaders.
Digital Streaming

Bluetooth NANO SiMD & API
Digital TV in H266, VP9 & AV1,

Locally compiled accelerators should be respected first; Such as the output & input 3D Matrix & CPU & GPU Acceleration engine..

Code can include Matrix converters into common output format such as WebP & Textures & BC, DXT Compression presentation; Vulkan, OpenCL & DirectX & Texture & Audio Shaders.

Java, JS & WebASM are examples with operator mechanisms & JIT Compiler optimisation..
Minimising storage requirements for good compatibility while maximising performance.

RS

Requirements:

https://science.n-helix.com/2022/08/jit-dongle.html
https://science.n-helix.com/2022/06/jit-compiler.html

https://science.n-helix.com/2023/02/smart-compression.html
https://science.n-helix.com/2022/10/ml.html
https://science.n-helix.com/2023/06/map.html

*

TPU & SiMD Parallel wavetables Pre-Calculation Meta-Data : RS

{ For data expansion & Precomputed Upscaling through meta data per frame sequence }
#MetaDATA #PreProcessing Parallel Text loading and machine learning processing : RS 23/07/2023

Pre-calculation table; For Example the Amiga uses tables for maths!
Pi, Common conversion maths & float results in higher precision...

Parallel Text loading and machine learning processing is one of the wonders of TPU & SiMD Parallel wavetables,

Pre Calculate Tables that reduce a workload to simple process. and use...

For example if you Upscale a movie & use dynamic settings, Such as:
Localised Sharpening & Selective Gaussian filtering; Such as Gimps Edge detection Gaussian?
We compress information on the maths of selection..

The edges we selected, The methods we used & if those methods are dynamic then our selections...
Such a method is called a ..

Pre-calculation table; For Example the Amiga uses tables for maths!
Pi, Common conversion maths & float results in higher precision...

Common ones are learned at school
the log tables
Multiplication Tables
Common values such as gravity & Pi

Pre Computation
Upscaling
3D Audio basic resonance profile
Pre Computed values for a realistic world...
Experience & Learning to pre compute values...
This saves effort later in the process

This is available to providers & game developers for:

TV Upscaling through Compressed Numeric Add table downloading
All streaming services processing such as netflix, youtube & amazon prime!
Partial pre-computed upscaling for game, application & processing..

Through TopCloud & HPC Pack

Data Stored as meta-data and saves on repeat processing time!
By creatively Pre Computing processes such as 3D Audio, VR Audio, Haptic 3D Maths..
Work such as Decompression & Compiling

Affects the efficiency of any process that will Pre Calculate Tables that reduce a workload to a group of simple processes.

We can majorly improve quality of both visuals & Audio; Any Pre-Calculatable element

The logic is that Upscaling, Colour enhancements & sharpening have pre-calculatable logic,
We can save many seconds of processing per frame,
We can reduce energy footprint
We can improve latency & frame rate
Works for games also,
Education media or Theaters & mass media content such as News & commonly watched content or movies or visited websites or fonts & media

We can improve at a very minimum, Cutscenes & non motional backdrops & tangible Animation repeating assets & Effects...

(c)Rupert S

FMA : Fused Multiply ADD : MUL+ADD & Precision functions


You may be assuming that only modern GPUs such as RTX 2080+ & RT 5700+ has this?

FMA is a feature of the business editions & FX Series on AMD & exists in granite ridge & other Intel,
So FMA F16 is possible with the F32 : F16 conversion features present in for example FX8320E...

So what does this mean? In terms of:

Chrom that Emulates a lot of its GPU functions in CPU..
In terms of Python ML that F16 feature combined with FMA is very helpful in learning & efficiency!

In terms of CPU; mostly using 32Bit, F32, 64Bit, F64 is very helpful; in terms of SiMD,
F16 exists though; Even on the yee FX8320E!

So we can use potentially: Int8, Int32, Int64, F64, F32, F16 & Float 182Bit as with FPU!
Best to do DEEP work with the CPU FPU & SiMD...

We do have these functions though!, But Deep work FPU 182Bit? CPU! Some GPU have double precision also!

What do we use this variety for? Many things!

Defined by our precision requirements; not all things are INT64 & FPU But not every issue is covered by..
The MP4v, MP4a F16! AC3 & AC4 for example F32; A glass? FPU 182... or many F32 or even more F16 work units.

Rupert S

Exponent factorisation : RS

8Bit, 16Bit, 32Bit, 64Bit Exponent theory.
Available to you-(EF)

A value in 8Bit is no use in a 16 Bit operation... or is it?

Firstly 8 Bit values can be loaded with Zeros into higher math precisions,
In normal maths we use a remainder; So we can load 8Bit values into 32Bit Int & that works...

2 F16 blocks would be 32Bit; As 2 16Bit Blocks? So what use is this ?
in a 64Bit & 32Bit processor storage of FPU-182Bit values is possible ...
32Bit Blocks * 6 with XOR 00
64Bit Blocks * 3 with XOR 00
2 * Largest value...

But parallelising F64 on groups for 182Bit? with multiplications roll left <> Right .. & Additions +- ...
Possible.

But if the resultant is beyond 8Bit ? & we wanted to save as 8Bit?

Factorisation of a 32Bit value into 8Bit is possible; But we need to factor it!
Well:

32Bit to 8Bit is 6:1, So we have to random roll 6 Bits for every 1
We can factor in HighLow with 1 bit or use 8Bit fator 256 & 8Bit Number...

We can Multiply, Add, Subtract or divide or fraction:

256(*/-)1>256, leaving us with a 32Bit value? Well what can we use this for ?

Example complex : N/(240*50); See the maths can roll into 16Bit values..
We can use them, Or load a particular object, Classifier, HASH, AES, EEC...
We can quickly classify as 16Bit resultant & still save as a particular 8Bit value!

Images
Gains
Memories
Load file
load value
Random
Table Value
Compression!

(c)Rupert S,

Reference Int & FP Value Sizes; A reminder that Floats are 50% of highest Integer Value,
ROLLInt floats still have an amazing additional value!


https://science.n-helix.com/2023/02/smart-compression.html

F16b Adaptive Float value : Texture Color Palette Example : RS



Basic Example of F16b float in action on a colour pallet: {F16b,F32b, F64b}

F16b is short remainder F16 & it has 8 Bits of 0.01 point value rather than 16,
So what do we mean ? What is significant about this?

F16b Has 24Bit precision integer with an 8 bit remainder!
So? So 16Bit + 8Bit = 24Bit! & 8bit point value...

In colour representation point values contribute to subtle blending;
So a full 24Bit contributes to 90% of the Color Palettes

So the 24Bit colour pallet is 32Bit Colour Minus Alpha;
We can use F16b in HDMI & DisplayPort & inside the GPU & Also for textures & JPG'S..
Thereby i present F16b & F24Bit colours in F16b

This saves all data in single 32bit Spaces & therefore is both faster & higher resolution than comparable float value presentations.

Bound to make a big difference to BlueRay, but particularly DVD & AC3 & AC4;
F16b Adaptive Float value : Texture Color Palettes Example;

(you can use F16b * R,G,B,A) in HDMI a& DisplayPort, Massive colour improvements; Lower RAM Costs

Rupert S

AnPa_Wave - Analogue Pattern Wave Vector SiMD Unit : (c)RS


The base symphony is harmony, In other words waveforms; There are a couple of Simple methods that really work:

High performance Float values F16, F32, F64, FPU

Q-Bit Quantum; All forms of Quantum wave work
Radio waves;
Light patterns
Photon wave patterns; single & multiple
Sound hardware; 1 to 3 Bit DAC; Audio conversions; Sample range
Analogue chips that work on harmony & frequency
SVM Elliptic curve maths
Sin, Arc, Tan, Time, Vector

In essence Harmony & frequency is the equivalent of Complex Elliptic curve maths

A Music note score suffices to specify harmony basics:

Waveform shape in 3D
Harmony / Disharmony
Vibration High / Vibration Low
Power High / Power Low
Volts High / Volts Low
Watts High / Wats Low

(c)Rupert S

https://science.n-helix.com/2023/07/3dchiplet.html

Wonderful Wave-Pattern Analogue waveforms in meta materials - Pattern recognition in reciprocal space with a magnon-scattering reservoir
https://www.nature.com/articles/s41467-023-39452-y.pdf

*

Vectors & maths
https://science.n-helix.com/2022/08/simd.html
https://science.n-helix.com/2022/04/vecsr.html
https://science.n-helix.com/2016/04/3d-desktop-virtualization.html
https://science.n-helix.com/2022/04/vecsr.html
https://science.n-helix.com/2018/01/integer-floats-with-remainder-theory.html
https://science.n-helix.com/2023/02/smart-compression.html

Networking & Management
https://science.n-helix.com/2023/06/tops.html
https://science.n-helix.com/2023/06/ptp.html
https://science.n-helix.com/2023/06/map.html
https://science.n-helix.com/2023/02/pm-qos.html
https://science.n-helix.com/2022/08/jit-dongle.html
https://science.n-helix.com/2022/06/jit-compiler.html
https://science.n-helix.com/2022/03/ice-ssrtp.html
https://science.n-helix.com/2022/01/ntp.html

Faster Maths & ML
https://science.n-helix.com/2018/01/integer-floats-with-remainder-theory.html
https://science.n-helix.com/2021/02/multi-operation-maths.html
https://science.n-helix.com/2021/11/parallel-execution.html
https://science.n-helix.com/2022/12/math-error-solve.html
https://science.n-helix.com/2021/03/brain-bit-precision-int32-fp32-int16.html
https://science.n-helix.com/2022/10/ml.html

Focus on Quality
https://science.n-helix.com/2022/09/ovccans.html
https://science.n-helix.com/2022/11/frame-expand-gen-3.html
https://science.n-helix.com/2022/03/fsr-focal-length.html

For when {U, X, Y, Z} = N Expressions https://is.gd/ForWhen_UXYZ_N
For when {(A+B/2)} = C Expressions https://is.gd/ForWhen_ABx2_C

Hallelujah RS Light-Wave SiMD https://www.allaboutcircuits.com/news/lightelligence-reports-worlds-first-optical-network-on-chip-processor/

RS Spectra Mitigations https://science.n-helix.com/2018/01/microprocessor-bug-meltdown.html
ZenBleed Parallel Solvent RS 2023 https://science.n-helix.com/2023/07/zenbleed.html

Core/CPU/GPU security core SSL/TLS BugFix
https://science.n-helix.com/2020/06/cryptoseed.html
https://science.n-helix.com/2019/05/zombie-load.html

Secure Configuration:
https://is.gd/SSL_NetSecurity_NTP_PTP
https://is.gd/EthernetTunnelOpt
https://is.gd/SSL_Optimise

PTP & NTP Improve security WW https://is.gd/PTP_TimeStream

*****

Running Code

https://is.gd/UpscaleWinDL

https://is.gd/HPC_HIP_CUDA

PoCL Source & Code
https://is.gd/LEDSource

PoCL-Direct
https://is.gd/PoCL_Source

X86Features-Emu
https://drive.google.com/file/d/15vXBPLaU9W4ul7lmHZsw1dwVPe3lo-jK/view?usp=usp=sharing

https://www.amd.com/en/developer/rocm-hub/hip-sdk.html#tabs-ddafbba141-item-c6b9ce2aab-tab
https://rocm.docs.amd.com/en/docs-5.5.1/deploy/windows/quick_start.html

AMD 23.Q3Pro_HIP #HPC #DirectML MatrixMathOps 'Release unto me the great! Chobokniki' Thine Prayers Answered https://is.gd/AMD23Q3PRO_HIP
Run the .reg after install; Before reboot https://is.gd/AMDRebarReg

**********
https://en.wikipedia.org/wiki/Cell_(processor)

https://www.khronos.org/news/permalink/ibm-releases-opencl-drivers-for-power6-and-cell-b.e/

Not Accessible
https://www.alphaworks.ibm.com/tech/opencl
**********

AI: Artificial Intelligence
ML: Machine Learning
PULP: Parallel Ultra Low Power

ML Network Types


DNN: Deep Neural Network
CNN: Convolutional Neural Network
QML: Quantum Machine Learning
QPU: Quantum Processing Unit

RNN: Recurrent Neural Network
SNN: Spiking Neural Network
MLP: Multi-Layer Perceptron

NN: Neural Network
TNN: Ternary Neural Network
QNN: Quantized Neural Network

HDL: Hardware Description Language
HLS: High Level Synthesis

Maths Operations


FMA: Fused Multiply-Add
GEMM: General Matrix Multiply
SIMD: Single Instruction Multiple Data
SIMT: Single Instruction Multiple Thread

SP: Single Precision
DP: Double Precision
FLOPS: Floating Point Operations per Second

Processor Types & RAM

ASIC: Application Specific Integrated Circuit

SoC: System on Chip
PCU: Programmable Computing Unit
NoC: Network on Chip

CPU Central Processing Unit
VPU: Vector Processing Unit
NPU: Neural Processing Unit
TPU: Tensor Processing Unit
FPGA: Field-Programmable Gate Array

RISC: Reduced Instruction Set Computer
CISC: Complex Instruction Set Computer

NDP: Near Data Processing

PIM: Processing In-Memory
IMC: In-Memory Computing

SRAM: Static Random Access Memory
VRAM: Video Random Access Memory
DRAM: Dynamic Random Access Memory
PCM: Phase Change Memory
BRAM: Block Random Access Memory
RAM: Random Access Memory
RRAM: Resistive RAM

*****

Matrix Array Processor Unit (c)RS


[M.A.P] [=====] [H.P.C] - Matrix Array Processor Unit (c)RS

This document describes the design and implementation of a novel computing device called the Matrix Array Processor Unit (M.A.P.U).

The M.A.P.U is a co-processor that can perform high-speed parallel operations on multi-dimensional arrays of data, such as those used in quantum computing, machine learning, and computer graphics,

A novel co-processor that can perform high-performance computing tasks using quantum-inspired principles.

The Matrix Array Processor is a type of processor that is designed to handle multi-directional and multi-dimensional arrays per Qbit.

It is used in quantum computers and relies on percentage-based 3D processing to handle all 3D array processing.

The central tasks map to probability over networks and MAP units in arrays.

The M.A.P is composed of multiple interconnected units that can process multi-dimensional arrays in parallel, using a percentage-based 3D processing scheme.

The M.A.P can be integrated with existing CPU, GPU and DPU architectures, as well as with other M.A.P units, to form a scalable and flexible computing platform.

The differences of Some Matrix Array Processor and other processors such as:

SIMD (Single Instruction Multiple Data),
SISD (Single Instruction Single Data),
MISD (Multiple Instruction Single Data),
MIMD (Multiple Instruction Multiple Data),
Vector processors,
Systolic Arrays,

Is that the Matrix Array Processor is designed to handle multi-directional and multi-dimensional arrays per Qbit...

While other processors are designed to operate efficiently and effectively on large one-dimensional arrays of data called vectors

The M.A.P.U consists of three main components:

The Matrix Array Processor (M.A.P),
The High Precision Central Core (H.P.C),
The Bus Connections and Networking (=====).

Core Definitions 3D M.A.P:

[H.P.C]:

A high-precision central core that can handle complex tasks such as probability mapping, network routing and memory management.

The H.P.C is the central controller of the M.A.P.U.

It coordinates the execution of tasks across the M.A.P units, assigns probabilities to different outcomes, and handles complex calculations that require high precision or accuracy.

Each [H.P.C] unit can connect to 8 [M.A.P] units and optionally to other [H.P.C] units in different layers of the 3D matrix.

The [H.P.C] can also communicate with external devices such as CPUs, GPUs, DPUs, or networks via the bottom layer of the wafer.

[M.A.P]:

The M.A.P is a specialized processing unit that can execute multiple arithmetic and logical operations on a single array element in one clock cycle.

A unit that can perform arithmetic operations on multi-dimensional arrays using a dot product-like algorithm.

Each M.A.P has 8-way interconnects to communicate with neighboring M.A.P units and a central [H.P.C] unit.

The M.A.P has eight-way interconnects to communicate with other M.A.P units in the same layer or adjacent layers.

The M.A.P can also access local cache or RAM for storing intermediate results or constants.

[=====]:

A bus connection that enables data transfer and networking among the M.A.P units and the [H.P.C] units.

The bottom layer of the wafer contains a high-resolution bus that connects to the onboard controllers and networks and the external CPU, GPU and DPU devices.

The ===== supports different communication protocols and topologies, such as mesh, torus, or hypercube.

The ===== also provides fault tolerance and load balancing mechanisms to ensure reliable and efficient performance.

The M.A.P.U is designed to be scalable and modular.

It can be stacked in three dimensions to form a larger array of processors that can handle more complex and diverse tasks.

The M.A.P.U can also be customized for different applications by changing the size, shape, or configuration of the M.A.P units, the H.P.C cores, or the ===== network.

The following diagrams illustrate the structure and functionality of the M.A.P.U.

Top View

[M.A.P][M.A.P][M.A.P]
[M.A.P][H.P.C][M.A.P]
[M.A.P][M.A.P][M.A.P]


Side View 3D


[M.A.P][H.P.C][M.A.P]
[M.A.P][=====][M.A.P]
[=====][H.P.C][=====]
[M.A.P][=====][M.A.P]
[=====][H.P.C][=====]
[M.A.P][=====][M.A.P]
[=====][H.P.C][=====]

Each [H.P.C] Central Contains RAM & connections to the 8 [M.A.P] & Optionally to layers above & bellow in 3D Matrix,
Bottom of wafer contains high resolution buss to onboard controllers & networks & DPU/GPU/CPU's

Array = Matrix Array Processor Unit (c)RS

ffffffff ffffffff ffffffff
........+ ........*+ ........*
........+ ........*+ ........*
........+ ........*+ ........*

f=fp,unit
*=mul
+=add
.=Cache/Ram

Simple absolver table for MUL:ADD : MUL* Only = +0, +- Only = N*1 then +-
% = / 100 + ADD Table {N1 <> N...} : Result!

The M.A.P unit can perform operations on multi-dimensional arrays using a combination of:

Floating-point units (f), Multiplication units (*), Addition units (+) and cache/ram units (.).

The M.A.P unit can support different data types such as DOT4, INT8, INT16, F16, F32 and F64.

The M.A.P co-processor is a cutting-edge technology that can enable new applications in fields such as artificial intelligence, machine learning, scientific computing and more.

(c)Rupert S

References: DOT4, INT8, INT16, F16, F32, F64 (c)Rupert S

https://is.gd/LEDSource

https://science.n-helix.com/2023/06/map.html

https://science.n-helix.com/2018/01/integer-floats-with-remainder-theory.html
https://science.n-helix.com/2021/02/multi-operation-maths.html
https://science.n-helix.com/2021/11/parallel-execution.html
https://science.n-helix.com/2022/12/math-error-solve.html
https://science.n-helix.com/2021/03/brain-bit-precision-int32-fp32-int16.html
https://science.n-helix.com/2022/10/ml.html

Sparse matrix multiplication in SRM array
https://www.science.org/doi/10.1126/sciadv.adf7474

Error Correction Options & Mitigation
https://futurism.com/ibm-breakthrough-quantum-computing

**********


Light Processors (c)Rupert S https://science.n-helix.com


Light processors : Access to advanced : Storage Cache, Random Access RAM Cache & Processor architecture: Starting with SiMD Simple Vector Instruction Set

Complex forms are a goal, Start simple : The world will thank you!
Simple as SiMD appears there are many uses,
Considering that higher instruction sets are delayed by SiMD space & speed priorities..

Array = Matrix Array Processor Unit (c)RS

ffffffff ffffffff ffffffff
........+ ........*+ ........*
........+ ........*+ ........*
........+ ........*+ ........*

f=fp,unit
*=mul
+=add
.=Cache/Ram

Simple absolver table for MUL:ADD : MUL* Only = +0, +- Only = N*1 then +-
% = / 100 + ADD Table {N1 <> N...} : Result!

Array = Matrix Array Processor Unit (c)RS

Cache is also a priority with manyfold application of simple data transfer & buffering to solid storage,
Power outage is our main concern so that we save all our work.

SSD is an obvious solution to backing up speedily,
However we do use RAM Cache for this goal..

The goal of speeding storage access up,
Light does all the work types we need:

List:
Data transit
CacheProcessing via dimensions & signal variance
RAM (Cyclic light transfer) Same principle as fibre optic cable over large distances.

(c)Rupert S https://science.n-helix.com

Quantum ! Light Compute : Reference material : RS

Yes we can solve classic problems with light computers, Light computers perform geometry & quantitative sampling (Comment by inventor) Rupert S

Light Compute : Reference material : RS
https://science.n-helix.com/2012/09/geometric-calculating-machines.html

https://science.n-helix.com/2020/03/single-photon.html

https://science.n-helix.com/2014/07/the-formula-of-geometric-volumes.html

https://science.n-helix.com/2018/07/universeal-algebra-paper.html

https://science.n-helix.com/2018/06/compression-libraries-index-prime.html

https://science.n-helix.com/2013/08/light-theory-on-creation-of-3d-image.html

https://science.n-helix.com/2018/06/uses-for-micro-laser-light-emitting.html

https://science.n-helix.com/2020/04/render.html

https://science.n-helix.com/2019/06/vulkan-stack.html

https://science.n-helix.com/2019/06/kernel.html

https://science.n-helix.com/2019/05/compiler-optimisation.html

https://science.n-helix.com/2018/09/hpc-pack-install-guide.html

https://science.n-helix.com/2020/04/cern.html

"Let's Play" Station NitroMagika_LightCaster

Lets face it, Realtec could well resource the "Original QFFT Audio device & CPU/GPU"

The mic works by calculating angle on a drum...
Light.. and timing & dispersion...
The audio works by QFFT replication of audio function..
The DAC works by quantifying as Analog digital or Metric Matrix..
The CPU/GPU by interpreting the data of logic, Space & timing...

We need to calculate Quantum is not the necessary feature;

But it is the highlight of our:

Data storage cache.
Our Temporary RAM
Our Data transport..
Of our fusion future.

(c)Rupert S https://science.n-helix.com

"Weedbrook points out that as yet, and in contrast to Google’s Sycamore, the Chinese team’s photonic circuit is not programmable, so at this point “it cannot be used for solving practical problems”."
https://www.nature.com/articles/d41586-020-03434-7

https://scitechdaily.com/ai-boosted-by-parallel-convolutional-light-based-processors/

https://interestingengineering.com/worlds-fastest-most-powerful-neuromorphic-processor-for-ai-unveiled

Physicists in China challenge Google’s ‘quantum advantage’
Photon-based quantum computer does a calculation that ordinary computers might never be able to do.
Philip Ball

PDF version
The interferometer part of our experiment.

This photonic computer performed in 200 seconds a calculation that on an ordinary supercomputer would take 2.5 billion years to complete.Credit: Hansen Zhong

A team in China claims to have made the first definitive demonstration of ‘quantum advantage’ — exploiting the counter-intuitive workings of quantum mechanics to perform computations that would be prohibitively slow on classical computers.

They have used beams of laser light to perform a computation which had been mathematically proven to be practically impossible on normal computers. The team achieved within a few minutes what would take half the age of Earth on the best existing supercomputers. Contrary to Google’s first demonstration of a quantum advantage, performed last year, their version is virtually unassailable by any classical computer. The results appeared in Science on 3 December1.

“We have shown that we can use photons, the fundamental unit of light, to demonstrate quantum computational power well beyond the classical counterpart,” says Jian-Wei Pan at the University of Science and Technology of China in Hefei. He adds that the calculation that they carried out — called the boson-sampling problem — is not just a convenient vehicle for demonstrating quantum advantage, but has potential practical applications in graph theory, quantum chemistry and machine learning.

“This is certainly a tour de force experiment, and an important milestone,” says physicist Ian Walmsley at Imperial College London.

Quantum advantage challenged

Teams at both academic and corporate laboratories have been vying to demonstrate quantum advantage (a term that has now largely replaced the earlier ‘quantum supremacy’).

Last year, researchers at Google’s quantum-computing laboratory in Santa Barbara, California, announced the first-ever demonstration of quantum advantage. They used their state-of-the-art Sycamore device, which has 53 quantum bits (qubits) made from superconducting circuits that are kept at ultracold temperatures2.

But some quantum researchers contested the claim, on the grounds that a better classical algorithm that would outperform the quantum one could exist3. And researchers at IBM claimed that its classical supercomputers could in principle already run existing algorithms to do the same calculations in 2.5 days.

To convincingly demonstrate quantum advantage, it should be unlikely that a significantly faster classical method could ever be found for the task being tested.

The Hefei team, led by Pan and Chao-Yang Lu, chose a different problem for its demonstration, called boson sampling. It was devised in 2011 by two computer scientists, Scott Aaronson and Alex Arkhipov4, then at the Massachusetts Institute of Technology in Cambridge. It entails calculating the probability distribution of many bosons — a category of fundamental particle that includes photons — whose quantum waves interfere with one another in a way that essentially randomizes the position of the particles. The probability of detecting a boson at a given position can be calculated from an equation in many unknowns.

200 seconds

But the calculation in this case is a ‘#P-hard problem’, which is even harder than notoriously tricky NP-hard problems, for which the number of solutions increases exponentially with the number of variables. For many tens of bosons, Aaronson and Arkhipov showed that there’s no classical shortcut for the impossibly long calculation.

A quantum computer, however, can sidestep the brute-force calculation by simulating the quantum process directly — allowing bosons to interfere and sampling the resulting distribution. To do this, Pan and colleagues chose to use photons as their qubits. They carried out the task on a photonic quantum computer working at room temperature.

Starting from laser pulses, the researchers encoded the information in the spatial position and the polarization of particular photon states — the orientation of the photons’ electromagnetic fields. These states were then brought together to interfere with one another and generate the photon distribution that represents the output. The team used photodetectors capable of registering single photons to measure that distribution, which in effect encodes the calculations that are so hard to perform classically.

In this way, Pan and colleagues could find solutions to the boson-sampling problem in 200 seconds. They estimate these would take 2.5 billion years to calculate on China’s TaihuLight supercomputer — a quantum advantage of around 1014.

Practical problems

“This is the first time that quantum advantage has been demonstrated using light or photonics,” says Christian Weedbrook, chief executive of quantum-computing startup Xanadu in Toronto, Canada, which is seeking to build practical quantum computers based on photonics.

Walmsley says this claim of quantum advantage is convincing. “Because [the experiment] hews very closely to the original Aaronson–Arkiphov scheme, it is unlikely that a better classical algorithm can be found,” he says.

However, Weedbrook points out that as yet, and in contrast to Google’s Sycamore, the Chinese team’s photonic circuit is not programmable, so at this point “it cannot be used for solving practical problems”.

But he adds that if the team is able to build an efficient enough programmable chip, several important computational problems could be solved. Among those are predicting how proteins dock to one another and how molecules vibrate, says Lu.

Weedbrook notes that photonic quantum computing started later than the other approaches, but it could now “potentially leap-frog the rest”. At any rate, he adds, “It is only a matter of time before quantum computers will leave classical computers in the dust.”

https://scitechdaily.com/ai-boosted-by-parallel-convolutional-light-based-processors/

"AI Boosted by Parallel Convolutional Light-Based Processors

TOPICS:Artificial IntelligenceElectrical EngineeringEPFLMachine LearningOpticsPhotonicsPopular

By EPFL JANUARY 7, 2021

Matrix Multiplications Light Processor

Schematic representation of a processor for matrix multiplications which runs on light. Credit: University of Oxford

The exponential growth of data traffic in our digital age poses some real challenges on processing power. And with the advent of machine learning and AI in, for example, self-driving vehicles and speech recognition, the upward trend is set to continue. All this places a heavy burden on the ability of current computer processors to keep up with demand.

Now, an international team of scientists has turned to light to tackle the problem. The researchers developed a new approach and architecture that combines processing and data storage onto a single chip by using light-based, or “photonic” processors, which are shown to surpass conventional electronic chips by processing information much more rapidly and in parallel.

The scientists developed a hardware accelerator for so-called matrix-vector multiplications, which are the backbone of neural networks (algorithms that simulate the human brain), which themselves are used for machine-learning algorithms. Since different light wavelengths (colors) don’t interfere with each other, the researchers could use multiple wavelengths of light for parallel calculations. But to do this, they used another innovative technology, developed at EPFL, a chip-based “frequency comb,” as a light source.

Matrix Multiplications Light Processor Schematic

Schematic representation of a processor for matrix multiplications which runs on light. Credit: University of Oxford

“Our study is the first to apply frequency combs in the field of artificial neural networks,” says Professor Tobias Kippenberg at EPFL, one the study’s leads. Professor Kippenberg’s research has pioneered the development of frequency combs. “The frequency comb provides a variety of optical wavelengths that are processed independently of one another in the same photonic chip.”

“Light-based processors for speeding up tasks in the field of machine learning enable complex mathematical tasks to be processed at high speeds and throughputs,” says senior co-author Wolfram Pernice at Münster University, one of the professors who led the research. “This is much faster than conventional chips which rely on electronic data transfer, such as graphic cards or specialized hardware like TPU’s (Tensor Processing Unit).”

After designing and fabricating the photonic chips, the researchers tested them on a neural network that recognizes of hand-written numbers. Inspired by biology, these networks are a concept in the field of machine learning and are used primarily in the processing of image or audio data. “The convolution operation between input data and one or more filters — which can identify edges in an image, for example, are well suited to our matrix architecture,” says Johannes Feldmann, now based at the University of Oxford Department of Materials. Nathan Youngblood (Oxford University) adds: “Exploiting wavelength multiplexing permits higher data rates and computing densities, i.e. operations per area of processer, not previously attained.”

“This work is a real showcase of European collaborative research,” says David Wright at the University of Exeter, who leads the EU project FunComp, which funded the work. “Whilst every research group involved is world-leading in their own way, it was bringing all these parts together that made this work truly possible.”

The study is published in Nature this week, and has far-reaching applications: higher simultaneous (and energy-saving) processing of data in artificial intelligence, larger neural networks for more accurate forecasts and more precise data analysis, large amounts of clinical data for diagnoses, enhancing rapid evaluation of sensor data in self-driving vehicles, and expanding cloud computing infrastructures with more storage space, computing power, and applications software.

Reference: “Parallel convolutional processing using an integrated photonic tensor core” by J. Feldmann, N. Youngblood, M. Karpov, H. Gehring, X. Li, M. Stappers, M. Le Gallo, X. Fu, A. Lukashchuk, A. S. Raja, J. Liu, C. D. Wright, A. Sebastian, T. J. Kippenberg, W. H. P. Pernice and H. Bhaskaran, 6 January 2021, Nature."

https://interestingengineering.com/worlds-fastest-most-powerful-neuromorphic-processor-for-ai-unveiled

"A new optical neuromorphic processor developed by Swinburne University of Technology can operate more than 1000 times faster than any previous processor. The processor for artificial intelligence (AI) functions faster than 10 trillion operations per second (TeraOPs/s).

RELATED: HUAWEI LAUNCHES WORLD'S MOST POWERFUL AI PROCESSOR

Optical micro-combs

The invention could revolutionize neural networks and neuromorphic processing in general. “This breakthrough was achieved with ‘optical micro-combs', as was our world-record internet data speed reported in May 2020,” said in a statement Swinburne’s Professor David Moss.

Micro-combs are new devices made up of hundreds of infrared lasers all held on a single chip. Compared to other optical sources, they are much smaller, lighter, faster, and cheaper.

The new innovation demonstrated by the Swinburne team uses a single processor while simultaneously interleaving the data in time, wavelength, and spatial dimensions through a single micro-comb chip.

“In the 10 years since I co-invented them, integrated micro-comb chips have become enormously important and it is truly exciting to see them enabling these huge advances in information communication and processing. Micro-combs offer enormous promise for us to meet the world’s insatiable need for information," added Moss.

Co-lead author of the study Dr. Xingyuan (Mike) Xu explained how this innovative use of micro-combs is giving the researchers a glimpse into the processors of the future.

Cost and energy reductions

Distinguished Professor Arnan Mitchell from RMIT University added that the "technology is applicable to all forms of processing and communications" and will result in significant future cost and energy consumption reductions.

“Convolutional neural networks have been central to the artificial intelligence revolution, but existing silicon technology increasingly presents a bottleneck in processing speed and energy efficiency,” said key supporter of the research team, Professor Damien Hicks from Swinburne and the Walter and Elizabeth Hall Institute.

“This breakthrough shows how a new optical technology makes such networks faster and more efficient and is a profound demonstration of the benefits of cross-disciplinary thinking, in having the inspiration and courage to take an idea from one field and using it to solve a fundamental problem in another.”"

No comments: