Friday, June 23, 2023

[M.A.P] [=====] [H.P.C] - Matrix Array Processor Unit (c)RS

Matrix Array Processor Unit (c)RS

[M.A.P] [=====] [H.P.C] - Matrix Array Processor Unit (c)RS

The M.A.P Processor is ideal as a Tensor Unit, For Small Array Solving; Such as MP3, MP4 & AC4 3D Audio,
The Base Map is simply to Fit a large static conversion M.A.P into the device,
For example a 32Bit Audio Sample Pluse 3D Layer for Bluetooth would simply be around 64Bits for Stereo 32Bit Audio MP4; Plus 32Bits for the 3D Map,
The M.A.P Process is not static; But you stick to the maths you wanted.

In parallel instructions, one calls interrupts if bad; IRQ & DMA Notes if you want to have better performance,
But in a processor Internals you have to call the main loops in your App; & OS Task Instruction cache..

Instruct The loop; Don't Interrupt; Stop, Look, Listen! Look, Slowdown, Showtime!

Integer instructions multiple parallel example of The principle of,
M.A.P is based on wide multiple instructions, This suites AVX & SiMD,
Particularly in 16Bit Multi Parallel Instruction Mode

Rupert S

Soft Interrupt IRQ: Faster CPU Cycles: RS

A Soft Interrupt is where you direct the interrupt register to a compiled Code Block..
The code block handles the Wait Queue in a gentle way that allows processing to continue & Ram to be accessed..

While the HDD directly writes the IRQ messages to the Code Block; The Code block is below the size of Cache on the Processor..

In advanced scenarios the Soft Int Caches Read/Write in RAM while Directing DMA & R/W Cached Cycles; Good Bioses & Software do this.

But in a processor Internals you have to call the Main Micro loops (Soft Int) in your App; & OS Task Instruction cache.


Interrupts particularly effect the Processor functions such as..
Machine Learning Load & Store of Frames, Also the internet..
In such as Network cards offloading is often required to handle interrupts..


VPDM-ST-LRS : Verified Processor Direct Memory Space Transactions Load, Register & Save (c)RS

In Concurrence with DM-TCP & DM-UDP & DM-Quicc Soft Interrupt IRQ

For SI-IRQ to safely directly write RAM for a SiMD & CPU/TPU; The following protocol is observed:

1 DMA Memory Management Processor, Device Bios/PCI Bus & Network Chipset/Network card..
Shall directly code check incoming traffic; But shall not void EEC Mode error check...

Bear in mind that AES, Common TLS & Packet Compression are in effect!
So you shall be using Networking features directly through the Transparent H.D.L Hardware Device Layer...

In effect the MMU & Network adapter transparently offload directly to Device Topography RAM & Cache!

2 The network card Certifies transactions & offloads security to internal features; Main Certification is still TPM & HMS.

3 You can handle directly to Processor of memory space matches internet Bit-depth; However this is usually 32Bit as with IP4 & 64Bit with IP6..

4 So the MMU & Network chipset work in sync; EEC, Security, TLS, M.S.T: Memory Space Translation...

5 VPDM-ST-LRS : Verified Processor Direct Memory Space Transactions Load, Register & Save (c)RS

So to be clear Automated Load, Register & Save Networking; Yes,
Device Low Level Firmware Translation Transactions; Yes
Processor Direct Memory Space Transactions; No, With Verification? Yes

To stop per Frame IO being a high cost transport processing; We process the entire frame per In/Out,
The same with TCP/UDP/Quicc; We process per whole Bit; For example 192Bits (SSL,AES),
Packet containment & control protocols; Mainly because Half packets caused inefficiency!

Rupert S


Embedded Hardened Pointer Table Cache for 3D Chips : RS

Based on PCI Edge RAM, Internal Loop Dynamic RAM; With internalised DMA Memory transfers..

In the process the feature has the ability to set a page table; 1MB, 2MB, 4MB, 16MB > 1TB,The Ram can be internally written to without invoking ALU or OS,

Pages are allocated; The GPU is an example; Physical pages are allocated in RAM that is directly Set by OS & Firmware/ROM Parameters...

Internal access to the RAM is set within the page allocation set, But all internal mapping & paging is done directly & though ALU & Memory Management Unit MMU.

With 1MB Cache set aside per feature; Not entirely unreasonable these days...

Most if a process such as SiMD can be carried out on internal loops..

Depending on Cache/RAM Space; Based on PCI Edge RAM

Internal DataSet Size based on Dynamic RAM Variable; That is set per USE &Or Per Settings or application,

That being said; RAM Allocations best be per session & directly after Setting is changed on reboot or refresh, Load & unload cycling.

Rupert S


Gather/Scatter Microcode no-overload ALU or Data/Code Cache, Just L3/RAM

When we look at the Instructions of the SiMD; We could see potential in them to further improve the Gather/Scatter Instructions; Although it has to be said that the instructions are well optimised!
Like many pre-Fetching Assembly code for earlier years they are well created & quick!

But we can do several things with them; So what ?

We can directly fetch the Cache in the code & Link to cache locations using linking (if we have enough & we do at L3/L2)

We can make a Hardlink table in cache(L3) for load and save processing (64Kb, Including header)

We can directly invoke pre-fetch with a system call (With SoftLink Pointer Tables)

We can incache modify (if a directive is singular in a chain of a, b, c, d)
We can individually SysCall a direct load of a single {a, b, c, d) statement & not reload it all...

For this we need a matrix table in L3 RAM; We can do this if we keep the table under 512KB,
But we do not intend to be selfish & RAM is fast these days! So we can directly load a single matrix Element {a, b, c, d} & not refresh the loading cycle for the code...

Thus we do not have to overload ALU or Data/Code Cache, Just L3/RAM

Rupert S


Temporary HardLinking in Prefetching Matrix instructions,

Gather/Scatter operations of localised random scattering of information to ram & retrieval

for (i = 0; i < N; ++i)
x[i] = y[idx[i]];

for (i = 0; i < N; ++i)
y[idx[i]] = x[i];

Firstly i read statistical gathing & Seeding; Pre-Fetching is a method of anticipating & preloading data,
So what do i want to do ? In Vector Matrix Prefetch Logical Gather

Potentially i would like to use:

Softlink (ram retrieval & multiple value)
HardLink (maths)
Prefetching logic {such as,

Run length prefetching,
Follow & Forward loading Cache,
Entire instruction load & Timing Pre-fetch & Statistic for Loop time & load frequency

So on any potential layout for SiMD Matrix a most likely configuration is:

A B = C : Mul or ADD

So a logical statement is, A, B Gather/Seed C; Directly logical AKA Prefetch
A B C D; Logical fields of prefetch are localised to parameter...

Only likely to draw data from a specific subset of points,
Byte Swapping is obviously A1 B1,2,3

Most specifically if the command is a hardlink With A B C; Then most likely Storage is directly linked; Like a HardLink on a HDD in NT,

The hard link is direct value fetching from a specific Var table & most likely a sorted list!
If the list is not sorted; We are probably sorting the list..

If we do not HardLink data in a matrix (Example):

Var = V+n, Table
a b c d

A Matrix HardLink is a temporary Table specific logical reading of instructions & direct memory load and save,
Registers {A,B,C,D}=v{1,2,3,4}..

Directly read direct memory table logic & optimise resulting likely storage or retrieval locations & Soft Link (pointer table)

Solutions include multiple Gather/Scatter & 'Gather/Scatter Stride' Cube Block multi load/save..
Logical Cache Storage History Pointer Table, Group Sorted RAM Save/Load by classification {A,B,C,D}=v{1,2,3,4}
When X + Xa + Xb + Xc, When Y + a b c, When Y or X Prefetch Pointer Table + Data { a, b, c }

Example Gather/Scatter logical multiple

var pointer [p1] {a ,b, c, d}
var pointer [p2] {1 ,2, 3, 4}

for (i = 0; i < N; ++i)
x[i] = y[idx[i]];
fetch y {p1, p2}; {a, b, c, d}:{1 ,2, 3, 4}

for (i = 0; i < N; ++i)
y[idx[i]] = x[i];
send x {p1, p2}; {a, b, c, d}:{1 ,2, 3, 4}
Rupert S : Reference


FMA is a Matrix SiMD feature & is common to ARM & AMD, CPU & GPU

Phone SIM cards can use FMA for GSM network acceleration,

We can use FMA fused MUL ADD for elliptic curve encryption to multiple Time * curve & ADD AES encryption in the form of time model & 3D dimensions,

Therefore we can use FMA to calculate the room area & add audio reverberation matrix as volume levels over time..

FMA as a basic GPU..

We can convert adder & fused MUL ADD ML,

Use all 3 types on integer function of CPU & internal GPU on echo dot type device's with internal GPU and CPU.. FPGA design.

Rupert S


Pre-Fetching; Statistically Ordered Gather/Scatter & The Scatter/Gather Commands

(SiMD) The gather/scatter commands may seem particularly random?
But we can use this in machine learning:

The equivalent of Gathering a group of factors or memories into a group & thinking about them in the context of our code! (our thought rules),

Now if we think about scatter; we have to limit the radius of our through to a small area of brain matter (or ram)... Or the process will leave us "Scatter-Brained"

Statistical Pre-Fetching:

Ordered Scatter
When you know approximately where to scatter

Ordered Gather
Where you know approximately where to gather

Free Thought
So now we can associate scatter & gather as a form of free thought? Yes but chaotic...
So we add order to that chaos! We limit the scattering to a single field.

Stride is the equivalent of following a line in the field; Do we also gather &Or Scatter while we stride ?
Do we simply stride a field?

Now to answer this question we simply have to denote motive!
In seeding we can scatter; Will we do better with an Ordered Scatter ? Yes we could!

Statistically Ordered Gather/Scatter & The Scatter/Gather Commands

Rupert S


Multi-line Packed-Bit Int SiMD Maths : Relevance HDR, WCG, ML Machine Learning (Most advantaged ADDER Maths)

The rules of multiple Maths with lower Bit widths into SiMD 256Bit (example) 64Bit & 128Bit & 512Bit can be used

In all methods you use packed bits per save, so single line save or load, Parallel, No ram thrashing.

You cannot flow a 16Bit block into another segment (the next 16Bit block)

You can however use 9 bit as a separator & rolling an addition to the next bit means a more accurate result!
in 32Bit you do 3 * 8bit & 1 * 4Bit, in this example the 4Bit op has 5 Bit results & The 8Bit have 9Bit results..
This is preferable!

2Bit, 3Bit, 4Bit Operation 1 , 8Bit Operations 3: Table

4 : 1, 8 : 3

4 : 2, 8 : 6
2 : 1, 7 : 8
3 : 1, 8 : 1, 16 : 3

Addition is the only place where 16Bit * 4 = 64Bit works easily, but when you ADD or - you can only roll to the lowest boundary of each 16Bit segment & not into the higher or lower segment.

A: In order to multiply you need adaptable rules to division & multiply
B: you need a dividable Maths unit with And OR & Not gates to segment the registered Mul SiMD Unit..

In the case of + * you need to use single line rule addition (no over flow per pixel)..
& Either Many AND-OR / Not gate layer or Parallel 16Bit blocks..

You can however painful as it is Multi Load & Zero remainder registers & &or X or Not remainder 00000 on higher depth instructions & so remain pure!

8Bit blocks are a bit small and we use HDR & WCG, So mostly pointless!

We can however 8Bit Write a patch of pallet & sub divide our colour pallet & Light Shadow Curves in anything over 8Bit depth colour,

In the case of Intel 8Bit * 8 Inferencing unit : 16 Bit Colour in probably (WCG 8 * 8) + (HDR 8 * 8) Segments,

In any case Addition is fortunately what we need! so with ADD we can use SiMD & Integer Today.

Rupert S


M.A.P NPU Matrix Processor Dimensional construct (c)RS

Primary reason for expansion of function data sets: 2D, 3D,< nD

P.D.C is a worker thread parallel 2D or 3D Grid,
Utilising QQ & A, B,C Array maths allows us to collapse or expand dimensions in a flexible way,

The same principles as SVM (S.V.M SiMD Vector matrix) can be used to culminate or expand dimensions...

That way a M.A.P Processor can expand or collapse all mathematical constructs,
We can therefore use all mathematical & statistical arrays for machine Learning & Maths.



The Subject of 4x4 tables,

We are obviously looking for more like 16x16 for Physics maths!
The matrix processor is a large data set; Divisible into 4x2 & 4x4 & 8x8 groups for execution speedups,
Aligned Parallel processing....

Aligned Matrix tables need to be larger than 4x4 for Physics & Chemistry; So a matrix processor ideally can at a minimum:

Matrix Table






Matrix Method (c)RS

Any GPU & CPU SiMD can do a form of Matrix maths in an Array Parallel Load & Run as consecutive tasks..

Like So

Matrix Formulas : (c)RS

SiMD Array A to X, Usually 8, 16, 32, 64 Parallel Groups

Grouped Parallel Runs
A 1, 2, 3, N
B 1, 2, 3, N
Y 1, 2, 3, N
X 1, 2, 3, N
Run 1 {A1, B1 to X1, Y1} Run 2+ {A2, B2 to X2, Y2}++ {An, Bn to Xn, Yn}

Matrix Processor Method Synchronous Cube Map Usually 8x8, 16x16, 32x32, 64x64 Parallel Quad++ Groups

2D:3D Cube

A 1, 2, 3, N
B1, 2, 3, N
C1, 2, 3, N
D1, 2, 3, N

Run 1 2D:3D Cube {
A 1, 2, 3, N
B1, 2, 3, N
C1, 2, 3, N
D1, 2, 3, N

Run N 2D:3D Cube {
A 1, 2, 3, N
B1, 2, 3, N
C1, 2, 3, N
D1, 2, 3, N

Rupert Summerskill


SiMD Matrix maths begins with a 3D graph,


The graphs principal of 3 dimensions; We can use more dimensions but on paper we need to represent dimensions in colours so that all 3 dimensions that we can draw; are represented.

In algebra we represent 3+ dimensions with small glyphs next to each letter that represents our maths operation theoretical number.

During operation of computation we maintain in memory the specific dimensions interactions and interplay of complex matrix maths.

Rupert S

Numbers example 4D matrix

I love you 2, I love you 3, I love you 4 the ends of time... To be continued...



Directed Matrix Principle : RS

Matrix Principle directed at traditional parallel Integer & SiMD Instruction groups

The main problem with 32KB L1 tables is cache filling & domination of CPU/GPU by single program instruction groups..

Instruction cache is the primary challenge; Because Instruction cache L1 is commonly 32KB; Data cache 64KB,
L2 is 512KB to 4MB; L3 4MB to 16MB (can be more on Epyc)..

Optimised instruction groups by instruction, SiMD multiprocessing thread count:

Firstly requirements: (32KB instruction Cache L1, 512KB L2, 8MB L3)

L1 Instruction Group 32KB
L2 running group 512KB
L3 RAM & storage direct fetching 8MB

8KB core table for group threading,
24KB of grouped & Synchronised instructions

Data work Groups 512KB L2 / 64 Instruction Group sets (L1 32KB Table),
So Main instruction groups from L1 with larger data sets.

L3 4MB to 8MB of data & instruction caching load (directed from L1 & funneled into L2)

Instructions are cross threaded directly though L3 & L2 synchronised Load, Run & Save,

Optimised instruction groups by instruction, SiMD multiprocessing thread count.

Rupert S


Parallel Arrays : Matrix forms : RS

Matrix processor is a feature that will be more common & is relatively similar to an Abacus with a multiple array of + & * Operators..

Now a Matrix Array is X1 > Xn & Y1 > Yn

Commonly an array of 16 x 16 but can be 8 x 8 or 4 x 4,

Now we can perform such operations as Relativity & String theory on a lattice & that is very fast!

We can also perform these functions on SiMD, AVX in parallel; Such that 256Bit SiMD is 32Bit x 8 Parallel & so forth

a : 64Bit
b : 64Bit
c : 64Bit
d : 64Bit


Now we can see that we can perform a matrix operation such as lattice with both SiMD & SiMD-Matrix,

We can also see that a Matrix shall & can present our solution & that SiMD can also!
But we need Long operation SiMD or many passes to complete our operations; If Larger than our size..

We can also therefore most likely..

Use AES-NI S Letter Box & SVE & Matrix & SiMD to our advantage for many Lattice operations.

Multiplier Matrix Accelerated Encryption, Like i said A Parallel SiMD array may do the same; If all memory arrays are connected by a single RAM/Cache ALU Node,

As stated Parallel Arrays & Parallel Matrix Arrays.

Rupert Summerskill

Bluetooth LE Protocol


Examples of Parallel execution pipeline : Parallel arrays:

Crypto lattice, Kyber/ML-KEM, AES : Parallelised Lattices, 8x & 16x Parallel SiMD F16/32/64/128/192/256Bit

parameterisation of groups of 4x Parallel SiMD F16 & 8x Parallel SiMD F16

Parallelised motion & Video/Audio Deblocking/Blocking

8x8 16x16 quantification of video is common in VVC & H265 & H264 & JPEG & MP3, MP4a & AAC,
Suggested parameterisation of 4x Parallel SiMD F16

8x8 16x16 quantification of video is common in HDR VVC & H265 & H264 & JPEG & MP3, MP4a & AAC & AC3 & AC4,
Suggested parameterisation of 4x Parallel SiMD F32

Shapes in motion 2D : 4x per Cube in motion,
Shapes in motion 2D : 6x per Texture Shaded Cube in motion,

Shapes in motion 3D : 6x per Cube in motion,
Shapes in motion 3D : 8x per Texture Shaded Cube in motion,



Number relativity, Bit precision: RS

In gaming a player has access to palette of 16bit FFFFFFFFFFFFFFFFFFFF.FFFFFFFF BF16 F=16 HEX; In 32bit memory storage.

Average gamers recognise maybe 32000 colours directly,

Colour rich artist colourist's recognise almost 6000000 colours  TOPCloud.

Variety is king & queen of experience,
Artists specialist recognises more colours than a basic gamer or graphics artist in vectors..

Matrix maths operations precision is relative to hardware,

RollINT precision 1 to 4 bit + integer -1 to 4 bit F, FFFF, FFF+.F Xbox Or FFFFFFF+.F Ps

Bit precision is relative to your experience!

Rupert S


RollINT - Machine Learning for Console & Computer : RS

With True Value memory/Operation cache...

Application of RollINT to machine learning with definition,
A Playstation APU has 8Bit Integers for inference; XBox 4Bit..

In order to describe 4Bit as float; You would need to define 3Bit & 1Bit R remainder,
So how does this work?

In loading value the first 3Bit is the value & the 4th bit is remainder & when you load the value stored..

You fetch 3Bit as the value & 1 Bit as the remainder; Example:

FFFe > Value FFF &R e, So the value is FFF.e not FFFe
you can do multiple data type operations in this method; For example:

FFde = FF & de or or you could do Ffde & mean F.fde; Useful for definitions of Pi,

For example Pi in 4Bit (8Bits Prefered); Commonly used by kids at school!,

However you convert the stored 4Bit Pi to a fully accurate value on FPU & SiMD execution by loading pre-stored true value.


We are using roll to roll a zero on or off an integer,

Therefore we are able to divide and multiply and add so that..

101-0 > 10.1+0 No can range practically from 0 to 00000000 practically.

So 10023-000 > 10.023+000

We can then store floating point numbers in integers.

(C) Rupert S,

Reference Int & FP Value Sizes; A reminder that Floats are 50% of highest Integer Value,
ROLLInt floats still have an amazing additional value!


RollINT : The Float Perfectionist

Playstation & XBox are primary examples where the Int8 unit could do a RollINT Floating point operation for machine learning that is specific to float FPU Solves,

Edge detection, Sharpening & Adaptive Contrast & Colour HDR..

Depending if you directly roll on SiMD & FPU then you can still sharpen with the bF16 & half precision FPU/SiMD Maths operations on the final run!

Imagine Luke SkyWalkers final Torpedo Salvo as FPU/SiMD Vectors DT



Scaler is an argument for the role of RollINT & also a pointer to method

RollINT : A Float view of machine learning,
Essentially the core issue is the role float may play in a result...

Not the human mind does use a common integer format with a small float remainder?
Potential for this configuration is mainly because Integer values are in the main Substantive information..

Float value (the sub decimal place below 0.); Is in essence a precise small value of high importance to skills such as jumping, Running, Motions & skill actions like shooting..

Integer is the majority of action related to large steps; Particularly because people have the capacity to change from Meter to Centimetre to Millimetre,

Justifications for Float values diminish if you have scalar units such as the meter, the Yard, foot, Inch & 16th!

However; As may be pointed out, Roll Scalar? Is a form of floating unit expression; If Scalar measurements are regarded in terms of static's; Then Yes Integer:{Meter; FPU:{cm, mm} is a float value!

Nonetheless Scaler is an argument for the role of RollINT & also a pointer to method..

Scaling you see; is everything to detail; If you want to see this? Magnify or Zoom & Wide angle!
We further scale; By hitboxing our ML; In other words by training the AI on Centric value rewards..

AI Content:

{Content value reward targets};
{Centric Core values};

Return = Value;
end = infinite
Test Loop {AI C, End}; Begine

Epoches = {Satisfied End}

Rupert S


Float & Integer : RollINT : In Depth Analytics

RollINT List

Floats with small precision values : RollINT

Dreams have 'Small Randoms', Minor details make a true reality

(OS & Chrome Example)
The size of frames & text alignment
Main colour groups for desktop & browser colours : FFFFFF.FF
Frames forward & backward with submenus are worthy of low precision floats : FFF.F 300 Frames 16 sub allocated positions inside frame:{SubFrame}

Both low & high precision

High Efficiency ZLib, GZip Ram compression
Localised Error correction

Colour depth & contrast HDR, Low error rate/Higher



RollINT Versus Metric principle of float reduction : RS

Scale correctly & avoid that FPU being needed

Scale correctly first; Example mouse is Millimetre & Micrometre & Large scale Centimetre,
Photon Microscope is Picometre, Milimetre, Centimetre,
Telescope is Kilometer, Metre, Milimetre..
Screens UpScale & Zoom, Do we need to rescale our measurement ?

X+- , Y+- 2D+- central point measurements
Int16 2 -32,768 32,767
Int32 4 -2,147,483,648 2,147,483,647
Int64 8 -9,223,372,036,854,775,808 9,223,372,036,854,775,807 (might want to use floats; A lot quicker)

Precision Floats
16Bit Half 2 ±65504
32Bit Single 4 ±3.4 x 1038
64Bit Double 8 ±1.7 × 10308

The main attack Vector being mice & touchscreens & utility scopes & measuring devices...
We wanted DPI without stress!

A range of options exist when using RollINT; The idea is to Roll a float on operation; To be fair hardware like the Amiga has the concept of Integer operation with a float as the final result..

However that option Is "the Final result" & does not mean that you could use RollINT to make a repeated Float maths for applications..

However RollINT could be used 2 Significant ways:

You could use FPU on the result (Previous integer operations save FPU for other tasks)
You could receive an Integer result from the float operation (Final float value on multiple operations not important to you?)

Perform Metrification & therefor avoid float value use; for example expand the data into a higher precision mode,

The principle of the Metric system is to use sub parts to reduce the necessity of floats : Meter, Centimetre, Milimetre, KG, Gram, Ounce..

So avoiding a floating unit..

The method is multiple operations, Large, Small, Smaller & can in reality be repeated down to picometer or tiny weights...

This method is multiple operation rounds,

RollINT & FPU Avoid rounds of CPU Cycles; But options exist.



As you know the Matrix Array Processor is now frequent with Intel, Mac M1 & M2, AMD & NVidia Versions..

Quantum computers rely on Multi-Directional & Multi-Dimensional Arrays per Qbit!

Well this is a design structure for a Multi-Array Multi-Connection Matrix Array Processor..

The principle is basically quite logical!

Multi-Array Multi-Connection Matrix Array Co-Processor - Quanta Light Compute 2023-06-23

Percentage based 3D Processing to handle all 3D Array processing,

Central [H.P.C] Tasks map to probability over Networks [=====] & [M.A.P] Units in arrays

Table define


[M.A.P] = M.A.P , M.A.P 8 Way interconnect,
[H.P.C] = M.A.P High Precision Central Core,
[=====] = Buss Connections & networking


Top View


Side View 3D


Each [H.P.C] Central Contains RAM & connections to the 8 [M.A.P] & Optionally to layers above & bellow in 3D Matrix,
Bottom of wafer contains high resolution buss to onboard controllers & networks & DPU/GPU/CPU's

Array = Matrix Array Processor Unit (c)RS

ffffffff ffffffff ffffffff
........+ ........*+ ........*
........+ ........*+ ........*
........+ ........*+ ........*


Simple absolver table for MUL:ADD : MUL* Only = +0, +- Only = N*1 then +-
% = / 100 + ADD Table {N1 <> N...} : Result!

(c)Rupert S


Standard SiMD Features, Byte Swap, ADD,MUL[SSimd]
8 x Cache,Mul,ADD: [8xCMA]


[SSimd] is additional features accessed by register poke, Standard Operation is CMA & RAM
[8xCMA] is used as RAM in most SiMD Operations & MUL+ADD, ADD, MUL

In SiMD Ops
On RAM upto 3x F16 can be stored (3xF16, F32 + F16, F48, F24x2)

MUL or ADD Operations can be {F16:F16:F16, F32 *+- F16, F24 *+- F24}
Operations are saved to Master Cache & sent to RAM or other functions & can be {F16, F24, F32, F48},
Because master cache is a full buffer; you have to save it first! before reuse!

Design uses the M.A.P basic MUL+ADD & RAM

(c)Rupert S

References: DOT4, INT8, INT16, F16, F32, F64 (c)Rupert S

Nx-DeepMatrix Engines

Experimental CPU Proof : A proposal for an Open RISC V Processor, Statistical diagrams of function & graphs with function use under load...

ML Batch Matrix MAP in FPGA

ML Compressed Dynamic16bit-8Bit - Hardware-friendly compression and hardware acceleration for ML Transformer

Matrix Processors - Memory & command - All-Digital Compute-In-Memory FPGA Architecture for Deep Learning Acceleration

Matrix Processors - Inline Ram & Command { CMD : RAM }:{NET}


Cooperative Matrix Math : RS

Cooperative Matrix is a Math type where you formulate a Grid of number & math notations & solve them in sync,

The consequence for you is that the maths is both Faster; More Complex But also easier to correct for errors...

Usually Matrix Maths is used for Algebra, Image & 3D Mapping ML; Such as to see, Maps & Dungeons, Water tables, Technology Development.


Var = V+n, Table
     a      b      c     d

There are 3 main ways for matrix maths:

V1a {/,*,+,-},Value, %, Fraction V1b, V2a, V2b : In effect a dither map or calulation; So connected.
Vector groups {V1a<>z} Maths to {V2a<>z} to {V3a<>z} to {V4a<>z} & more ..

Sorted by Type of operation example
M = Multi Complex Operations In Groups
    a         b        c        d
3[V3] / [V3] / [V3]/[V3]

Refer to : Var = V+n, Table

Matrix Accumulator Header Matrix : {MAHM}
SiMD Wave : 32, 64 Group with finalised result + ALU : Work Group Wave Matrix : {WGWM}
Wave Matrix Accumulator Cube : {WMAC}




CTP-HTM : CPU, TPU, Processor Hypervisor Thread Management : RS

Parallel Group Threads:

Work groups by Aligned by:

Work Group Size (aligned by Bit):

Memory Range {Half Float, b16Bit,b32Bit, 16Bit,32Bit , Double Float}
Aligned Cluster Size,
Bit-depth & Length of code

The logic is that Parallel Group Threads with the same Code complexity & Size should finish around the same time,
They also typically require the same processor priority so that system tasks have Runtime Availability.


Guide to Cooperative Matrix Math : RS

Base principle of the Matrix & Graph goes beyond Accumulation of numbers..
I am reminded by microsofts dev post of Excel & Spreadsheet applications..

Yes they Graph/Matrix; But math solves require it! For example the Acidity/Alkaline matrix with Protons & Electrons,

However a more sophisticated form is algebra; But you have to simply the Algebra & put that in a table..
Einstein, Shrodinger, Physics, Chemestry & DNA By connection...

Algebra is the main reason we would use Float : {bF16 <> bF32} {Single Precision <> Double Precision} SiMD,
The chief objective is the solve; Complex SiMD offer the answer of flexibility..

Simple absolver table for MUL:ADD : MUL* Only = +0, +- Only = N*1 then +-
% = / 100 + ADD Table {N1 <> N...} : Result!

(c)Rupert S

Graph Accumulator Multiply ADD - Cooperative Matrix

SDK Sample :

AMD 23.Q3Pro_HIP #HPC #DirectML MatrixMathOps 'Release unto me the great! Chobokniki' Thine Prayers Answered
Run the .reg after install; Before reboot


Inference & FMA De-Block Styles

For upscaling matrix: MMX+ & SiMD
16x16 Block as used just about in HD,
8x8 Blocks Certainly NTSC, PAL, JP_NTSC!,
Very usable for deblocking JPG,
16x16 & 8x8 is very good for Inferencing active on Scaling & Deblocking..

4x4 for main Inference XBox & 8x8 for PS5..
XBox can use (4x4)x4 for 8x8 & (4x4)x16 for 16x16; Very powerful!
PS5 can use (8x8)x1 or x2 for 8x8 & (8x8)x4 (x8 for additional processing) for 16x16; Very powerful!

​The table solves common issues with 4Bit & 8Bit direct loading of colour tables of the F16 Types..
16Bit is a bit more common in older hardware & luckily quite a lot more flexible!
But 8Bit & 4Bit inferencing have a number of uses...

Indirect load though F16 Register can work by sideloading the operation; With Inferencing Sub routine coding & Returns,
Processing the actual inference but losing data store & returns just information..

Sub Routine INT8 & INT4 can:
Directly manipulate a small palette; Scoped Palette,
Single channel colour or multiple operations..
Load, Store & Save

Inference & FMA De-Block Styles List

(4x4)x16 + processing
(4x4)x32 +++ processing

(8x8)x8 + processing
(8x8)x16 + processing

(16x16)x1 + processing
(16x16)x2 ++ processing
(16x16)x4 +++ processing

8:4Bit Concepts: 65535/255=8Bit 65535/16=4Bit

16bit/4bit : 4Bit colour pallet, But we can fraction 16Bit/4bit in essence 16/4! 65535/16; Compression Shapes & Gradients.
Polygon, Shadow, Contact
Alpha Channel 2Bit, 4Bit
Grayscale edge define sharpening
Single Colour Edge detect
Shape Fill in Alpha 10,10,10,2
Xor, Pattern, Shading, Shader, Cull, Shape & Depth Compare after define

For when {U, X, Y, Z} = N Expressions
For when {(A+B/2)} = C Expressions



An example use of FMA Cooperative Matrix

In the example we use a formula like (U/X²)+(U/Y²)+(U/Z²)
Firstly the x²,y²,z² are MUL, So we need a * table or maybe with FMA we can use a (MUL)+0 ?
My primary observation is that we can use 2 methods:

MUL (U/X²), (U/Y²), (U/Z²) in tables, I suggest 3 * or FMA (MUL)+0
Or we can perform tables in order but complete all the MUL operations in Sync & then ADD with FMA,
Sync : (U/X²)+(U/Y²)+(U/Z²) to (Un/X²)+(Un/Y²)+(Un/Z²)

F1 = First Operation F2 = Second operation R = Result {R1:R3 = R4}

R1=(U/X²) R2=(U/Y²) R3=(U/Z²)
R1=+ R2=+ R3 = R4

So we have an example where MUL & then ADD is usable; But we could use Synced FMA

For when {U, X, Y, Z} = N Expressions


Brilliant examples of matrix maths

VXEdDSA & XEdDSA & X25519 & X448

SiMD-Matrix Maths example - Wave retrieval from quad-polarized Chinese Gaofen-3 SAR image using an improved tilt modulation transfer function

SiMD-Matrix Maths example D-Waves


High speed Per operation Cycle operations of D R² Pi

An (A[diameter]*B²[Pi] : D * R² operation is 2 Cycles, this specialised Arc, Sin, Tan operation can be accomplished a couple of ways in a single cycle,

Options table : D R² Pi

Firstly by sideways memory load in lower Single Precision to double precision output in a SiMD

You need to pre cache R²You can use the same value for R or for D &or both
You can pre cache all static D &or R, So you can vary either D or R & single cycle
You need to perform 2 operations , Diameter & R² & obviously they are relational!

For examples:

R = Atom Zink (standard size!) Cache D R
You move a compass but the needle is the same size! Cache D
You draw faces but the width is the same, Cache D
You draw faces but the Shape is the same but size is not! Cache R

Rupert S


How you use FMA, Basic MUL+ADD examples first & then Mul & ADD

Firstly in video,
MUL a float set A * B + C
Video Upscaling basic A:Pixel * B:PixelDiffRightPixel + C:RightPixel,
Do that 16 Times per pixel pair and you have 16*Interpolate, So a 16* Data set Wave!
You could obviously use a 32* Wave SiMD & do 4x8; So 4 Pixel groups per Wave.

So for example you can ADD Log Gama or other simple values, In A * B + C,
Pixel Values or whatever, You can use Point float 0.001 for example to do division on floats.

For all personal maths that you imagine:
Simple absolver table for MUL:ADD : MUL* Only = +0, +- Only = N*1 then +-
% = / 100 + ADD Table {N1 <> N...} : Result!

Interpolation & smoothing :

The method i am thinking of is ADD Mul/Div : Edge Left A+B Edge Right = C Center, (A to C)<>(C to A)

(A+B)/2 = C

Factor A_to_C
16 Steps

Factor C_to_B
16 Steps


((A-C)/16)=F | (F* A over C)=F Step * 16 over Time or distance

(Call slope)
find 16 Fractions of A To C
find 16 Fractions of C to B

For when {(A+B/2)} = C Expressions


Pixel A to B, Interpolation upscaling

from A1 to B16 ADD Difference of A - B

Red A1 : 2 : 3 : 4 : 5 : 6 : 7 : 8 : 9 : 10 : 11 : 12: 13 : 14 : 15 : 16B
Green A1 : 2 : 3 : 4 : 5 : 6 : 7 : 8 : 9 : 10 : 11 : 12: 13 : 14 : 15 : 16B
Blue A1 : 2 : 3 : 4 : 5 : 6 : 7 : 8 : 9 : 10 : 11 : 12: 13 : 14 : 15 : 16B

Tables can be 16 Wide & 16 Long to advantage ourselves of Byte aligned F16

Pixel A to B, Interpolation upscaling



R,G,B Value of A
R,G,B Value of B
RCv = Value per pixel of 16

Which is higher RA or RB
if RA
RA - RB = RC
RB - RA = RC

RB{1 to 16} repeat +- RCv

Sorry about the coding RS

Rupert S


FMA AVX Performance table: 2Flops per Cycle per FMA Unit
Architecture Fast Instructions for FMA

Reference Tables

Operators in C
● Arithmetic
a + b, a – b, a*b, a/b, a%b
● Bitwise
a | b, a & b, a ^ b, ~a
● Bit shift
a << b, a >> b (signed), a >> b (unsigned)
● Logical operators
a && b, a || b, !a
● Comparison operators
a == b, a != b, a < b, a <= b, a > b, a >= b
● Tertiary operator
x = a ? b : c
● Special functions:
sqrt(x), abs(x), fma(a,b,c), ceil(x), floor(x)

Fast division for constant divisors

Calculate r = a/b where b is a constant
With floating point we precompute (at compile time
or outside of the main loop) the inverse ib = 1.0/b.
r = ib*a
Floating point division with constant divisors
becomes multiplication
With integers the inverse is more complicated
ib,n = get_magic_numbers(b);
r = ib*a >> n

Integer division with constant divisors becomes
multiplication and a bit-shift

Fast Division Examples
● x/3 = x*1431655766/2^32
27*1431655766/2^32 = 3
● x/1000 = x*274877907/2^38
10000*274877907/2^32 = 10
● x/314159 = x*895963435/2
7*314159*895963435/2^48 = 7

Dividing integers by a power of two can be done with a bit shift which is very fast.


High-Performance Elliptic Curve Cryptography: A SIMD Approach to Modern Curves


Triangle 3D Matrix graphs


Vector table for audio & video or graphics..

We will use integers for the 3D audio presentation & SiMD fpu for MP4 & AC4 & Alac decompression..


So we will be using a form of float unit called..


We are using roll to roll a zero on or off an integer,

Therefore we are able to divide and multiply and add so that..

101-0 > 10.1+0 No can range practically from 0 to 00000000 practically.

So 10023-000 > 10.023+000

We can then store floating point numbers in integer.

(C) Rupert S,

Reference Int & FP Value Sizes; A reminder that Floats are 50% of highest Integer Value,
ROLLInt floats still have an amazing additional value!


ECC elliptic curves & Gradients : RS

Leveraging FMA fused MUL ADD on Internet & Software ...

For examples:

Gradients vector compression..

Colour A to colour B

Compare dif {A:B}
Transform A over steps B

Same colour ranges {R,G,B}

(A - B) = Dif
Shift B over steps = A

Store Vec VTable = steps


Steps S1 to Sn

Colour B1 to Bn + S1 to Sn


Same with time & dimensions in the ECC elliptic curve..

Vector= {B1,Bn}



Steps S1 to Sn

Colour B1 to Bn + S1 to Sn


Rupert S


Einstein : Quad:20x30 Matrix table

With Einstein Formula being around 20 operations wide, 30 Lines long..
Single Operation Formula Matrix Tables could be popular,

Consequently matrix math : MTU/MAP processor features should be popular...

I take the view that 8 x 30 is about manageable on the Epyc & M2..
Bearing mind that a 32 Wide x 32 Long Operations SiMD is achievable...

An AVX512 SiMD could run Quad operations (128Bit AVX) x 4,
So 20/4 = 5x; So 6x AVX512(128Bit Operation); Now there is; I believe; 1 AVX core per 2 Core Groups!

So 24 Core has 8x or 4x or 2x (8 or 4 Cores per die unit)!
So 84 Core units should have enough AVX512?

But one Mac M2... :D

Einstein : Quad:20x30 Matrix table

With Einstein Formula being around 20 operations wide, 30 Lines long..
Single Operation Formula Matrix Tables could be popular,

Consequently matrix math : MTU/MAP processor features should be popular...

I take the view that 8 x 30 is about manageable on the Epyc & M2..
Bearing in mind that a 32 Wide x 32 Long Operations SiMD is achievable...

An AVX512 SiMD could run Quad operations (128Bit AVX) x 4,
So 20/4 = 5x; So 6x AVX512(128Bit Operation); Now there is; I believe; 1 AVX core per 2 Core Groups!

So 24 Core has 8x or 4x or 2x (8 or 4 Cores per die unit)!
So 84 Core units should have enough AVX512?

But one Mac M2... :D

In our case Einstein, the table is 20 Wide & 35 Long (roughly)

So : Einstein = Quad:20x35 | Alternative Quad:8x16, More manageable in
SiMD Parallel Executions; Quad:8x16 x 3, ....

One presume strict aligned multiple multiplication

4X4 Tables are still utility for Science maths; But we need
to get the point across what we need for Einstein! The Subject of 4x4

The Subject of 4x4 tables,

We are obviously looking for more like 16x16 for Physics maths!
The matrix processor is a large data set; Divisible into 4x2 & 4x4 &
8x8 groups for execution speedups,
Aligned Parallel processing....

Aligned Matrix tables need to be larger than 4x4 for Physics &
Chemistry; So a matrix processor ideally can at a minimum:

Matrix Table




Quaternions > PGA Geometric : a+b+c : Rotational algebra : ax+by+c=0 | e1, e2, e3

Improving Structured Grid-Based Sparse Matrix-Vector Multiplication and Gauss–Seidel Iteration on GPDSP

SiMD Matrix Maths - Performance Portable SIMD Approach - Implementing Block Line Solver For Coupled PDEs

SiMD Matrix Maths - Operations Details HIP AMD

SiMD double tables, M1 Matrix

FMA AVX Performance table: 2Flops per Cycle per FMA Unit
Architecture Fast Instructions for FMA

#RIP (Intro interesting!) Optimizing massively parallel sparse matrix computing on ARM many-core processor

CGal is a Matrix Math library for C; Luckily OpenBLAS is a compatible library & AMD Makes a version in HIP

Matrix Libs : L1 means compatible with CGAL, A+ means i rate them highly on science community use : RS

GLM (L1)
QuantLib (L1)
Ceres-Solver (L1)

OpenBLAS (A+)
Eigan (A+)
MiraCL (A+)

C++ Matrix Maths

MPPT is Camera & FFMPeg complex install

C++ Matrix Maths : Simple

C++ conversions between Numpy arrays and Armadillo matrices; Converts Into Numpy Py not out (needs work)

Motivated applications of 3D Matrix Database ML


Just shows how fast Blas & these NumPy & Arma & Mave is! 1998-man SigRS
Parallel matrix multiplication & diagonalization

Wasm Inefficiency


3D Matrix Web Codecs

Are presented as being JIT Compiler re-encoded when required; Frequently WebASM, WebGPU Code, JS...
Audio, Video, Sensation, Code Runtimes.

Web Codecs for devices are a modern concept & are available for common websites such as news & music,
devices such as Alexa Echo & Google Dot & Bluetooth Devices?

Media players & BT devices particularly suffer from small Storage potential!
So Web Codecs downloaded to the device from a source; Such as a smart phone or computer..
Are a clear-minded solution!

JIT Compiler

3D Matrix Tables in FMA, Mul & ADD code to be automatically recompiled locally when required!
Directed to a common API, Direct Compute, WebGPU, WebASM, Jit Compiler OpenCL

Many Operations can be done from unique device specific optimisation; Examples:

API, DirectX & OpenCL & Vulkan & WebGPU & WebASM
Texture & Audio Shaders.
Digital Streaming

Bluetooth NANO SiMD & API
Digital TV in H266, VP9 & AV1,

Locally compiled accelerators should be respected first; Such as the output & input 3D Matrix & CPU & GPU Acceleration engine..

Code can include Matrix converters into common output format such as WebP & Textures & BC, DXT Compression presentation; Vulkan, OpenCL & DirectX & Texture & Audio Shaders.

Java, JS & WebASM are examples with operator mechanisms & JIT Compiler optimisation..
Minimising storage requirements for good compatibility while maximising performance.




TPU & SiMD Parallel wavetables Pre-Calculation Meta-Data : RS

{ For data expansion & Precomputed Upscaling through meta data per frame sequence }
#MetaDATA #PreProcessing Parallel Text loading and machine learning processing : RS 23/07/2023

Pre-calculation table; For Example the Amiga uses tables for maths!
Pi, Common conversion maths & float results in higher precision...

Parallel Text loading and machine learning processing is one of the wonders of TPU & SiMD Parallel wavetables,

Pre Calculate Tables that reduce a workload to simple process. and use...

For example if you Upscale a movie & use dynamic settings, Such as:
Localised Sharpening & Selective Gaussian filtering; Such as Gimps Edge detection Gaussian?
We compress information on the maths of selection..

The edges we selected, The methods we used & if those methods are dynamic then our selections...
Such a method is called a ..

Pre-calculation table; For Example the Amiga uses tables for maths!
Pi, Common conversion maths & float results in higher precision...

Common ones are learned at school
the log tables
Multiplication Tables
Common values such as gravity & Pi

Pre Computation
3D Audio basic resonance profile
Pre Computed values for a realistic world...
Experience & Learning to pre compute values...
This saves effort later in the process

This is available to providers & game developers for:

TV Upscaling through Compressed Numeric Add table downloading
All streaming services processing such as netflix, youtube & amazon prime!
Partial pre-computed upscaling for game, application & processing..

Through TopCloud & HPC Pack

Data Stored as meta-data and saves on repeat processing time!
By creatively Pre Computing processes such as 3D Audio, VR Audio, Haptic 3D Maths..
Work such as Decompression & Compiling

Affects the efficiency of any process that will Pre Calculate Tables that reduce a workload to a group of simple processes.

We can majorly improve quality of both visuals & Audio; Any Pre-Calculatable element

The logic is that Upscaling, Colour enhancements & sharpening have pre-calculatable logic,
We can save many seconds of processing per frame,
We can reduce energy footprint
We can improve latency & frame rate
Works for games also,
Education media or Theaters & mass media content such as News & commonly watched content or movies or visited websites or fonts & media

We can improve at a very minimum, Cutscenes & non motional backdrops & tangible Animation repeating assets & Effects...

(c)Rupert S

FMA : Fused Multiply ADD : MUL+ADD & Precision functions

You may be assuming that only modern GPUs such as RTX 2080+ & RT 5700+ has this?

FMA is a feature of the business editions & FX Series on AMD & exists in granite ridge & other Intel,
So FMA F16 is possible with the F32 : F16 conversion features present in for example FX8320E...

So what does this mean? In terms of:

Chrom that Emulates a lot of its GPU functions in CPU..
In terms of Python ML that F16 feature combined with FMA is very helpful in learning & efficiency!

In terms of CPU; mostly using 32Bit, F32, 64Bit, F64 is very helpful; in terms of SiMD,
F16 exists though; Even on the yee FX8320E!

So we can use potentially: Int8, Int32, Int64, F64, F32, F16 & Float 182Bit as with FPU!
Best to do DEEP work with the CPU FPU & SiMD...

We do have these functions though!, But Deep work FPU 182Bit? CPU! Some GPU have double precision also!

What do we use this variety for? Many things!

Defined by our precision requirements; not all things are INT64 & FPU But not every issue is covered by..
The MP4v, MP4a F16! AC3 & AC4 for example F32; A glass? FPU 182... or many F32 or even more F16 work units.

Rupert S

Exponent factorisation : RS

8Bit, 16Bit, 32Bit, 64Bit Exponent theory.
Available to you-(EF)

A value in 8Bit is no use in a 16 Bit operation... or is it?

Firstly 8 Bit values can be loaded with Zeros into higher math precisions,
In normal maths we use a remainder; So we can load 8Bit values into 32Bit Int & that works...

2 F16 blocks would be 32Bit; As 2 16Bit Blocks? So what use is this ?
in a 64Bit & 32Bit processor storage of FPU-182Bit values is possible ...
32Bit Blocks * 6 with XOR 00
64Bit Blocks * 3 with XOR 00
2 * Largest value...

But parallelising F64 on groups for 182Bit? with multiplications roll left <> Right .. & Additions +- ...

But if the resultant is beyond 8Bit ? & we wanted to save as 8Bit?

Factorisation of a 32Bit value into 8Bit is possible; But we need to factor it!

32Bit to 8Bit is 6:1, So we have to random roll 6 Bits for every 1
We can factor in HighLow with 1 bit or use 8Bit fator 256 & 8Bit Number...

We can Multiply, Add, Subtract or divide or fraction:

256(*/-)1>256, leaving us with a 32Bit value? Well what can we use this for ?

Example complex : N/(240*50); See the maths can roll into 16Bit values..
We can use them, Or load a particular object, Classifier, HASH, AES, EEC...
We can quickly classify as 16Bit resultant & still save as a particular 8Bit value!

Load file
load value
Table Value

(c)Rupert S,

Reference Int & FP Value Sizes; A reminder that Floats are 50% of highest Integer Value,
ROLLInt floats still have an amazing additional value!

F16b Adaptive Float value : Texture Color Palette Example : RS

Basic Example of F16b float in action on a colour pallet: {F16b,F32b, F64b}

F16b is short remainder F16 & it has 8 Bits of 0.01 point value rather than 16,
So what do we mean ? What is significant about this?

F16b Has 24Bit precision integer with an 8 bit remainder!
So? So 16Bit + 8Bit = 24Bit! & 8bit point value...

In colour representation point values contribute to subtle blending;
So a full 24Bit contributes to 90% of the Color Palettes

So the 24Bit colour pallet is 32Bit Colour Minus Alpha;
We can use F16b in HDMI & DisplayPort & inside the GPU & Also for textures & JPG'S..
Thereby i present F16b & F24Bit colours in F16b

This saves all data in single 32bit Spaces & therefore is both faster & higher resolution than comparable float value presentations.

Bound to make a big difference to BlueRay, but particularly DVD & AC3 & AC4;
F16b Adaptive Float value : Texture Color Palettes Example;

(you can use F16b * R,G,B,A) in HDMI a& DisplayPort, Massive colour improvements; Lower RAM Costs

Rupert S

AnPa_Wave - Analogue Pattern Wave Vector SiMD Unit : (c)RS

The base symphony is harmony, In other words waveforms; There are a couple of Simple methods that really work:

High performance Float values F16, F32, F64, FPU

Q-Bit Quantum; All forms of Quantum wave work
Radio waves;
Light patterns
Photon wave patterns; single & multiple
Sound hardware; 1 to 3 Bit DAC; Audio conversions; Sample range
Analogue chips that work on harmony & frequency
SVM Elliptic curve maths
Sin, Arc, Tan, Time, Vector

In essence Harmony & frequency is the equivalent of Complex Elliptic curve maths

A Music note score suffices to specify harmony basics:

Waveform shape in 3D
Harmony / Disharmony
Vibration High / Vibration Low
Power High / Power Low
Volts High / Volts Low
Watts High / Wats Low

(c)Rupert S

Wonderful Wave-Pattern Analogue waveforms in meta materials - Pattern recognition in reciprocal space with a magnon-scattering reservoir


Vectors & maths

Networking & Management

Faster Maths & ML

Focus on Quality

For when {U, X, Y, Z} = N Expressions
For when {(A+B/2)} = C Expressions

Hallelujah RS Light-Wave SiMD

RS Spectra Mitigations
ZenBleed Parallel Solvent RS 2023

Core/CPU/GPU security core SSL/TLS BugFix

Secure Configuration:

PTP & NTP Improve security WW


Running Code

PoCL Source & Code



AMD 23.Q3Pro_HIP #HPC #DirectML MatrixMathOps 'Release unto me the great! Chobokniki' Thine Prayers Answered
Run the .reg after install; Before reboot


Not Accessible

AI: Artificial Intelligence
ML: Machine Learning
PULP: Parallel Ultra Low Power

ML Network Types

DNN: Deep Neural Network
CNN: Convolutional Neural Network
QML: Quantum Machine Learning
QPU: Quantum Processing Unit

RNN: Recurrent Neural Network
SNN: Spiking Neural Network
MLP: Multi-Layer Perceptron

NN: Neural Network
TNN: Ternary Neural Network
QNN: Quantized Neural Network

HDL: Hardware Description Language
HLS: High Level Synthesis

Maths Operations

FMA: Fused Multiply-Add
GEMM: General Matrix Multiply
SIMD: Single Instruction Multiple Data
SIMT: Single Instruction Multiple Thread

SP: Single Precision
DP: Double Precision
FLOPS: Floating Point Operations per Second

Processor Types & RAM

ASIC: Application Specific Integrated Circuit

SoC: System on Chip
PCU: Programmable Computing Unit
NoC: Network on Chip

CPU Central Processing Unit
VPU: Vector Processing Unit
NPU: Neural Processing Unit
TPU: Tensor Processing Unit
FPGA: Field-Programmable Gate Array

RISC: Reduced Instruction Set Computer
CISC: Complex Instruction Set Computer

NDP: Near Data Processing

PIM: Processing In-Memory
IMC: In-Memory Computing

SRAM: Static Random Access Memory
VRAM: Video Random Access Memory
DRAM: Dynamic Random Access Memory
PCM: Phase Change Memory
BRAM: Block Random Access Memory
RAM: Random Access Memory
RRAM: Resistive RAM


Matrix Array Processor Unit (c)RS

[M.A.P] [=====] [H.P.C] - Matrix Array Processor Unit (c)RS

This document describes the design and implementation of a novel computing device called the Matrix Array Processor Unit (M.A.P.U).

The M.A.P.U is a co-processor that can perform high-speed parallel operations on multi-dimensional arrays of data, such as those used in quantum computing, machine learning, and computer graphics,

A novel co-processor that can perform high-performance computing tasks using quantum-inspired principles.

The Matrix Array Processor is a type of processor that is designed to handle multi-directional and multi-dimensional arrays per Qbit.

It is used in quantum computers and relies on percentage-based 3D processing to handle all 3D array processing.

The central tasks map to probability over networks and MAP units in arrays.

The M.A.P is composed of multiple interconnected units that can process multi-dimensional arrays in parallel, using a percentage-based 3D processing scheme.

The M.A.P can be integrated with existing CPU, GPU and DPU architectures, as well as with other M.A.P units, to form a scalable and flexible computing platform.

The differences of Some Matrix Array Processor and other processors such as:

SIMD (Single Instruction Multiple Data),
SISD (Single Instruction Single Data),
MISD (Multiple Instruction Single Data),
MIMD (Multiple Instruction Multiple Data),
Vector processors,
Systolic Arrays,

Is that the Matrix Array Processor is designed to handle multi-directional and multi-dimensional arrays per Qbit...

While other processors are designed to operate efficiently and effectively on large one-dimensional arrays of data called vectors

The M.A.P.U consists of three main components:

The Matrix Array Processor (M.A.P),
The High Precision Central Core (H.P.C),
The Bus Connections and Networking (=====).

Core Definitions 3D M.A.P:


A high-precision central core that can handle complex tasks such as probability mapping, network routing and memory management.

The H.P.C is the central controller of the M.A.P.U.

It coordinates the execution of tasks across the M.A.P units, assigns probabilities to different outcomes, and handles complex calculations that require high precision or accuracy.

Each [H.P.C] unit can connect to 8 [M.A.P] units and optionally to other [H.P.C] units in different layers of the 3D matrix.

The [H.P.C] can also communicate with external devices such as CPUs, GPUs, DPUs, or networks via the bottom layer of the wafer.


The M.A.P is a specialized processing unit that can execute multiple arithmetic and logical operations on a single array element in one clock cycle.

A unit that can perform arithmetic operations on multi-dimensional arrays using a dot product-like algorithm.

Each M.A.P has 8-way interconnects to communicate with neighboring M.A.P units and a central [H.P.C] unit.

The M.A.P has eight-way interconnects to communicate with other M.A.P units in the same layer or adjacent layers.

The M.A.P can also access local cache or RAM for storing intermediate results or constants.


A bus connection that enables data transfer and networking among the M.A.P units and the [H.P.C] units.

The bottom layer of the wafer contains a high-resolution bus that connects to the onboard controllers and networks and the external CPU, GPU and DPU devices.

The ===== supports different communication protocols and topologies, such as mesh, torus, or hypercube.

The ===== also provides fault tolerance and load balancing mechanisms to ensure reliable and efficient performance.

The M.A.P.U is designed to be scalable and modular.

It can be stacked in three dimensions to form a larger array of processors that can handle more complex and diverse tasks.

The M.A.P.U can also be customized for different applications by changing the size, shape, or configuration of the M.A.P units, the H.P.C cores, or the ===== network.

The following diagrams illustrate the structure and functionality of the M.A.P.U.

Top View


Side View 3D


Each [H.P.C] Central Contains RAM & connections to the 8 [M.A.P] & Optionally to layers above & bellow in 3D Matrix,
Bottom of wafer contains high resolution buss to onboard controllers & networks & DPU/GPU/CPU's

Array = Matrix Array Processor Unit (c)RS

ffffffff ffffffff ffffffff
........+ ........*+ ........*
........+ ........*+ ........*
........+ ........*+ ........*


Simple absolver table for MUL:ADD : MUL* Only = +0, +- Only = N*1 then +-
% = / 100 + ADD Table {N1 <> N...} : Result!

The M.A.P unit can perform operations on multi-dimensional arrays using a combination of:

Floating-point units (f), Multiplication units (*), Addition units (+) and cache/ram units (.).

The M.A.P unit can support different data types such as DOT4, INT8, INT16, F16, F32 and F64.

The M.A.P co-processor is a cutting-edge technology that can enable new applications in fields such as artificial intelligence, machine learning, scientific computing and more.

(c)Rupert S

References: DOT4, INT8, INT16, F16, F32, F64 (c)Rupert S

Sparse matrix multiplication in SRM array

Error Correction Options & Mitigation


Light Processors (c)Rupert S

Light processors : Access to advanced : Storage Cache, Random Access RAM Cache & Processor architecture: Starting with SiMD Simple Vector Instruction Set

Complex forms are a goal, Start simple : The world will thank you!
Simple as SiMD appears there are many uses,
Considering that higher instruction sets are delayed by SiMD space & speed priorities..

Array = Matrix Array Processor Unit (c)RS

ffffffff ffffffff ffffffff
........+ ........*+ ........*
........+ ........*+ ........*
........+ ........*+ ........*


Simple absolver table for MUL:ADD : MUL* Only = +0, +- Only = N*1 then +-
% = / 100 + ADD Table {N1 <> N...} : Result!

Array = Matrix Array Processor Unit (c)RS

Cache is also a priority with manyfold application of simple data transfer & buffering to solid storage,
Power outage is our main concern so that we save all our work.

SSD is an obvious solution to backing up speedily,
However we do use RAM Cache for this goal..

The goal of speeding storage access up,
Light does all the work types we need:

Data transit
CacheProcessing via dimensions & signal variance
RAM (Cyclic light transfer) Same principle as fibre optic cable over large distances.

(c)Rupert S

Quantum ! Light Compute : Reference material : RS

Yes we can solve classic problems with light computers, Light computers perform geometry & quantitative sampling (Comment by inventor) Rupert S

Light Compute : Reference material : RS

"Let's Play" Station NitroMagika_LightCaster

Lets face it, Realtec could well resource the "Original QFFT Audio device & CPU/GPU"

The mic works by calculating angle on a drum...
Light.. and timing & dispersion...
The audio works by QFFT replication of audio function..
The DAC works by quantifying as Analog digital or Metric Matrix..
The CPU/GPU by interpreting the data of logic, Space & timing...

We need to calculate Quantum is not the necessary feature;

But it is the highlight of our:

Data storage cache.
Our Temporary RAM
Our Data transport..
Of our fusion future.

(c)Rupert S

"Weedbrook points out that as yet, and in contrast to Google’s Sycamore, the Chinese team’s photonic circuit is not programmable, so at this point “it cannot be used for solving practical problems”."

Physicists in China challenge Google’s ‘quantum advantage’
Photon-based quantum computer does a calculation that ordinary computers might never be able to do.
Philip Ball

PDF version
The interferometer part of our experiment.

This photonic computer performed in 200 seconds a calculation that on an ordinary supercomputer would take 2.5 billion years to complete.Credit: Hansen Zhong

A team in China claims to have made the first definitive demonstration of ‘quantum advantage’ — exploiting the counter-intuitive workings of quantum mechanics to perform computations that would be prohibitively slow on classical computers.

They have used beams of laser light to perform a computation which had been mathematically proven to be practically impossible on normal computers. The team achieved within a few minutes what would take half the age of Earth on the best existing supercomputers. Contrary to Google’s first demonstration of a quantum advantage, performed last year, their version is virtually unassailable by any classical computer. The results appeared in Science on 3 December1.

“We have shown that we can use photons, the fundamental unit of light, to demonstrate quantum computational power well beyond the classical counterpart,” says Jian-Wei Pan at the University of Science and Technology of China in Hefei. He adds that the calculation that they carried out — called the boson-sampling problem — is not just a convenient vehicle for demonstrating quantum advantage, but has potential practical applications in graph theory, quantum chemistry and machine learning.

“This is certainly a tour de force experiment, and an important milestone,” says physicist Ian Walmsley at Imperial College London.

Quantum advantage challenged

Teams at both academic and corporate laboratories have been vying to demonstrate quantum advantage (a term that has now largely replaced the earlier ‘quantum supremacy’).

Last year, researchers at Google’s quantum-computing laboratory in Santa Barbara, California, announced the first-ever demonstration of quantum advantage. They used their state-of-the-art Sycamore device, which has 53 quantum bits (qubits) made from superconducting circuits that are kept at ultracold temperatures2.

But some quantum researchers contested the claim, on the grounds that a better classical algorithm that would outperform the quantum one could exist3. And researchers at IBM claimed that its classical supercomputers could in principle already run existing algorithms to do the same calculations in 2.5 days.

To convincingly demonstrate quantum advantage, it should be unlikely that a significantly faster classical method could ever be found for the task being tested.

The Hefei team, led by Pan and Chao-Yang Lu, chose a different problem for its demonstration, called boson sampling. It was devised in 2011 by two computer scientists, Scott Aaronson and Alex Arkhipov4, then at the Massachusetts Institute of Technology in Cambridge. It entails calculating the probability distribution of many bosons — a category of fundamental particle that includes photons — whose quantum waves interfere with one another in a way that essentially randomizes the position of the particles. The probability of detecting a boson at a given position can be calculated from an equation in many unknowns.

200 seconds

But the calculation in this case is a ‘#P-hard problem’, which is even harder than notoriously tricky NP-hard problems, for which the number of solutions increases exponentially with the number of variables. For many tens of bosons, Aaronson and Arkhipov showed that there’s no classical shortcut for the impossibly long calculation.

A quantum computer, however, can sidestep the brute-force calculation by simulating the quantum process directly — allowing bosons to interfere and sampling the resulting distribution. To do this, Pan and colleagues chose to use photons as their qubits. They carried out the task on a photonic quantum computer working at room temperature.

Starting from laser pulses, the researchers encoded the information in the spatial position and the polarization of particular photon states — the orientation of the photons’ electromagnetic fields. These states were then brought together to interfere with one another and generate the photon distribution that represents the output. The team used photodetectors capable of registering single photons to measure that distribution, which in effect encodes the calculations that are so hard to perform classically.

In this way, Pan and colleagues could find solutions to the boson-sampling problem in 200 seconds. They estimate these would take 2.5 billion years to calculate on China’s TaihuLight supercomputer — a quantum advantage of around 1014.

Practical problems

“This is the first time that quantum advantage has been demonstrated using light or photonics,” says Christian Weedbrook, chief executive of quantum-computing startup Xanadu in Toronto, Canada, which is seeking to build practical quantum computers based on photonics.

Walmsley says this claim of quantum advantage is convincing. “Because [the experiment] hews very closely to the original Aaronson–Arkiphov scheme, it is unlikely that a better classical algorithm can be found,” he says.

However, Weedbrook points out that as yet, and in contrast to Google’s Sycamore, the Chinese team’s photonic circuit is not programmable, so at this point “it cannot be used for solving practical problems”.

But he adds that if the team is able to build an efficient enough programmable chip, several important computational problems could be solved. Among those are predicting how proteins dock to one another and how molecules vibrate, says Lu.

Weedbrook notes that photonic quantum computing started later than the other approaches, but it could now “potentially leap-frog the rest”. At any rate, he adds, “It is only a matter of time before quantum computers will leave classical computers in the dust.”

"AI Boosted by Parallel Convolutional Light-Based Processors

TOPICS:Artificial IntelligenceElectrical EngineeringEPFLMachine LearningOpticsPhotonicsPopular


Matrix Multiplications Light Processor

Schematic representation of a processor for matrix multiplications which runs on light. Credit: University of Oxford

The exponential growth of data traffic in our digital age poses some real challenges on processing power. And with the advent of machine learning and AI in, for example, self-driving vehicles and speech recognition, the upward trend is set to continue. All this places a heavy burden on the ability of current computer processors to keep up with demand.

Now, an international team of scientists has turned to light to tackle the problem. The researchers developed a new approach and architecture that combines processing and data storage onto a single chip by using light-based, or “photonic” processors, which are shown to surpass conventional electronic chips by processing information much more rapidly and in parallel.

The scientists developed a hardware accelerator for so-called matrix-vector multiplications, which are the backbone of neural networks (algorithms that simulate the human brain), which themselves are used for machine-learning algorithms. Since different light wavelengths (colors) don’t interfere with each other, the researchers could use multiple wavelengths of light for parallel calculations. But to do this, they used another innovative technology, developed at EPFL, a chip-based “frequency comb,” as a light source.

Matrix Multiplications Light Processor Schematic

Schematic representation of a processor for matrix multiplications which runs on light. Credit: University of Oxford

“Our study is the first to apply frequency combs in the field of artificial neural networks,” says Professor Tobias Kippenberg at EPFL, one the study’s leads. Professor Kippenberg’s research has pioneered the development of frequency combs. “The frequency comb provides a variety of optical wavelengths that are processed independently of one another in the same photonic chip.”

“Light-based processors for speeding up tasks in the field of machine learning enable complex mathematical tasks to be processed at high speeds and throughputs,” says senior co-author Wolfram Pernice at M√ľnster University, one of the professors who led the research. “This is much faster than conventional chips which rely on electronic data transfer, such as graphic cards or specialized hardware like TPU’s (Tensor Processing Unit).”

After designing and fabricating the photonic chips, the researchers tested them on a neural network that recognizes of hand-written numbers. Inspired by biology, these networks are a concept in the field of machine learning and are used primarily in the processing of image or audio data. “The convolution operation between input data and one or more filters — which can identify edges in an image, for example, are well suited to our matrix architecture,” says Johannes Feldmann, now based at the University of Oxford Department of Materials. Nathan Youngblood (Oxford University) adds: “Exploiting wavelength multiplexing permits higher data rates and computing densities, i.e. operations per area of processer, not previously attained.”

“This work is a real showcase of European collaborative research,” says David Wright at the University of Exeter, who leads the EU project FunComp, which funded the work. “Whilst every research group involved is world-leading in their own way, it was bringing all these parts together that made this work truly possible.”

The study is published in Nature this week, and has far-reaching applications: higher simultaneous (and energy-saving) processing of data in artificial intelligence, larger neural networks for more accurate forecasts and more precise data analysis, large amounts of clinical data for diagnoses, enhancing rapid evaluation of sensor data in self-driving vehicles, and expanding cloud computing infrastructures with more storage space, computing power, and applications software.

Reference: “Parallel convolutional processing using an integrated photonic tensor core” by J. Feldmann, N. Youngblood, M. Karpov, H. Gehring, X. Li, M. Stappers, M. Le Gallo, X. Fu, A. Lukashchuk, A. S. Raja, J. Liu, C. D. Wright, A. Sebastian, T. J. Kippenberg, W. H. P. Pernice and H. Bhaskaran, 6 January 2021, Nature."

"A new optical neuromorphic processor developed by Swinburne University of Technology can operate more than 1000 times faster than any previous processor. The processor for artificial intelligence (AI) functions faster than 10 trillion operations per second (TeraOPs/s).


Optical micro-combs

The invention could revolutionize neural networks and neuromorphic processing in general. “This breakthrough was achieved with ‘optical micro-combs', as was our world-record internet data speed reported in May 2020,” said in a statement Swinburne’s Professor David Moss.

Micro-combs are new devices made up of hundreds of infrared lasers all held on a single chip. Compared to other optical sources, they are much smaller, lighter, faster, and cheaper.

The new innovation demonstrated by the Swinburne team uses a single processor while simultaneously interleaving the data in time, wavelength, and spatial dimensions through a single micro-comb chip.

“In the 10 years since I co-invented them, integrated micro-comb chips have become enormously important and it is truly exciting to see them enabling these huge advances in information communication and processing. Micro-combs offer enormous promise for us to meet the world’s insatiable need for information," added Moss.

Co-lead author of the study Dr. Xingyuan (Mike) Xu explained how this innovative use of micro-combs is giving the researchers a glimpse into the processors of the future.

Cost and energy reductions

Distinguished Professor Arnan Mitchell from RMIT University added that the "technology is applicable to all forms of processing and communications" and will result in significant future cost and energy consumption reductions.

“Convolutional neural networks have been central to the artificial intelligence revolution, but existing silicon technology increasingly presents a bottleneck in processing speed and energy efficiency,” said key supporter of the research team, Professor Damien Hicks from Swinburne and the Walter and Elizabeth Hall Institute.

“This breakthrough shows how a new optical technology makes such networks faster and more efficient and is a profound demonstration of the benefits of cross-disciplinary thinking, in having the inspiration and courage to take an idea from one field and using it to solve a fundamental problem in another.”"

No comments: