ESA Space blog - All Rights Reserved RS

Friday, June 23, 2023

[M.A.P] [=====] [H.P.C] - Matrix Array Processor Unit (c)RS

Matrix Array Processor Unit (c)RS

[M.A.P] [=====] [H.P.C] - Matrix Array Processor Unit (c)RS

*
The M.A.P Processor is ideal as a Tensor Unit, For Small Array Solving; Such as MP3, MP4 & AC4 3D Audio,
The Base Map is simply to Fit a large static conversion M.A.P into the device,
For example a 32Bit Audio Sample Pluse 3D Layer for Bluetooth would simply be around 64Bits for Stereo 32Bit Audio MP4; Plus 32Bits for the 3D Map,
The M.A.P Process is not static; But you stick to the maths you wanted.

In parallel instructions, one calls interrupts if bad; IRQ & DMA Notes if you want to have better performance,
But in a processor Internals you have to call the main loops in your App; & OS Task Instruction cache..

Instruct The loop; Don't Interrupt; Stop, Look, Listen! Look, Slowdown, Showtime!

Integer instructions multiple parallel example of The principle of,
M.A.P is based on wide multiple instructions, This suites AVX & SiMD,
Particularly in 16Bit Multi Parallel Instruction Mode

Rupert S

Soft Interrupt IRQ: Faster CPU Cycles: RS

A Soft Interrupt is where you direct the interrupt register to a compiled Code Block..
The code block handles the Wait Queue in a gentle way that allows processing to continue & Ram to be accessed..

While the HDD directly writes the IRQ messages to the Code Block; The Code block is below the size of Cache on the Processor..

In advanced scenarios the Soft Int Caches Read/Write in RAM while Directing DMA & R/W Cached Cycles; Good Bioses & Software do this.

But in a processor Internals you have to call the Main Micro loops (Soft Int) in your App; & OS Task Instruction cache.

RS

Interrupts particularly effect the Processor functions such as..
Machine Learning Load & Store of Frames, Also the internet..
In such as Network cards offloading is often required to handle interrupts..

VPDM-ST-LRS : Verified Processor Direct Memory Space Transactions Load, Register & Save (c)RS

In Concurrence with DM-TCP & DM-UDP & DM-Quicc Soft Interrupt IRQ

https://www.phoronix.com/news/Linux-Device-Memory-TCP

For SI-IRQ to safely directly write RAM for a SiMD & CPU/TPU; The following protocol is observed:

1 DMA Memory Management Processor, Device Bios/PCI Bus & Network Chipset/Network card..
Shall directly code check incoming traffic; But shall not void EEC Mode error check...

Bear in mind that AES, Common TLS & Packet Compression are in effect!
So you shall be using Networking features directly through the Transparent H.D.L Hardware Device Layer...

In effect the MMU & Network adapter transparently offload directly to Device Topography RAM & Cache!

2 The network card Certifies transactions & offloads security to internal features; Main Certification is still TPM & HMS.

3 You can handle directly to Processor of memory space matches internet Bit-depth; However this is usually 32Bit as with IP4 & 64Bit with IP6..

4 So the MMU & Network chipset work in sync; EEC, Security, TLS, M.S.T: Memory Space Translation...

5 VPDM-ST-LRS : Verified Processor Direct Memory Space Transactions Load, Register & Save (c)RS

So to be clear Automated Load, Register & Save Networking; Yes,
Device Low Level Firmware Translation Transactions; Yes
Processor Direct Memory Space Transactions; No, With Verification? Yes

To stop per Frame IO being a high cost transport processing; We process the entire frame per In/Out,
The same with TCP/UDP/Quicc; We process per whole Bit; For example 192Bits (SSL,AES),
Packet containment & control protocols; Mainly because Half packets caused inefficiency!

Rupert S

The IDFlow Work Networking : DMA to DMA Buffer write throughs with caching : (c)RS

DMA Offloading such devices as Network Cards, Audio & GPU to GPU connections : DMA, Direct Memory Access with direct device to device write-through caching

What is different about this approach; The TCP UDP non specific protocols allow motion through a computer system,

Routed through chiplets & internal networks; No latency issues & very little protocol overhead.

Applies to Ethernet, Wifi, Network, Internal Buss, Audio, Video, CPU or processor & is the internal data flow system : The IDFlow Work Networking

With GPU to GPU & Hard Drive to Hard drive transfers direct equivalence is the primary necessity!

To have a coherent transfer between two of the equivalent systems we need a Cache for input & output..

However we can create a load on arrival eta on transfer that automates correct RAM location that is optimally sorted!

The difference with IDFlow DMA :

Negotiated security profile..

The main thing about Mapped DMA is an ideal route

The routing table, To handle complexities in machinery & ethernet & wifi/BT

Negotiated Data Types, Traditional DMA is memory, IDFlow can use data types, For example textures or OpenCL Kernels & Data

Privacy, traditional DMA is quite private because information is not provided on route by intermediaries..

However you think about DMA System IDFlow,
It may be a tiny bit slower negotiating on boot,

However in Ethernet Negotiation only takes a second,
Once negotiation is accomplished... The system acts like a traditional DMA..

(c)RS

Certificate exchange IP Packets & then Device classifiers : RAM, Processing power & features, Priority, Availability, workload levels, common statistics, Routing table array.

Routing table array { passthrough hardware such as Motherboard chipset & special devices such as DMA, Busses & routing table storage & access }

IP Packet formula, Metadata {

Workload timer for OpenCL & DirectCompute workloads

Send cooky code packet on request reception or query

What we need to do first is send a quick burst of metadata; The metadata contains the application & use principle; We define preferred use & reception application!

Identification of the type of data being sent allows Direct RAM Allocation in the correct formula,
For example Textures or OpenCL & Direct Compute runtimes or Hard drive or Ram Data Blocks..

DMA Cache can then be directly allocated based on size & composition of data & that memory can be directly moved to the application memory allocation, Avoiding the cache being moved internally inside the Processor or GPU IPU, NPU etcetera..

We can pre formulate the data packet from a source such as QAM that sends Encryption offloaded packets for storage or use; This allows Prior work in the flow of data,

Where we need to directly allocate RAM Blocks to write but the end device needs to arrange the RAM block for write; Effectively a dynamic frame/Data block.

Example application where prior work from source device to end device is applicable:

QAM & Chipset to HDD & SDD & Drive direct transfer to QAM & Chipset for Processor use..

Maybe directly to Encrypted RAM as commanded by the Processor.

Direct Storage to GPU or decompression chipset to GPU

In terms of FPU & NPU to CPU task sharing, Dynamic metadata allows task optimisation & ram allocations..

Improving on that Dynamic Storage & Retrieval with optimal computation block.. Reduces overhead & repeated task processing.

In terms of HDMI & DisplayPort direct frame to frame DMA would speed up Ethernet transport protocols from the GPU to the display & back for when you frame copy..

In terms of Audio the per frame or tick cycle translation data to output would reduce overhead..

The principle if IDFlow is planned & secure DMA,

In principle the key point is the same as modern GPU direct RAM Access,

However because DMA is private & secured by being per application,

Direct DMA is a means of keeping secrets in the same way as PreFetch on the CPU,

Now you know that prefetch is bugged, DMA holds discrete secrets & privacy.

};

Rupert S

https://science.n-helix.com/2023/02/pm-qos.html

https://lore.kernel.org/dri-devel/20230710223304.1174642-1-almasrymina@google.com/

https://is.gd/HPC_PTP_Low_Latency_Network

https://www.linuxfoundation.org/press/announcing-ultra-ethernet-consortium-uec

https://ultraethernet.org/

https://jointdevelopment.org/

DMA & IO Device mapping

Dynamic Mapped Data flow with device compression
DMA & PIO needs to pass logically from device to device..
Memory allocation for buffers & cache; Input & direct load

https://science.n-helix.com/2023/06/ptp.html
https://science.n-helix.com/2023/02/pm-qos.html
https://science.n-helix.com/2023/06/map.html

RS

*

Embedded Hardened Pointer Table Cache for 3D Chips : RS

Based on PCI Edge RAM, Internal Loop Dynamic RAM; With internalised DMA Memory transfers..

In the process the feature has the ability to set a page table; 1MB, 2MB, 4MB, 16MB > 1TB,The Ram can be internally written to without invoking ALU or OS,

Pages are allocated; The GPU is an example; Physical pages are allocated in RAM that is directly Set by OS & Firmware/ROM Parameters...

Internal access to the RAM is set within the page allocation set, But all internal mapping & paging is done directly & though ALU & Memory Management Unit MMU.

With 1MB Cache set aside per feature; Not entirely unreasonable these days...

Most if a process such as SiMD can be carried out on internal loops..

Depending on Cache/RAM Space; Based on PCI Edge RAM

Internal DataSet Size based on Dynamic RAM Variable; That is set per USE &Or Per Settings or application,

That being said; RAM Allocations best be per session & directly after Setting is changed on reboot or refresh, Load & unload cycling.

Rupert S

*

Gather/Scatter Microcode no-overload ALU or Data/Code Cache, Just L3/RAM

When we look at the Instructions of the SiMD; We could see potential in them to further improve the Gather/Scatter Instructions; Although it has to be said that the instructions are well optimised!
Like many pre-Fetching Assembly code for earlier years they are well created & quick!

But we can do several things with them; So what ?

We can directly fetch the Cache in the code & Link to cache locations using linking (if we have enough & we do at L3/L2)

We can make a Hardlink table in cache(L3) for load and save processing (64Kb, Including header)

We can directly invoke pre-fetch with a system call (With SoftLink Pointer Tables)

We can incache modify (if a directive is singular in a chain of a, b, c, d)
We can individually SysCall a direct load of a single {a, b, c, d) statement & not reload it all...

For this we need a matrix table in L3 RAM; We can do this if we keep the table under 512KB,
But we do not intend to be selfish & RAM is fast these days! So we can directly load a single matrix Element {a, b, c, d} & not refresh the loading cycle for the code...

Thus we do not have to overload ALU or Data/Code Cache, Just L3/RAM

Rupert S

*

Temporary HardLinking in Prefetching Matrix instructions,

Gather/Scatter operations of localised random scattering of information to ram & retrieval

Gather
for (i = 0; i < N; ++i)
x[i] = y[idx[i]];

Scatter
for (i = 0; i < N; ++i)
y[idx[i]] = x[i];

Firstly i read statistical gathing & Seeding; Pre-Fetching is a method of anticipating & preloading data,
So what do i want to do ? In Vector Matrix Prefetch Logical Gather

Potentially i would like to use:

Softlink (ram retrieval & multiple value)
HardLink (maths)
Prefetching logic {such as,

Run length prefetching,
Follow & Forward loading Cache,
Entire instruction load & Timing Pre-fetch & Statistic for Loop time & load frequency
}

So on any potential layout for SiMD Matrix a most likely configuration is:

A B C : FMA
A B = C : Mul or ADD

So a logical statement is, A, B Gather/Seed C; Directly logical AKA Prefetch
A B C D; Logical fields of prefetch are localised to parameter...

Only likely to draw data from a specific subset of points,
Byte Swapping is obviously A1 B1,2,3

Most specifically if the command is a hardlink With A B C; Then most likely Storage is directly linked; Like a HardLink on a HDD in NT,

The hard link is direct value fetching from a specific Var table & most likely a sorted list!
If the list is not sorted; We are probably sorting the list..

If we do not HardLink data in a matrix (Example):

Var = V+n, Table
a b c d
1[V1][V1][V1][V1]
2[V2][V2][V2][V2]
3[V3][V3][V3][V3]
4[V4][V4][V4][V4]

A Matrix HardLink is a temporary Table specific logical reading of instructions & direct memory load and save,
Registers {A,B,C,D}=v{1,2,3,4}..

Directly read direct memory table logic & optimise resulting likely storage or retrieval locations & Soft Link (pointer table)

Solutions include multiple Gather/Scatter & 'Gather/Scatter Stride' Cube Block multi load/save..
Logical Cache Storage History Pointer Table, Group Sorted RAM Save/Load by classification {A,B,C,D}=v{1,2,3,4}
When X + Xa + Xb + Xc, When Y + a b c, When Y or X Prefetch Pointer Table + Data { a, b, c }

Example Gather/Scatter logical multiple

var pointer [p1] {a ,b, c, d}
var pointer [p2] {1 ,2, 3, 4}

Gather
for (i = 0; i < N; ++i)
x[i] = y[idx[i]];
fetch y {p1, p2}; {a, b, c, d}:{1 ,2, 3, 4}

Scatter
for (i = 0; i < N; ++i)
y[idx[i]] = x[i];
send x {p1, p2}; {a, b, c, d}:{1 ,2, 3, 4}

Rupert S : Reference https://en.wikipedia.org/wiki/Gather/scatter_(vector_addressing)

*

FMA is a Matrix SiMD feature & is common to ARM & AMD, CPU & GPU

Phone SIM cards can use FMA for GSM network acceleration,

We can use FMA fused MUL ADD for elliptic curve encryption to multiple Time * curve & ADD AES encryption in the form of time model & 3D dimensions,

Therefore we can use FMA to calculate the room area & add audio reverberation matrix as volume levels over time..

FMA as a basic GPU..

We can convert adder & fused MUL ADD ML,

Use all 3 types on integer function of CPU & internal GPU on echo dot type device's with internal GPU and CPU.. FPGA design.

Rupert S

Pre-Fetching; Statistically Ordered Gather/Scatter & The Scatter/Gather Commands

(SiMD) The gather/scatter commands may seem particularly random?
But we can use this in machine learning:

Gather
The equivalent of Gathering a group of factors or memories into a group & thinking about them in the context of our code! (our thought rules),

Scatter
Now if we think about scatter; we have to limit the radius of our through to a small area of brain matter (or ram)... Or the process will leave us "Scatter-Brained"

Statistical Pre-Fetching:

Ordered Scatter
When you know approximately where to scatter

Ordered Gather
Where you know approximately where to gather

Free Thought
So now we can associate scatter & gather as a form of free thought? Yes but chaotic...
So we add order to that chaos! We limit the scattering to a single field.

Stride
Stride is the equivalent of following a line in the field; Do we also gather &Or Scatter while we stride ?
Do we simply stride a field?

Now to answer this question we simply have to denote motive!
In seeding we can scatter; Will we do better with an Ordered Scatter ? Yes we could!

Statistically Ordered Gather/Scatter & The Scatter/Gather Commands
Pre-Fetched

Rupert S

Multi-line Packed-Bit Int SiMD Maths : Relevance HDR, WCG, ML Machine Learning (Most advantaged ADDER Maths)

The rules of multiple Maths with lower Bit widths into SiMD 256Bit (example) 64Bit & 128Bit & 512Bit can be used

In all methods you use packed bits per save, so single line save or load, Parallel, No ram thrashing.

You cannot flow a 16Bit block into another segment (the next 16Bit block)

You can however use 9 bit as a separator & rolling an addition to the next bit means a more accurate result!
in 32Bit you do 3 * 8bit & 1 * 4Bit, in this example the 4Bit op has 5 Bit results & The 8Bit have 9Bit results..
This is preferable!

2Bit, 3Bit, 4Bit Operation 1 , 8Bit Operations 3: Table

32Bit
4 : 1, 8 : 3

64Bit
4 : 2, 8 : 6
2 : 1, 7 : 8
3 : 1, 8 : 1, 16 : 3

Addition is the only place where 16Bit * 4 = 64Bit works easily, but when you ADD or - you can only roll to the lowest boundary of each 16Bit segment & not into the higher or lower segment.

A: In order to multiply you need adaptable rules to division & multiply
B: you need a dividable Maths unit with And OR & Not gates to segment the registered Mul SiMD Unit..

In the case of + * you need to use single line rule addition (no over flow per pixel)..
& Either Many AND-OR / Not gate layer or Parallel 16Bit blocks..

You can however painful as it is Multi Load & Zero remainder registers & &or X or Not remainder 00000 on higher depth instructions & so remain pure!

8Bit blocks are a bit small and we use HDR & WCG, So mostly pointless!

We can however 8Bit Write a patch of pallet & sub divide our colour pallet & Light Shadow Curves in anything over 8Bit depth colour,

In the case of Intel 8Bit * 8 Inferencing unit : 16 Bit Colour in probably (WCG 8 * 8) + (HDR 8 * 8) Segments,

In any case Addition is fortunately what we need! so with ADD we can use SiMD & Integer Today.

Rupert S

https://science.n-helix.com/2018/01/integer-floats-with-remainder-theory.html

https://science.n-helix.com/2021/11/parallel-execution.html

https://science.n-helix.com/2022/10/ml.html

https://science.n-helix.com/2021/03/brain-bit-precision-int32-fp32-int16.html

https://science.n-helix.com/2023/06/map.html

M.A.P NPU Matrix Processor Dimensional construct (c)RS

Primary reason for expansion of function data sets: 2D, 3D,< nD

P.D.C is a worker thread parallel 2D or 3D Grid,
Utilising QQ & A, B,C Array maths allows us to collapse or expand dimensions in a flexible way,

The same principles as SVM (S.V.M SiMD Vector matrix) can be used to culminate or expand dimensions...

That way a M.A.P Processor can expand or collapse all mathematical constructs,
We can therefore use all mathematical & statistical arrays for machine Learning & Maths.

RS

*

The Subject of 4x4 tables,

We are obviously looking for more like 16x16 for Physics maths!
The matrix processor is a large data set; Divisible into 4x2 & 4x4 & 8x8 groups for execution speedups,
Aligned Parallel processing....

Aligned Matrix tables need to be larger than 4x4 for Physics & Chemistry; So a matrix processor ideally can at a minimum:

Matrix Table

x1
16x16

16/2
x2
8x8,8x8
8x8,8x8

8/4
x4
4x4,4x4
4x4,4x4

RS

*

Matrix Method (c)RS

Any GPU & CPU SiMD can do a form of Matrix maths in an Array Parallel Load & Run as consecutive tasks..

Like So

Matrix Formulas : (c)RS

SiMD Array A to X, Usually 8, 16, 32, 64 Parallel Groups

Grouped Parallel Runs
A 1, 2, 3, N
B 1, 2, 3, N
to
Y 1, 2, 3, N
X 1, 2, 3, N
Run 1 {A1, B1 to X1, Y1} Run 2+ {A2, B2 to X2, Y2}++ {An, Bn to Xn, Yn}

Matrix Processor Method Synchronous Cube Map Usually 8x8, 16x16, 32x32, 64x64 Parallel Quad++ Groups

2D:3D Cube

A 1, 2, 3, N
B1, 2, 3, N
C1, 2, 3, N
D1, 2, 3, N

Run 1 2D:3D Cube {
A 1, 2, 3, N
B1, 2, 3, N
C1, 2, 3, N
D1, 2, 3, N
};

Run N 2D:3D Cube {
A 1, 2, 3, N
B1, 2, 3, N
C1, 2, 3, N
D1, 2, 3, N
}

Rupert Summerskill

SiMD Matrix maths begins with a 3D graph,

|___c

The graphs principal of 3 dimensions; We can use more dimensions but on paper we need to represent dimensions in colours so that all 3 dimensions that we can draw; are represented.

In algebra we represent 3+ dimensions with small glyphs next to each letter that represents our maths operation theoretical number.

During operation of computation we maintain in memory the specific dimensions interactions and interplay of complex matrix maths.

Rupert S

Numbers example 4D matrix

I love you 2, I love you 3, I love you 4 the ends of time... To be continued...

The formula for the NPU (c)RS

Codecs & drivers with Matrix Mathematics Formula for AVX, NPU, TPU, Coral.ai Edge TPU & GPU,
Can potentially optimise 1000's of Web pages per second

Matrix Math Formula Get more Upscaling & performance per WATT, By Block Load/Run Parallel Processing:
SiMD, AVX, XMM, NPU, FPU, GPU, Processor

The formula for the NPU is +++ *** +*+*+* Adder, Multiplier, ADD MUL, This original formula was thought of by me in the 1980's as a child...

My basic reasoning (fighting for credit with a developer) Was for the Atari game cartridge system!

Now Why ? Because of several factors:

Parallel Adder Tables are FAST, Like so fast!
Parallel Adder Tables are cheap

Parallel is the new in of RISC, Extended parallel instruction sets, Vectors & MMX & 3DNow!

The philosophy behind Parallel instruction use is to be understood to be based on the Console requirements...

Speed, Performance, Price & the difference it makes!

Formula tables may seem complex! How does a child learn of this?

2 factors: I like game arcades & consoles.. & Basically Maths education at school!

Factor tables are a basic of Excel Spreadsheets & formulas for the process of examining the forex, foreign trade & equities & share markets of the world & New York makes one dream! Dream Big!

It is to be understood that APPLE understand the functional potential of ADDER MUL & +* Tables with memory...

They do I.T!

It is to be understood that AMD & Intel & NVidia Understand machine learning in a point of view.

Due to the complexity of Formula Tables as a basis of maths & science?

WE UNDERSTOOD IT.

Do you the client and the producers : APPLE, ARM, AMD, INTEL, NVidia, Motorola (6833++)

Truly understand the TRUE Potential of Formula Tables?

Basic formula table examples:

Codecs & drivers with Matrix Mathematics Formula for AVX, NPU, TPU, Coral.ai Edge TPU & GPU,
Can potentially optimise 1000's of Web pages per second

General Matrix Table maths with parallel arrays can optimise most table maths compatible Vector Units,

Code: SysCL, OpenCL, Assembler, Tight maths code in optimised & expressed in cube blocks of data in 256kb,128kb,64kb,32kb,8kb,4kb chunks as defined by grids..

AVX, SiMD & Coral.ai EdgeTPU & NPU Acceleration

Matrix Array Maths, Arrays & vector tables
Instanced_Arrays , Cubes & Polygons & curves..
AES, ECC, RSA, DSA, Array Maths

Anti-aliasing
Edge detection
Sharpening
Excel spreadsheets
Mathematical reduction
Statistics
Synergy
Artificial intelligence (AI)
Deep learning (DL)
Machine learning (ML)
Mathematics

Understand the requirements of Maths & Know the truth!

Basic assumptions of parallel processing require a full netmasked NPU Grid Matrix..
Now obviously an NPU does NOT fully map the entire Grid array in a single pass!

Sub cubes are to be mapped in allocation blocks with strict alignment..
That being said, if you cannot use a fully packed Byte..
You probably need to map better!

MAP(c)RS include the following parameters : ++++ **** +++* In Matrix, 2D & 3D & varieties there-of

Matrix Formula Parallel Processing (c)RS

++++ **** +*+*+*+*
++++ **** +*+*+*+*
++++ **** +*+*+*+*
++++ **** +*+*+*+*

By multiplying by N*1 = +* = + & forms of cross line multiply with 0+* : N*Y

Honestly 2D & 3D Matrix SiMD qualify as significantly qualified...

The question of AVX, SiMD & Basic instructional formulas..
Is potentially only about latency & complexity!

Instruction hirachies with fully qualified SiMD & Vector instruction lists
Wiring on die is significant with complex instructions; So Buss Width is a significant challenge.

2x & 3x instruction Load/Store per operation Cycle reduces required Buss width..

Buss =I , SiMD S, Memory Cache M

SI=M=IS S=IM=S S=IM=S=IM
SI=M=IS S=IM=S S=IM=S=IM
SI=M=IS S=IM=S S=IM=S=IM
SI=M=IS S=IM=S S=IM=S=IM

The HerringBone Attribute allows Store & Run with faster Cache RAM & Dynamic Allocations?
Instruction out through buss, Parallel pipe.

Instructions can run, Left, Right, Up, Down...

Logic dictates Output direction is next operation or system ram & use.

RS

Map grid examples:

16KB Cubes [ ]

Grids are defined like so..

[ ]1a, [ ]2a, [ ]3a, [ ]4a
[ ]1b, [ ]2b, [ ]3b, [ ]4b
[ ]1c, [ ]2c, [ ]3c, [ ]4c
[ ]1d, [ ]2d, [ ]3d, [ ]4d

Example, We allocate an Address segment, 1, 2, 3, 4 or a1, a2, b1, b2 or 1a, 1b, 1c, 1d or a1:a4, b1:b4, c1:c4, d1:d4,
Independent parallel masking.

We can map multiple arrays in a bus & in a single pass..
With command load, run, save

(c)Rupert Summerskill 2024 'The Years to EXCELL'

TPM Verified Loop Code : Production Verified & Signed : Qualified Encryption & Compression Privacy (c)RS

Private loops : Security Level Verified Code & Byte Code

For security reasons the Block set of Lattice maths is loaded & fetched on secondary execution string,

Code dislocation involves no trace loop; For efficiency reasons the code optimal loop & fetch cycles are analysed in closed loop lab..

Data & code analytics are non disclosed for debugging clean stack code.

(c)RS

*

Matrix Formula block loading for SiMD Shaders makes sense, Most tasks can fit 4 commands in a row (in 64KB RAM)

Depending on the task; You can fill a grid { a1, a2,/ b1, b2 } ,

Or more depending on command length & data content..

SiMD Unit 2x 16Bit per row; 4 Rows per unit : grid { a1, a2,/ b1, b2 },

NPU & AVX 512 & 256 & 128 bit; have a much larger grid if supporting 16Bit values.

Rupert S

*

Standard deviation & derivatives (c)RS

There are many tasks suitable for standard, average, gaussian & mean deviation..

By perfect example; The Average, Mean, High & Low sample data set & Machine Learning &
Reason..

Cherished by the late Greeks, averaging data sets & pole data & metrics.

Standard deviation used for Dithering & Smoothing & Edge shaping & Sharpening with a smooth look,

In Codecs & Texture formats & can significantly improve look..

Used in statistical analysis, Image processing : Averaging, Error Diffusion Dithering, Averaging Dither, Sharpening & shaping,

By average mean & standard deviation : Tessellation, Vertice culling, Shape & colour composure, Colour matching & Identification tasks.

Rupert S

Understanding Standard Deviation and Derivatives

Standard Deviation:
A measure of how spread out data is from the mean.
A high standard deviation indicates a wide range of values..
A low standard deviation means data points are clustered closely around the mean.

Gaussian Distribution:
The normal distribution (or Gaussian distribution) is often used in statistical analysis and machine learning due to its symmetrical shape and known properties. Standard deviation is a key parameter of the Gaussian distribution.

Mean Deviation:
While less commonly used than standard deviation,
Mean deviation measures the average absolute distance from the mean.
It can be more robust to outliers than standard deviation.

Derivatives: A mathematical tool that measures the rate of change of a function.
In image processing, derivatives can help detect edges and features.

Applications

Image Processing:

Edge Detection:
Derivatives can highlight areas of rapid change in intensity, indicating edges.

Noise Reduction: Standard deviation can be used to identify and filter out outliers (noise) in images.

Gaussian Blur: A convolution with a Gaussian kernel (which is defined by its mean and standard deviation) is used to smooth images and reduce noise.

Dithering:
Standard deviation can help determine the optimal dithering pattern for reducing color banding.

Derivatives in Higher Dimensions:
For images, which are 2D & 3D signals..
We can use partial derivatives to measure changes in the x and y directions.

Edge Detection:

Convolution with Sobel or Laplacian kernels: These kernels are essentially derivatives.
The magnitude of the convolution output indicates the edge strength.

Canny Edge Detector: Uses standard deviation to determine thresholds for edge detection.

Median Filter: A non-linear filter that replaces each pixel with the median value of its neighborhood.
While not directly related to standard deviation, it's often used for noise reduction.

Statistics:

Hypothesis Testing:
Standard deviation is crucial for calculating test statistics and determining significance levels.

Confidence Intervals:
It helps construct confidence intervals around sample means.

T-test:
Compares the means of two groups. T
he standard deviation is used to calculate the t-statistic.

ANOVA:
Compares the means of multiple groups.
Standard deviation is used to calculate the F-statistic.

Clustering: It can help identify natural groupings in data.

Histogram of Oriented Gradients (HOG):
Uses derivatives to compute gradient magnitudes and orientations, which are then used to create feature descriptors.

Machine Learning & statistical analysis pt2:

Standard deviation can be used to normalize features & enhance average improving model performance & accuracy..
Normalization: Standard deviation is often used to normalize features, ensuring they have a similar scale.

Clustering: Algorithms like k-means use distance measures (which can involve standard deviation) to group data points.

Least Squares Regression:

The standard deviation of the residuals (the difference between the predicted and actual values) is used to assess the model's fit & fitness & accuracy.

Computational Efficiency:

While standard deviation and derivatives are powerful tools..
they can be computationally expensive for large datasets.
Efficient algorithms and hardware implementations are essential.

RS

ML, TFLite/ONNX : Wavelet & Array content such as HTML, JS, DNS & NTP protocols : RS

Example TFLite/ONNX can interpret Wavelet restoration as a final layer to sharpen encoding or decoding in Codecs & Texture formats or image compression libraries & DLL/.so H265 H264 H266 & DSC,
Such a compression advantage is due to the random bits in MPG & AAC & Huffmans.

We use a Matrix Maths Array to carry out the shaping; Because Waveshaping Matrix is a lot faster!

Aligned Matrix

2x SiMD
4x SiMD
8x to 64x AVX
4x to 128x NPU/TPU

Array formatted content such as DNS information can be ordered & sorted by logic & corrected for deviations from standard W3.org Mark-up language

We need TFLite/ONNX for games that have a light ML Payload for gaming AI & for antivirus or system flow such as servers route selection..

Don't kid yourself TFLite is a light load on a system!

ONNX is good but TFLite kernels come in under 50Kb

https://www.w3.org/TR/
https://www.w3.org/TR/webnn/#api
https://www.w3.org/TR/webgpu/#packed-formats

RS

Perfect sample for Matrix Tables : https://gpuopen.com/learn/sampling-normal-gaussian-distribution-gpus/

" // Method 2: Box-Muller Transform

float2 sampleGaussBoxMuller(float2 u, float mean, float standardDeviation)
{
const float a = standardDeviation * sqrt(-2.0f * log(1.0f - u.x));
const float b = TWO_PI * u.y;

return float2(cos(b), sin(b)) * a + mean;
}

"

We can either repeat loop solves : (cos(b), sin(b)) * a + mean,
Or we can form a table matrix

(cos(b), sin(b)) = x , * a + mean = y

1 2 3 4
a x*y, x*y, x*y, x*y
b x*y, x*y, x*y, x*y
c x*y, x*y, x*y, x*y
d x*y, x*y, x*y, x*y

Rupert S

*

Lattice Squares Kyber, Falcon, AES, DES, RSA, ECC:

The use of Lattice Squares, Otherwise known as Matrix Maths Formula..
In AVX, SiMD, NPU require efficient code modelling:

Lattice Grids are defined like so..

[ ]1a, [ ]2a, [ ]3a, [ ]4a
[ ]1b, [ ]2b, [ ]3b, [ ]4b
[ ]1c, [ ]2c, [ ]3c, [ ]4c
[ ]1d, [ ]2d, [ ]3d, [ ]4d

Multi-Threaded in parallel

Security for top quality Mobile Phone 3G, 4G, 5G, LTE & 2.4G & WiFi & Bluetooth : ICE-SSRTP GEA Replacement 2022 https://science.n-helix.com/2022/03/ice-ssrtp.html

A SiMD Variant Matrix Maths Formula, All kinds of work can be carried out :
Anti-Aliasing,
Sharpening Masks,
Code that requires a 4x4, 8x8, 16,16 Grid,

Bear in mind that SiMD is 2 lane 32Bt & 4 Lane 16Bit..
So a 4x4 matrix is ideal per SiMD Core Group @ 16Bit

4x4 = Single double lane SiMD Unit
8x8 = 2 Double Lanes
16x16 = 4 Double lanes Lanes
32x32 = 8 Double lanes Lanes

Double Fetch or Quad Fetch

RS

DML

Suitable for processing: Lattice, Kyber, Falcon, AES, ECC & drawing Vectors, JS & WebASM, PHP & HTML5 Web formats..

DML Level is important!

Matrix Maths Formula

Matrix Operation examples :

DML_FEATURE_LEVEL_2_1 : https://is.gd/DictionarySortJS

DML in relation to Instanced_Arrays & DirectX & Vulcan/OpenGL & CL

DML_OPERATOR_ELEMENT_WISE_
MAX, MEAN, MIN, MULTIPLY, SUBTRACT

RS

Reference the 'Parallel multiplication Grid NPU Simulation' Doc in https://is.gd/DictionarySortJS

https://science.n-helix.com/2016/04/3d-desktop-virtualization.html

https://science.n-helix.com/2022/04/vecsr.html

https://science.n-helix.com/2019/06/vulkan-stack.html

Accelerated Python: NPU, TPU, SiMD
https://is.gd/CoralAI

https://is.gd/DictionarySortJS
https://is.gd/UpscaleWinDL
https://is.gd/HPC_HIP_CUDA

https://is.gd/SPIRV_HIPcuda
https://is.gd/TFLiteDev

https://is.gd/UpscalerUSB_ROM

Directed Matrix Principle : RS

Matrix Principle directed at traditional parallel Integer & SiMD Instruction groups

The main problem with 32KB L1 tables is cache filling & domination of CPU/GPU by single program instruction groups..

Instruction cache is the primary challenge; Because Instruction cache L1 is commonly 32KB; Data cache 64KB,
L2 is 512KB to 4MB; L3 4MB to 16MB (can be more on Epyc)..

Optimised instruction groups by instruction, SiMD multiprocessing thread count:

Firstly requirements: (32KB instruction Cache L1, 512KB L2, 8MB L3)

L1 Instruction Group 32KB
L2 running group 512KB
L3 RAM & storage direct fetching 8MB

8KB core table for group threading,
24KB of grouped & Synchronised instructions

Data work Groups 512KB L2 / 64 Instruction Group sets (L1 32KB Table),
So Main instruction groups from L1 with larger data sets.

L3 4MB to 8MB of data & instruction caching load (directed from L1 & funneled into L2)

Instructions are cross threaded directly though L3 & L2 synchronised Load, Run & Save,

Optimised instruction groups by instruction, SiMD multiprocessing thread count.

Rupert S

Parallel Arrays : Matrix forms : RS

Matrix processor is a feature that will be more common & is relatively similar to an Abacus with a multiple array of + & * Operators..

Now a Matrix Array is X1 > Xn & Y1 > Yn

Commonly an array of 16 x 16 but can be 8 x 8 or 4 x 4,

Now we can perform such operations as Relativity & String theory on a lattice & that is very fast!

We can also perform these functions on SiMD, AVX in parallel; Such that 256Bit SiMD is 32Bit x 8 Parallel & so forth

Parallel
a : 64Bit
b : 64Bit
c : 64Bit
d : 64Bit

Matrix
a1a2a3a4
b1b2b3b4
c1c2c3c4
d1d2d3d4

Now we can see that we can perform a matrix operation such as lattice with both SiMD & SiMD-Matrix,

We can also see that a Matrix shall & can present our solution & that SiMD can also!
But we need Long operation SiMD or many passes to complete our operations; If Larger than our size..

We can also therefore most likely..

Use AES-NI S Letter Box & SVE & Matrix & SiMD to our advantage for many Lattice operations.

Multiplier Matrix Accelerated Encryption, Like i said A Parallel SiMD array may do the same; If all memory arrays are connected by a single RAM/Cache ALU Node,

As stated Parallel Arrays & Parallel Matrix Arrays.

Rupert Summerskill

https://science.n-helix.com/2023/06/map.html

https://science.n-helix.com/2022/03/ice-ssrtp.html

Bluetooth LE Protocol
https://drive.google.com/file/d/17csRnAfdceZiTSnQZvhaLqLSwL__zsIG/view?usp=sharing

*

Examples of Parallel execution pipeline : Parallel arrays:

Crypto lattice, Kyber/ML-KEM, AES : Parallelised Lattices, 8x & 16x Parallel SiMD F16/32/64/128/192/256Bit

parameterisation of groups of 4x Parallel SiMD F16 & 8x Parallel SiMD F16

Parallelised motion & Video/Audio Deblocking/Blocking

8x8 16x16 quantification of video is common in VVC & H265 & H264 & JPEG & MP3, MP4a & AAC,
Suggested parameterisation of 4x Parallel SiMD F16

8x8 16x16 quantification of video is common in HDR VVC & H265 & H264 & JPEG & MP3, MP4a & AAC & AC3 & AC4,
Suggested parameterisation of 4x Parallel SiMD F32

Shapes in motion 2D : 4x per Cube in motion,
Shapes in motion 2D : 6x per Texture Shaded Cube in motion,

Shapes in motion 3D : 6x per Cube in motion,
Shapes in motion 3D : 8x per Texture Shaded Cube in motion,

RS

*

Number relativity, Bit precision: RS

In gaming a player has access to palette of 16bit FFFFFFFFFFFFFFFFFFFF.FFFFFFFF BF16 F=16 HEX; In 32bit memory storage.

Average gamers recognise maybe 32000 colours directly,

Colour rich artist colourist's recognise almost 6000000 colours TOPCloud.

Variety is king & queen of experience,

Artists specialist recognises more colours than a basic gamer or graphics artist in vectors..

Matrix maths operations precision is relative to hardware,

XBox 4bit FFFF, PLAYSTATION 8Bit FFFFFFFF

RollINT precision 1 to 4 bit + integer -1 to 4 bit F, FFFF, FFF+.F Xbox Or FFFFFFF+.F Ps

Bit precision is relative to your experience!

Rupert S

RollINT - Machine Learning for Console & Computer : RS

With True Value memory/Operation cache...

Application of RollINT to machine learning with definition,
A Playstation APU has 8Bit Integers for inference; XBox 4Bit..

In order to describe 4Bit as float; You would need to define 3Bit & 1Bit R remainder,
So how does this work?

In loading value the first 3Bit is the value & the 4th bit is remainder & when you load the value stored..

You fetch 3Bit as the value & 1 Bit as the remainder; Example:

FFFe > Value FFF &R e, So the value is FFF.e not FFFe
you can do multiple data type operations in this method; For example:

FFde = FF & de or FF.de or you could do Ffde & mean F.fde; Useful for definitions of Pi,

For example Pi in 4Bit (8Bits Prefered); Commonly used by kids at school!,

However you convert the stored 4Bit Pi to a fully accurate value on FPU & SiMD execution by loading pre-stored true value.

RollINT

We are using roll to roll a zero on or off an integer,

Therefore we are able to divide and multiply and add so that..

101-0 > 10.1+0 No can range practically from 0 to 00000000 practically.

So 10023-000 > 10.023+000

We can then store floating point numbers in integers.

(C) Rupert S,

Reference Int & FP Value Sizes; A reminder that Floats are 50% of highest Integer Value,
ROLLInt floats still have an amazing additional value!

https://learn.microsoft.com/en-us/dotnet/standard/numerics

RollINT : The Float Perfectionist

Playstation & XBox are primary examples where the Int8 unit could do a RollINT Floating point operation for machine learning that is specific to float FPU Solves,

Edge detection, Sharpening & Adaptive Contrast & Colour HDR..

Depending if you directly roll on SiMD & FPU then you can still sharpen with the bF16 & half precision FPU/SiMD Maths operations on the final run!

Imagine Luke SkyWalkers final Torpedo Salvo as FPU/SiMD Vectors DT

RS

Scaler is an argument for the role of RollINT & also a pointer to method

RollINT : A Float view of machine learning,
Essentially the core issue is the role float may play in a result...

Not the human mind does use a common integer format with a small float remainder?
Potential for this configuration is mainly because Integer values are in the main Substantive information..

Float value (the sub decimal place below 0.); Is in essence a precise small value of high importance to skills such as jumping, Running, Motions & skill actions like shooting..

Integer is the majority of action related to large steps; Particularly because people have the capacity to change from Meter to Centimetre to Millimetre,

Justifications for Float values diminish if you have scalar units such as the meter, the Yard, foot, Inch & 16th!

However; As may be pointed out, Roll Scalar? Is a form of floating unit expression; If Scalar measurements are regarded in terms of static's; Then Yes Integer:{Meter; FPU:{cm, mm} is a float value!

Nonetheless Scaler is an argument for the role of RollINT & also a pointer to method..

Scaling you see; is everything to detail; If you want to see this? Magnify or Zoom & Wide angle!
We further scale; By hitboxing our ML; In other words by training the AI on Centric value rewards..

AI Content:

{Content value reward targets};
{Centric Core values};

Return = Value;
end = infinite
Test Loop {AI C, End}; Begine

Epoches = {Satisfied End}

Rupert S

Float & Integer : RollINT : In Depth Analytics

RollINT List

Floats with small precision values : RollINT

Dreams have 'Small Randoms', Minor details make a true reality

(OS & Chrome Example)
The size of frames & text alignment
Main colour groups for desktop & browser colours : FFFFFF.FF
Frames forward & backward with submenus are worthy of low precision floats : FFF.F 300 Frames 16 sub allocated positions inside frame:{SubFrame}

Both low & high precision

High Efficiency ZLib, GZip Ram compression
Localised Error correction

Colour depth & contrast HDR, Low error rate/Higher

RS

RollINT Versus Metric principle of float reduction : RS

Scale correctly & avoid that FPU being needed

Scale correctly first; Example mouse is Millimetre & Micrometre & Large scale Centimetre,
Photon Microscope is Picometre, Milimetre, Centimetre,
Telescope is Kilometer, Metre, Milimetre..
Screens UpScale & Zoom, Do we need to rescale our measurement ?

https://learn.microsoft.com/en-us/dotnet/standard/numerics

X+- , Y+- 2D+- central point measurements
Int16 2 -32,768 32,767
Int32 4 -2,147,483,648 2,147,483,647
Int64 8 -9,223,372,036,854,775,808 9,223,372,036,854,775,807 (might want to use floats; A lot quicker)

Precision Floats
16Bit Half 2 ±65504
32Bit Single 4 ±3.4 x 1038
64Bit Double 8 ±1.7 × 10308

The main attack Vector being mice & touchscreens & utility scopes & measuring devices...
We wanted DPI without stress!

A range of options exist when using RollINT; The idea is to Roll a float on operation; To be fair hardware like the Amiga has the concept of Integer operation with a float as the final result..

However that option Is "the Final result" & does not mean that you could use RollINT to make a repeated Float maths for applications..

However RollINT could be used 2 Significant ways:

You could use FPU on the result (Previous integer operations save FPU for other tasks)
You could receive an Integer result from the float operation (Final float value on multiple operations not important to you?)

Perform Metrification & therefor avoid float value use; for example expand the data into a higher precision mode,

The principle of the Metric system is to use sub parts to reduce the necessity of floats : Meter, Centimetre, Milimetre, KG, Gram, Ounce..

So avoiding a floating unit..

The method is multiple operations, Large, Small, Smaller & can in reality be repeated down to picometer or tiny weights...

This method is multiple operation rounds,

RollINT & FPU Avoid rounds of CPU Cycles; But options exist.

RS

As you know the Matrix Array Processor is now frequent with Intel, Mac M1 & M2, AMD & NVidia Versions..

Quantum computers rely on Multi-Directional & Multi-Dimensional Arrays per Qbit!

Well this is a design structure for a Multi-Array Multi-Connection Matrix Array Processor..

The principle is basically quite logical!

Multi-Array Multi-Connection Matrix Array Co-Processor - Quanta Light Compute 2023-06-23

Percentage based 3D Processing to handle all 3D Array processing,

Central [H.P.C] Tasks map to probability over Networks [=====] & [M.A.P] Units in arrays

Table define

{

[M.A.P] = M.A.P , M.A.P 8 Way interconnect,
[H.P.C] = M.A.P High Precision Central Core,
[=====] = Buss Connections & networking

}

Top View

[M.A.P][M.A.P][M.A.P]
[M.A.P][H.P.C][M.A.P]
[M.A.P][M.A.P][M.A.P]

Side View 3D

[M.A.P][H.P.C][M.A.P]
[M.A.P][=====][M.A.P]
[=====][H.P.C][=====]
[M.A.P][=====][M.A.P]
[=====][H.P.C][=====]
[M.A.P][=====][M.A.P]
[=====][H.P.C][=====]

Each [H.P.C] Central Contains RAM & connections to the 8 [M.A.P] & Optionally to layers above & bellow in 3D Matrix,
Bottom of wafer contains high resolution buss to onboard controllers & networks & DPU/GPU/CPU's

Array = Matrix Array Processor Unit (c)RS

ffffffff ffffffff ffffffff
........+ ........*+ ........*
........+ ........*+ ........*
........+ ........*+ ........*

f=fp,unit
*=mul
+=add
.=Cache/Ram

Simple absolver table for MUL:ADD : MUL* Only = +0, +- Only = N*1 then +-
% = / 100 + ADD Table {N1 <> N...} : Result!

(c)Rupert S

SiMD:CMA (c)RS

Standard SiMD Features, Byte Swap, ADD,MUL[SSimd]
8 x Cache,Mul,ADD: [8xCMA]

[SSimd]
[8xCMA][8xCMA][8xCMA][8xCMA]

[SSimd] is additional features accessed by register poke, Standard Operation is CMA & RAM
[8xCMA] is used as RAM in most SiMD Operations & MUL+ADD, ADD, MUL

In SiMD Ops
On RAM upto 3x F16 can be stored (3xF16, F32 + F16, F48, F24x2)

MUL or ADD Operations can be {F16:F16:F16, F32 *+- F16, F24 *+- F24}
Operations are saved to Master Cache & sent to RAM or other functions & can be {F16, F24, F32, F48},
Because master cache is a full buffer; you have to save it first! before reuse!

Design uses the M.A.P basic MUL+ADD & RAM

(c)Rupert S

References: DOT4, INT8, INT16, F16, F32, F64 (c)Rupert S

https://is.gd/LEDSource

https://science.n-helix.com/2023/06/ptp.html
https://science.n-helix.com/2023/06/map.html
https://science.n-helix.com/2023/06/tops.html
https://science.n-helix.com/2022/01/ntp.html

https://science.n-helix.com/2023/02/pm-qos.html

https://science.n-helix.com/2023/07/3dchiplet.html

https://science.n-helix.com/2018/01/integer-floats-with-remainder-theory.html
https://science.n-helix.com/2021/02/multi-operation-maths.html
https://science.n-helix.com/2021/11/parallel-execution.html
https://science.n-helix.com/2022/12/math-error-solve.html
https://science.n-helix.com/2021/03/brain-bit-precision-int32-fp32-int16.html
https://science.n-helix.com/2022/10/ml.html

Sparse matrix multiplication in SRM array
https://www.science.org/doi/10.1126/sciadv.adf7474

Error Correction Options & Mitigation
https://futurism.com/ibm-breakthrough-quantum-computing

Nx-DeepMatrix Engines
https://www.nextplatform.com/2023/08/02/unleashing-an-open-source-torrent-on-cpus-and-ai-engines/
https://idstch.com/geopolitics/next-generation-neuromorphic-chips-bringing-deep-learning-from-cloud-to-iot-edge-devices-and-mobiles/
https://www.backblaze.com/blog/ai-101-gpu-vs-tpu-vs-npu/

Experimental CPU Proof : A proposal for an Open RISC V Processor, Statistical diagrams of function & graphs with function use under load...
https://www.researchgate.net/publication/373403576_Design_of_a_High_Performance_Vector_Processor_Based_on_RISIC-V_Architecture

ML Batch Matrix MAP in FPGA
https://drive.google.com/file/d/1hdxeK1r8LIhvpn7poOm3MfXmGr9Tq-ni/view?usp=sharing

ML Compressed Dynamic16bit-8Bit - Hardware-friendly compression and hardware acceleration for ML Transformer
https://aimspress.com/article/doi/10.3934/era.2022192

Matrix Processors - Memory & command - All-Digital Compute-In-Memory FPGA Architecture for Deep Learning Acceleration
https://dl.acm.org/doi/pdf/10.1145/3640469

Matrix Processors - Inline Ram & Command { CMD : RAM }:{NET}
https://www.xilinx.com/content/dam/xilinx/support/documents/white_papers/wp506-ai-engine.pdf
https://www.xilinx.com/content/dam/xilinx/support/documents/white_papers/EW2020-Deep-Learning-Inference-AICore.pdf

***

Cooperative Matrix Math : RS

Cooperative Matrix is a Math type where you formulate a Grid of number & math notations & solve them in sync,

The consequence for you is that the maths is both Faster; More Complex But also easier to correct for errors...

Usually Matrix Maths is used for Algebra, Image & 3D Mapping ML; Such as to see, Maps & Dungeons, Water tables, Technology Development.

Matrix

Var = V+n, Table

a b c d

1[V1][V1][V1][V1]
2[V2][V2][V2][V2]

3[V3][V3][V3][V3]

4[V4][V4][V4][V4]

There are 3 main ways for matrix maths:

V1a {/,*,+,-},Value, %, Fraction V1b, V2a, V2b : In effect a dither map or calulation; So connected.
Vector groups {V1a<>z} Maths to {V2a<>z} to {V3a<>z} to {V4a<>z} & more ..

Sorted by Type of operation example
M = Multi Complex Operations In Groups
a b c d
1[V1]+[V1]+[V1]+[V1]
2[V2]*[V2]*[V2]*[V2]
3[V3] / [V3] / [V3]/[V3]
4[V4]M[V4]M[V4]M[V4]

Refer to : Var = V+n, Table

Matrix Accumulator Header Matrix : {MAHM}
SiMD Wave : 32, 64 Group with finalised result + ALU : Work Group Wave Matrix : {WGWM}
Wave Matrix Accumulator Cube : {WMAC}

{MAHM}
{WMAC},{WMAC}
{WMAC},{WMAC}

{MAHM}
{WGWM},{WGWM}
{WGWM},{WGWM}

{MAHM}
{WGWM},{WGWM}
{WMAC},{WMAC}

CTP-HTM : CPU, TPU, Processor Hypervisor Thread Management : RS

Parallel Group Threads:

Work groups by Aligned by:

Work Group Size (aligned by Bit):

Memory Range {Half Float, b16Bit,b32Bit, 16Bit,32Bit , Double Float}
Aligned Cluster Size,
Bit-depth & Length of code

The logic is that Parallel Group Threads with the same Code complexity & Size should finish around the same time,
They also typically require the same processor priority so that system tasks have Runtime Availability.

RS

Guide to Cooperative Matrix Math : RS

Base principle of the Matrix & Graph goes beyond Accumulation of numbers..
I am reminded by microsofts dev post of Excel & Spreadsheet applications..

Yes they Graph/Matrix; But math solves require it! For example the Acidity/Alkaline matrix with Protons & Electrons,

However a more sophisticated form is algebra; But you have to simply the Algebra & put that in a table..
Einstein, Shrodinger, Physics, Chemestry & DNA By connection...

Algebra is the main reason we would use Float : {bF16 <> bF32} {Single Precision <> Double Precision} SiMD,
The chief objective is the solve; Complex SiMD offer the answer of flexibility..
MUL:DIV ADD

Simple absolver table for MUL:ADD : MUL* Only = +0, +- Only = N*1 then +-
% = / 100 + ADD Table {N1 <> N...} : Result!

(c)Rupert S

Graph Accumulator Multiply ADD - Cooperative Matrix

SDK Sample : https://github.com/ROCmSoftwarePlatform/rocWMMA

VK_KHR_cooperative_matrix https://www.amd.com/en/support/kb/release-notes/rn-rad-win-vulkan

https://registry.khronos.org/vulkan/specs/1.3-extensions/man/html/VK_KHR_cooperative_matrix.html

https://devblogs.microsoft.com/directx/d3d12-work-graphs-preview/#Prerequisites
https://devblogs.microsoft.com/directx/agility-sdk-1-711/

https://gpuopen.com/wmma_benefits_ml_compute/

https://gpuopen.com/learn/amd-lab-notes/amd-lab-notes-matrix-cores-readme/

https://paperswithcode.com/paper/a-survey-on-deep-learning-hardware/review/

AMD 23.Q3Pro_HIP #HPC #DirectML MatrixMathOps 'Release unto me the great! Chobokniki' Thine Prayers Answered https://is.gd/AMD23Q3PRO_HIP
Run the .reg after install; Before reboot https://is.gd/AMDRebarReg

Inference & FMA De-Block Styles

For upscaling matrix: MMX+ & SiMD
16x16 Block as used just about in HD,
8x8 Blocks Certainly NTSC, PAL, JP_NTSC!,
Very usable for deblocking JPG,
16x16 & 8x8 is very good for Inferencing active on Scaling & Deblocking..

4x4 for main Inference XBox & 8x8 for PS5..
XBox can use (4x4)x4 for 8x8 & (4x4)x16 for 16x16; Very powerful!
PS5 can use (8x8)x1 or x2 for 8x8 & (8x8)x4 (x8 for additional processing) for 16x16; Very powerful!

The table solves common issues with 4Bit & 8Bit direct loading of colour tables of the F16 Types..
16Bit is a bit more common in older hardware & luckily quite a lot more flexible!
But 8Bit & 4Bit inferencing have a number of uses...

Indirect load though F16 Register can work by sideloading the operation; With Inferencing Sub routine coding & Returns,
Processing the actual inference but losing data store & returns just information..

Sub Routine INT8 & INT4 can:
Directly manipulate a small palette; Scoped Palette,
Single channel colour or multiple operations..
Load, Store & Save

Inference & FMA De-Block Styles List

(4x4)x4
(4x4)x8
(4x4)x16 + processing
(4x4)x32 +++ processing

(8x8)x4
(8x8)x8 + processing
(8x8)x16 + processing

(16x16)x1 + processing
(16x16)x2 ++ processing
(16x16)x4 +++ processing

8:4Bit Concepts: 65535/255=8Bit 65535/16=4Bit

16bit/4bit : 4Bit colour pallet, But we can fraction 16Bit/4bit in essence 16/4! 65535/16; Compression Shapes & Gradients.
Polygon, Shadow, Contact
Alpha Channel 2Bit, 4Bit
Grayscale edge define sharpening
Single Colour Edge detect
Shape Fill in Alpha 10,10,10,2
Xor, Pattern, Shading, Shader, Cull, Shape & Depth Compare after define

For when {U, X, Y, Z} = N Expressions https://is.gd/ForWhen_UXYZ_N
For when {(A+B/2)} = C Expressions https://is.gd/ForWhen_ABx2_C

(c)RS

An example use of FMA Cooperative Matrix

In the example we use a formula like (U/X²)+(U/Y²)+(U/Z²)
Firstly the x²,y²,z² are MUL, So we need a * table or maybe with FMA we can use a (MUL)+0 ?
My primary observation is that we can use 2 methods:

MUL (U/X²), (U/Y²), (U/Z²) in tables, I suggest 3 * or FMA (MUL)+0
Or we can perform tables in order but complete all the MUL operations in Sync & then ADD with FMA,
Sync : (U/X²)+(U/Y²)+(U/Z²) to (Un/X²)+(Un/Y²)+(Un/Z²)

F1 = First Operation F2 = Second operation R = Result {R1:R3 = R4}

F1
R1=(U/X²) R2=(U/Y²) R3=(U/Z²)
F2
R1=+ R2=+ R3 = R4

So we have an example where MUL & then ADD is usable; But we could use Synced FMA

For when {U, X, Y, Z} = N Expressions https://is.gd/ForWhen_UXYZ_N

RS

Brilliant examples of matrix maths
https://gpuopen.com/learn/amd-lab-notes/amd-lab-notes-finite-difference-docs-laplacian_part1/

VXEdDSA & XEdDSA & X25519 & X448
https://signal.org/docs/specifications/xeddsa/

SiMD-Matrix Maths example - Wave retrieval from quad-polarized Chinese Gaofen-3 SAR image using an improved tilt modulation transfer function
https://www.tandfonline.com/doi/full/10.1080/10095020.2023.2239849?src=
https://drive.google.com/file/d/1uN047PvBJhFkcdNJKqx6cBZ9vnAxcjPj/view?usp=drive_link

SiMD-Matrix Maths example D-Waves

https://drive.google.com/file/d/15iPy-Z24GsbcUdEycOfS1819Fdf0sWoE/view?usp=drive_link

*****

High speed Per operation Cycle operations of D R² Pi

An (A[diameter]*B²[Pi] : D * R² operation is 2 Cycles, this specialised Arc, Sin, Tan operation can be accomplished a couple of ways in a single cycle,

Options table : D R² Pi

Firstly by sideways memory load in lower Single Precision to double precision output in a SiMD

You need to pre cache R²You can use the same value for R or for D &or both
You can pre cache all static D &or R, So you can vary either D or R & single cycle
You need to perform 2 operations , Diameter & R² & obviously they are relational!

For examples:

R = Atom Zink (standard size!) Cache D R
You move a compass but the needle is the same size! Cache D
You draw faces but the width is the same, Cache D
You draw faces but the Shape is the same but size is not! Cache R

Rupert S

**********

How you use FMA, Basic MUL+ADD examples first & then Mul & ADD

Firstly in video,
MUL a float set A * B + C
Video Upscaling basic A:Pixel * B:PixelDiffRightPixel + C:RightPixel,
Do that 16 Times per pixel pair and you have 16*Interpolate, So a 16* Data set Wave!
You could obviously use a 32* Wave SiMD & do 4x8; So 4 Pixel groups per Wave.

So for example you can ADD Log Gama or other simple values, In A * B + C,
Pixel Values or whatever, You can use Point float 0.001 for example to do division on floats.

For all personal maths that you imagine:
Simple absolver table for MUL:ADD : MUL* Only = +0, +- Only = N*1 then +-
% = / 100 + ADD Table {N1 <> N...} : Result!

Interpolation & smoothing :

The method i am thinking of is ADD Mul/Div : Edge Left A+B Edge Right = C Center, (A to C)<>(C to A)

(A+B)/2 = C

Factor A_to_C
16 Steps

Factor C_to_B
16 Steps

*alternatives*

((A-C)/16)=F | (F* A over C)=F Step * 16 over Time or distance

(Call slope)
find 16 Fractions of A To C
find 16 Fractions of C to B

For when {(A+B/2)} = C Expressions https://is.gd/ForWhen_ABx2_C

RS

Pixel A to B, Interpolation upscaling

from A1 to B16 ADD Difference of A - B

Red A1 : 2 : 3 : 4 : 5 : 6 : 7 : 8 : 9 : 10 : 11 : 12: 13 : 14 : 15 : 16B
Green A1 : 2 : 3 : 4 : 5 : 6 : 7 : 8 : 9 : 10 : 11 : 12: 13 : 14 : 15 : 16B
Blue A1 : 2 : 3 : 4 : 5 : 6 : 7 : 8 : 9 : 10 : 11 : 12: 13 : 14 : 15 : 16B

Tables can be 16 Wide & 16 Long to advantage ourselves of Byte aligned F16

Pixel A to B, Interpolation upscaling

AAA
ABA
AAA

Example

R,G,B Value of A
R,G,B Value of B
RCv = Value per pixel of 16

Which is higher RA or RB
if RA
RA - RB = RC
If RB
RB - RA = RC

RB{1 to 16} repeat +- RCv

Sorry about the coding RS

Rupert S

FMA AVX Performance table: 2Flops per Cycle per FMA Unit
Architecture Fast Instructions for FMA

Reference Tables https://www.uio.no/studier/emner/matnat/ifi/IN3200/v19/teaching-material/avx512.pdf

Operators in C
● Arithmetic
a + b, a – b, a*b, a/b, a%b
● Bitwise
a | b, a & b, a ^ b, ~a
● Bit shift
a << b, a >> b (signed), a >> b (unsigned)
● Logical operators
a && b, a || b, !a
● Comparison operators
a == b, a != b, a < b, a <= b, a > b, a >= b
● Tertiary operator
x = a ? b : c
● Special functions:
sqrt(x), abs(x), fma(a,b,c), ceil(x), floor(x)

Fast division for constant divisors

Calculate r = a/b where b is a constant
With floating point we precompute (at compile time
or outside of the main loop) the inverse ib = 1.0/b.
r = ib*a
Floating point division with constant divisors
becomes multiplication
With integers the inverse is more complicated
ib,n = get_magic_numbers(b);
r = ib*a >> n

Integer division with constant divisors becomes
multiplication and a bit-shift

Fast Division Examples
● x/3 = x*1431655766/2^32
27*1431655766/2^32 = 3
● x/1000 = x*274877907/2^38
10000*274877907/2^32 = 10
● x/314159 = x*895963435/2
7*314159*895963435/2^48 = 7

Dividing integers by a power of two can be done with a bit shift which is very fast.

RS

https://en.wikipedia.org/wiki/FMA_instruction_set
https://en.wikipedia.org/wiki/Advanced_Vector_Extensions
https://en.wikipedia.org/wiki/AArch64#Scalable_Vector_Extension_(SVE)

High-Performance Elliptic Curve Cryptography: A SIMD Approach to Modern Curves
https://www.lasca.ic.unicamp.br/media/publications/FazHernandez_Armando_D.pdf
https://science.n-helix.com/2023/06/map.html
https://science.n-helix.com/2022/04/vecsr.html

https://gpuopen.com/learn/matrix-compendium/matrix-compendium-intro/

Triangle 3D Matrix graphs

_____b

Vector table for audio & video or graphics..

We will use integers for the 3D audio presentation & SiMD fpu for MP4 & AC4 & Alac decompression..

So we will be using a form of float unit called..

RollINT

We are using roll to roll a zero on or off an integer,

Therefore we are able to divide and multiply and add so that..

101-0 > 10.1+0 No can range practically from 0 to 00000000 practically.

So 10023-000 > 10.023+000

We can then store floating point numbers in integer.

Reference Int & FP Value Sizes; A reminder that Floats are 50% of highest Integer Value,
ROLLInt floats still have an amazing additional value!

https://learn.microsoft.com/en-us/dotnet/standard/numerics

ECC elliptic curves & Gradients : RS

Leveraging FMA fused MUL ADD on Internet & Software ...

For examples:

Gradients vector compression..

Colour A to colour B

Compare dif {A:B}

Transform A over steps B

Same colour ranges {R,G,B}

(A - B) = Dif

Shift B over steps = A

Store Vec VTable = steps

VTable:

Steps S1 to Sn

Colour B1 to Bn + S1 to Sn

S1,Sn

B1,Bn

Same with time & dimensions in the ECC elliptic curve..

S=T*D

Vector= {B1,Bn}

(T*D)+Bn

VTable:

Steps S1 to Sn

Colour B1 to Bn + S1 to Sn

S1,Sn

B1,Bn

Rupert S

Einstein : Quad:20x30 Matrix table

With Einstein Formula being around 20 operations wide, 30 Lines long..
Single Operation Formula Matrix Tables could be popular,

Consequently matrix math : MTU/MAP processor features should be popular...

I take the view that 8 x 30 is about manageable on the Epyc & M2..
Bearing mind that a 32 Wide x 32 Long Operations SiMD is achievable...

An AVX512 SiMD could run Quad operations (128Bit AVX) x 4,
So 20/4 = 5x; So 6x AVX512(128Bit Operation); Now there is; I believe; 1 AVX core per 2 Core Groups!

So 24 Core has 8x or 4x or 2x (8 or 4 Cores per die unit)!
So 84 Core units should have enough AVX512?

But one Mac M2... :D

Einstein : Quad:20x30 Matrix table

With Einstein Formula being around 20 operations wide, 30 Lines long..

Single Operation Formula Matrix Tables could be popular,

Consequently matrix math : MTU/MAP processor features should be popular...

I take the view that 8 x 30 is about manageable on the Epyc & M2..

Bearing in mind that a 32 Wide x 32 Long Operations SiMD is achievable...

An AVX512 SiMD could run Quad operations (128Bit AVX) x 4,

So 20/4 = 5x; So 6x AVX512(128Bit Operation); Now there is; I believe; 1 AVX core per 2 Core Groups!

So 24 Core has 8x or 4x or 2x (8 or 4 Cores per die unit)!

So 84 Core units should have enough AVX512?

But one Mac M2... :D

In our case Einstein, the table is 20 Wide & 35 Long (roughly)

So : Einstein = Quad:20x35 | Alternative Quad:8x16, More manageable in
SiMD Parallel Executions; Quad:8x16 x 3, ....

One presume strict aligned multiple multiplication

4X4 Tables are still utility for Science maths; But we need
to get the point across what we need for Einstein! The Subject of 4x4
tables,

The Subject of 4x4 tables,

We are obviously looking for more like 16x16 for Physics maths!
The matrix processor is a large data set; Divisible into 4x2 & 4x4 &
8x8 groups for execution speedups,
Aligned Parallel processing....

Aligned Matrix tables need to be larger than 4x4 for Physics &
Chemistry; So a matrix processor ideally can at a minimum:

Matrix Table

x1
16x16

16/2
x2
8x8,8x8
8x8,8x8

8/4
x4
4x4,4x4
4x4,4x4

RS

*

Triangle 3D Matrix graphs : a+b+c : Rotational algebra : ax+by+c=0 | e1, e2, e3

https://www.icalculator.com/matrix-calculators.html
https://academic-accelerator.com/encyclopedia/quaternions-and-spatial-rotation
https://stackoverflow.com/questions/tagged/matrix

https://gpuopen.com/learn/matrix-compendium/matrix-compendium-intro/

https://marctenbosch.com/quaternions/
https://arxiv.org/abs/1101.4542

Quaternions > PGA Geometric : a+b+c : Rotational algebra : ax+by+c=0 | e1, e2, e3
https://www.youtube.com/watch?v=0i3ocLhbxJ4
https://www.youtube.com/watch?v=Idlv83CxP-8

Improving Structured Grid-Based Sparse Matrix-Vector Multiplication and Gauss–Seidel Iteration on GPDSP
https://www.mdpi.com/2076-3417/13/15/8952

SiMD Matrix Maths - Performance Portable SIMD Approach - Implementing Block Line Solver For Coupled PDEs
https://www.osti.gov/servlets/purl/1602621

SiMD Matrix Maths - Operations Details HIP AMD
https://rocm.docs.amd.com/_/downloads/en/latest/pdf/

SiMD double tables, M1 Matrix
https://developer.apple.com/documentation/accelerate/working_with_matrices

https://en.wikipedia.org/wiki/Advanced_Vector_Extensions

FMA AVX Performance table: 2Flops per Cycle per FMA Unit
Architecture Fast Instructions for FMA
https://www.uio.no/studier/emner/matnat/ifi/IN3200/v19/teaching-material/avx512.pdf

#RIP (Intro interesting!) Optimizing massively parallel sparse matrix computing on ARM many-core processor
https://www.sciencedirect.com/science/article/abs/pii/S0167819123000418

https://www.gamedeveloper.com/programming/implementing-a-3d-simd-geometry-and-lighting-pipeline
https://developer.apple.com/documentation/accelerate/working_with_matrices

CGal is a Matrix Math library for C; Luckily OpenBLAS is a compatible library & AMD Makes a version in HIP
https://cpp.libhunt.com/cgal-alternatives

Matrix Libs : L1 means compatible with CGAL, A+ means i rate them highly on science community use : RS

CGAL (L1)
GLM (L1)
QuantLib (L1)
Ceres-Solver (L1)

OpenBLAS (A+)
Eigan (A+)
MiraCL (A+)

Github 3D Matrix AVX with alternatives
https://swiftpackageindex.com/fireblade-engine/math

https://github.com/ToruNiina/mave
https://github.com/fireblade-engine/math.git

C++ Matrix Maths

MPPT is Camera & FFMPeg complex install
https://docs.mrpt.org/reference/latest/compiling.html

C++ Matrix Maths : Simple
https://sourceforge.net/projects/arma/

C++ conversions between Numpy arrays and Armadillo matrices; Converts Into Numpy Py not out (needs work)
https://github.com/RUrlus/carma

https://sourceforge.net/software/product/NumPy/
https://sourceforge.net/software/product/NumPy/integrations/

Motivated applications of 3D Matrix Database ML

https://science.n-helix.com/2022/10/ml.html

Matrix-Blas_Libs-Compile
https://is.gd/HPC_HIP_CUDA

Just shows how fast Blas & these NumPy & Arma & Mave is! 1998-man SigRS
Parallel matrix multiplication & diagonalization
https://www-users.york.ac.uk/~mijp1/teaching/grad_HPC_for_MatSci/Lecture4.pdf

Wasm Inefficiency
https://news.ycombinator.com/item?id=37387629

3D Matrix Web Codecs

Are presented as being JIT Compiler re-encoded when required; Frequently WebASM, WebGPU Code, JS...
Audio, Video, Sensation, Code Runtimes.

Web Codecs for devices are a modern concept & are available for common websites such as news & music,
devices such as Alexa Echo & Google Dot & Bluetooth Devices?

Media players & BT devices particularly suffer from small Storage potential!
So Web Codecs downloaded to the device from a source; Such as a smart phone or computer..
Are a clear-minded solution!

JIT Compiler

3D Matrix Tables in FMA, Mul & ADD code to be automatically recompiled locally when required!
Directed to a common API, Direct Compute, WebGPU, WebASM, Jit Compiler OpenCL

Many Operations can be done from unique device specific optimisation; Examples:

API, DirectX & OpenCL & Vulkan & WebGPU & WebASM
Texture & Audio Shaders.
Digital Streaming

Bluetooth NANO SiMD & API
Digital TV in H266, VP9 & AV1,

Locally compiled accelerators should be respected first; Such as the output & input 3D Matrix & CPU & GPU Acceleration engine..

Code can include Matrix converters into common output format such as WebP & Textures & BC, DXT Compression presentation; Vulkan, OpenCL & DirectX & Texture & Audio Shaders.

Java, JS & WebASM are examples with operator mechanisms & JIT Compiler optimisation..
Minimising storage requirements for good compatibility while maximising performance.

RS

Requirements:

https://science.n-helix.com/2022/08/jit-dongle.html
https://science.n-helix.com/2022/06/jit-compiler.html

https://science.n-helix.com/2023/02/smart-compression.html
https://science.n-helix.com/2022/10/ml.html
https://science.n-helix.com/2023/06/map.html

*

TPU & SiMD Parallel wavetables Pre-Calculation Meta-Data : RS

{ For data expansion & Precomputed Upscaling through meta data per frame sequence }
#MetaDATA #PreProcessing Parallel Text loading and machine learning processing : RS 23/07/2023

Pre-calculation table; For Example the Amiga uses tables for maths!
Pi, Common conversion maths & float results in higher precision...

Parallel Text loading and machine learning processing is one of the wonders of TPU & SiMD Parallel wavetables,

Pre Calculate Tables that reduce a workload to simple process. and use...

For example if you Upscale a movie & use dynamic settings, Such as:
Localised Sharpening & Selective Gaussian filtering; Such as Gimps Edge detection Gaussian?
We compress information on the maths of selection..

The edges we selected, The methods we used & if those methods are dynamic then our selections...
Such a method is called a ..

Pre-calculation table; For Example the Amiga uses tables for maths!
Pi, Common conversion maths & float results in higher precision...

Common ones are learned at school
the log tables
Multiplication Tables
Common values such as gravity & Pi

Pre Computation
Upscaling
3D Audio basic resonance profile
Pre Computed values for a realistic world...
Experience & Learning to pre compute values...
This saves effort later in the process

This is available to providers & game developers for:

TV Upscaling through Compressed Numeric Add table downloading
All streaming services processing such as netflix, youtube & amazon prime!
Partial pre-computed upscaling for game, application & processing..

Through TopCloud & HPC Pack

Data Stored as meta-data and saves on repeat processing time!
By creatively Pre Computing processes such as 3D Audio, VR Audio, Haptic 3D Maths..
Work such as Decompression & Compiling

Affects the efficiency of any process that will Pre Calculate Tables that reduce a workload to a group of simple processes.

We can majorly improve quality of both visuals & Audio; Any Pre-Calculatable element

The logic is that Upscaling, Colour enhancements & sharpening have pre-calculatable logic,
We can save many seconds of processing per frame,
We can reduce energy footprint
We can improve latency & frame rate
Works for games also,
Education media or Theaters & mass media content such as News & commonly watched content or movies or visited websites or fonts & media

We can improve at a very minimum, Cutscenes & non motional backdrops & tangible Animation repeating assets & Effects...

(c)Rupert S

FMA : Fused Multiply ADD : MUL+ADD & Precision functions

You may be assuming that only modern GPUs such as RTX 2080+ & RT 5700+ has this?

FMA is a feature of the business editions & FX Series on AMD & exists in granite ridge & other Intel,
So FMA F16 is possible with the F32 : F16 conversion features present in for example FX8320E...

So what does this mean? In terms of:

Chrom that Emulates a lot of its GPU functions in CPU..
In terms of Python ML that F16 feature combined with FMA is very helpful in learning & efficiency!

In terms of CPU; mostly using 32Bit, F32, 64Bit, F64 is very helpful; in terms of SiMD,
F16 exists though; Even on the yee FX8320E!

So we can use potentially: Int8, Int32, Int64, F64, F32, F16 & Float 182Bit as with FPU!
Best to do DEEP work with the CPU FPU & SiMD...

We do have these functions though!, But Deep work FPU 182Bit? CPU! Some GPU have double precision also!

What do we use this variety for? Many things!

Defined by our precision requirements; not all things are INT64 & FPU But not every issue is covered by..
The MP4v, MP4a F16! AC3 & AC4 for example F32; A glass? FPU 182... or many F32 or even more F16 work units.

Rupert S

Exponent factorisation : RS

8Bit, 16Bit, 32Bit, 64Bit Exponent theory.
Available to you-(EF)

A value in 8Bit is no use in a 16 Bit operation... or is it?

Firstly 8 Bit values can be loaded with Zeros into higher math precisions,
In normal maths we use a remainder; So we can load 8Bit values into 32Bit Int & that works...

2 F16 blocks would be 32Bit; As 2 16Bit Blocks? So what use is this ?
in a 64Bit & 32Bit processor storage of FPU-182Bit values is possible ...
32Bit Blocks * 6 with XOR 00
64Bit Blocks * 3 with XOR 00
2 * Largest value...

But parallelising F64 on groups for 182Bit? with multiplications roll left <> Right .. & Additions +- ...
Possible.

But if the resultant is beyond 8Bit ? & we wanted to save as 8Bit?

Factorisation of a 32Bit value into 8Bit is possible; But we need to factor it!
Well:

32Bit to 8Bit is 6:1, So we have to random roll 6 Bits for every 1
We can factor in HighLow with 1 bit or use 8Bit fator 256 & 8Bit Number...

We can Multiply, Add, Subtract or divide or fraction:

256(*/-)1>256, leaving us with a 32Bit value? Well what can we use this for ?

Example complex : N/(240*50); See the maths can roll into 16Bit values..
We can use them, Or load a particular object, Classifier, HASH, AES, EEC...
We can quickly classify as 16Bit resultant & still save as a particular 8Bit value!

Images
Gains
Memories
Load file
load value
Random
Table Value
Compression!

(c)Rupert S,

Reference Int & FP Value Sizes; A reminder that Floats are 50% of highest Integer Value,

ROLLInt floats still have an amazing additional value!

https://learn.microsoft.com/en-us/dotnet/standard/numerics

https://science.n-helix.com/2023/02/smart-compression.html

F16b Adaptive Float value : Texture Color Palette Example : RS

Basic Example of F16b float in action on a colour pallet: {F16b,F32b, F64b}

F16b is short remainder F16 & it has 8 Bits of 0.01 point value rather than 16,
So what do we mean ? What is significant about this?

F16b Has 24Bit precision integer with an 8 bit remainder!
So? So 16Bit + 8Bit = 24Bit! & 8bit point value...

In colour representation point values contribute to subtle blending;
So a full 24Bit contributes to 90% of the Color Palettes

So the 24Bit colour pallet is 32Bit Colour Minus Alpha;
We can use F16b in HDMI & DisplayPort & inside the GPU & Also for textures & JPG'S..
Thereby i present F16b & F24Bit colours in F16b

This saves all data in single 32bit Spaces & therefore is both faster & higher resolution than comparable float value presentations.

Bound to make a big difference to BlueRay, but particularly DVD & AC3 & AC4;
F16b Adaptive Float value : Texture Color Palettes Example;

(you can use F16b * R,G,B,A) in HDMI a& DisplayPort, Massive colour improvements; Lower RAM Costs

Rupert S

AnPa_Wave - Analogue Pattern Wave Vector SiMD Unit : (c)RS

The base symphony is harmony, In other words waveforms; There are a couple of Simple methods that really work:

High performance Float values F16, F32, F64, FPU

Q-Bit Quantum; All forms of Quantum wave work
Radio waves;
Light patterns
Photon wave patterns; single & multiple
Sound hardware; 1 to 3 Bit DAC; Audio conversions; Sample range
Analogue chips that work on harmony & frequency
SVM Elliptic curve maths
Sin, Arc, Tan, Time, Vector

In essence Harmony & frequency is the equivalent of Complex Elliptic curve maths

A Music note score suffices to specify harmony basics:

Waveform shape in 3D
Harmony / Disharmony
Vibration High / Vibration Low
Power High / Power Low
Volts High / Volts Low
Watts High / Wats Low

(c)Rupert S

https://science.n-helix.com/2023/07/3dchiplet.html

Wonderful Wave-Pattern Analogue waveforms in meta materials - Pattern recognition in reciprocal space with a magnon-scattering reservoir
https://www.nature.com/articles/s41467-023-39452-y.pdf

*

Vectors & maths
https://science.n-helix.com/2022/08/simd.html
https://science.n-helix.com/2022/04/vecsr.html
https://science.n-helix.com/2016/04/3d-desktop-virtualization.html
https://science.n-helix.com/2022/04/vecsr.html
https://science.n-helix.com/2018/01/integer-floats-with-remainder-theory.html

https://science.n-helix.com/2023/02/smart-compression.html

Networking & Management
https://science.n-helix.com/2023/06/tops.html
https://science.n-helix.com/2023/06/ptp.html
https://science.n-helix.com/2023/06/map.html

https://science.n-helix.com/2023/02/pm-qos.html
https://science.n-helix.com/2022/08/jit-dongle.html
https://science.n-helix.com/2022/06/jit-compiler.html

https://science.n-helix.com/2022/03/ice-ssrtp.html
https://science.n-helix.com/2022/01/ntp.html

Faster Maths & ML
https://science.n-helix.com/2018/01/integer-floats-with-remainder-theory.html
https://science.n-helix.com/2021/02/multi-operation-maths.html
https://science.n-helix.com/2021/11/parallel-execution.html
https://science.n-helix.com/2022/12/math-error-solve.html
https://science.n-helix.com/2021/03/brain-bit-precision-int32-fp32-int16.html
https://science.n-helix.com/2022/10/ml.html

Focus on Quality
https://science.n-helix.com/2022/09/ovccans.html
https://science.n-helix.com/2022/11/frame-expand-gen-3.html
https://science.n-helix.com/2022/03/fsr-focal-length.html

For when {U, X, Y, Z} = N Expressions https://is.gd/ForWhen_UXYZ_N
For when {(A+B/2)} = C Expressions https://is.gd/ForWhen_ABx2_C

Hallelujah RS Light-Wave SiMD https://www.allaboutcircuits.com/news/lightelligence-reports-worlds-first-optical-network-on-chip-processor/

RS Spectra Mitigations https://science.n-helix.com/2018/01/microprocessor-bug-meltdown.html
ZenBleed Parallel Solvent RS 2023 https://science.n-helix.com/2023/07/zenbleed.html

Core/CPU/GPU security core SSL/TLS BugFix
https://science.n-helix.com/2020/06/cryptoseed.html
https://science.n-helix.com/2019/05/zombie-load.html

Secure Configuration:
https://is.gd/SSL_NetSecurity_NTP_PTP
https://is.gd/EthernetTunnelOpt
https://is.gd/SSL_Optimise

PTP & NTP Improve security WW https://is.gd/PTP_TimeStream

*****

Running Code

https://is.gd/UpscaleWinDL

https://is.gd/HPC_HIP_CUDA

PoCL Source & Code
https://is.gd/LEDSource

PoCL-Direct
https://is.gd/PoCL_Source

X86Features-Emu

https://drive.google.com/file/d/15vXBPLaU9W4ul7lmHZsw1dwVPe3lo-jK/view?usp=usp=sharing

https://www.amd.com/en/developer/rocm-hub/hip-sdk.html#tabs-ddafbba141-item-c6b9ce2aab-tab
https://rocm.docs.amd.com/en/docs-5.5.1/deploy/windows/quick_start.html

AMD 23.Q3Pro_HIP #HPC #DirectML MatrixMathOps 'Release unto me the great! Chobokniki' Thine Prayers Answered https://is.gd/AMD23Q3PRO_HIP
Run the .reg after install; Before reboot https://is.gd/AMDRebarReg

**********

https://en.wikipedia.org/wiki/Cell_(processor)

https://www.khronos.org/news/permalink/ibm-releases-opencl-drivers-for-power6-and-cell-b.e/

Not Accessible
https://www.alphaworks.ibm.com/tech/opencl
**********

AI: Artificial Intelligence
ML: Machine Learning
PULP: Parallel Ultra Low Power

Maths Operations

FMA: Fused Multiply-Add
GEMM: General Matrix Multiply
SIMD: Single Instruction Multiple Data
SIMT: Single Instruction Multiple Thread

SP: Single Precision
DP: Double Precision
FLOPS: Floating Point Operations per Second

Processor Types & RAM

ASIC: Application Specific Integrated Circuit

SoC: System on Chip
PCU: Programmable Computing Unit
NoC: Network on Chip

CPU Central Processing Unit
VPU: Vector Processing Unit
NPU: Neural Processing Unit
TPU: Tensor Processing Unit
FPGA: Field-Programmable Gate Array

RISC: Reduced Instruction Set Computer
CISC: Complex Instruction Set Computer

NDP: Near Data Processing

PIM: Processing In-Memory
IMC: In-Memory Computing

SRAM: Static Random Access Memory
VRAM: Video Random Access Memory
DRAM: Dynamic Random Access Memory
PCM: Phase Change Memory
BRAM: Block Random Access Memory
RAM: Random Access Memory
RRAM: Resistive RAM

*****

Matrix Array Processor Unit (c)RS

[M.A.P] [=====] [H.P.C] - Matrix Array Processor Unit (c)RS

This document describes the design and implementation of a novel computing device called the Matrix Array Processor Unit (M.A.P.U).

The M.A.P.U is a co-processor that can perform high-speed parallel operations on multi-dimensional arrays of data, such as those used in quantum computing, machine learning, and computer graphics,

A novel co-processor that can perform high-performance computing tasks using quantum-inspired principles.

The Matrix Array Processor is a type of processor that is designed to handle multi-directional and multi-dimensional arrays per Qbit.

It is used in quantum computers and relies on percentage-based 3D processing to handle all 3D array processing.

The central tasks map to probability over networks and MAP units in arrays.

The M.A.P is composed of multiple interconnected units that can process multi-dimensional arrays in parallel, using a percentage-based 3D processing scheme.

The M.A.P can be integrated with existing CPU, GPU and DPU architectures, as well as with other M.A.P units, to form a scalable and flexible computing platform.

The differences of Some Matrix Array Processor and other processors such as:

SIMD (Single Instruction Multiple Data),
SISD (Single Instruction Single Data),
MISD (Multiple Instruction Single Data),
MIMD (Multiple Instruction Multiple Data),
Vector processors,
Systolic Arrays,

Is that the Matrix Array Processor is designed to handle multi-directional and multi-dimensional arrays per Qbit...

While other processors are designed to operate efficiently and effectively on large one-dimensional arrays of data called vectors

The M.A.P.U consists of three main components:

The Matrix Array Processor (M.A.P),
The High Precision Central Core (H.P.C),
The Bus Connections and Networking (=====).

Core Definitions 3D M.A.P:

[H.P.C]:

A high-precision central core that can handle complex tasks such as probability mapping, network routing and memory management.

The H.P.C is the central controller of the M.A.P.U.

It coordinates the execution of tasks across the M.A.P units, assigns probabilities to different outcomes, and handles complex calculations that require high precision or accuracy.

Each [H.P.C] unit can connect to 8 [M.A.P] units and optionally to other [H.P.C] units in different layers of the 3D matrix.

The [H.P.C] can also communicate with external devices such as CPUs, GPUs, DPUs, or networks via the bottom layer of the wafer.

[M.A.P]:

The M.A.P is a specialized processing unit that can execute multiple arithmetic and logical operations on a single array element in one clock cycle.

A unit that can perform arithmetic operations on multi-dimensional arrays using a dot product-like algorithm.

Each M.A.P has 8-way interconnects to communicate with neighboring M.A.P units and a central [H.P.C] unit.

The M.A.P has eight-way interconnects to communicate with other M.A.P units in the same layer or adjacent layers.

The M.A.P can also access local cache or RAM for storing intermediate results or constants.

[=====]:

A bus connection that enables data transfer and networking among the M.A.P units and the [H.P.C] units.

The bottom layer of the wafer contains a high-resolution bus that connects to the onboard controllers and networks and the external CPU, GPU and DPU devices.

The ===== supports different communication protocols and topologies, such as mesh, torus, or hypercube.

The ===== also provides fault tolerance and load balancing mechanisms to ensure reliable and efficient performance.

The M.A.P.U is designed to be scalable and modular.

It can be stacked in three dimensions to form a larger array of processors that can handle more complex and diverse tasks.

The M.A.P.U can also be customized for different applications by changing the size, shape, or configuration of the M.A.P units, the H.P.C cores, or the ===== network.

The following diagrams illustrate the structure and functionality of the M.A.P.U.

Top View

[M.A.P][M.A.P][M.A.P]
[M.A.P][H.P.C][M.A.P]
[M.A.P][M.A.P][M.A.P]

Side View 3D

[M.A.P][H.P.C][M.A.P]
[M.A.P][=====][M.A.P]
[=====][H.P.C][=====]
[M.A.P][=====][M.A.P]
[=====][H.P.C][=====]
[M.A.P][=====][M.A.P]
[=====][H.P.C][=====]

Each [H.P.C] Central Contains RAM & connections to the 8 [M.A.P] & Optionally to layers above & bellow in 3D Matrix,
Bottom of wafer contains high resolution buss to onboard controllers & networks & DPU/GPU/CPU's

Array = Matrix Array Processor Unit (c)RS

ffffffff ffffffff ffffffff
........+ ........*+ ........*
........+ ........*+ ........*
........+ ........*+ ........*

f=fp,unit
*=mul
+=add
.=Cache/Ram

Simple absolver table for MUL:ADD : MUL* Only = +0, +- Only = N*1 then +-
% = / 100 + ADD Table {N1 <> N...} : Result!

The M.A.P unit can perform operations on multi-dimensional arrays using a combination of:

Floating-point units (f), Multiplication units (*), Addition units (+) and cache/ram units (.).

The M.A.P unit can support different data types such as DOT4, INT8, INT16, F16, F32 and F64.

The M.A.P co-processor is a cutting-edge technology that can enable new applications in fields such as artificial intelligence, machine learning, scientific computing and more.

(c)Rupert S

References: DOT4, INT8, INT16, F16, F32, F64 (c)Rupert S

https://is.gd/LEDSource

https://science.n-helix.com/2023/06/map.html

https://science.n-helix.com/2018/01/integer-floats-with-remainder-theory.html
https://science.n-helix.com/2021/02/multi-operation-maths.html
https://science.n-helix.com/2021/11/parallel-execution.html
https://science.n-helix.com/2022/12/math-error-solve.html
https://science.n-helix.com/2021/03/brain-bit-precision-int32-fp32-int16.html
https://science.n-helix.com/2022/10/ml.html

Sparse matrix multiplication in SRM array
https://www.science.org/doi/10.1126/sciadv.adf7474

Error Correction Options & Mitigation
https://futurism.com/ibm-breakthrough-quantum-computing

**********

Light Processors (c)Rupert S https://science.n-helix.com

Light processors : Access to advanced : Storage Cache, Random Access RAM Cache & Processor architecture: Starting with SiMD Simple Vector Instruction Set

Complex forms are a goal, Start simple : The world will thank you!
Simple as SiMD appears there are many uses,
Considering that higher instruction sets are delayed by SiMD space & speed priorities..

Array = Matrix Array Processor Unit (c)RS

ffffffff ffffffff ffffffff
........+ ........*+ ........*
........+ ........*+ ........*
........+ ........*+ ........*

f=fp,unit
*=mul
+=add
.=Cache/Ram

Simple absolver table for MUL:ADD : MUL* Only = +0, +- Only = N*1 then +-
% = / 100 + ADD Table {N1 <> N...} : Result!

Array = Matrix Array Processor Unit (c)RS

Cache is also a priority with manyfold application of simple data transfer & buffering to solid storage,
Power outage is our main concern so that we save all our work.

SSD is an obvious solution to backing up speedily,
However we do use RAM Cache for this goal..

The goal of speeding storage access up,
Light does all the work types we need:

List:
Data transit
CacheProcessing via dimensions & signal variance
RAM (Cyclic light transfer) Same principle as fibre optic cable over large distances.

(c)Rupert S https://science.n-helix.com

Quantum ! Light Compute : Reference material : RS

Yes we can solve classic problems with light computers, Light computers perform geometry & quantitative sampling (Comment by inventor) Rupert S

Light Compute : Reference material : RS
https://science.n-helix.com/2012/09/geometric-calculating-machines.html

https://science.n-helix.com/2020/03/single-photon.html

https://science.n-helix.com/2014/07/the-formula-of-geometric-volumes.html

https://science.n-helix.com/2018/07/universeal-algebra-paper.html

https://science.n-helix.com/2018/06/compression-libraries-index-prime.html

https://science.n-helix.com/2013/08/light-theory-on-creation-of-3d-image.html

https://science.n-helix.com/2018/06/uses-for-micro-laser-light-emitting.html

https://science.n-helix.com/2020/04/render.html

https://science.n-helix.com/2019/06/vulkan-stack.html

https://science.n-helix.com/2019/06/kernel.html

https://science.n-helix.com/2019/05/compiler-optimisation.html

https://science.n-helix.com/2018/09/hpc-pack-install-guide.html

https://science.n-helix.com/2020/04/cern.html

"Let's Play" Station NitroMagika_LightCaster

Lets face it, Realtec could well resource the "Original QFFT Audio device & CPU/GPU"

The mic works by calculating angle on a drum...
Light.. and timing & dispersion...
The audio works by QFFT replication of audio function..
The DAC works by quantifying as Analog digital or Metric Matrix..
The CPU/GPU by interpreting the data of logic, Space & timing...

We need to calculate Quantum is not the necessary feature;

But it is the highlight of our:

Data storage cache.
Our Temporary RAM
Our Data transport..
Of our fusion future.

(c)Rupert S https://science.n-helix.com

"Weedbrook points out that as yet, and in contrast to Google’s Sycamore, the Chinese team’s photonic circuit is not programmable, so at this point “it cannot be used for solving practical problems”."
https://www.nature.com/articles/d41586-020-03434-7

https://scitechdaily.com/ai-boosted-by-parallel-convolutional-light-based-processors/

https://interestingengineering.com/worlds-fastest-most-powerful-neuromorphic-processor-for-ai-unveiled

Physicists in China challenge Google’s ‘quantum advantage’
Photon-based quantum computer does a calculation that ordinary computers might never be able to do.
Philip Ball

PDF version
The interferometer part of our experiment.

This photonic computer performed in 200 seconds a calculation that on an ordinary supercomputer would take 2.5 billion years to complete.Credit: Hansen Zhong

A team in China claims to have made the first definitive demonstration of ‘quantum advantage’ — exploiting the counter-intuitive workings of quantum mechanics to perform computations that would be prohibitively slow on classical computers.

They have used beams of laser light to perform a computation which had been mathematically proven to be practically impossible on normal computers. The team achieved within a few minutes what would take half the age of Earth on the best existing supercomputers. Contrary to Google’s first demonstration of a quantum advantage, performed last year, their version is virtually unassailable by any classical computer. The results appeared in Science on 3 December1.

“We have shown that we can use photons, the fundamental unit of light, to demonstrate quantum computational power well beyond the classical counterpart,” says Jian-Wei Pan at the University of Science and Technology of China in Hefei. He adds that the calculation that they carried out — called the boson-sampling problem — is not just a convenient vehicle for demonstrating quantum advantage, but has potential practical applications in graph theory, quantum chemistry and machine learning.

“This is certainly a tour de force experiment, and an important milestone,” says physicist Ian Walmsley at Imperial College London.

Quantum advantage challenged

Teams at both academic and corporate laboratories have been vying to demonstrate quantum advantage (a term that has now largely replaced the earlier ‘quantum supremacy’).

Last year, researchers at Google’s quantum-computing laboratory in Santa Barbara, California, announced the first-ever demonstration of quantum advantage. They used their state-of-the-art Sycamore device, which has 53 quantum bits (qubits) made from superconducting circuits that are kept at ultracold temperatures2.

But some quantum researchers contested the claim, on the grounds that a better classical algorithm that would outperform the quantum one could exist3. And researchers at IBM claimed that its classical supercomputers could in principle already run existing algorithms to do the same calculations in 2.5 days.

To convincingly demonstrate quantum advantage, it should be unlikely that a significantly faster classical method could ever be found for the task being tested.

The Hefei team, led by Pan and Chao-Yang Lu, chose a different problem for its demonstration, called boson sampling. It was devised in 2011 by two computer scientists, Scott Aaronson and Alex Arkhipov4, then at the Massachusetts Institute of Technology in Cambridge. It entails calculating the probability distribution of many bosons — a category of fundamental particle that includes photons — whose quantum waves interfere with one another in a way that essentially randomizes the position of the particles. The probability of detecting a boson at a given position can be calculated from an equation in many unknowns.

200 seconds

But the calculation in this case is a ‘#P-hard problem’, which is even harder than notoriously tricky NP-hard problems, for which the number of solutions increases exponentially with the number of variables. For many tens of bosons, Aaronson and Arkhipov showed that there’s no classical shortcut for the impossibly long calculation.

A quantum computer, however, can sidestep the brute-force calculation by simulating the quantum process directly — allowing bosons to interfere and sampling the resulting distribution. To do this, Pan and colleagues chose to use photons as their qubits. They carried out the task on a photonic quantum computer working at room temperature.

Starting from laser pulses, the researchers encoded the information in the spatial position and the polarization of particular photon states — the orientation of the photons’ electromagnetic fields. These states were then brought together to interfere with one another and generate the photon distribution that represents the output. The team used photodetectors capable of registering single photons to measure that distribution, which in effect encodes the calculations that are so hard to perform classically.

In this way, Pan and colleagues could find solutions to the boson-sampling problem in 200 seconds. They estimate these would take 2.5 billion years to calculate on China’s TaihuLight supercomputer — a quantum advantage of around 1014.

Practical problems

“This is the first time that quantum advantage has been demonstrated using light or photonics,” says Christian Weedbrook, chief executive of quantum-computing startup Xanadu in Toronto, Canada, which is seeking to build practical quantum computers based on photonics.

Walmsley says this claim of quantum advantage is convincing. “Because [the experiment] hews very closely to the original Aaronson–Arkiphov scheme, it is unlikely that a better classical algorithm can be found,” he says.

However, Weedbrook points out that as yet, and in contrast to Google’s Sycamore, the Chinese team’s photonic circuit is not programmable, so at this point “it cannot be used for solving practical problems”.

But he adds that if the team is able to build an efficient enough programmable chip, several important computational problems could be solved. Among those are predicting how proteins dock to one another and how molecules vibrate, says Lu.

Weedbrook notes that photonic quantum computing started later than the other approaches, but it could now “potentially leap-frog the rest”. At any rate, he adds, “It is only a matter of time before quantum computers will leave classical computers in the dust.”

https://scitechdaily.com/ai-boosted-by-parallel-convolutional-light-based-processors/

"AI Boosted by Parallel Convolutional Light-Based Processors

TOPICS:Artificial IntelligenceElectrical EngineeringEPFLMachine LearningOpticsPhotonicsPopular

By EPFL JANUARY 7, 2021

Matrix Multiplications Light Processor

Schematic representation of a processor for matrix multiplications which runs on light. Credit: University of Oxford

The exponential growth of data traffic in our digital age poses some real challenges on processing power. And with the advent of machine learning and AI in, for example, self-driving vehicles and speech recognition, the upward trend is set to continue. All this places a heavy burden on the ability of current computer processors to keep up with demand.

Now, an international team of scientists has turned to light to tackle the problem. The researchers developed a new approach and architecture that combines processing and data storage onto a single chip by using light-based, or “photonic” processors, which are shown to surpass conventional electronic chips by processing information much more rapidly and in parallel.

The scientists developed a hardware accelerator for so-called matrix-vector multiplications, which are the backbone of neural networks (algorithms that simulate the human brain), which themselves are used for machine-learning algorithms. Since different light wavelengths (colors) don’t interfere with each other, the researchers could use multiple wavelengths of light for parallel calculations. But to do this, they used another innovative technology, developed at EPFL, a chip-based “frequency comb,” as a light source.

Matrix Multiplications Light Processor Schematic

Schematic representation of a processor for matrix multiplications which runs on light. Credit: University of Oxford

“Our study is the first to apply frequency combs in the field of artificial neural networks,” says Professor Tobias Kippenberg at EPFL, one the study’s leads. Professor Kippenberg’s research has pioneered the development of frequency combs. “The frequency comb provides a variety of optical wavelengths that are processed independently of one another in the same photonic chip.”

“Light-based processors for speeding up tasks in the field of machine learning enable complex mathematical tasks to be processed at high speeds and throughputs,” says senior co-author Wolfram Pernice at Münster University, one of the professors who led the research. “This is much faster than conventional chips which rely on electronic data transfer, such as graphic cards or specialized hardware like TPU’s (Tensor Processing Unit).”

After designing and fabricating the photonic chips, the researchers tested them on a neural network that recognizes of hand-written numbers. Inspired by biology, these networks are a concept in the field of machine learning and are used primarily in the processing of image or audio data. “The convolution operation between input data and one or more filters — which can identify edges in an image, for example, are well suited to our matrix architecture,” says Johannes Feldmann, now based at the University of Oxford Department of Materials. Nathan Youngblood (Oxford University) adds: “Exploiting wavelength multiplexing permits higher data rates and computing densities, i.e. operations per area of processer, not previously attained.”

“This work is a real showcase of European collaborative research,” says David Wright at the University of Exeter, who leads the EU project FunComp, which funded the work. “Whilst every research group involved is world-leading in their own way, it was bringing all these parts together that made this work truly possible.”

The study is published in Nature this week, and has far-reaching applications: higher simultaneous (and energy-saving) processing of data in artificial intelligence, larger neural networks for more accurate forecasts and more precise data analysis, large amounts of clinical data for diagnoses, enhancing rapid evaluation of sensor data in self-driving vehicles, and expanding cloud computing infrastructures with more storage space, computing power, and applications software.

Reference: “Parallel convolutional processing using an integrated photonic tensor core” by J. Feldmann, N. Youngblood, M. Karpov, H. Gehring, X. Li, M. Stappers, M. Le Gallo, X. Fu, A. Lukashchuk, A. S. Raja, J. Liu, C. D. Wright, A. Sebastian, T. J. Kippenberg, W. H. P. Pernice and H. Bhaskaran, 6 January 2021, Nature."

https://interestingengineering.com/worlds-fastest-most-powerful-neuromorphic-processor-for-ai-unveiled

"A new optical neuromorphic processor developed by Swinburne University of Technology can operate more than 1000 times faster than any previous processor. The processor for artificial intelligence (AI) functions faster than 10 trillion operations per second (TeraOPs/s).

RELATED: HUAWEI LAUNCHES WORLD'S MOST POWERFUL AI PROCESSOR

Optical micro-combs

The invention could revolutionize neural networks and neuromorphic processing in general. “This breakthrough was achieved with ‘optical micro-combs', as was our world-record internet data speed reported in May 2020,” said in a statement Swinburne’s Professor David Moss.

Micro-combs are new devices made up of hundreds of infrared lasers all held on a single chip. Compared to other optical sources, they are much smaller, lighter, faster, and cheaper.

The new innovation demonstrated by the Swinburne team uses a single processor while simultaneously interleaving the data in time, wavelength, and spatial dimensions through a single micro-comb chip.

“In the 10 years since I co-invented them, integrated micro-comb chips have become enormously important and it is truly exciting to see them enabling these huge advances in information communication and processing. Micro-combs offer enormous promise for us to meet the world’s insatiable need for information," added Moss.

Co-lead author of the study Dr. Xingyuan (Mike) Xu explained how this innovative use of micro-combs is giving the researchers a glimpse into the processors of the future.

Cost and energy reductions

Distinguished Professor Arnan Mitchell from RMIT University added that the "technology is applicable to all forms of processing and communications" and will result in significant future cost and energy consumption reductions.

“Convolutional neural networks have been central to the artificial intelligence revolution, but existing silicon technology increasingly presents a bottleneck in processing speed and energy efficiency,” said key supporter of the research team, Professor Damien Hicks from Swinburne and the Walter and Elizabeth Hall Institute.

“This breakthrough shows how a new optical technology makes such networks faster and more efficient and is a profound demonstration of the benefits of cross-disciplinary thinking, in having the inspiration and courage to take an idea from one field and using it to solve a fundamental problem in another.”"

Tuesday, June 13, 2023

Theory of mind - TOPCloud

[Theory of mind - TOPCloud +2021-03 RS]

Theory of mind : LLM:ML & us : RS

Theory of mind : Clearly the Problem Sort Tree & Theory of mind; But also of the industrial age+(stone)
LLM - Large Language Models as tool makers

https://www.youtube.com/watch?v=qWI1AJ2nSDY

To sum the content directly within the Layers of TOPCloud..

Work Unit Cost Average = {

Work Blocks : Work Unit Allocations per task

WATTS,
TIME,
Effort,
Accuracy,

}

Basic LLM-Hive={

LLM : The Mind or Hive:

Direct knowledge gathering,
Basic Tool use : MathML, PyMath, OpenCL, Programming

Tool making is the stone age step
Tool on tool is Industrial

Too Big To Fail ;-)

}

Rupert S

https://science.n-helix.com/2021/03/brain-bit-precision-int32-fp32-int16.html

https://science.n-helix.com/2022/10/ml.html

https://science.n-helix.com/2023/06/tops.html

https://science.n-helix.com/2022/08/jit-dongle.html
https://science.n-helix.com/2022/06/jit-compiler.html

https://science.n-helix.com/2023/06/map.html
https://science.n-helix.com/2023/06/ptp.html

https://science.n-helix.com/2023/02/smart-compression.html

https://is.gd/TheSelfInSelf

***********

Fully Autonomous NPC : Research Paper - https://arxiv.org/pdf/2304.03442.pdf

Fully Autonomous Real-World Reinforcement Learning with Applications to Mobile Manipulation https://arxiv.org/pdf/2107.13545.pdf

TOPCloud Heuristic Machine Learning
https://is.gd/LEDSource

Fully Autonomous NPCs - Putting "Open World" To Shame (ChatGPT-Powered) : TOPCloud

https://www.youtube.com/watch?v=Se6KFn1Nni4

Autonomous agents are:

Angels In Disguise : Secret underflow missions, Emotive resonances & Shared information such as local logs,
How well do 'Autonomous NPC' handle information about & from others ?

Repeating? Common goals become daily tasks, Heuristics is like this! in Rogue

Expressive? Allow interactions from such functions as Educators & News channels that they watch...
Treat some of the content like dreams or surreal interactions & narrative...
Not all content is believed; Not all dreams were unreal...

Some are made; Some fall!

Life is a function & has mechanics! Not all events suffer from a proof that nothing is or was programmed...

TOPCloud; Outside influences & Larger pools of experience such as dream; role play; Interactions; Moving home to another 'persons device' (Such as Beijing)

Interaction creation & perfection; TOP Cloud.

Example Material : TOPCloud Text Translate & Associate

Soule
https://www.youtube.com/watch?v=KBqPIcQV3hk
https://www.youtube.com/watch?v=ICGuGONrNzk
https://www.youtube.com/watch?v=UorRxnx-dsw

************

TOP BOOSTER Cloud Enemy(tm) Provided by potentially DLSS Cloud Founder :

*
TOPCloud & BlueTooth & Device : Localised & Cloud Computing JITCompiler:

I have to be specific about TOPCloud & BlueTooth; Due to data bandwidth constraints..
Phone & Device direct provision of Computing power to Bluetooth devices is hard!

Bandwidth is often only 250Kb & that is including the Codec data such as SBC & LCPlus & HE AAC 3D Audio!
So we need to save data & also Compute! Presenting TOPCloud..

TOPCloud provides ML TOPS & Computing power to devices through Protocols known as the JITCompiler GPU RTP/RDP Device-Chain Stack

https://science.n-helix.com/2022/06/jit-compiler.html
https://science.n-helix.com/2022/10/ml.html

RS
*

We cannot all Buy a founders GPU But we can all use your Founders Edition low price Cloud plugin for MMO & Online activated play Gaming :
Cloud Enemy(tm) - TENSOR CORE + TOPS + We cannot all buy your cloud GPU Founders edition...

for reasons that AMD & NVidia and ARM & Intel do not directly buy a RTX3080TI Founders edition :p ^^ but we can all use your :

Cloud Enemy(tm):(c)RS TENSOR CORE : All GPU of note have TOPS and obviously we all specialise <3

My proposal is simple : All special console MMO need a 370 Tensor core server side :

Enemy, Friend,Pet, Emoti play(tm)

(read at the bottom of the post please, Bear in mind this does not mean NVidia is the best at RayTracing..
But it does mean we can truly afford to activate the full benefits of having ML TOPS..
Mobile phones often only have 4 TOPS or even 2! at the most 10 and specialists like IPhone 20>30

But could all afford a small compliment to the Founders Cloud in that ML is dealt with for the entire MMO by the cloud; That way no one needs to know that ..

MLT_RTP:RS
Machine Learning TOPS RTP Is a protocol specifically for the Mapping & implementation of AI
Upscale your machine parameters with living system ML

Packets are intended to be between 15KB & 1MB light load over 1 minute
256KB to 4MB load over 1 minute..
Containing pre mapped dynamic logic & operations procedure calls that enhance for example:

Game environment
Game AI
Robot logic
Driver logic
NPC Logic

Research & Logistics
Mapping & Terrain
Radar & Drive By Wire
Traffic control & routing
Landing & takeoff

GPU RTP (Complex 3D RTP, Simple message, local cache, Monster cloud render + local)(c)RS
Exists specifically for You the client:

NVidia
Microsoft..
Google
Apple
AMD
Cloud gaming and service providers

Linux VM
Windows VM
Mac VM

Cloud Machine learning at GPU specialist clouds is of very high potency & potential,
But for a 1$ a week subscription game like Quake? very hard at large cost!

(c)Rupert S https://science.n-helix.com

Cloud Enemy(tm)

Core strategic advice & adaptable SVM CPU <> GPU

SVM/Int List:
Hard mode: Smaller refinement
Advance Hard mode: Micro model save, Micro model regression

Advance BattleMode: Hard mode: Micro model save, Varied challenge (small regression),Indirect reference chat
Advance BattleMode: Hard mode: Micro model save, Varied challenge (small regression),Indirect reference chat,Personal chat
Advance BattleMode: Hard mode:RND resurgence, Micro model save, Varied challenge (small regression),Indirect reference chat,Personal chat

Machine learning,
The Advanced SVM feature Set & Development

CPU lead Advanced SVM/ML potential
GPU refinement & memory Expansion/Expression/Development

SVM/ML Logic for:
Shaders,
Tessellation,
Compression,
PML Vector Ray-Tracing

(c)RS

Raising TOP's is JIT OpenCL

The main process of internally Raising TOP's is JIT OpenCL

https://science.n-helix.com/2022/08/jit-dongle.html
https://science.n-helix.com/2022/06/jit-compiler.html

*
ML_RTP chain events:

TPU Main Machine Learning NPC,

Micro Enactor Scripts and ML (GPU Server Side)

Local Micro Enactor Scripts and ML (Client GPU Side)
*

The concept is to share processing work further down or up the chain:
Display to GPU & then CPU & USB,

If there is a USB JIT Dongle such as compute stick that is in the Monitor USB or in a USB Dock on the HDMI/DisplayPort Cable; Then the JIT Compiler will handle OpenCL work units called Kernels...

The ML RTP protocol sends work packets to servers; Traditionally in online games Scripts run on the server,

MLT_RTP adds depth because the server can run Machine Learning Workloads such as OpenCL JIT & procedural calls to run mobs & pets..

The main process is to have the local computer or device such as phone running small Machine Task interpreters; MTI are small machine learning routines that run through script's & diagnose problems with it..

For example MOBS/Allies run into walls; With higher latency localised JIT Compiler Tasks can run the MOB/Ally Locally & not have to download from server so frequently..

So we reduce latency but can still check the Mob/Ally is doing something we want & is not exploited.
We can run 10 Seconds of commands locally; For example on a localised node in Europe while the game runs in Japan...

We can execute the thought processes of the Ally/Mob on the powerful TPU / Tensor Cores / Server F16..

Individually scripting motions for all characters on another node; As in the Physics, Motions & Animations!
TPU are not known for GPU Render capacity & Nodes with both TPU & GPU would be pricey!

But we chain events:

TPU Main Machine Learning NPC,

Micro Enactor Scripts and ML (GPU Server Side)

Local Micro Enactor Scripts and ML (Client GPU Side)

(c)RS

*

Low Latency ALLM Direct Render : GPU RTP & GPU RDP Protocols..
Specifically designed with GPU & Display Connections, Transport & presentation with..

JIT Compiler
https://science.n-helix.com/2022/08/jit-dongle.html
https://science.n-helix.com/2022/06/jit-compiler.html

Compressed Render VECSR https://science.n-helix.com/2022/04/vecsr.html
https://science.n-helix.com/2023/02/smart-compression.html
https://is.gd/LEDSource

*

TOP Cloud Basics for personal help AI

Machine learning from the direction of Alexa, Cortana, Siri, Bard

Local processing requires RAM & processor time? Yes

So we have a planed local process:

2 MB Ram
700 Cross references on topic you ask...
300 Language response process
Optimised server access; Point of view is to isolate the connection below 2Mb/s (Probably 50Kb per response)

Local library of common topics for you!
Local list of items you like; Song types you prefer; Your personal preference over 30 minutes

Local data matrix is optimised for you..

Most tasks are carried out local first,
As you see most requests require less thought & are already optimised for uploading & downloading...

Question is; how much server do we need ? & how personal is it?

Uses of TOP Cloud : Disabled People Basics:
TOP Cloud, is purely the best for all efficient visual & audio damaged people,
Can provide heuristics that allow colour blind people to see what they need!
Can do many things with a very small bit of time on large TPU & GPU, potentially in 1 Second for many people,

Heuristics is all that we need after logic; & we can filter a video with a colour sensitive persons visual range; basic example & with a single WebASM or WebGPU colour layer; Very low CPU use they See..

Red, Green, Blue; Enhance or tint..
Single colour layer WebGPU, Shader, WebASM.. not actually on the video!
Additive tint.. As to enhance the colour or indicate with another a slight amount truecolour...

"I see your true colours shining through" TOPCloud

RS

TOPCloud Offload Logic:

In terms of WebASM & WebGPU & MathML; TOPCloud provides sufficient advantages to be considered a core utility..

While Offloading repeating content such as Siteload core stack (Server) & Localising configuration such as Webpage size & DPI & Dynamic font arrangements that require thought.

In terms of Offloaded function & Efficient system load for large configurations..

Especially efficient configurations such as TPU, Coral, GPU work & Cloud CPU that have large optimised stacks & installed drivers.

RS

#Doctors #HuristicLists #CommonMedicalAdvisory #WebMD #CommonPerscriptionAuditingAdvice #Doctors I do not always know where to go!

#HeuristicList

#MD
#TOPCloud
#CommonResource
#DiscreteCosign
#Doctors
#HuristicLists
#CommonMedicalAdvisory
#WebMD
#CommonPerscriptionAuditingAdvice
#InfogramaticSortLists
#CommonErrorsTipNotes
#SugestedStaffLevels
#NonObligatoryMandate

https://is.gd/LEDSource

Rupert S

*

#TheTOPCloudEdit (c)RS : Principle of data saving non localised Machine aided design & workflow (c)RS

We really have to think about all the offloading strategies we can; Our network & storage footprint should be minimal..

To name the philosophy completely we need to start with our most compressible assets!*

Very high precision Float operations
high complexity offloaded ML
Long term strategies; Minutes to hours!

Basic operation to offload are complex ones..

We need multiple shape cuts in a single pass; Preferably vectors!
But those shapes shall be multiple factor complexity!

The offloading of simple operations with KB of image or file per operation has higher latency & bandwidth!
Complex operations also may require that the HPC configuration has the image, video or data..
But we DO NOT Want to transfer GB/s data on presumption if we do not need to!

So our primary source of TOPS performance; Is complexity operations; We no not firstly offload the image, Video, Texture, Complex Vector upload... If we are avoiding that?

But we DO Offload:

Vector lists
Sort lists
Memory optimisation lists
Khronos Compressed Vector files
Complex Math rotations & motions
Complex Vectors (in the sense of motion)
Elliptic Curves & SVM Maths
Multiple Dimensional Vector Arrays
Multi point paths & video & 3D Path tracing pre computations

The principle is precision, Because what we do with a Photoshop is map a topography, our 3D Space with a complex compressed interpretation that our Facebook Codec can compose into an image edit

We do the same Topography with cancer cutting surgical equipment, We need a precise CUT but our robot is 32Bit!

Due to complexity we need a larger float value! (example value, Many values exist that we need & Armstrong knows that on Saturn voyage 13)

TOPClouds non local edit is an example where the function; for example of the Alexa music player...
Is not to send all the data; We Help our local computer think; the same way as a teacher; gives a formula!
We do not need to know the Pythagoras value in full; But our operation may require it!

We do not just need examples of Pi; We need examples of polynomial shapes, Vectors, Concepts & designs, Requiring less data sent & received than the work total cost of transfer to a trained massive network

https://www.youtube.com/watch?v=9ykRV2OMPbE

Rupert S

*

#Sound Strategy game TOPCloud (c)RS

PCM & MP4 are 2D/3D Image so GPU Helps there also with 3D Audio mapping!
Games do not require cloud processing of images & a lot of local strategies are procedural Heuristic

You see RDP has GPU Connect (my innovation i might add) So Bluetooth & Wifi can connect RTP GPU; The port specifics are not particularly important; However a device such as music streamer can have ML TOP's available locally & from the cloud,

Due to how the TOPCloud strategy works with localised ML TOPS; Not all data has to be sent or received..
For example all Audio 3D Profiles for HQ Room audio can be done within a few MB of data; With some hard work? 150Kb of data & so in reach of phones & mobile!

Gaming is an example here. I give TickTackToe as the example where all that a device like Alexa or Google smart device has to think is Which square? but..

No physical picture needs to be sent for the game to be played & if required a small TickTack Strategy ML is desired locally for a quicker response!

You see with a low latency GPU RTP & GPU RDP connection to cloud GPU; Most localised thinking TOPS can be carried out in Seconds if not milliseconds & PCM & MP4 are 2D/3D Image so GPU Helps there also with 3D Audio mapping!

Rupert S

*

Core features of TOPCloud:

RTP ML TOPS are a processors friend

3D audio mapping & spatialization for realistic sound effects
3D Vector Support for various audio formats such as PCM, MP4, OGG, and WAV

Low latency & high bandwidth connection to cloud GPU servers via RDP

Procedural & heuristic algorithms for generating game scenarios & strategies & 3D Audio & Visuals
Localized & cloud-based machine learning models for optimizing game performance & user experience

RTP GPU Connect technology that allows users to access GPU resources from any device with Bluetooth or WiFi

TOPCloud is a revolutionary 'TOPS' way to enjoy & create audio games using your own music & the power of the cloud. Try it today & discover a new dimension of gaming!

https://science.n-helix.com/2022/10/ml.html
https://science.n-helix.com/2021/03/brain-bit-precision-int32-fp32-int16.html

https://science.n-helix.com/2022/08/jit-dongle.html
https://science.n-helix.com/2022/06/jit-compiler.html

https://science.n-helix.com/2023/02/smart-compression.html

*

Scaling; We can classify by colour or creativity. (c)RS

If you use TOPCloud, you can share between different displays in the TOP's Sense..
but mostly you would need cloud presence,

Mostly this would be about making the most out of TOP heavy Business GPU & personal ones in your computer or consoles.

But sharing common tasks such as scaling movies by type or by identifying a single movie to upscale...

Now you might be asking what we would be doing there?
Well a single movie uses the same materials in our ML; We can analyse the class & optimise the scaling by class..

For those familiar with games & FSR; We familiarise our code with a single game!
By doing this we improve our product and can therefore classify by:

Resolution
Style
Speed
Type, FPS for example & RTS

We can classify by colour or creativity...

We do not simply have to roll the dice on General Scaling, We can use classifiers:

Title
Scale
Type
Speed
Frame Rate
Colour & Composure

PrePlanning
With the help of #TheTOPCloudEdit & F16 + DOT4 Classification commitments to:

Larger Tables Interpolated & Optimised

Pre planning & Optimisation LUT Mapping,
Colour & Dynamic range,
Dynamic frame rate control & adaptation

Sound Dynamic Range,
Dynamic Volume
Virtual 3D Space

Channel balancing; Before you ask:
Smoothing the 3D Range over each Speaker & Combined audio space,
Space mapping, Head averages such as size & width, Ears & Room Size

Rupert S

Agents
https://www.youtube.com/watch?v=Se6KFn1Nni4
https://www.youtube.com/watch?v=DxxAwDHgQhE

https://science.n-helix.com/2021/10/eccd-vr-3datmos-enhanced-codec.html

https://science.n-helix.com/2023/06/tops.html

Rupert S

LUT Table Example {TOPCloud & TOPCloud Edit}

The significance of LUT Tables; Colour conversion ICC; Is fundamental to how good a monitor or TV Image looks,

But we need to assume that most TV's & Monitors do not have a suitably RAM Loaded GPU;

ICC can by themselves take MB of RAM to load & Upto 256MB of conversion Table!
TOPCloud & TOPCloud Edit allow for parameter offloading,

The basic assumption for offloading is that there is no advantage to offloading a LUT Table to the local GPU?

However TOPCloud allows for 3 fundamentally Simple Concepts to be in play,

Firstly the use of OpenCL JITCompiler to procedurally unfold & map all LUT Mappings,

2 You can remap to different hardware using the Hardware Abstraction Layer; Well in fact JITCompiler makes running the command low latency & super easy!

3 You can even offload to cloud (same town for example Cloudflare),

RS

****************

Basic Upscaling Kernel Starter Set, Contains a basic set of what we hope to achieve.
Learning from proverb; Future Productions inc

OpenCL Kernel Builder
https://drive.google.com/file/d/1d_bWbZl9fAZXsLbN_jZdqSxdWzraLSIz/view?usp=share_link

Texture Encode Source
https://drive.google.com/file/d/1udWU4slmZkUGcagcJl1KwFWh5FJ5ScoN/view?usp=sharing

FSR Scaler
https://drive.google.com/file/d/1D27MOBYKVkKib1JzP_eFucp8RRrzAhd6/view?usp=share_link

Python ML Image denoisers, Very heavy denoising
https://github.com/cszn/BSRGAN
https://github.com/cszn/SCUNet

Crucial Codec source for projects
H266 https://drive.google.com/file/d/1Zt0CrP5p8ld7xnki1B9X4wz6Opyv13aH/view?usp=share_link
AV1 https://drive.google.com/file/d/179pqqS36v--t_BDjyhe1x_oVeYuxkWBw/view?usp=share_link
AAC https://drive.google.com/file/d/1YJy1yAdmEdjSMhtUjvTEU-y9HqJXFzzN/view?usp=share_link
LC3 https://drive.google.com/file/d/1_Gnf_PLN81YepCugmaRNofib7zLOHBNO/view?usp=share_link
DSC https://drive.google.com/file/d/1hbTFsFqzQTqLbhOaEwY-QkM4y3uAglXX/view?usp=share_link

X86Features-Emu
https://drive.google.com/file/d/15vXBPLaU9W4ul7lmHZsw1dwVPe3lo-jK/view?usp=usp=sharing

PoCL Source & Code
https://is.gd/LEDSource

Linux HPC Node install
https://is.gd/LinuxHPCNode

https://github.com/GPUOpen-LibrariesAndSDKs/RadeonML
https://github.com/GPUOpen-LibrariesAndSDKs/RadeonImageFilter

https://science.n-helix.com/2022/10/ml.html

To Compress using CPU/GPU: MS-OpenCL
https://is.gd/MS_OpenCL
https://is.gd/OpenCL4X64
https://is.gd/OpenCL4ARM

Upscale DL

https://is.gd/DictionarySortJS
https://is.gd/UpscaleWinDL
https://is.gd/HPC_HIP_CUDA

https://is.gd/UpscalerUSB_ROM

https://is.gd/OpenStreamingCodecs

PoCL
https://drive.google.com/file/d/1Cvq9uQlEedwIXaJEMoD_r4lvOXgCy-Ld/view?usp=drive_link

X86Features-Emu
https://drive.google.com/file/d/1iDW0HcpOoJqaSkuZGpHKJfKrI1H68diU/view?usp=sharing

*
https://github.com/ssube/diffusers/tree/feature/onnx-upscale

https://github.com/huggingface/diffusers
https://huggingface.co/ssube/stable-diffusion-x4-upscaler-onnx

https://huggingface.co/uwg/upscaler/tree/main
https://huggingface.co/nvmmonkey/optimal_upscale/tree/main
https://huggingface.co/gmp-dev/gmp-upscaler/tree/main/ESRGAN

Neural Engine
https://github.com/godly-devotion/MochiDiffusion

*

PysicsX
Isaac Gym - Preview Release
https://developer.nvidia.com/isaac-gym

CALM: Conditional Adversarial Latent Models for Directable Virtual Characters
https://github.com/NVlabs/CALM

*

Personality UI : Have a friend

Alpaca Character Generation model
4Bit for speed, But not precise
https://huggingface.co/anon8231489123/gpt4-x-alpaca-13b-native-4bit-128g
trained 3Epoc Higher Precision https://huggingface.co/chavinlo/gpt4-x-alpaca

Base model https://huggingface.co/chavinlo/alpaca-13b
https://github.com/teknium1/GPTeacher

Python WebUI
https://github.com/oobabooga/text-generation-webui
Mac; Mostly MAC but fast
https://github.com/ggerganov/llama.cpp

how to use & personality sets https://discord.com/invite/aitrepreneur-1018992679893340160

On the subject of how deep a personality of 4Bit, 8Bit, 16Bit is reference:
https://science.n-helix.com/2021/03/brain-bit-precision-int32-fp32-int16.html
https://science.n-helix.com/2022/10/ml.html
https://science.n-helix.com/2023/06/tops.html

Friday, March 24, 2023

Path-trace-RTDL (c)RS - The combination of Ray Tracing & Path Tracing & FSR_DL; The advantage being a combination of RayTrace CU & General SiMD

Path-trace-RTDL (c)RS

The combination of Ray Tracing & Path Tracing & FSR_DL; The advantage being a combination of RayTrace CU & General SiMD, RS 2023-03 in response to the RS Technology being implemented.

https://science.n-helix.com/2022/03/fsr-focal-length.html

https://science.n-helix.com/2019/06/vulkan-stack.html

https://science.n-helix.com/2022/04/vecsr.html

https://science.n-helix.com/2016/04/3d-desktop-virtualization.html

Path Tracing define: RS

Path tracing is when you take an objective viewpoint; A number of viewpoints to the receptor (Observer, such as gamer or camera

VP = View Point, Observer is camera, RT Path = RT Core Ray

View point Mesh, That is directional

VP : VP : VP : VP : VP

VP : VP : VP : VP : VP

VP : VP : VP : VP : VP

VP : VP : VP : VP : VP

VP : VP : VP : VP : VP {Forward} VP : VP : VP : VP : VP

VP : VP : VP : VP : VP {Observer} VP : VP : VP : VP : VP

VP : VP : VP : VP : VP {Backward} VP : VP : VP : VP : VP

VP : VP : VP : VP : VP

VP : VP : VP : VP : VP

VP : VP : VP : VP : VP

VP : VP : VP : VP : VP

The location VP initiates a SiMD view directly to & from reflective objects & calculates distortion of view & texture with FSR_DL

In this view i would like you to consider a reflective bounce camera viewpoint & think of the energy that saves you.

RayTracing Define:

Ray is cast from object & calculated to target vector; With Distortion calculation & reflections.

We Combine minimum intersection with the VP; Using a cast Ray; So we know the viewpoint is active,
We trace the route back as the observer & calculate each intersection as an observer...

FSR_DL handles surface distortions & fogs of war...

We minimise the viewpoints memory footprint by altering the scale of the viewpoint in respect to the observers screen resolution / Distance .... We can also upscale the pretend frame!

We can cache the frame & discard if we wish!

Raytracing also provides distortion defines for viewpoints & Ray Distortion & Direction Calculations.

RT-Sparse-Field Pre Calculation Cache : RS & Lisa Lue

During the initiation of the frame we calculate polygon placement,
We Cache the metrics & use them for our distance fields.

Long term Non Volatile Cache
Short term recalculation cache
Validate Cache & use for ray tracing RayMarch & lighting.

Real Time Sparse Distance Fields: https://www.youtube.com/watch?v=iY15xhuuHPQ

Distance Fields are defined as Object detection with range finding,
In GPU SiMD we can reduce the Field Multiple Recount,

Low cache containment Serial processing; Is where we have not got all the Polygon distances counted & in cache...

We can however count on the GPU having the Polygon Map in RAM for a small segment of polygons; But due to the fact that we place the polygons in precise locations; We already have Distance.

Distance fields are helpful because; Ray-forwarding (Ray March) does not need to do more than,
Process; Distortion & Viscosity & Density & transparency & Reflection,

But we can do this over larger fields in areas with lower levels of modification property with counts as a lower required precision!

(c)Rupert S

Path-trace-RTDL : This could be us : Path Tracing all light reflection, Does not require something as high on GPU as RX6500! Can be CPU SiMD/AVX on the Vectors, So can be a regular thing!

We can even super sample our cube maps dynamically; So that we take the vector locations & transform the cube maps into fully RayMaped Polygons.

The results are all about how we plan to Dynamically Optimise & Draw Vectors.

RS

https://drive.google.com/file/d/14gGMWscMeUSRTDQJumclXfD5hDnHtxb2/view?usp=sharing, https://drive.google.com/file/d/15wZotdIXvctqoNQAc8bXwDHZx9w1VBAR/view?usp=sharing, https://drive.google.com/file/d/1ALi7anoOif5XT6VQYiWw_xfXVrrAedhD/view?usp=sharing, https://drive.google.com/file/d/1AsdsW8c4-sKk4asLOTv8ESCCS3u6Y25X/view?usp=sharing, https://drive.google.com/file/d/1H4VkoyuVVfAN2V0KiEF9VXM3OLadmuXt/view?usp=sharing, https://drive.google.com/file/d/1LIf05i_A7omfELolanN0wEwG2HosIiKz/view?usp=sharing, https://drive.google.com/file/d/1Rt1-4_UKodFnbnaHXYnKRh2G6-k0GCzc/view?usp=sharing, https://drive.google.com/file/d/1X8bprVmk8vtfhJxDtd6zKZBOjOL7CDiS/view?usp=sharing, https://drive.google.com/file/d/1czvKdoE0rAJogQMwMCwOUpYe-Dna9gdN/view?usp=sharing

Mine-Craft-PathTrace

Cubic SubSampling reference :

https://science.n-helix.com/2023/03/path-trace.html
https://science.n-helix.com/2023/02/smart-compression.html

In simple principle SubS uses Probable interaction PDF & Ray Boxing (Isolated Cell Cube = [SS]/[SubS]),
We only therefore only need to Predict Sample for likely cube overflows into adjacent boxes.

Resampling first; As we are resampling a ray box for probable intersection with our primary target (viewer),
Our motive is that the viewer is the only one to see the rays; Only Science project need to know all; But not always,

We need a sample that does interact with the Observer/Viewer!
So we simply need a bounding box with a direction mesh (multiply by X) that shows probable cause to interact!

We know that Viewer X is the only person seeing that interaction & So we know that if we point a triangle towards a light source; We directly interact with a subsample array,
We do not need them all!

PDF Similarity is used with the Ray Box to allocate work to probable cause; Located at User interaction AKA Observer/Viewer.

https://gpuopen.com/download/publications/Efficient_Spatial_Resampling_Using_the_PDF_Similarity.pdf
https://gpuopen.com/download/publications/I3D2023_SubspaceCulling_updated.pdf

MultiDimensional Raytracing & 3D Visualisation

Projection Pursuit (PP) based algorithms were shown to be efficient solutions for performing dimensionality reduction on large
datasets by searching low-dimensional projections of the data
Accelerating a Geometrical Approximated PCA Algorithm Using AVX2 and CUDA

https://www.mdpi.com/2072-4292/12/12/1918

Ray Tracing and Volume Rendering Large Molecular Data on Multi-Core and Many-Core Architectures
http://www.sci.utah.edu/~wald/Publications/2013/bnsview/bnsview.pdf

Objective ~= Viewer, Deformation Bounce : Scatter Pattern S{1 : 2 : 3 : 4 } : Repeat

GDC 2023 - Two-Level Radiance Caching for Fast and Scalable Real-Time Global Illumination in Games
https://www.youtube.com/watch?v=1eLz6WpXvQo

the objective is to bounce rays towards viewer in a probability Oblong uneven cube,
What we do is mathematically work out how probable that additional light bounces on surface X

/{s}--{surface}
{Light Source}---/ \ / \ {viewer}
\---\{surface}

We can take the surface as a cube; Aligning a common detection point along a flat or low polygon count version of the surface...

Map from the rays of light intersecting the surface at low resolution & map the average reflection as with path tracing,
compensating for shape distortion with calculations...

Effectively we treat the light as a polygon & prove probable additional light based on it's likeliness to exist,
Low light levels reduce likeliness, Strong sources of light will more likely have rays...

Surface deformations require more effort & we will concentrate more processor cycles to deformed areas such as water ripples,

However we shall calculate the deformation matrix of the surface & therefore average the rays we measure & Calculate directions from deformation bounce.

Because we calculate distortion from arc, sine, tan, Reflection value & variation in reflection dispersion & opacity.

Scatter Pattern S{1 : 2 : 3 : 4 } : Repeat

For Surface X{1 : 2 : 3 : 4 } + Light Y{1 : 2 : 3 : 4 } = light Z{1 : 2 : 3 : 4 } + Scatter pattern S{1 : 2 : 3 : 4 }

Y{1 : 2 : 3 : 4 } / X{1 : 2 : 3 : 4 } = Scatter pattern S{1 : 2 : 3 : 4 }

Rupert S

*

PoCL Source & Code
https://is.gd/LEDSource

https://science.n-helix.com/2022/06/jit-compiler.html

https://science.n-helix.com/2022/08/jit-dongle.html

Bus Tec : https://drive.google.com/file/d/1M2ie8Jf_bNJaySNQZ5mqM1fD9SAUOQud/view?usp=sharing

FPGA 'Xilinx Virtex-II' HPC application Multiple-Applications & Image-Net & Matrix-Multiplication - H-SIMD machine _ configurable parallel computing for data-intensive HPC
https://digitalcommons.njit.edu/cgi/viewcontent.cgi?article=1836&context=dissertations

A SIMD architecture for hard real-time systems
https://www.repository.cam.ac.uk/bitstream/handle/1810/315712/dissertation.pdf?sequence=2

Ideal for 4Bit Int4 XBox & Int8 GPU
PULP-NN: accelerating quantized neural networks on parallel ultra-low-power RISC-V processors - Bus-width 8-bit, 4-bit, 2-bit and 1-bit
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6939244/

Audio BT Codec

https://science.n-helix.com/2021/10/he-aacsbc-overlapping-wave-domains.html

DSC, ETC, ASTC & DTX Compression for display frames

https://science.n-helix.com/2022/09/ovccans.html

https://science.n-helix.com/2023/02/smart-compression.html

https://science.n-helix.com/2022/04/vecsr.html

https://science.n-helix.com/2016/04/3d-desktop-virtualization.html

https://science.n-helix.com/2019/06/vulkan-stack.html

https://science.n-helix.com/2019/06/kernel.html

https://science.n-helix.com/2022/03/fsr-focal-length.html

https://science.n-helix.com/2018/01/integer-floats-with-remainder-theory.html

https://science.n-helix.com/2022/08/simd.html

ESA Space blog - All Rights Reserved RS

Friday, June 23, 2023

[M.A.P] [=====] [H.P.C] - Matrix Array Processor Unit (c)RS

Matrix Array Processor Unit (c)RS

Soft Interrupt IRQ: Faster CPU Cycles: RS

VPDM-ST-LRS : Verified Processor Direct Memory Space Transactions Load, Register & Save (c)RS

The IDFlow Work Networking : DMA to DMA Buffer write throughs with caching : (c)RS

The difference with IDFlow DMA :

DMA & IO Device mapping

Embedded Hardened Pointer Table Cache for 3D Chips : RS

Gather/Scatter Microcode no-overload ALU or Data/Code Cache, Just L3/RAM

Temporary HardLinking in Prefetching Matrix instructions,

Pre-Fetching; Statistically Ordered Gather/Scatter & The Scatter/Gather Commands

Multi-line Packed-Bit Int SiMD Maths : Relevance HDR, WCG, ML Machine Learning (Most advantaged ADDER Maths)

M.A.P NPU Matrix Processor Dimensional construct (c)RS

Matrix Method (c)RS

SiMD Matrix maths begins with a 3D graph,

The formula for the NPU (c)RS

MAP(c)RS include the following parameters : ++++ **** +*+*+* In Matrix, 2D & 3D & varieties there-of

TPM Verified Loop Code : Production Verified & Signed : Qualified Encryption & Compression Privacy (c)RS

Matrix Formula block loading for SiMD Shaders makes sense, Most tasks can fit 4 commands in a row (in 64KB RAM)

Standard deviation & derivatives (c)RS

Understanding Standard Deviation and Derivatives

ML, TFLite/ONNX : Wavelet & Array content such as HTML, JS, DNS & NTP protocols : RS

Perfect sample for Matrix Tables : https://gpuopen.com/learn/sampling-normal-gaussian-distribution-gpus/

Lattice Squares Kyber, Falcon, AES, DES, RSA, ECC:

DML

Directed Matrix Principle : RS

Parallel Arrays : Matrix forms : RS

Examples of Parallel execution pipeline : Parallel arrays:

Number relativity, Bit precision: RS

RollINT - Machine Learning for Console & Computer : RS

RollINT : The Float Perfectionist

Scaler is an argument for the role of RollINT & also a pointer to method

SiMD:CMA (c)RS

Cooperative Matrix Math : RS

Refer to : Var = V+n, Table

Inference & FMA De-Block Styles

An example use of FMA Cooperative Matrix

High speed Per operation Cycle operations of D R² Pi

How you use FMA, Basic MUL+ADD examples first & then Mul & ADD

Interpolation & smoothing :

Pixel A to B, Interpolation upscaling

FMA AVX Performance table: 2Flops per Cycle per FMA UnitArchitecture Fast Instructions for FMA

Triangle 3D Matrix graphs

ECC elliptic curves & Gradients : RS

Einstein : Quad:20x30 Matrix table

Triangle 3D Matrix graphs : a+b+c : Rotational algebra : ax+by+c=0 | e1, e2, e3

3D Matrix Web Codecs

TPU & SiMD Parallel wavetables Pre-Calculation Meta-Data : RS

FMA : Fused Multiply ADD : MUL+ADD & Precision functions

Exponent factorisation : RS

F16b Adaptive Float value : Texture Color Palette Example : RS

AnPa_Wave - Analogue Pattern Wave Vector SiMD Unit : (c)RS

ML Network Types

Maths Operations

Processor Types & RAM

Matrix Array Processor Unit (c)RS

Light Processors (c)Rupert S https://science.n-helix.com

Tuesday, June 13, 2023

Theory of mind - TOPCloud

[Theory of mind - TOPCloud +2021-03 RS]

TOP BOOSTER Cloud Enemy(tm) Provided by potentially DLSS Cloud Founder :

TOP Cloud Basics for personal help AI

TOPCloud Offload Logic:

#TheTOPCloudEdit (c)RS : Principle of data saving non localised Machine aided design & workflow (c)RS

#Sound Strategy game TOPCloud (c)RS

Core features of TOPCloud:

Scaling; We can classify by colour or creativity. (c)RS

LUT Table Example {TOPCloud & TOPCloud Edit}

Friday, March 24, 2023

Path-trace-RTDL (c)RS - The combination of Ray Tracing & Path Tracing & FSR_DL; The advantage being a combination of RayTrace CU & General SiMD

Path-trace-RTDL (c)RS

Path Tracing define: RS

RayTracing Define:

RT-Sparse-Field Pre Calculation Cache : RS & Lisa Lue

Cubic SubSampling reference :

Objective ~= Viewer, Deformation Bounce : Scatter Pattern S{1 : 2 : 3 : 4 } : Repeat

Blog Archive

About Me

MAP(c)RS include the following parameters : ++++ **** +++* In Matrix, 2D & 3D & varieties there-of

FMA AVX Performance table: 2Flops per Cycle per FMA Unit
Architecture Fast Instructions for FMA