Monday, July 24, 2023

ZenBleed

ZenBleed Parallel Solvent RS 2023

ZenBleed, So what about 64Bit to 128Bit bleed in SiMD? Mind you; 'Bound to be One' 20:20pm 24/07/2023 (c)RS

XMM 128Bit YMM 256Bit ZMM 512Bit

My theory involves using higher modes for synchronous packing!

What do i mean ?

When you have a full system (processes), 64Bit Processes start packing 128Bit registers! Particularly with Float units 182Bits...

Indeed an olderror is packing 128Bit registers with Float unit (FPU) Values with rollover!

So?

Two things first positive:

We can pack FPU Register Values into 256Bit (and Zero : vzeroupper & tzcnt (Trailing Zero Count)),
Enabling us to directly utilise SiMD <> with <> FPU!

We can solve the lower XMM to YMM to ZMM differences! How ?

We Multiple Array fill the next register with at least 2 values!

So ?

Parallel processing!

How ?

XMM-128 | ZMM / 4 or 128 * 4! Parallel!*4 Best!

XMM-128 | YMM / 2 or 128 * 2! Parallel!*2 Best!

YMM-256 | ZMM / 2 or 256 * 2! Parallel!*2 Best!

FPU-182 | YMM or 182 * 1 = Single File FPU <> SiMD |

ZMM / 2 or 182+r * 2! Parallel!*2 Best! = Double File FPU <> SiMD

r = Remainder for vzeroupper | tzcnt

Parallel Operation Principle with CPU Register & OPS division : RS


We will be using the value split:

512/2 = 256*2
256/2 = 128*2
128/2 = 64*2
128/4 = 32*4

We will therefor be able to use 32Bit, 64Bit, 128Bit , 256Bit, 512Bit values at leasure..
But we have to optimise the entire branch to use a single precision!

Single Type Precision operations make the effects of C++ Fast-float & Half Precision removed...

No operation errors.. & Parallel operation

reference (Faster Maths & ML)

(c)Rupert S

< Yes Bug Bounty & Solve Bounty : Bounty Bounty >

https://lock.cmpxchg8b.com/zenbleed.html

Vulnerability

It turns out that with precise scheduling, you can cause some processors to recover from a mis-predicted vzeroupper incorrectly!

This technique is CVE-2023-20593 and it works on all Zen 2 class processors, which includes at least the following products:

AMD Ryzen 3000 Series Processors
AMD Ryzen PRO 3000 Series Processors
AMD Ryzen Threadripper 3000 Series Processors
AMD Ryzen 4000 Series Processors with Radeon Graphics
AMD Ryzen PRO 4000 Series Processors
AMD Ryzen 5000 Series Processors with Radeon Graphics
AMD Ryzen 7020 Series Processors with Radeon Graphics
AMD EPYC “Rome” Processors

Speculation

Hold on, there’s another complication! Modern processors use speculative execution, so sometimes operations have to be rolled back.

What should happen if the processor speculatively executed a vzeroupper, but then discovers that there was a branch misprediction? Well, we will have to revert that operation and put things back the way they were… maybe we can just unset that z-bit?

If we return to the analogy of malloc and free, you can see that it can’t be that simple - that would be like calling free() on a pointer, and then changing your mind!

That would be a use-after-free vulnerability, but there is no such thing as a use-after-free in a CPU… or is there?

RS Spectra Mitigations https://science.n-helix.com/2018/01/microprocessor-bug-meltdown.html
ZenBleed Parallel Solvent RS 2023 https://science.n-helix.com/2023/07/zenbleed.html

Core/CPU/GPU security core SSL/TLS BugFix
https://science.n-helix.com/2020/06/cryptoseed.html
https://science.n-helix.com/2019/05/zombie-load.html


Secure Configuration:
https://is.gd/SecurityHSM
https://is.gd/WebPKI

Open Streaming Codecs 2023 https://is.gd/OpenStreamingCodecs

Vectors & maths
https://science.n-helix.com/2022/08/simd.html
https://science.n-helix.com/2022/04/vecsr.html
https://science.n-helix.com/2016/04/3d-desktop-virtualization.html
https://science.n-helix.com/2022/04/vecsr.html
https://science.n-helix.com/2018/01/integer-floats-with-remainder-theory.html
https://science.n-helix.com/2023/02/smart-compression.html

Networking & Management
https://science.n-helix.com/2023/06/tops.html
https://science.n-helix.com/2023/06/ptp.html
https://science.n-helix.com/2023/06/map.html
https://science.n-helix.com/2023/02/pm-qos.html
https://science.n-helix.com/2022/08/jit-dongle.html
https://science.n-helix.com/2022/06/jit-compiler.html
https://science.n-helix.com/2022/03/ice-ssrtp.html
https://science.n-helix.com/2022/01/ntp.html

Faster Maths & ML
https://science.n-helix.com/2018/01/integer-floats-with-remainder-theory.html
https://science.n-helix.com/2021/02/multi-operation-maths.html
https://science.n-helix.com/2021/11/parallel-execution.html
https://science.n-helix.com/2022/12/math-error-solve.html
https://science.n-helix.com/2021/03/brain-bit-precision-int32-fp32-int16.html
https://science.n-helix.com/2022/10/ml.html

Focus on Quality
https://science.n-helix.com/2022/09/ovccans.html
https://science.n-helix.com/2022/11/frame-expand-gen-3.html
https://science.n-helix.com/2022/03/fsr-focal-length.html

https://blog.cloudflare.com/zenbleed-vulnerability/

https://www.theverge.com/2023/7/25/23806705/amd-ryzen-cpu-processor-zenbleed-vulnerability-exploit-bug

************* Reportage >

Introduction

All x86-64 CPUs have a set of 128-bit vector registers called the XMM registers. You can never have enough bits, so recent CPUs have extended the width of those registers up to 256-bit and even 512-bits.

The 256-bit extended registers are called YMM, and the 512-bit registers are ZMM.

These big registers are useful in lots of situations, not just number crunching! They’re even used by standard C library functions, like strcmp, memcpy, strlen and so on.

Let’s take a look at an example. Here are the first few instructions of glibc’s AVX2 optimized strlen:


(gdb) x/20i __strlen_avx2
...
<__strlen_avx2+9>: vpxor xmm0,xmm0,xmm0
...
<__strlen_avx2+29>: vpcmpeqb ymm1,ymm0,YMMWORD PTR [rdi]
<__strlen_avx2+33>: vpmovmskb eax,ymm1
...
<__strlen_avx2+41>: tzcnt eax,eax
<__strlen_avx2+45>: vzeroupper
<__strlen_avx2+48>: ret

The full routine is complicated and handles lots of cases, but let’s step through this simple case. Bear with me, I promise there’s a point!

The first step is to initialize ymm0 to zero, which is done by just xoring xmm0 with itself1.

VPXOR xmm0, xmm0, xmm0
> vpxor xmm0, xmm0, xmm0
vpcmpeqb ymm1, ymm0, [rdi]
vpmovmskb eax, ymm1
tzcnt eax, eax
vzeroupper

Here rdi contains a pointer to our string, so vpcmpeqb will check which bytes in ymm0 match our string, and stores the result in ymm1.

As we’ve already set ymm0 to all zero bytes, only nul bytes will match.

vpcmpeqb ymm1, ymm0, rdi
vpxor xmm0, xmm0, xmm0
> vpcmpeqb ymm1, ymm0, [rdi]
vpmovmskb eax, ymm1
tzcnt eax, eax
vzeroupper

Now we can extract the result into a general purpose register like eax with vpmovmskb.

Any nul byte will create a 1 bit, and any other value will create a 0 bit.

vpmovmskb eax, ymm1
vpxor xmm0, xmm0, xmm0
vpcmpeqb ymm1, ymm0, [rdi]
> vpmovmskb eax, ymm1
tzcnt eax, eax
vzeroupper

Finding the first zero byte is now just a case of counting the number of trailing zero bits.

That’s a common enough operation that there’s an instruction for it - tzcnt (Trailing Zero Count).

tzcnt eax, eax
vpxor xmm0, xmm0, xmm0
vpcmpeqb ymm1, ymm0, [rdi]
vpmovmskb eax, ymm1
> tzcnt eax, eax
vzeroupper

Now we have the position of the first nul byte, in just four machine instructions!

You can probably imagine just how often strlen is running on your system right now, but suffice to say, bits and bytes are flowing into these vector registers from all over your system constantly.

Zeroing Registers

You might have noticed that I missed one instruction, and that’s vzeroupper.

vzeroupper
vpxor xmm0, xmm0, xmm0
vpcmpeqb ymm1, ymm0, [rdi]
vpmovmskb eax, ymm1
tzcnt eax, eax
> vzeroupper

You guessed it, vzeroupper will zero the upper bits of the vector registers.

The reason we do this is because if you mix XMM and YMM registers, the XMM registers automatically get promoted to full width. It’s a bit like integer promotion in C.

This works fine, but superscalar processors need to track dependencies so that they know which operations can be parallelized. This promotion adds a dependency on those upper bits, and that causes unnecessary stalls while the processor waits for results it didn’t really need.

These stalls are what glibc is trying to avoid with vzeroupper. Now any future results won’t depend on what those bits are, so we safely avoid that bottleneck!

The Vector Register File

Now that we know what vzeroupper does, how does it do it?

Your processor doesn’t have a single physical location where each register lives, it has what’s called a Register File and a Register Allocation Table. This is a bit like managing the heap with malloc and free, if you think of each register as a pointer. The RAT keeps track of what space in the register file is assigned to which register.

In fact, when you zero an XMM register, the processor doesn’t store those bits anywhere at all - it just sets a flag called the z-bit in the RAT. This flag can be applied to the upper and lower parts of YMM registers independently, so vzeroupper can simply set the z-bit and then release any resources assigned to it in the register file.

Z-Bit

A register allocation table (left) and a physical register file (right).

Speculation

Hold on, there’s another complication! Modern processors use speculative execution, so sometimes operations have to be rolled back.

What should happen if the processor speculatively executed a vzeroupper, but then discovers that there was a branch misprediction? Well, we will have to revert that operation and put things back the way they were… maybe we can just unset that z-bit?

If we return to the analogy of malloc and free, you can see that it can’t be that simple - that would be like calling free() on a pointer, and then changing your mind!

That would be a use-after-free vulnerability, but there is no such thing as a use-after-free in a CPU… or is there?

Spoiler: yes there is ๐Ÿ™‚

Zenbleed Demo

This animation shows why resetting the z-bit is not sufficient.

Vulnerability

It turns out that with precise scheduling, you can cause some processors to recover from a mispredicted vzeroupper incorrectly!

This technique is CVE-2023-20593 and it works on all Zen 2 class processors, which includes at least the following products:

AMD Ryzen 3000 Series Processors
AMD Ryzen PRO 3000 Series Processors
AMD Ryzen Threadripper 3000 Series Processors
AMD Ryzen 4000 Series Processors with Radeon Graphics
AMD Ryzen PRO 4000 Series Processors
AMD Ryzen 5000 Series Processors with Radeon Graphics
AMD Ryzen 7020 Series Processors with Radeon Graphics
AMD EPYC “Rome” Processors

The bug works like this, first of all you need to trigger something called the XMM Register Merge Optimization2, followed by a register rename and a mispredicted vzeroupper. This all has to happen within a precise window to work.

We now know that basic operations like strlen, memcpy and strcmp will use the vector registers - so we can effectively spy on those operations happening anywhere on the system! It doesn’t matter if they’re happening in other virtual machines, sandboxes, containers, processes, whatever!

This works because the register file is shared by everything on the same physical core. In fact, two hyperthreads even share the same physical register file.

Don’t believe me? Let’s write an exploit ๐Ÿ™‚

Exploitation

There are quite a few ways to trigger this, but let’s examine a very simple example.

vcvtsi2s{s,d} xmm, xmm, r64
vmovdqa ymm, ymm
jcc overzero
vzeroupper
overzero:
nop

Here cvtsi2sd is used to trigger the merge optimization. It’s not important what cvtsi2sd is supposed to do, I’m just using it because it’s one of the instructions the manual says use that optimization3.

Then we need to trigger a register rename, vmovdqa will work. If the conditional branch4 is taken but the CPU predicts the not-taken path, the vzeroupper will be mispredicted and the bug occurs!

Optimization

Exploit Running

It turns out that mis-predicting on purpose is difficult to optimize! It took a bit of work, but I found a variant that can leak about 30 kb per core, per second.

This is fast enough to monitor encryption keys and passwords as users login!

We’re releasing our full technical advisory, along with all the associated code today. Full details will be available in our security research repository.

If you want to test the exploit, the code is available here.

Note that the code is for Linux, but the bug is not dependent on any particular operating system - all operating systems are affected!

Discovery

I found this bug by fuzzing, big surprise ๐Ÿ™‚ I’m not the first person to apply fuzzing techniques to finding hardware flaws. In fact, vendors fuzz their own products extensively - the industry term for it is Post-Silicon Validation.

So how come this bug wasn’t found earlier? I think I did a couple of things differently, perhaps with a new perspective as I don’t have an EE background!

Feedback

The best performing fuzzers are guided by coverage feedback. The problem is that there is nothing really analogous to code coverage in CPUs… However, we do have performance counters!

These will let us know when all kinds of interesting architectural events happen.

Feeding this data to the fuzzer lets us gently guide it towards exploring interesting features that we wouldn’t have been able to find by chance alone!

It was challenging to get the details right, but I used this to teach my fuzzer to find interesting instruction sequences. This allowed me to discover features like merge optimization automatically, without any input from me!

Oracle

When we fuzz software, we’re usually looking for crashes. Software isn’t supposed to crash, so we know something must have gone wrong if it does.

How can we know if a a CPU is executing a randomly generated program correctly? It might be completely correct for it to crash!

Well, a few solutions have been proposed to this problem. One approach is called reversi. The general idea is that for every random instruction you generate, you also generate the inverse (e.g. ADD r1, r2 → SUB r1, r2). Any deviation from the initial state at the end of execution must have been an error, neat!

The reversi approach is clever, but it makes generating testcases very complicated for a CISC architecture like x86.

A simpler solution is to use an oracle. An oracle is just another CPU or a simulator that we can use to check the result. If we compare the results from our test CPU to our oracle CPU, any mismatch would suggest that something went wrong.

I developed a new approach with a combination of these two ideas, I call it Oracle Serialization.

Oracle Serialization

As developers we monitor the macro-architectural state, that’s just things like register values. There is also the micro-architectural state which is mostly invisible to us, like the branch predictor, out-of-order execution state and the instruction pipeline.

Serialization lets us have some control over that, by instructing the CPU to reset instruction-level parallelism. This includes things like store/load barriers, speculation fences, cache line flushes, and so on.

The idea of a Serialized Oracle is to generate a random program, then automatically transform it into a serialized form.

A randomly generated sequence of instructions, and the same sequence but with randomized alignment, serialization and speculation fences added.

movnti [rbp+0x0],ebx movnti [rbp+0x0],ebx
sfence
rcr dh,1 rcr dh,1
lfence
sub r10, rax sub r10, rax
mfence
rol rbx, cl rol rbx, cl
nop
xor edi,[rbp-0x57] xor edi,[rbp-0x57]

These two program might have very different performance characteristics, but they should produce identical output. The serialized form can now be my oracle!

If the final states don’t match, then there must have been some error in how they were executed micro-architecturally - that could indicate a bug.

This is exactly how we first discovered this vulnerability, the output of the serialized oracle didn’t match!

Solution

We reported this vulnerability to AMD on the 15th May 2023.

AMD have released an microcode update for affected processors. Your BIOS or Operating System vendor may already have an update available that includes it.

Workaround

It is highly recommended to use the microcode update.

If you can’t apply the update for some reason, there is a software workaround: you can set the chicken bit DE_CFG[9].

This may have some performance cost.

Linux

You can use msr-tools to set the chicken bit on all cores, like this:

# wrmsr -a 0xc0011029 $(($(rdmsr -c 0xc0011029) | (1<<9)))

FreeBSD

On FreeBSD you would use cpucontrol(8).

Others

If you’re using some other operating system and don’t know how to set MSRs, ask your vendor for assistance.

Note that it is not sufficient to disable SMT.

Detection

I am not aware of any reliable techniques to detect exploitation. This is because no special system calls or privileges are required.

It is definitely not possible to detect improper usage of vzeroupper statically, please don’t try!

Conclusion
It turns out that memory management is hard, even in silicon ๐Ÿ™‚

Acknowledgements

This bug was discovered by me, Tavis Ormandy from Google Information Security!

I couldn’t have found it without help from my colleagues, in particular Eduardo Vela Nava and Alexandra Sandulescu. I also had help analyzing the bug from Josh Eads.

3DChiplet Side By Side 3D Magic with 3D Trenching

3DChiplet Side By Side 3D Magic with 3D Trenching 2021-2023

3D Fabric 5800X3D is hard in production but the delivery is the problem so ... i have another proposal,

Called 

Side By Side 3D Magic (c)Rupert S


Yes 3D Chips are good for cache, Simply connecting chiplets does not require 3D or 3D Stacking,

Side By Side 3D Magic (c)Rupert S

https://science.n-helix.com

Has Layered Chip wafer & PCB Board with interweaved wires:

Carbon fibers, Copper or aluminum or Iron, Not a problem

Through the PCB Chip board, These micro tunnels provide all the PCI & Chip tunnels that a Board could require!

Layered micro tunnel imprinted PCB can have 3 wires per layer (crosswise, Diagonal & Ordered form)

Additionally tunneling up and down is not a problem for you simply layer a connection point that is welded to the next layer as it is laid on top..

Micro film is available, As this is both electrostatic & noise resistant composite.

Since this is a micro multiformat PCB / Chip fabric, At no time do you have to worry about dampness or heat split when made well.

https://www.youtube.com/watch?v=pBZQeW1eeEw

Example of 3D Layered PCB, A but too rigid but good for a phone or telescope Board..

Chips can be placed inside if you need to! for space reasons; Embed the chiplet..

PCB is ideal for this task; Common view PCB is large space & coy compact?

3D PCB is a space saver & 3D Network Ethernet/Chip IO memory ops

PCB Wire mesh (internal networks) = - |, PCB Layer = _

______(CHIP With Connect)________
----------|-----|-----|----|----|-----------------
_______\____|___\___\_|___________
--------(cooling & IO Chip)--------------
_______|__|_______|___|___________


***********

07:39 23/07/2023 (c)Rupert S

Circuit 3D Print with laser (c)RS


While trenching semiconductors work, in space (vacuum) electrical energy transfers through vacuum!

So you have to use a resistor material in the trench, this is not impossible if you imbed ceramic formulas with a laser!

you can however with this technology go upto 2.7v on 5nm; Because higher voltages are faster & more resistant; this makes sense..

The trench (hole) Formatic processor 3D layering technology with:

Circuit = C, Trench = \_/ , resistor = r, Circuit in trench = c, raised bit Circuit or resistor = /C\

C\r/C C\r/C C\r/C

C\_/C C\_/C C\_/C

C\_/C C\r/C C\_/C

/C\c/C\r/C\r/C\r/C\

The challenge of using traditional circuit printing methods in space is that the vacuum can cause the circuit to degrade over time..

This is because the vacuum can strip away the electrons that carry current in the circuit.

3D laser circuit printing could help to mitigate this problem by creating a very dense and compact circuit. This reduces the surface area of the circuit that is exposed to the vacuum and it helps to protect the circuit from the harsh environment of space.

& Also..

One of the challenges of using trench & processor circuit methods in space is that electrical energy transfers through vacuum; Which can be difficult in a vacuum.

This means that you need to use a resistor material in the trench,

It is possible to imbed ceramic formulas with a laser; This could be a promising way to create resistors in/for space.

However, 3D laser circuit printing could help to mitigate this problem; As the laser can be used to create a very precise and durable circuit.

This technology is meant for the world but also with spatial integrity for deep space & So functionally Rugged/Rigid in use & Function.

Additional thoughts on the challenges and potential of 3D laser circuit printing for space applications:

Challenges:
The vacuum of space can be very harsh on materials, so it is important to use materials that are resistant to radiation and temperature extremes.

Potential:

3D laser circuit printing could allow for the creation of more complex and efficient circuits.

3D laser circuit printing could make it possible to print circuits on-demand; Which could be a major advantage for space missions.

It could also be used to create circuits that are more resistant to the harsh environment of space.

The lack of gravity can also make it difficult to print precise circuits..

(c)Rupert S

Application 23/07/2023

https://science.n-helix.com/2023/07/3dchiplet.html

https://science.n-helix.com/2023/06/map.html

https://science.n-helix.com/2023/06/ptp.html

https://science.n-helix.com/2023/06/tops.html

https://science.n-helix.com/2022/01/ntp.html


*********************

Tilly Arms; The girl with no arms, sympathetic nerve response & frequency rate : Operation Cyborg RS 2023

Tilly Arms; The girl with no arms

I think that the arms are very good, But she needs more!
Clearly artificial skin in silver would do the trick?

I noticed that she has control of them though her stimulated skin.... at the elbow....
Now i saw a study that clearly would help....

Neurons respond on training to noisy signals- & clear notes+

We can clearly get a sympathetic skin monitor to receive the feelings; By listening to skin cell responses ....

Now i feel that since a 9v battery stings the tongue; 2volts is about a bit too much right on sweaty skin, So 1.8 is around right? Dr

https://www.youtube.com/shorts/pmIoL-Ja_Co

Depending upon how much resistance there is in skin, might even help with Lightening & Shocks...

RS

20:08 23/07/2023 What have we learned; Brain Cells : RS : https://www.youtube.com/watch?v=bEXefdbQDjw

Brain Cells respond to:

Clear tones : } well to { Entropic Noisy tones }: unwell
Clean Image } to [ Entropic Noisy Image }

Cell electrode networks begin at 0.75cm for tasks like DOOM

Cell inputs are learned,
Dynamic connections form to the electrodes & We use logic on the inputs...

Here the strategy is to use tones & noise to respond to the doom player in motion.

The cell structure is clearly not a problem at 3700 * 4 mm

Rupert S

*

AnPa_Wave - Analogue Pattern Wave Vector SiMD Unit : (c)RS


The base symphony is harmony, In other words waveforms; There are a couple of Simple methods that really work:

High performance Float values F16, F32, F64, FPU

Q-Bit Quantum; All forms of Quantum wave work
Radio waves;
Light patterns
Photon wave patterns; single & multiple
Sound hardware; 1 to 3 Bit DAC; Audio conversions; Sample range
Analogue chips that work on harmony & frequency
SVM Elliptic curve maths
Sin, Arc, Tan, Time, Vector

In essence Harmony & frequency is the equivalent of Complex Elliptic curve maths

A Music note score suffices to specify harmony basics:

Waveform shape in 3D
Harmony / Disharmony
Vibration High / Vibration Low
Power High / Power Low
Volts High / Volts Low
Watts High / Wats Low

(c)Rupert S

https://science.n-helix.com/2023/07/3dchiplet.html

https://science.n-helix.com/2023/06/map.html

Wonderful Wave-Pattern Analogue waveforms in meta materials - Pattern recognition in reciprocal space with a magnon-scattering reservoir
https://www.nature.com/articles/s41467-023-39452-y.pdf

Monday, June 26, 2023

Clock & Low Latency Secure NTP, PTP Video & Audio Sync network card (c)RS

Clock & Low Latency Secure NTP, PTP Video & Audio Sync network card (c)RS


Data Throughput:PTP,NTP,AES - Programmable logic, Why use this instead of a NIC ? or with a nic, Latency RS 2023-06-14 (c)RS

FPGA | FPMG Programmable clocks

PTP Official Clock generator,
In board multiplier,
On Die Cache
Precision enhancement Interpolation circuit
On Die Network translation, IP6 & IP4 with
Output Cache

In the case of low latency networking with EEC & Elliptic Curve integrated security:

Time clock +

Onboard
TPM
Certificate Cache

AES output with certificate (can be static & cached)

Output Cache,
Security layer & IP Translation layer

(c)Rupert S

https://www.youtube.com/watch?v=l3pe_qx95E0 1h:00

https://science.n-helix.com/2022/06/jit-compiler.html

https://science.n-helix.com/2022/10/ml.html

https://science.n-helix.com/2023/06/tops.html

https://is.gd/LEDSource

Clock expander with parallel async gate activation

                / |
{Clock} |< |
               |< |
               |< |
               |< |
               \ |

[C] [E]

               / |
{Clock} |< |
               |< | = [CE]
               |< |
               |< |
               \ |

[CE] + Micro [E]

Value Large F16, F32, F64 & so forth

Interpolator

A
----- = Fraction
B

A = 100 - [Fraction] Until B

Or

100 = [Value]A

0 = [Value]B

100 - [Fraction] (A - B)

Rupert S

Reasoning for the network NTP & PTP Audio & Video Sync device


The Network card & Devices are designed to provide high-precision synchronization for video and audio applications using NTP and PTP protocols..

It features a FPGA-based programmable clock generator that can produce multiple output frequencies and phases with low jitter and high accuracy.

The clock generator also supports NTP & PTP official clock functionality,
Which allows the network card to act as a master or slave clock in a NTP & PTP network.

The network card also has a FPMG circuit that can perform interpolation and scaling operations on the input and output clocks & an on-die cache that can store the clock data and reduce latency.

The network card also has a built-in network translation module that can handle both IPv4 and IPv6 protocols,
An output cache that can buffer the data packets before sending them to the network.

In addition, the network card has a security layer that integrates EEC and elliptic curve cryptography to protect the data transmission.

The security layer can also generate AES output with certificates that can be static or cached on the network card.

The network card also has a TPM module that can store the certificates and keys securely.

The network card is compatible with various video and audio formats and standards, such as Ethernet, Wifi & Radio, HDMI, DisplayPort, SDI, AES3, etc..

It can also support JIT compilation and machine learning applications using the resources of the FPGA and FPMG circuits.

Research and Development,

Rupert S

PTP Server Clock Sync with NTP https://is.gd/PTP_TimeStream
PTP Server Clock Sync https://is.gd/PTP_Low_Latency_Time


https://is.gd/HPC_PTP_Low_Latency_Network

https://www.linuxfoundation.org/press/announcing-ultra-ethernet-consortium-uec

https://ultraethernet.org/

https://jointdevelopment.org/

NTP64 Server (run after PTP) https://is.gd/NTP_Server

Open Streaming Codecs 2023 https://is.gd/OpenStreamingCodecs

The following diagram illustrates some of the possible components and functions of a programmable logic device for data throughput optimization:(c)RS


|-----------------| |-----------------| |-----------------|

| PTP official | | In-board | | On-die cache |

| clock generator |----| multiplier |----| |

|-----------------| |-----------------| |-----------------|

| | |

| | |

V V V

|-----------------| |-----------------| |-----------------|
| Precision | | On-die network | | Output cache |
| enhancement |----| translation |----| |
| interpolation | | | |-----------------|
| circuit | |-----------------|
|-----------------|

|

|

V

|-----------------|
| Time clock |
|-----------------|

|

|

V

|-----------------|
| Onboard TPM |
|-----------------|

|

|

V

|-----------------|
| Certificate |
| cache |
|-----------------|

|

|

V

|-----------------|
| AES output with |
| certificate |
|-----------------|

|

|

V

|-----------------|
| Security layer |
| and IP |
| translation |
| layer |
|-----------------|

*****

Data Throughput:PTP,NTP,AES Programmable Clock & Event Timer (c)RS

One of the challenges of modern network applications is to achieve high data throughput with low latency and high reliability.

Data throughput is the amount of data that can be transferred over a network in a given time.

Latency is the delay between sending and receiving data.

Reliability is the ability to maintain data integrity and availability.

One way to improve data throughput is to use programmable logic devices, such as field-programmable gate arrays (FPGAs) or field-programmable micro-gate arrays (FPMGs).

These devices can be customized to perform specific functions at high speed and efficiency, such as encryption, compression, filtering, routing, etc.

Programmable logic devices can also be configured to support different network protocols Such as:

Precision Time Protocol (PTP), Network Time Protocol (NTP), and Advanced Encryption Standard (AES).

PTP is a protocol that synchronizes the clocks of different devices on a network.

It is used for applications that require precise timing and coordination, such as industrial automation, test and measurement, and telecommunications.

PTP can achieve sub-microsecond accuracy over Ethernet networks.

NTP is a protocol that synchronizes the clocks of different devices on a network.

It is used for applications that require moderate accuracy and stability, such as web servers, email servers, and databases.

NTP can achieve millisecond accuracy over Ethernet networks.

AES is a standard for symmetric-key encryption.

It is used for applications that require data security and confidentiality, such as banking, e-commerce, and government.

AES can encrypt and decrypt data with 128-bit, 192-bit, or 256-bit keys.

Programmable logic devices can be used instead of or with network interface cards (NICs) to improve data throughput.

NICs are hardware components that connect a device to a network.

They are responsible for sending and receiving data packets over the physical layer of the network.Programmable logic devices can be integrated with NICs or replace them entirely,

depending on the application requirements.

For example:

A programmable logic device can be used as a PTP official clock generator providing a reference time for other devices on the network.

It can also implement an in-board multiplier, which increases the clock frequency of the device.

Additionally, it can have an on-die cache, which stores frequently used data for faster access.

A programmable logic device can also perform precision enhancement interpolation circuitry; Which improves the accuracy of the clock signal by interpolating between two adjacent clock pulses.

Furthermore, it can have an on-die network translation unit, which converts between different network protocols, such as IPv6 and IPv4.

Moreover, It can have an output cache, which buffers the outgoing data packets for smoother transmission.

In the case of low latency networking with error correction code (ECC) and elliptic curve integrated security, a programmable logic device can also provide additional features.

For example, a programmable logic device can have a time clock module that synchronizes with the PTP official clock generator.

It can also have an onboard trusted platform module (TPM), Which provides hardware-based security functions..

Such as key generation and storage.

Additionally, it can have a certificate cache; Which stores digital certificates for authentication and encryption.

A programmable logic device can also perform AES output with certificate verification..

Which encrypts the data packets with AES and attaches a digital signature for integrity checking.Furthermore,

It can have a security layer and an IP translation layer,

Which provide additional protection and compatibility for the data packets.

Some of the possible components and functions of a programmable logic device for data throughput optimization.

(c)Rupert S

Friday, June 23, 2023

[M.A.P] [=====] [H.P.C] - Matrix Array Processor Unit (c)RS

Matrix Array Processor Unit (c)RS


[M.A.P] [=====] [H.P.C] - Matrix Array Processor Unit (c)RS

*
The M.A.P Processor is ideal as a Tensor Unit, For Small Array Solving; Such as MP3, MP4 & AC4 3D Audio,
The Base Map is simply to Fit a large static conversion M.A.P into the device,
For example a 32Bit Audio Sample Pluse 3D Layer for Bluetooth would simply be around 64Bits for Stereo 32Bit Audio MP4; Plus 32Bits for the 3D Map,
The M.A.P Process is not static; But you stick to the maths you wanted.

In parallel instructions, one calls interrupts if bad; IRQ & DMA Notes if you want to have better performance,
But in a processor Internals you have to call the main loops in your App; & OS Task Instruction cache..

Instruct The loop; Don't Interrupt; Stop, Look, Listen! Look, Slowdown, Showtime!

Integer instructions multiple parallel example of The principle of,
M.A.P is based on wide multiple instructions, This suites AVX & SiMD,
Particularly in 16Bit Multi Parallel Instruction Mode

Rupert S

Soft Interrupt IRQ: Faster CPU Cycles: RS

A Soft Interrupt is where you direct the interrupt register to a compiled Code Block..
The code block handles the Wait Queue in a gentle way that allows processing to continue & Ram to be accessed..

While the HDD directly writes the IRQ messages to the Code Block; The Code block is below the size of Cache on the Processor..

In advanced scenarios the Soft Int Caches Read/Write in RAM while Directing DMA & R/W Cached Cycles; Good Bioses & Software do this.

But in a processor Internals you have to call the Main Micro loops (Soft Int) in your App; & OS Task Instruction cache.

RS

Interrupts particularly effect the Processor functions such as..
Machine Learning Load & Store of Frames, Also the internet..
In such as Network cards offloading is often required to handle interrupts..

*

VPDM-ST-LRS : Verified Processor Direct Memory Space Transactions Load, Register & Save (c)RS


In Concurrence with DM-TCP & DM-UDP & DM-Quicc Soft Interrupt IRQ

https://www.phoronix.com/news/Linux-Device-Memory-TCP

For SI-IRQ to safely directly write RAM for a SiMD & CPU/TPU; The following protocol is observed:

1 DMA Memory Management Processor, Device Bios/PCI Bus & Network Chipset/Network card..
Shall directly code check incoming traffic; But shall not void EEC Mode error check...

Bear in mind that AES, Common TLS & Packet Compression are in effect!
So you shall be using Networking features directly through the Transparent H.D.L Hardware Device Layer...

In effect the MMU & Network adapter transparently offload directly to Device Topography RAM & Cache!

2 The network card Certifies transactions & offloads security to internal features; Main Certification is still TPM & HMS.

3 You can handle directly to Processor of memory space matches internet Bit-depth; However this is usually 32Bit as with IP4 & 64Bit with IP6..

4 So the MMU & Network chipset work in sync; EEC, Security, TLS, M.S.T: Memory Space Translation...

5 VPDM-ST-LRS : Verified Processor Direct Memory Space Transactions Load, Register & Save (c)RS

So to be clear Automated Load, Register & Save Networking; Yes,
Device Low Level Firmware Translation Transactions; Yes
Processor Direct Memory Space Transactions; No, With Verification? Yes

To stop per Frame IO being a high cost transport processing; We process the entire frame per In/Out,
The same with TCP/UDP/Quicc; We process per whole Bit; For example 192Bits (SSL,AES),
Packet containment & control protocols; Mainly because Half packets caused inefficiency!

Rupert S

https://science.n-helix.com/2023/02/pm-qos.html

https://lore.kernel.org/dri-devel/20230710223304.1174642-1-almasrymina@google.com/

https://is.gd/HPC_PTP_Low_Latency_Network

https://www.linuxfoundation.org/press/announcing-ultra-ethernet-consortium-uec

https://ultraethernet.org/

https://jointdevelopment.org/

*

Embedded Hardened Pointer Table Cache for 3D Chips : RS


Based on PCI Edge RAM, Internal Loop Dynamic RAM; With internalised DMA Memory transfers..

In the process the feature has the ability to set a page table; 1MB, 2MB, 4MB, 16MB > 1TB,The Ram can be internally written to without invoking ALU or OS,

Pages are allocated; The GPU is an example; Physical pages are allocated in RAM that is directly Set by OS & Firmware/ROM Parameters...

Internal access to the RAM is set within the page allocation set, But all internal mapping & paging is done directly & though ALU & Memory Management Unit MMU.

With 1MB Cache set aside per feature; Not entirely unreasonable these days...

Most if a process such as SiMD can be carried out on internal loops..

Depending on Cache/RAM Space; Based on PCI Edge RAM

Internal DataSet Size based on Dynamic RAM Variable; That is set per USE &Or Per Settings or application,

That being said; RAM Allocations best be per session & directly after Setting is changed on reboot or refresh, Load & unload cycling.

Rupert S

*

Gather/Scatter Microcode no-overload ALU or Data/Code Cache, Just L3/RAM


When we look at the Instructions of the SiMD; We could see potential in them to further improve the Gather/Scatter Instructions; Although it has to be said that the instructions are well optimised!
Like many pre-Fetching Assembly code for earlier years they are well created & quick!

But we can do several things with them; So what ?

We can directly fetch the Cache in the code & Link to cache locations using linking (if we have enough & we do at L3/L2)

We can make a Hardlink table in cache(L3) for load and save processing (64Kb, Including header)

We can directly invoke pre-fetch with a system call (With SoftLink Pointer Tables)

We can incache modify (if a directive is singular in a chain of a, b, c, d)
We can individually SysCall a direct load of a single {a, b, c, d) statement & not reload it all...

For this we need a matrix table in L3 RAM; We can do this if we keep the table under 512KB,
But we do not intend to be selfish & RAM is fast these days! So we can directly load a single matrix Element {a, b, c, d} & not refresh the loading cycle for the code...

Thus we do not have to overload ALU or Data/Code Cache, Just L3/RAM

Rupert S


*

Temporary HardLinking in Prefetching Matrix instructions,

Gather/Scatter operations of localised random scattering of information to ram & retrieval

Gather
for (i = 0; i < N; ++i)
x[i] = y[idx[i]];

Scatter
for (i = 0; i < N; ++i)
y[idx[i]] = x[i];

Firstly i read statistical gathing & Seeding; Pre-Fetching is a method of anticipating & preloading data,
So what do i want to do ? In Vector Matrix Prefetch Logical Gather

Potentially i would like to use:

Softlink (ram retrieval & multiple value)
HardLink (maths)
Prefetching logic {such as,

Run length prefetching,
Follow & Forward loading Cache,
Entire instruction load & Timing Pre-fetch & Statistic for Loop time & load frequency
}

So on any potential layout for SiMD Matrix a most likely configuration is:

A B C : FMA
A B = C : Mul or ADD

So a logical statement is, A, B Gather/Seed C; Directly logical AKA Prefetch
A B C D; Logical fields of prefetch are localised to parameter...

Only likely to draw data from a specific subset of points,
Byte Swapping is obviously A1 B1,2,3

Most specifically if the command is a hardlink With A B C; Then most likely Storage is directly linked; Like a HardLink on a HDD in NT,

The hard link is direct value fetching from a specific Var table & most likely a sorted list!
If the list is not sorted; We are probably sorting the list..

If we do not HardLink data in a matrix (Example):

Var = V+n, Table
a b c d
1[V1][V1][V1][V1]
2[V2][V2][V2][V2]
3[V3][V3][V3][V3]
4[V4][V4][V4][V4]

A Matrix HardLink is a temporary Table specific logical reading of instructions & direct memory load and save,
Registers {A,B,C,D}=v{1,2,3,4}..

Directly read direct memory table logic & optimise resulting likely storage or retrieval locations & Soft Link (pointer table)

Solutions include multiple Gather/Scatter & 'Gather/Scatter Stride' Cube Block multi load/save..
Logical Cache Storage History Pointer Table, Group Sorted RAM Save/Load by classification {A,B,C,D}=v{1,2,3,4}
When X + Xa + Xb + Xc, When Y + a b c, When Y or X Prefetch Pointer Table + Data { a, b, c }

Example Gather/Scatter logical multiple

var pointer [p1] {a ,b, c, d}
var pointer [p2] {1 ,2, 3, 4}

Gather
for (i = 0; i < N; ++i)
x[i] = y[idx[i]];
fetch y {p1, p2}; {a, b, c, d}:{1 ,2, 3, 4}

Scatter
for (i = 0; i < N; ++i)
y[idx[i]] = x[i];
send x {p1, p2}; {a, b, c, d}:{1 ,2, 3, 4}
 
Rupert S : Reference https://en.wikipedia.org/wiki/Gather/scatter_(vector_addressing)

*

FMA is a Matrix SiMD feature & is common to ARM & AMD, CPU & GPU

Phone SIM cards can use FMA for GSM network acceleration,

We can use FMA fused MUL ADD for elliptic curve encryption to multiple Time * curve & ADD AES encryption in the form of time model & 3D dimensions,

Therefore we can use FMA to calculate the room area & add audio reverberation matrix as volume levels over time..

FMA as a basic GPU..

We can convert adder & fused MUL ADD ML,

Use all 3 types on integer function of CPU & internal GPU on echo dot type device's with internal GPU and CPU.. FPGA design.

Rupert S

*

Pre-Fetching; Statistically Ordered Gather/Scatter & The Scatter/Gather Commands


(SiMD) The gather/scatter commands may seem particularly random?
But we can use this in machine learning:

Gather
The equivalent of Gathering a group of factors or memories into a group & thinking about them in the context of our code! (our thought rules),

Scatter
Now if we think about scatter; we have to limit the radius of our through to a small area of brain matter (or ram)... Or the process will leave us "Scatter-Brained"

Statistical Pre-Fetching:

Ordered Scatter
When you know approximately where to scatter

Ordered Gather
Where you know approximately where to gather

Free Thought
So now we can associate scatter & gather as a form of free thought? Yes but chaotic...
So we add order to that chaos! We limit the scattering to a single field.

Stride
Stride is the equivalent of following a line in the field; Do we also gather &Or Scatter while we stride ?
Do we simply stride a field?

Now to answer this question we simply have to denote motive!
In seeding we can scatter; Will we do better with an Ordered Scatter ? Yes we could!

Statistically Ordered Gather/Scatter & The Scatter/Gather Commands
Pre-Fetched

Rupert S

*

Multi-line Packed-Bit Int SiMD Maths : Relevance HDR, WCG, ML Machine Learning (Most advantaged ADDER Maths)


The rules of multiple Maths with lower Bit widths into SiMD 256Bit (example) 64Bit & 128Bit & 512Bit can be used

In all methods you use packed bits per save, so single line save or load, Parallel, No ram thrashing.

You cannot flow a 16Bit block into another segment (the next 16Bit block)

You can however use 9 bit as a separator & rolling an addition to the next bit means a more accurate result!
in 32Bit you do 3 * 8bit & 1 * 4Bit, in this example the 4Bit op has 5 Bit results & The 8Bit have 9Bit results..
This is preferable!

2Bit, 3Bit, 4Bit Operation 1 , 8Bit Operations 3: Table

32Bit
4 : 1, 8 : 3

64Bit
4 : 2, 8 : 6
2 : 1, 7 : 8
3 : 1, 8 : 1, 16 : 3

Addition is the only place where 16Bit * 4 = 64Bit works easily, but when you ADD or - you can only roll to the lowest boundary of each 16Bit segment & not into the higher or lower segment.

A: In order to multiply you need adaptable rules to division & multiply
B: you need a dividable Maths unit with And OR & Not gates to segment the registered Mul SiMD Unit..

In the case of + * you need to use single line rule addition (no over flow per pixel)..
& Either Many AND-OR / Not gate layer or Parallel 16Bit blocks..

You can however painful as it is Multi Load & Zero remainder registers & &or X or Not remainder 00000 on higher depth instructions & so remain pure!

8Bit blocks are a bit small and we use HDR & WCG, So mostly pointless!

We can however 8Bit Write a patch of pallet & sub divide our colour pallet & Light Shadow Curves in anything over 8Bit depth colour,

In the case of Intel 8Bit * 8 Inferencing unit : 16 Bit Colour in probably (WCG 8 * 8) + (HDR 8 * 8) Segments,

In any case Addition is fortunately what we need! so with ADD we can use SiMD & Integer Today.

Rupert S

https://science.n-helix.com/2018/01/integer-floats-with-remainder-theory.html

https://science.n-helix.com/2021/11/parallel-execution.html

https://science.n-helix.com/2022/10/ml.html

https://science.n-helix.com/2021/03/brain-bit-precision-int32-fp32-int16.html

https://science.n-helix.com/2023/06/map.html

*

M.A.P NPU Matrix Processor Dimensional construct (c)RS


Primary reason for expansion of function data sets: 2D, 3D,< nD

P.D.C is a worker thread parallel 2D or 3D Grid,
Utilising QQ & A, B,C Array maths allows us to collapse or expand dimensions in a flexible way,

The same principles as SVM (S.V.M SiMD Vector matrix) can be used to culminate or expand dimensions...

That way a M.A.P Processor can expand or collapse all mathematical constructs,
We can therefore use all mathematical & statistical arrays for machine Learning & Maths.

RS

*

The Subject of 4x4 tables,

We are obviously looking for more like 16x16 for Physics maths!
The matrix processor is a large data set; Divisible into 4x2 & 4x4 & 8x8 groups for execution speedups,
Aligned Parallel processing....

Aligned Matrix tables need to be larger than 4x4 for Physics & Chemistry; So a matrix processor ideally can at a minimum:

Matrix Table

x1
16x16

16/2
x2
8x8,8x8
8x8,8x8

8/4
x4
4x4,4x4
4x4,4x4

RS

*

Matrix Method (c)RS


Any GPU & CPU SiMD can do a form of Matrix maths in an Array Parallel Load & Run as consecutive tasks..

Like So

Matrix Formulas : (c)RS

SiMD Array A to X, Usually 8, 16, 32, 64 Parallel Groups

Grouped Parallel Runs
A 1, 2, 3, N
B 1, 2, 3, N
to
Y 1, 2, 3, N
X 1, 2, 3, N
Run 1 {A1, B1 to X1, Y1} Run 2+ {A2, B2 to X2, Y2}++ {An, Bn to Xn, Yn}

Matrix Processor Method Synchronous Cube Map Usually 8x8, 16x16, 32x32, 64x64 Parallel Quad++ Groups

2D:3D Cube

A 1, 2, 3, N
B1, 2, 3, N
C1, 2, 3, N
D1, 2, 3, N

Run 1 2D:3D Cube {
A 1, 2, 3, N
B1, 2, 3, N
C1, 2, 3, N
D1, 2, 3, N
};

Run N 2D:3D Cube {
A 1, 2, 3, N
B1, 2, 3, N
C1, 2, 3, N
D1, 2, 3, N
}

Rupert Summerskill

*

SiMD Matrix maths begins with a 3D graph,


a
|___c
 \
   b

The graphs principal of 3 dimensions; We can use more dimensions but on paper we need to represent dimensions in colours so that all 3 dimensions that we can draw; are represented.

In algebra we represent 3+ dimensions with small glyphs next to each letter that represents our maths operation theoretical number.

During operation of computation we maintain in memory the specific dimensions interactions and interplay of complex matrix maths.

Rupert S

Numbers example 4D matrix

I love you 2, I love you 3, I love you 4 the ends of time... To be continued...

JN

*

Directed Matrix Principle : RS


Matrix Principle directed at traditional parallel Integer & SiMD Instruction groups

The main problem with 32KB L1 tables is cache filling & domination of CPU/GPU by single program instruction groups..

Instruction cache is the primary challenge; Because Instruction cache L1 is commonly 32KB; Data cache 64KB,
L2 is 512KB to 4MB; L3 4MB to 16MB (can be more on Epyc)..

Optimised instruction groups by instruction, SiMD multiprocessing thread count:

Firstly requirements: (32KB instruction Cache L1, 512KB L2, 8MB L3)

L1 Instruction Group 32KB
L2 running group 512KB
L3 RAM & storage direct fetching 8MB

8KB core table for group threading,
24KB of grouped & Synchronised instructions

Data work Groups 512KB L2 / 64 Instruction Group sets (L1 32KB Table),
So Main instruction groups from L1 with larger data sets.

L3 4MB to 8MB of data & instruction caching load (directed from L1 & funneled into L2)

Instructions are cross threaded directly though L3 & L2 synchronised Load, Run & Save,

Optimised instruction groups by instruction, SiMD multiprocessing thread count.

Rupert S

*

Parallel Arrays : Matrix forms : RS


Matrix processor is a feature that will be more common & is relatively similar to an Abacus with a multiple array of + & * Operators..

Now a Matrix Array is X1 > Xn & Y1 > Yn

Commonly an array of 16 x 16 but can be 8 x 8 or 4 x 4,

Now we can perform such operations as Relativity & String theory on a lattice & that is very fast!

We can also perform these functions on SiMD, AVX in parallel; Such that 256Bit SiMD is 32Bit x 8 Parallel & so forth

Parallel
a : 64Bit
b : 64Bit
c : 64Bit
d : 64Bit

Matrix
a1a2a3a4
b1b2b3b4
c1c2c3c4
d1d2d3d4

Now we can see that we can perform a matrix operation such as lattice with both SiMD & SiMD-Matrix,

We can also see that a Matrix shall & can present our solution & that SiMD can also!
But we need Long operation SiMD or many passes to complete our operations; If Larger than our size..

We can also therefore most likely..

Use AES-NI S Letter Box & SVE & Matrix & SiMD to our advantage for many Lattice operations.

Multiplier Matrix Accelerated Encryption, Like i said A Parallel SiMD array may do the same; If all memory arrays are connected by a single RAM/Cache ALU Node,

As stated Parallel Arrays & Parallel Matrix Arrays.

Rupert Summerskill

https://science.n-helix.com/2023/06/map.html

https://science.n-helix.com/2022/03/ice-ssrtp.html

Bluetooth LE Protocol
https://drive.google.com/file/d/17csRnAfdceZiTSnQZvhaLqLSwL__zsIG/view?usp=sharing

*

Examples of Parallel execution pipeline : Parallel arrays:


Crypto lattice, Kyber/ML-KEM, AES : Parallelised Lattices, 8x & 16x Parallel SiMD F16/32/64/128/192/256Bit

parameterisation of groups of 4x Parallel SiMD F16 & 8x Parallel SiMD F16

Parallelised motion & Video/Audio Deblocking/Blocking

8x8 16x16 quantification of video is common in VVC & H265 & H264 & JPEG & MP3, MP4a & AAC,
Suggested parameterisation of 4x Parallel SiMD F16

8x8 16x16 quantification of video is common in HDR VVC & H265 & H264 & JPEG & MP3, MP4a & AAC & AC3 & AC4,
Suggested parameterisation of 4x Parallel SiMD F32

Shapes in motion 2D : 4x per Cube in motion,
Shapes in motion 2D : 6x per Texture Shaded Cube in motion,

Shapes in motion 3D : 6x per Cube in motion,
Shapes in motion 3D : 8x per Texture Shaded Cube in motion,

RS

*

Number relativity, Bit precision: RS


In gaming a player has access to palette of 16bit FFFFFFFFFFFFFFFFFFFF.FFFFFFFF BF16 F=16 HEX; In 32bit memory storage.

Average gamers recognise maybe 32000 colours directly,

Colour rich artist colourist's recognise almost 6000000 colours  TOPCloud.

Variety is king & queen of experience,
Artists specialist recognises more colours than a basic gamer or graphics artist in vectors..

Matrix maths operations precision is relative to hardware,
XBox 4bit FFFF, PLAYSTATION 8Bit FFFFFFFF

RollINT precision 1 to 4 bit + integer -1 to 4 bit F, FFFF, FFF+.F Xbox Or FFFFFFF+.F Ps

Bit precision is relative to your experience!

Rupert S

*

RollINT - Machine Learning for Console & Computer : RS

With True Value memory/Operation cache...

Application of RollINT to machine learning with definition,
A Playstation APU has 8Bit Integers for inference; XBox 4Bit..

In order to describe 4Bit as float; You would need to define 3Bit & 1Bit R remainder,
So how does this work?

In loading value the first 3Bit is the value & the 4th bit is remainder & when you load the value stored..

You fetch 3Bit as the value & 1 Bit as the remainder; Example:

FFFe > Value FFF &R e, So the value is FFF.e not FFFe
you can do multiple data type operations in this method; For example:

FFde = FF & de or FF.de or you could do Ffde & mean F.fde; Useful for definitions of Pi,

For example Pi in 4Bit (8Bits Prefered); Commonly used by kids at school!,

However you convert the stored 4Bit Pi to a fully accurate value on FPU & SiMD execution by loading pre-stored true value.

RollINT

We are using roll to roll a zero on or off an integer,

Therefore we are able to divide and multiply and add so that..

101-0 > 10.1+0 No can range practically from 0 to 00000000 practically.

So 10023-000 > 10.023+000

We can then store floating point numbers in integers.

(C) Rupert S,

Reference Int & FP Value Sizes; A reminder that Floats are 50% of highest Integer Value,
ROLLInt floats still have an amazing additional value!

https://learn.microsoft.com/en-us/dotnet/standard/numerics

*

RollINT : The Float Perfectionist


Playstation & XBox are primary examples where the Int8 unit could do a RollINT Floating point operation for machine learning that is specific to float FPU Solves,

Edge detection, Sharpening & Adaptive Contrast & Colour HDR..

Depending if you directly roll on SiMD & FPU then you can still sharpen with the bF16 & half precision FPU/SiMD Maths operations on the final run!

Imagine Luke SkyWalkers final Torpedo Salvo as FPU/SiMD Vectors DT

RS

*

Scaler is an argument for the role of RollINT & also a pointer to method


RollINT : A Float view of machine learning,
Essentially the core issue is the role float may play in a result...

Not the human mind does use a common integer format with a small float remainder?
Potential for this configuration is mainly because Integer values are in the main Substantive information..

Float value (the sub decimal place below 0.); Is in essence a precise small value of high importance to skills such as jumping, Running, Motions & skill actions like shooting..

Integer is the majority of action related to large steps; Particularly because people have the capacity to change from Meter to Centimetre to Millimetre,

Justifications for Float values diminish if you have scalar units such as the meter, the Yard, foot, Inch & 16th!

However; As may be pointed out, Roll Scalar? Is a form of floating unit expression; If Scalar measurements are regarded in terms of static's; Then Yes Integer:{Meter; FPU:{cm, mm} is a float value!

Nonetheless Scaler is an argument for the role of RollINT & also a pointer to method..

Scaling you see; is everything to detail; If you want to see this? Magnify or Zoom & Wide angle!
We further scale; By hitboxing our ML; In other words by training the AI on Centric value rewards..

AI Content:

{Content value reward targets};
{Centric Core values};

Return = Value;
end = infinite
Test Loop {AI C, End}; Begine

Epoches = {Satisfied End}

Rupert S

*

Float & Integer : RollINT : In Depth Analytics

RollINT List

Floats with small precision values : RollINT

Dreams have 'Small Randoms', Minor details make a true reality

(OS & Chrome Example)
The size of frames & text alignment
Main colour groups for desktop & browser colours : FFFFFF.FF
Frames forward & backward with submenus are worthy of low precision floats : FFF.F 300 Frames 16 sub allocated positions inside frame:{SubFrame}

Both low & high precision

High Efficiency ZLib, GZip Ram compression
Localised Error correction

Colour depth & contrast HDR, Low error rate/Higher

RS

*

RollINT Versus Metric principle of float reduction : RS

Scale correctly & avoid that FPU being needed

Scale correctly first; Example mouse is Millimetre & Micrometre & Large scale Centimetre,
Photon Microscope is Picometre, Milimetre, Centimetre,
Telescope is Kilometer, Metre, Milimetre..
Screens UpScale & Zoom, Do we need to rescale our measurement ?

https://learn.microsoft.com/en-us/dotnet/standard/numerics

X+- , Y+- 2D+- central point measurements
Int16 2 -32,768 32,767
Int32 4 -2,147,483,648 2,147,483,647
Int64 8 -9,223,372,036,854,775,808 9,223,372,036,854,775,807 (might want to use floats; A lot quicker)

Precision Floats
16Bit Half 2 ±65504
32Bit Single 4 ±3.4 x 1038
64Bit Double 8 ±1.7 × 10308

The main attack Vector being mice & touchscreens & utility scopes & measuring devices...
We wanted DPI without stress!

A range of options exist when using RollINT; The idea is to Roll a float on operation; To be fair hardware like the Amiga has the concept of Integer operation with a float as the final result..

However that option Is "the Final result" & does not mean that you could use RollINT to make a repeated Float maths for applications..

However RollINT could be used 2 Significant ways:

You could use FPU on the result (Previous integer operations save FPU for other tasks)
You could receive an Integer result from the float operation (Final float value on multiple operations not important to you?)

Perform Metrification & therefor avoid float value use; for example expand the data into a higher precision mode,

The principle of the Metric system is to use sub parts to reduce the necessity of floats : Meter, Centimetre, Milimetre, KG, Gram, Ounce..

So avoiding a floating unit..

The method is multiple operations, Large, Small, Smaller & can in reality be repeated down to picometer or tiny weights...

This method is multiple operation rounds,

RollINT & FPU Avoid rounds of CPU Cycles; But options exist.

RS

*

As you know the Matrix Array Processor is now frequent with Intel, Mac M1 & M2, AMD & NVidia Versions..

Quantum computers rely on Multi-Directional & Multi-Dimensional Arrays per Qbit!

Well this is a design structure for a Multi-Array Multi-Connection Matrix Array Processor..

The principle is basically quite logical!

Multi-Array Multi-Connection Matrix Array Co-Processor - Quanta Light Compute 2023-06-23

Percentage based 3D Processing to handle all 3D Array processing,

Central [H.P.C] Tasks map to probability over Networks [=====] & [M.A.P] Units in arrays

Table define

{

[M.A.P] = M.A.P , M.A.P 8 Way interconnect,
[H.P.C] = M.A.P High Precision Central Core,
[=====] = Buss Connections & networking

}

Top View

[M.A.P][M.A.P][M.A.P]
[M.A.P][H.P.C][M.A.P]
[M.A.P][M.A.P][M.A.P]

Side View 3D

[M.A.P][H.P.C][M.A.P]
[M.A.P][=====][M.A.P]
[=====][H.P.C][=====]
[M.A.P][=====][M.A.P]
[=====][H.P.C][=====]
[M.A.P][=====][M.A.P]
[=====][H.P.C][=====]

Each [H.P.C] Central Contains RAM & connections to the 8 [M.A.P] & Optionally to layers above & bellow in 3D Matrix,
Bottom of wafer contains high resolution buss to onboard controllers & networks & DPU/GPU/CPU's

Array = Matrix Array Processor Unit (c)RS

ffffffff ffffffff ffffffff
........+ ........*+ ........*
........+ ........*+ ........*
........+ ........*+ ........*

f=fp,unit
*=mul
+=add
.=Cache/Ram

Simple absolver table for MUL:ADD : MUL* Only = +0, +- Only = N*1 then +-
% = / 100 + ADD Table {N1 <> N...} : Result!

(c)Rupert S

SiMD:CMA (c)RS


Standard SiMD Features, Byte Swap, ADD,MUL[SSimd]
8 x Cache,Mul,ADD: [8xCMA]

[SSimd]
[8xCMA][8xCMA][8xCMA][8xCMA]

[SSimd] is additional features accessed by register poke, Standard Operation is CMA & RAM
[8xCMA] is used as RAM in most SiMD Operations & MUL+ADD, ADD, MUL

In SiMD Ops
On RAM upto 3x F16 can be stored (3xF16, F32 + F16, F48, F24x2)

MUL or ADD Operations can be {F16:F16:F16, F32 *+- F16, F24 *+- F24}
Operations are saved to Master Cache & sent to RAM or other functions & can be {F16, F24, F32, F48},
Because master cache is a full buffer; you have to save it first! before reuse!

Design uses the M.A.P basic MUL+ADD & RAM

(c)Rupert S

References: DOT4, INT8, INT16, F16, F32, F64 (c)Rupert S
https://science.n-helix.com/2023/02/pm-qos.html

https://science.n-helix.com/2023/07/3dchiplet.html

Nx-DeepMatrix Engines
https://www.nextplatform.com/2023/08/02/unleashing-an-open-source-torrent-on-cpus-and-ai-engines/
https://idstch.com/geopolitics/next-generation-neuromorphic-chips-bringing-deep-learning-from-cloud-to-iot-edge-devices-and-mobiles/
https://www.backblaze.com/blog/ai-101-gpu-vs-tpu-vs-npu/

Experimental CPU Proof : A proposal for an Open RISC V Processor, Statistical diagrams of function & graphs with function use under load...
https://www.researchgate.net/publication/373403576_Design_of_a_High_Performance_Vector_Processor_Based_on_RISIC-V_Architecture

ML Batch Matrix MAP in FPGA
https://drive.google.com/file/d/1hdxeK1r8LIhvpn7poOm3MfXmGr9Tq-ni/view?usp=sharing

ML Compressed Dynamic16bit-8Bit - Hardware-friendly compression and hardware acceleration for ML Transformer
https://aimspress.com/article/doi/10.3934/era.2022192

Matrix Processors - Memory & command - All-Digital Compute-In-Memory FPGA Architecture for Deep Learning Acceleration
https://dl.acm.org/doi/pdf/10.1145/3640469

Matrix Processors - Inline Ram & Command { CMD : RAM }:{NET}
https://www.xilinx.com/content/dam/xilinx/support/documents/white_papers/wp506-ai-engine.pdf
https://www.xilinx.com/content/dam/xilinx/support/documents/white_papers/EW2020-Deep-Learning-Inference-AICore.pdf

***

Cooperative Matrix Math : RS


Cooperative Matrix is a Math type where you formulate a Grid of number & math notations & solve them in sync,

The consequence for you is that the maths is both Faster; More Complex But also easier to correct for errors...

Usually Matrix Maths is used for Algebra, Image & 3D Mapping ML; Such as to see, Maps & Dungeons, Water tables, Technology Development.

Matrix

Var = V+n, Table
     a      b      c     d
1[V1][V1][V1][V1]
2[V2][V2][V2][V2]
3[V3][V3][V3][V3]
4[V4][V4][V4][V4]

There are 3 main ways for matrix maths:

V1a {/,*,+,-},Value, %, Fraction V1b, V2a, V2b : In effect a dither map or calulation; So connected.
Vector groups {V1a<>z} Maths to {V2a<>z} to {V3a<>z} to {V4a<>z} & more ..

Sorted by Type of operation example
M = Multi Complex Operations In Groups
    a         b        c        d
1[V1]+[V1]+[V1]+[V1]
2[V2]*[V2]*[V2]*[V2]
3[V3] / [V3] / [V3]/[V3]
4[V4]M[V4]M[V4]M[V4]

Refer to : Var = V+n, Table


Matrix Accumulator Header Matrix : {MAHM}
SiMD Wave : 32, 64 Group with finalised result + ALU : Work Group Wave Matrix : {WGWM}
Wave Matrix Accumulator Cube : {WMAC}

{MAHM}
{WMAC},{WMAC}
{WMAC},{WMAC}

{MAHM}
{WGWM},{WGWM}
{WGWM},{WGWM}

{MAHM}
{WGWM},{WGWM}
{WMAC},{WMAC}

CTP-HTM : CPU, TPU, Processor Hypervisor Thread Management : RS

Parallel Group Threads:

Work groups by Aligned by:

Work Group Size (aligned by Bit):

Memory Range {Half Float, b16Bit,b32Bit, 16Bit,32Bit , Double Float}
Aligned Cluster Size,
Bit-depth & Length of code

The logic is that Parallel Group Threads with the same Code complexity & Size should finish around the same time,
They also typically require the same processor priority so that system tasks have Runtime Availability.

RS

Guide to Cooperative Matrix Math : RS

Base principle of the Matrix & Graph goes beyond Accumulation of numbers..
I am reminded by microsofts dev post of Excel & Spreadsheet applications..

Yes they Graph/Matrix; But math solves require it! For example the Acidity/Alkaline matrix with Protons & Electrons,

However a more sophisticated form is algebra; But you have to simply the Algebra & put that in a table..
Einstein, Shrodinger, Physics, Chemestry & DNA By connection...

Algebra is the main reason we would use Float : {bF16 <> bF32} {Single Precision <> Double Precision} SiMD,
The chief objective is the solve; Complex SiMD offer the answer of flexibility..
MUL:DIV ADD

Simple absolver table for MUL:ADD : MUL* Only = +0, +- Only = N*1 then +-
% = / 100 + ADD Table {N1 <> N...} : Result!

(c)Rupert S

Graph Accumulator Multiply ADD - Cooperative Matrix


SDK Sample : https://github.com/ROCmSoftwarePlatform/rocWMMA

https://gpuopen.com/learn/amd-lab-notes/amd-lab-notes-matrix-cores-readme/

https://paperswithcode.com/paper/a-survey-on-deep-learning-hardware/review/

AMD 23.Q3Pro_HIP #HPC #DirectML MatrixMathOps 'Release unto me the great! Chobokniki' Thine Prayers Answered https://is.gd/AMD23Q3PRO_HIP
Run the .reg after install; Before reboot https://is.gd/AMDRebarReg

*

Inference & FMA De-Block Styles


For upscaling matrix: MMX+ & SiMD
16x16 Block as used just about in HD,
8x8 Blocks Certainly NTSC, PAL, JP_NTSC!,
Very usable for deblocking JPG,
16x16 & 8x8 is very good for Inferencing active on Scaling & Deblocking..

4x4 for main Inference XBox & 8x8 for PS5..
XBox can use (4x4)x4 for 8x8 & (4x4)x16 for 16x16; Very powerful!
PS5 can use (8x8)x1 or x2 for 8x8 & (8x8)x4 (x8 for additional processing) for 16x16; Very powerful!

​The table solves common issues with 4Bit & 8Bit direct loading of colour tables of the F16 Types..
16Bit is a bit more common in older hardware & luckily quite a lot more flexible!
But 8Bit & 4Bit inferencing have a number of uses...

Indirect load though F16 Register can work by sideloading the operation; With Inferencing Sub routine coding & Returns,
Processing the actual inference but losing data store & returns just information..

Sub Routine INT8 & INT4 can:
Directly manipulate a small palette; Scoped Palette,
Single channel colour or multiple operations..
Load, Store & Save

Inference & FMA De-Block Styles List

(4x4)x4
(4x4)x8
(4x4)x16 + processing
(4x4)x32 +++ processing

(8x8)x4
(8x8)x8 + processing
(8x8)x16 + processing

(16x16)x1 + processing
(16x16)x2 ++ processing
(16x16)x4 +++ processing

8:4Bit Concepts: 65535/255=8Bit 65535/16=4Bit

16bit/4bit : 4Bit colour pallet, But we can fraction 16Bit/4bit in essence 16/4! 65535/16; Compression Shapes & Gradients.
Polygon, Shadow, Contact
Alpha Channel 2Bit, 4Bit
Grayscale edge define sharpening
Single Colour Edge detect
Shape Fill in Alpha 10,10,10,2
Xor, Pattern, Shading, Shader, Cull, Shape & Depth Compare after define

For when {U, X, Y, Z} = N Expressions https://is.gd/ForWhen_UXYZ_N
For when {(A+B/2)} = C Expressions https://is.gd/ForWhen_ABx2_C

(c)RS

*

An example use of FMA Cooperative Matrix


In the example we use a formula like (U/X²)+(U/Y²)+(U/Z²)
Firstly the x²,y²,z² are MUL, So we need a * table or maybe with FMA we can use a (MUL)+0 ?
My primary observation is that we can use 2 methods:

MUL (U/X²), (U/Y²), (U/Z²) in tables, I suggest 3 * or FMA (MUL)+0
Or we can perform tables in order but complete all the MUL operations in Sync & then ADD with FMA,
Sync : (U/X²)+(U/Y²)+(U/Z²) to (Un/X²)+(Un/Y²)+(Un/Z²)

F1 = First Operation F2 = Second operation R = Result {R1:R3 = R4}

F1
R1=(U/X²) R2=(U/Y²) R3=(U/Z²)
F2
R1=+ R2=+ R3 = R4

So we have an example where MUL & then ADD is usable; But we could use Synced FMA

For when {U, X, Y, Z} = N Expressions https://is.gd/ForWhen_UXYZ_N

RS 

Brilliant examples of matrix maths
https://gpuopen.com/learn/amd-lab-notes/amd-lab-notes-finite-difference-docs-laplacian_part1/

VXEdDSA & XEdDSA & X25519 & X448
https://signal.org/docs/specifications/xeddsa/

SiMD-Matrix Maths example - Wave retrieval from quad-polarized Chinese Gaofen-3 SAR image using an improved tilt modulation transfer function
https://www.tandfonline.com/doi/full/10.1080/10095020.2023.2239849?src=
https://drive.google.com/file/d/1uN047PvBJhFkcdNJKqx6cBZ9vnAxcjPj/view?usp=drive_link

SiMD-Matrix Maths example D-Waves
https://drive.google.com/file/d/15iPy-Z24GsbcUdEycOfS1819Fdf0sWoE/view?usp=drive_link

*****

High speed Per operation Cycle operations of D R² Pi


An (A[diameter]*B²[Pi] : D * R² operation is 2 Cycles, this specialised Arc, Sin, Tan operation can be accomplished a couple of ways in a single cycle,

Options table : D R² Pi

Firstly by sideways memory load in lower Single Precision to double precision output in a SiMD

You need to pre cache R²You can use the same value for R or for D &or both
You can pre cache all static D &or R, So you can vary either D or R & single cycle
You need to perform 2 operations , Diameter & R² & obviously they are relational!

For examples:

R = Atom Zink (standard size!) Cache D R
You move a compass but the needle is the same size! Cache D
You draw faces but the width is the same, Cache D
You draw faces but the Shape is the same but size is not! Cache R

Rupert S

**********

How you use FMA, Basic MUL+ADD examples first & then Mul & ADD


Firstly in video,
MUL a float set A * B + C
Video Upscaling basic A:Pixel * B:PixelDiffRightPixel + C:RightPixel,
Do that 16 Times per pixel pair and you have 16*Interpolate, So a 16* Data set Wave!
You could obviously use a 32* Wave SiMD & do 4x8; So 4 Pixel groups per Wave.

So for example you can ADD Log Gama or other simple values, In A * B + C,
Pixel Values or whatever, You can use Point float 0.001 for example to do division on floats.

For all personal maths that you imagine:
Simple absolver table for MUL:ADD : MUL* Only = +0, +- Only = N*1 then +-
% = / 100 + ADD Table {N1 <> N...} : Result!

Interpolation & smoothing :

The method i am thinking of is ADD Mul/Div : Edge Left A+B Edge Right = C Center, (A to C)<>(C to A)

(A+B)/2 = C

Factor A_to_C
16 Steps

Factor C_to_B
16 Steps

*alternatives*

((A-C)/16)=F | (F* A over C)=F Step * 16 over Time or distance

(Call slope)
find 16 Fractions of A To C
find 16 Fractions of C to B

For when {(A+B/2)} = C Expressions https://is.gd/ForWhen_ABx2_C

RS

Pixel A to B, Interpolation upscaling


from A1 to B16 ADD Difference of A - B

Red A1 : 2 : 3 : 4 : 5 : 6 : 7 : 8 : 9 : 10 : 11 : 12: 13 : 14 : 15 : 16B
Green A1 : 2 : 3 : 4 : 5 : 6 : 7 : 8 : 9 : 10 : 11 : 12: 13 : 14 : 15 : 16B
Blue A1 : 2 : 3 : 4 : 5 : 6 : 7 : 8 : 9 : 10 : 11 : 12: 13 : 14 : 15 : 16B

Tables can be 16 Wide & 16 Long to advantage ourselves of Byte aligned F16

Pixel A to B, Interpolation upscaling

AAA
ABA
AAA

Example

R,G,B Value of A
R,G,B Value of B
RCv = Value per pixel of 16

Which is higher RA or RB
if RA
RA - RB = RC
If RB
RB - RA = RC

RB{1 to 16} repeat +- RCv

Sorry about the coding RS

Rupert S

*

FMA AVX Performance table: 2Flops per Cycle per FMA Unit
Architecture Fast Instructions for FMA


Reference Tables https://www.uio.no/studier/emner/matnat/ifi/IN3200/v19/teaching-material/avx512.pdf

Operators in C
● Arithmetic
a + b, a – b, a*b, a/b, a%b
● Bitwise
a | b, a & b, a ^ b, ~a
● Bit shift
a << b, a >> b (signed), a >> b (unsigned)
● Logical operators
a && b, a || b, !a
● Comparison operators
a == b, a != b, a < b, a <= b, a > b, a >= b
● Tertiary operator
x = a ? b : c
● Special functions:
sqrt(x), abs(x), fma(a,b,c), ceil(x), floor(x)

Fast division for constant divisors

Calculate r = a/b where b is a constant
With floating point we precompute (at compile time
or outside of the main loop) the inverse ib = 1.0/b.
r = ib*a
Floating point division with constant divisors
becomes multiplication
With integers the inverse is more complicated
ib,n = get_magic_numbers(b);
r = ib*a >> n

Integer division with constant divisors becomes
multiplication and a bit-shift

Fast Division Examples
● x/3 = x*1431655766/2^32
27*1431655766/2^32 = 3
● x/1000 = x*274877907/2^38
10000*274877907/2^32 = 10
● x/314159 = x*895963435/2
7*314159*895963435/2^48 = 7

Dividing integers by a power of two can be done with a bit shift which is very fast.

RS


High-Performance Elliptic Curve Cryptography: A SIMD Approach to Modern Curves
https://www.lasca.ic.unicamp.br/media/publications/FazHernandez_Armando_D.pdf
https://science.n-helix.com/2023/06/map.html
https://science.n-helix.com/2022/04/vecsr.html

https://gpuopen.com/learn/matrix-compendium/matrix-compendium-intro/

*

Triangle 3D Matrix graphs


C
|
|
_____b
\
  \
    A

Vector table for audio & video or graphics..

We will use integers for the 3D audio presentation & SiMD fpu for MP4 & AC4 & Alac decompression..

RS

So we will be using a form of float unit called..

RollINT

We are using roll to roll a zero on or off an integer,

Therefore we are able to divide and multiply and add so that..

101-0 > 10.1+0 No can range practically from 0 to 00000000 practically.

So 10023-000 > 10.023+000

We can then store floating point numbers in integer.

(C) Rupert S,

Reference Int & FP Value Sizes; A reminder that Floats are 50% of highest Integer Value,
ROLLInt floats still have an amazing additional value!

https://learn.microsoft.com/en-us/dotnet/standard/numerics

*

ECC elliptic curves & Gradients : RS


Leveraging FMA fused MUL ADD on Internet & Software ...

For examples:

Gradients vector compression..

Colour A to colour B

Compare dif {A:B}
Transform A over steps B

Same colour ranges {R,G,B}

(A - B) = Dif
Shift B over steps = A

Store Vec VTable = steps

VTable:

Steps S1 to Sn

Colour B1 to Bn + S1 to Sn

S1,Sn
B1,Bn
B1,Bn
B1,Bn

Same with time & dimensions in the ECC elliptic curve..

S=T*D
Vector= {B1,Bn}

(T*D)+Bn

VTable:

Steps S1 to Sn

Colour B1 to Bn + S1 to Sn

S1,Sn
B1,Bn
B1,Bn
B1,Bn

Rupert S

*

Einstein : Quad:20x30 Matrix table


With Einstein Formula being around 20 operations wide, 30 Lines long..
Single Operation Formula Matrix Tables could be popular,

Consequently matrix math : MTU/MAP processor features should be popular...

I take the view that 8 x 30 is about manageable on the Epyc & M2..
Bearing mind that a 32 Wide x 32 Long Operations SiMD is achievable...

An AVX512 SiMD could run Quad operations (128Bit AVX) x 4,
So 20/4 = 5x; So 6x AVX512(128Bit Operation); Now there is; I believe; 1 AVX core per 2 Core Groups!

So 24 Core has 8x or 4x or 2x (8 or 4 Cores per die unit)!
So 84 Core units should have enough AVX512?

But one Mac M2... :D

Einstein : Quad:20x30 Matrix table

With Einstein Formula being around 20 operations wide, 30 Lines long..
Single Operation Formula Matrix Tables could be popular,

Consequently matrix math : MTU/MAP processor features should be popular...

I take the view that 8 x 30 is about manageable on the Epyc & M2..
Bearing in mind that a 32 Wide x 32 Long Operations SiMD is achievable...

An AVX512 SiMD could run Quad operations (128Bit AVX) x 4,
So 20/4 = 5x; So 6x AVX512(128Bit Operation); Now there is; I believe; 1 AVX core per 2 Core Groups!

So 24 Core has 8x or 4x or 2x (8 or 4 Cores per die unit)!
So 84 Core units should have enough AVX512?

But one Mac M2... :D

In our case Einstein, the table is 20 Wide & 35 Long (roughly)

So : Einstein = Quad:20x35 | Alternative Quad:8x16, More manageable in
SiMD Parallel Executions; Quad:8x16 x 3, ....

One presume strict aligned multiple multiplication

4X4 Tables are still utility for Science maths; But we need
to get the point across what we need for Einstein! The Subject of 4x4
tables,

The Subject of 4x4 tables,

We are obviously looking for more like 16x16 for Physics maths!
The matrix processor is a large data set; Divisible into 4x2 & 4x4 &
8x8 groups for execution speedups,
Aligned Parallel processing....

Aligned Matrix tables need to be larger than 4x4 for Physics &
Chemistry; So a matrix processor ideally can at a minimum:

Matrix Table

x1
16x16

16/2
x2
8x8,8x8
8x8,8x8

8/4
x4
4x4,4x4
4x4,4x4
https://gpuopen.com/learn/matrix-compendium/matrix-compendium-intro/

https://marctenbosch.com/quaternions/
https://arxiv.org/abs/1101.4542

Quaternions > PGA Geometric : a+b+c : Rotational algebra : ax+by+c=0 | e1, e2, e3
https://www.youtube.com/watch?v=0i3ocLhbxJ4
https://www.youtube.com/watch?v=Idlv83CxP-8

Improving Structured Grid-Based Sparse Matrix-Vector Multiplication and Gauss–Seidel Iteration on GPDSP
https://www.mdpi.com/2076-3417/13/15/8952

SiMD Matrix Maths - Performance Portable SIMD Approach - Implementing Block Line Solver For Coupled PDEs
https://www.osti.gov/servlets/purl/1602621

SiMD Matrix Maths - Operations Details HIP AMD
https://rocm.docs.amd.com/_/downloads/en/latest/pdf/

SiMD double tables, M1 Matrix
https://developer.apple.com/documentation/accelerate/working_with_matrices


FMA AVX Performance table: 2Flops per Cycle per FMA Unit
Architecture Fast Instructions for FMA
https://www.uio.no/studier/emner/matnat/ifi/IN3200/v19/teaching-material/avx512.pdf

#RIP (Intro interesting!) Optimizing massively parallel sparse matrix computing on ARM many-core processor
https://www.sciencedirect.com/science/article/abs/pii/S0167819123000418

https://www.gamedeveloper.com/programming/implementing-a-3d-simd-geometry-and-lighting-pipeline
https://developer.apple.com/documentation/accelerate/working_with_matrices

CGal is a Matrix Math library for C; Luckily OpenBLAS is a compatible library & AMD Makes a version in HIP
https://cpp.libhunt.com/cgal-alternatives

Matrix Libs : L1 means compatible with CGAL, A+ means i rate them highly on science community use : RS

CGAL (L1)
GLM (L1)
QuantLib (L1)
Ceres-Solver (L1)

OpenBLAS (A+)
Eigan (A+)
MiraCL (A+)

C++ Matrix Maths

MPPT is Camera & FFMPeg complex install
https://docs.mrpt.org/reference/latest/compiling.html

C++ Matrix Maths : Simple
https://sourceforge.net/projects/arma/

C++ conversions between Numpy arrays and Armadillo matrices; Converts Into Numpy Py not out (needs work)
https://github.com/RUrlus/carma

https://sourceforge.net/software/product/NumPy/
https://sourceforge.net/software/product/NumPy/integrations/

Motivated applications of 3D Matrix Database ML

RS

Just shows how fast Blas & these NumPy & Arma & Mave is! 1998-man SigRS
Parallel matrix multiplication & diagonalization
https://www-users.york.ac.uk/~mijp1/teaching/grad_HPC_for_MatSci/Lecture4.pdf

Wasm Inefficiency
https://news.ycombinator.com/item?id=37387629

*

3D Matrix Web Codecs


Are presented as being JIT Compiler re-encoded when required; Frequently WebASM, WebGPU Code, JS...
Audio, Video, Sensation, Code Runtimes.

Web Codecs for devices are a modern concept & are available for common websites such as news & music,
devices such as Alexa Echo & Google Dot & Bluetooth Devices?

Media players & BT devices particularly suffer from small Storage potential!
So Web Codecs downloaded to the device from a source; Such as a smart phone or computer..
Are a clear-minded solution!

JIT Compiler

3D Matrix Tables in FMA, Mul & ADD code to be automatically recompiled locally when required!
Directed to a common API, Direct Compute, WebGPU, WebASM, Jit Compiler OpenCL

Many Operations can be done from unique device specific optimisation; Examples:

API, DirectX & OpenCL & Vulkan & WebGPU & WebASM
Texture & Audio Shaders.
Digital Streaming

Bluetooth NANO SiMD & API
Digital TV in H266, VP9 & AV1,

Locally compiled accelerators should be respected first; Such as the output & input 3D Matrix & CPU & GPU Acceleration engine..

Code can include Matrix converters into common output format such as WebP & Textures & BC, DXT Compression presentation; Vulkan, OpenCL & DirectX & Texture & Audio Shaders.

Java, JS & WebASM are examples with operator mechanisms & JIT Compiler optimisation..
Minimising storage requirements for good compatibility while maximising performance.

RS

Requirements:

https://science.n-helix.com/2022/08/jit-dongle.html
https://science.n-helix.com/2022/06/jit-compiler.html

https://science.n-helix.com/2023/02/smart-compression.html
https://science.n-helix.com/2022/10/ml.html
https://science.n-helix.com/2023/06/map.html

*

TPU & SiMD Parallel wavetables Pre-Calculation Meta-Data : RS

{ For data expansion & Precomputed Upscaling through meta data per frame sequence }
#MetaDATA #PreProcessing Parallel Text loading and machine learning processing : RS 23/07/2023

Pre-calculation table; For Example the Amiga uses tables for maths!
Pi, Common conversion maths & float results in higher precision...

Parallel Text loading and machine learning processing is one of the wonders of TPU & SiMD Parallel wavetables,

Pre Calculate Tables that reduce a workload to simple process. and use...

For example if you Upscale a movie & use dynamic settings, Such as:
Localised Sharpening & Selective Gaussian filtering; Such as Gimps Edge detection Gaussian?
We compress information on the maths of selection..

The edges we selected, The methods we used & if those methods are dynamic then our selections...
Such a method is called a ..

Pre-calculation table; For Example the Amiga uses tables for maths!
Pi, Common conversion maths & float results in higher precision...

Common ones are learned at school
the log tables
Multiplication Tables
Common values such as gravity & Pi

Pre Computation
Upscaling
3D Audio basic resonance profile
Pre Computed values for a realistic world...
Experience & Learning to pre compute values...
This saves effort later in the process

This is available to providers & game developers for:

TV Upscaling through Compressed Numeric Add table downloading
All streaming services processing such as netflix, youtube & amazon prime!
Partial pre-computed upscaling for game, application & processing..

Through TopCloud & HPC Pack

Data Stored as meta-data and saves on repeat processing time!
By creatively Pre Computing processes such as 3D Audio, VR Audio, Haptic 3D Maths..
Work such as Decompression & Compiling

Affects the efficiency of any process that will Pre Calculate Tables that reduce a workload to a group of simple processes.

We can majorly improve quality of both visuals & Audio; Any Pre-Calculatable element

The logic is that Upscaling, Colour enhancements & sharpening have pre-calculatable logic,
We can save many seconds of processing per frame,
We can reduce energy footprint
We can improve latency & frame rate
Works for games also,
Education media or Theaters & mass media content such as News & commonly watched content or movies or visited websites or fonts & media

We can improve at a very minimum, Cutscenes & non motional backdrops & tangible Animation repeating assets & Effects...

(c)Rupert S

FMA : Fused Multiply ADD : MUL+ADD & Precision functions


You may be assuming that only modern GPUs such as RTX 2080+ & RT 5700+ has this?

FMA is a feature of the business editions & FX Series on AMD & exists in granite ridge & other Intel,
So FMA F16 is possible with the F32 : F16 conversion features present in for example FX8320E...

So what does this mean? In terms of:

Chrom that Emulates a lot of its GPU functions in CPU..
In terms of Python ML that F16 feature combined with FMA is very helpful in learning & efficiency!

In terms of CPU; mostly using 32Bit, F32, 64Bit, F64 is very helpful; in terms of SiMD,
F16 exists though; Even on the yee FX8320E!

So we can use potentially: Int8, Int32, Int64, F64, F32, F16 & Float 182Bit as with FPU!
Best to do DEEP work with the CPU FPU & SiMD...

We do have these functions though!, But Deep work FPU 182Bit? CPU! Some GPU have double precision also!

What do we use this variety for? Many things!

Defined by our precision requirements; not all things are INT64 & FPU But not every issue is covered by..
The MP4v, MP4a F16! AC3 & AC4 for example F32; A glass? FPU 182... or many F32 or even more F16 work units.

Rupert S

Exponent factorisation : RS

8Bit, 16Bit, 32Bit, 64Bit Exponent theory.
Available to you-(EF)

A value in 8Bit is no use in a 16 Bit operation... or is it?

Firstly 8 Bit values can be loaded with Zeros into higher math precisions,
In normal maths we use a remainder; So we can load 8Bit values into 32Bit Int & that works...

2 F16 blocks would be 32Bit; As 2 16Bit Blocks? So what use is this ?
in a 64Bit & 32Bit processor storage of FPU-182Bit values is possible ...
32Bit Blocks * 6 with XOR 00
64Bit Blocks * 3 with XOR 00
2 * Largest value...

But parallelising F64 on groups for 182Bit? with multiplications roll left <> Right .. & Additions +- ...
Possible.

But if the resultant is beyond 8Bit ? & we wanted to save as 8Bit?

Factorisation of a 32Bit value into 8Bit is possible; But we need to factor it!
Well:

32Bit to 8Bit is 6:1, So we have to random roll 6 Bits for every 1
We can factor in HighLow with 1 bit or use 8Bit fator 256 & 8Bit Number...

We can Multiply, Add, Subtract or divide or fraction:

256(*/-)1>256, leaving us with a 32Bit value? Well what can we use this for ?

Example complex : N/(240*50); See the maths can roll into 16Bit values..
We can use them, Or load a particular object, Classifier, HASH, AES, EEC...
We can quickly classify as 16Bit resultant & still save as a particular 8Bit value!

Images
Gains
Memories
Load file
load value
Random
Table Value
Compression!

(c)Rupert S,

Reference Int & FP Value Sizes; A reminder that Floats are 50% of highest Integer Value,
ROLLInt floats still have an amazing additional value!


https://science.n-helix.com/2023/02/smart-compression.html

F16b Adaptive Float value : Texture Color Palette Example : RS



Basic Example of F16b float in action on a colour pallet: {F16b,F32b, F64b}

F16b is short remainder F16 & it has 8 Bits of 0.01 point value rather than 16,
So what do we mean ? What is significant about this?

F16b Has 24Bit precision integer with an 8 bit remainder!
So? So 16Bit + 8Bit = 24Bit! & 8bit point value...

In colour representation point values contribute to subtle blending;
So a full 24Bit contributes to 90% of the Color Palettes

So the 24Bit colour pallet is 32Bit Colour Minus Alpha;
We can use F16b in HDMI & DisplayPort & inside the GPU & Also for textures & JPG'S..
Thereby i present F16b & F24Bit colours in F16b

This saves all data in single 32bit Spaces & therefore is both faster & higher resolution than comparable float value presentations.

Bound to make a big difference to BlueRay, but particularly DVD & AC3 & AC4;
F16b Adaptive Float value : Texture Color Palettes Example;

(you can use F16b * R,G,B,A) in HDMI a& DisplayPort, Massive colour improvements; Lower RAM Costs

Rupert S

AnPa_Wave - Analogue Pattern Wave Vector SiMD Unit : (c)RS


The base symphony is harmony, In other words waveforms; There are a couple of Simple methods that really work:

High performance Float values F16, F32, F64, FPU

Q-Bit Quantum; All forms of Quantum wave work
Radio waves;
Light patterns
Photon wave patterns; single & multiple
Sound hardware; 1 to 3 Bit DAC; Audio conversions; Sample range
Analogue chips that work on harmony & frequency
SVM Elliptic curve maths
Sin, Arc, Tan, Time, Vector

In essence Harmony & frequency is the equivalent of Complex Elliptic curve maths

A Music note score suffices to specify harmony basics:

Waveform shape in 3D
Harmony / Disharmony
Vibration High / Vibration Low
Power High / Power Low
Volts High / Volts Low
Watts High / Wats Low

(c)Rupert S

https://science.n-helix.com/2023/07/3dchiplet.html

Wonderful Wave-Pattern Analogue waveforms in meta materials - Pattern recognition in reciprocal space with a magnon-scattering reservoir
https://www.nature.com/articles/s41467-023-39452-y.pdf

*

Vectors & maths
https://science.n-helix.com/2022/08/simd.html
https://science.n-helix.com/2022/04/vecsr.html
https://science.n-helix.com/2016/04/3d-desktop-virtualization.html
https://science.n-helix.com/2022/04/vecsr.html
https://science.n-helix.com/2018/01/integer-floats-with-remainder-theory.html
https://science.n-helix.com/2023/02/smart-compression.html

Networking & Management
https://science.n-helix.com/2023/06/tops.html
https://science.n-helix.com/2023/06/ptp.html
https://science.n-helix.com/2023/06/map.html
https://science.n-helix.com/2023/02/pm-qos.html
https://science.n-helix.com/2022/08/jit-dongle.html
https://science.n-helix.com/2022/06/jit-compiler.html
https://science.n-helix.com/2022/03/ice-ssrtp.html
https://science.n-helix.com/2022/01/ntp.html

Faster Maths & ML
https://science.n-helix.com/2018/01/integer-floats-with-remainder-theory.html
https://science.n-helix.com/2021/02/multi-operation-maths.html
https://science.n-helix.com/2021/11/parallel-execution.html
https://science.n-helix.com/2022/12/math-error-solve.html
https://science.n-helix.com/2021/03/brain-bit-precision-int32-fp32-int16.html
https://science.n-helix.com/2022/10/ml.html

Focus on Quality
https://science.n-helix.com/2022/09/ovccans.html
https://science.n-helix.com/2022/11/frame-expand-gen-3.html
https://science.n-helix.com/2022/03/fsr-focal-length.html

For when {U, X, Y, Z} = N Expressions https://is.gd/ForWhen_UXYZ_N
For when {(A+B/2)} = C Expressions https://is.gd/ForWhen_ABx2_C

Hallelujah RS Light-Wave SiMD https://www.allaboutcircuits.com/news/lightelligence-reports-worlds-first-optical-network-on-chip-processor/

RS Spectra Mitigations https://science.n-helix.com/2018/01/microprocessor-bug-meltdown.html
ZenBleed Parallel Solvent RS 2023 https://science.n-helix.com/2023/07/zenbleed.html

Core/CPU/GPU security core SSL/TLS BugFix
https://science.n-helix.com/2020/06/cryptoseed.html
https://science.n-helix.com/2019/05/zombie-load.html

Secure Configuration:
https://is.gd/SSL_NetSecurity_NTP_PTP
https://is.gd/EthernetTunnelOpt
https://is.gd/SSL_Optimise

PTP & NTP Improve security WW https://is.gd/PTP_TimeStream

*****

Running Code

https://is.gd/UpscaleWinDL

https://is.gd/HPC_HIP_CUDA

PoCL Source & Code
https://is.gd/LEDSource

PoCL-Direct
https://is.gd/PoCL_Source

X86Features-Emu
https://drive.google.com/file/d/15vXBPLaU9W4ul7lmHZsw1dwVPe3lo-jK/view?usp=usp=sharing

https://www.amd.com/en/developer/rocm-hub/hip-sdk.html#tabs-ddafbba141-item-c6b9ce2aab-tab
https://rocm.docs.amd.com/en/docs-5.5.1/deploy/windows/quick_start.html

AMD 23.Q3Pro_HIP #HPC #DirectML MatrixMathOps 'Release unto me the great! Chobokniki' Thine Prayers Answered https://is.gd/AMD23Q3PRO_HIP
Run the .reg after install; Before reboot https://is.gd/AMDRebarReg

**********
https://en.wikipedia.org/wiki/Cell_(processor)

https://www.khronos.org/news/permalink/ibm-releases-opencl-drivers-for-power6-and-cell-b.e/

Not Accessible
https://www.alphaworks.ibm.com/tech/opencl
**********

AI: Artificial Intelligence
ML: Machine Learning
PULP: Parallel Ultra Low Power

ML Network Types


DNN: Deep Neural Network
CNN: Convolutional Neural Network
QML: Quantum Machine Learning
QPU: Quantum Processing Unit

RNN: Recurrent Neural Network
SNN: Spiking Neural Network
MLP: Multi-Layer Perceptron

NN: Neural Network
TNN: Ternary Neural Network
QNN: Quantized Neural Network

HDL: Hardware Description Language
HLS: High Level Synthesis

Maths Operations


FMA: Fused Multiply-Add
GEMM: General Matrix Multiply
SIMD: Single Instruction Multiple Data
SIMT: Single Instruction Multiple Thread

SP: Single Precision
DP: Double Precision
FLOPS: Floating Point Operations per Second

Processor Types & RAM

ASIC: Application Specific Integrated Circuit

SoC: System on Chip
PCU: Programmable Computing Unit
NoC: Network on Chip

CPU Central Processing Unit
VPU: Vector Processing Unit
NPU: Neural Processing Unit
TPU: Tensor Processing Unit
FPGA: Field-Programmable Gate Array

RISC: Reduced Instruction Set Computer
CISC: Complex Instruction Set Computer

NDP: Near Data Processing

PIM: Processing In-Memory
IMC: In-Memory Computing

SRAM: Static Random Access Memory
VRAM: Video Random Access Memory
DRAM: Dynamic Random Access Memory
PCM: Phase Change Memory
BRAM: Block Random Access Memory
RAM: Random Access Memory
RRAM: Resistive RAM

*****

Matrix Array Processor Unit (c)RS


[M.A.P] [=====] [H.P.C] - Matrix Array Processor Unit (c)RS

This document describes the design and implementation of a novel computing device called the Matrix Array Processor Unit (M.A.P.U).

The M.A.P.U is a co-processor that can perform high-speed parallel operations on multi-dimensional arrays of data, such as those used in quantum computing, machine learning, and computer graphics,

A novel co-processor that can perform high-performance computing tasks using quantum-inspired principles.

The Matrix Array Processor is a type of processor that is designed to handle multi-directional and multi-dimensional arrays per Qbit.

It is used in quantum computers and relies on percentage-based 3D processing to handle all 3D array processing.

The central tasks map to probability over networks and MAP units in arrays.

The M.A.P is composed of multiple interconnected units that can process multi-dimensional arrays in parallel, using a percentage-based 3D processing scheme.

The M.A.P can be integrated with existing CPU, GPU and DPU architectures, as well as with other M.A.P units, to form a scalable and flexible computing platform.

The differences of Some Matrix Array Processor and other processors such as:

SIMD (Single Instruction Multiple Data),
SISD (Single Instruction Single Data),
MISD (Multiple Instruction Single Data),
MIMD (Multiple Instruction Multiple Data),
Vector processors,
Systolic Arrays,

Is that the Matrix Array Processor is designed to handle multi-directional and multi-dimensional arrays per Qbit...

While other processors are designed to operate efficiently and effectively on large one-dimensional arrays of data called vectors

The M.A.P.U consists of three main components:

The Matrix Array Processor (M.A.P),
The High Precision Central Core (H.P.C),
The Bus Connections and Networking (=====).

Core Definitions 3D M.A.P:

[H.P.C]:

A high-precision central core that can handle complex tasks such as probability mapping, network routing and memory management.

The H.P.C is the central controller of the M.A.P.U.

It coordinates the execution of tasks across the M.A.P units, assigns probabilities to different outcomes, and handles complex calculations that require high precision or accuracy.

Each [H.P.C] unit can connect to 8 [M.A.P] units and optionally to other [H.P.C] units in different layers of the 3D matrix.

The [H.P.C] can also communicate with external devices such as CPUs, GPUs, DPUs, or networks via the bottom layer of the wafer.

[M.A.P]:

The M.A.P is a specialized processing unit that can execute multiple arithmetic and logical operations on a single array element in one clock cycle.

A unit that can perform arithmetic operations on multi-dimensional arrays using a dot product-like algorithm.

Each M.A.P has 8-way interconnects to communicate with neighboring M.A.P units and a central [H.P.C] unit.

The M.A.P has eight-way interconnects to communicate with other M.A.P units in the same layer or adjacent layers.

The M.A.P can also access local cache or RAM for storing intermediate results or constants.

[=====]:

A bus connection that enables data transfer and networking among the M.A.P units and the [H.P.C] units.

The bottom layer of the wafer contains a high-resolution bus that connects to the onboard controllers and networks and the external CPU, GPU and DPU devices.

The ===== supports different communication protocols and topologies, such as mesh, torus, or hypercube.

The ===== also provides fault tolerance and load balancing mechanisms to ensure reliable and efficient performance.

The M.A.P.U is designed to be scalable and modular.

It can be stacked in three dimensions to form a larger array of processors that can handle more complex and diverse tasks.

The M.A.P.U can also be customized for different applications by changing the size, shape, or configuration of the M.A.P units, the H.P.C cores, or the ===== network.

The following diagrams illustrate the structure and functionality of the M.A.P.U.

Top View

[M.A.P][M.A.P][M.A.P]
[M.A.P][H.P.C][M.A.P]
[M.A.P][M.A.P][M.A.P]


Side View 3D


[M.A.P][H.P.C][M.A.P]
[M.A.P][=====][M.A.P]
[=====][H.P.C][=====]
[M.A.P][=====][M.A.P]
[=====][H.P.C][=====]
[M.A.P][=====][M.A.P]
[=====][H.P.C][=====]

Each [H.P.C] Central Contains RAM & connections to the 8 [M.A.P] & Optionally to layers above & bellow in 3D Matrix,
Bottom of wafer contains high resolution buss to onboard controllers & networks & DPU/GPU/CPU's

Array = Matrix Array Processor Unit (c)RS

ffffffff ffffffff ffffffff
........+ ........*+ ........*
........+ ........*+ ........*
........+ ........*+ ........*

f=fp,unit
*=mul
+=add
.=Cache/Ram

Simple absolver table for MUL:ADD : MUL* Only = +0, +- Only = N*1 then +-
% = / 100 + ADD Table {N1 <> N...} : Result!

The M.A.P unit can perform operations on multi-dimensional arrays using a combination of:

Floating-point units (f), Multiplication units (*), Addition units (+) and cache/ram units (.).

The M.A.P unit can support different data types such as DOT4, INT8, INT16, F16, F32 and F64.

The M.A.P co-processor is a cutting-edge technology that can enable new applications in fields such as artificial intelligence, machine learning, scientific computing and more.

(c)Rupert S

References: DOT4, INT8, INT16, F16, F32, F64 (c)Rupert S

https://is.gd/LEDSource

https://science.n-helix.com/2023/06/map.html

https://science.n-helix.com/2018/01/integer-floats-with-remainder-theory.html
https://science.n-helix.com/2021/02/multi-operation-maths.html
https://science.n-helix.com/2021/11/parallel-execution.html
https://science.n-helix.com/2022/12/math-error-solve.html
https://science.n-helix.com/2021/03/brain-bit-precision-int32-fp32-int16.html
https://science.n-helix.com/2022/10/ml.html

Sparse matrix multiplication in SRM array
https://www.science.org/doi/10.1126/sciadv.adf7474

Error Correction Options & Mitigation
https://futurism.com/ibm-breakthrough-quantum-computing

**********


Light Processors (c)Rupert S https://science.n-helix.com


Light processors : Access to advanced : Storage Cache, Random Access RAM Cache & Processor architecture: Starting with SiMD Simple Vector Instruction Set

Complex forms are a goal, Start simple : The world will thank you!
Simple as SiMD appears there are many uses,
Considering that higher instruction sets are delayed by SiMD space & speed priorities..

Array = Matrix Array Processor Unit (c)RS

ffffffff ffffffff ffffffff
........+ ........*+ ........*
........+ ........*+ ........*
........+ ........*+ ........*

f=fp,unit
*=mul
+=add
.=Cache/Ram

Simple absolver table for MUL:ADD : MUL* Only = +0, +- Only = N*1 then +-
% = / 100 + ADD Table {N1 <> N...} : Result!

Array = Matrix Array Processor Unit (c)RS

Cache is also a priority with manyfold application of simple data transfer & buffering to solid storage,
Power outage is our main concern so that we save all our work.

SSD is an obvious solution to backing up speedily,
However we do use RAM Cache for this goal..

The goal of speeding storage access up,
Light does all the work types we need:

List:
Data transit
CacheProcessing via dimensions & signal variance
RAM (Cyclic light transfer) Same principle as fibre optic cable over large distances.

(c)Rupert S https://science.n-helix.com

Quantum ! Light Compute : Reference material : RS

Yes we can solve classic problems with light computers, Light computers perform geometry & quantitative sampling (Comment by inventor) Rupert S

Light Compute : Reference material : RS
https://science.n-helix.com/2012/09/geometric-calculating-machines.html

https://science.n-helix.com/2020/03/single-photon.html

https://science.n-helix.com/2014/07/the-formula-of-geometric-volumes.html

https://science.n-helix.com/2018/07/universeal-algebra-paper.html

https://science.n-helix.com/2018/06/compression-libraries-index-prime.html

https://science.n-helix.com/2013/08/light-theory-on-creation-of-3d-image.html

https://science.n-helix.com/2018/06/uses-for-micro-laser-light-emitting.html

https://science.n-helix.com/2020/04/render.html

https://science.n-helix.com/2019/06/vulkan-stack.html

https://science.n-helix.com/2019/06/kernel.html

https://science.n-helix.com/2019/05/compiler-optimisation.html

https://science.n-helix.com/2018/09/hpc-pack-install-guide.html

https://science.n-helix.com/2020/04/cern.html

"Let's Play" Station NitroMagika_LightCaster

Lets face it, Realtec could well resource the "Original QFFT Audio device & CPU/GPU"

The mic works by calculating angle on a drum...
Light.. and timing & dispersion...
The audio works by QFFT replication of audio function..
The DAC works by quantifying as Analog digital or Metric Matrix..
The CPU/GPU by interpreting the data of logic, Space & timing...

We need to calculate Quantum is not the necessary feature;

But it is the highlight of our:

Data storage cache.
Our Temporary RAM
Our Data transport..
Of our fusion future.

(c)Rupert S https://science.n-helix.com

"Weedbrook points out that as yet, and in contrast to Google’s Sycamore, the Chinese team’s photonic circuit is not programmable, so at this point “it cannot be used for solving practical problems”."
https://www.nature.com/articles/d41586-020-03434-7

https://scitechdaily.com/ai-boosted-by-parallel-convolutional-light-based-processors/

https://interestingengineering.com/worlds-fastest-most-powerful-neuromorphic-processor-for-ai-unveiled

Physicists in China challenge Google’s ‘quantum advantage’
Photon-based quantum computer does a calculation that ordinary computers might never be able to do.
Philip Ball

PDF version
The interferometer part of our experiment.

This photonic computer performed in 200 seconds a calculation that on an ordinary supercomputer would take 2.5 billion years to complete.Credit: Hansen Zhong

A team in China claims to have made the first definitive demonstration of ‘quantum advantage’ — exploiting the counter-intuitive workings of quantum mechanics to perform computations that would be prohibitively slow on classical computers.

They have used beams of laser light to perform a computation which had been mathematically proven to be practically impossible on normal computers. The team achieved within a few minutes what would take half the age of Earth on the best existing supercomputers. Contrary to Google’s first demonstration of a quantum advantage, performed last year, their version is virtually unassailable by any classical computer. The results appeared in Science on 3 December1.

“We have shown that we can use photons, the fundamental unit of light, to demonstrate quantum computational power well beyond the classical counterpart,” says Jian-Wei Pan at the University of Science and Technology of China in Hefei. He adds that the calculation that they carried out — called the boson-sampling problem — is not just a convenient vehicle for demonstrating quantum advantage, but has potential practical applications in graph theory, quantum chemistry and machine learning.

“This is certainly a tour de force experiment, and an important milestone,” says physicist Ian Walmsley at Imperial College London.

Quantum advantage challenged

Teams at both academic and corporate laboratories have been vying to demonstrate quantum advantage (a term that has now largely replaced the earlier ‘quantum supremacy’).

Last year, researchers at Google’s quantum-computing laboratory in Santa Barbara, California, announced the first-ever demonstration of quantum advantage. They used their state-of-the-art Sycamore device, which has 53 quantum bits (qubits) made from superconducting circuits that are kept at ultracold temperatures2.

But some quantum researchers contested the claim, on the grounds that a better classical algorithm that would outperform the quantum one could exist3. And researchers at IBM claimed that its classical supercomputers could in principle already run existing algorithms to do the same calculations in 2.5 days.

To convincingly demonstrate quantum advantage, it should be unlikely that a significantly faster classical method could ever be found for the task being tested.

The Hefei team, led by Pan and Chao-Yang Lu, chose a different problem for its demonstration, called boson sampling. It was devised in 2011 by two computer scientists, Scott Aaronson and Alex Arkhipov4, then at the Massachusetts Institute of Technology in Cambridge. It entails calculating the probability distribution of many bosons — a category of fundamental particle that includes photons — whose quantum waves interfere with one another in a way that essentially randomizes the position of the particles. The probability of detecting a boson at a given position can be calculated from an equation in many unknowns.

200 seconds

But the calculation in this case is a ‘#P-hard problem’, which is even harder than notoriously tricky NP-hard problems, for which the number of solutions increases exponentially with the number of variables. For many tens of bosons, Aaronson and Arkhipov showed that there’s no classical shortcut for the impossibly long calculation.

A quantum computer, however, can sidestep the brute-force calculation by simulating the quantum process directly — allowing bosons to interfere and sampling the resulting distribution. To do this, Pan and colleagues chose to use photons as their qubits. They carried out the task on a photonic quantum computer working at room temperature.

Starting from laser pulses, the researchers encoded the information in the spatial position and the polarization of particular photon states — the orientation of the photons’ electromagnetic fields. These states were then brought together to interfere with one another and generate the photon distribution that represents the output. The team used photodetectors capable of registering single photons to measure that distribution, which in effect encodes the calculations that are so hard to perform classically.

In this way, Pan and colleagues could find solutions to the boson-sampling problem in 200 seconds. They estimate these would take 2.5 billion years to calculate on China’s TaihuLight supercomputer — a quantum advantage of around 1014.

Practical problems

“This is the first time that quantum advantage has been demonstrated using light or photonics,” says Christian Weedbrook, chief executive of quantum-computing startup Xanadu in Toronto, Canada, which is seeking to build practical quantum computers based on photonics.

Walmsley says this claim of quantum advantage is convincing. “Because [the experiment] hews very closely to the original Aaronson–Arkiphov scheme, it is unlikely that a better classical algorithm can be found,” he says.

However, Weedbrook points out that as yet, and in contrast to Google’s Sycamore, the Chinese team’s photonic circuit is not programmable, so at this point “it cannot be used for solving practical problems”.

But he adds that if the team is able to build an efficient enough programmable chip, several important computational problems could be solved. Among those are predicting how proteins dock to one another and how molecules vibrate, says Lu.

Weedbrook notes that photonic quantum computing started later than the other approaches, but it could now “potentially leap-frog the rest”. At any rate, he adds, “It is only a matter of time before quantum computers will leave classical computers in the dust.”

https://scitechdaily.com/ai-boosted-by-parallel-convolutional-light-based-processors/

"AI Boosted by Parallel Convolutional Light-Based Processors

TOPICS:Artificial IntelligenceElectrical EngineeringEPFLMachine LearningOpticsPhotonicsPopular

By EPFL JANUARY 7, 2021

Matrix Multiplications Light Processor

Schematic representation of a processor for matrix multiplications which runs on light. Credit: University of Oxford

The exponential growth of data traffic in our digital age poses some real challenges on processing power. And with the advent of machine learning and AI in, for example, self-driving vehicles and speech recognition, the upward trend is set to continue. All this places a heavy burden on the ability of current computer processors to keep up with demand.

Now, an international team of scientists has turned to light to tackle the problem. The researchers developed a new approach and architecture that combines processing and data storage onto a single chip by using light-based, or “photonic” processors, which are shown to surpass conventional electronic chips by processing information much more rapidly and in parallel.

The scientists developed a hardware accelerator for so-called matrix-vector multiplications, which are the backbone of neural networks (algorithms that simulate the human brain), which themselves are used for machine-learning algorithms. Since different light wavelengths (colors) don’t interfere with each other, the researchers could use multiple wavelengths of light for parallel calculations. But to do this, they used another innovative technology, developed at EPFL, a chip-based “frequency comb,” as a light source.

Matrix Multiplications Light Processor Schematic

Schematic representation of a processor for matrix multiplications which runs on light. Credit: University of Oxford

“Our study is the first to apply frequency combs in the field of artificial neural networks,” says Professor Tobias Kippenberg at EPFL, one the study’s leads. Professor Kippenberg’s research has pioneered the development of frequency combs. “The frequency comb provides a variety of optical wavelengths that are processed independently of one another in the same photonic chip.”

“Light-based processors for speeding up tasks in the field of machine learning enable complex mathematical tasks to be processed at high speeds and throughputs,” says senior co-author Wolfram Pernice at Mรผnster University, one of the professors who led the research. “This is much faster than conventional chips which rely on electronic data transfer, such as graphic cards or specialized hardware like TPU’s (Tensor Processing Unit).”

After designing and fabricating the photonic chips, the researchers tested them on a neural network that recognizes of hand-written numbers. Inspired by biology, these networks are a concept in the field of machine learning and are used primarily in the processing of image or audio data. “The convolution operation between input data and one or more filters — which can identify edges in an image, for example, are well suited to our matrix architecture,” says Johannes Feldmann, now based at the University of Oxford Department of Materials. Nathan Youngblood (Oxford University) adds: “Exploiting wavelength multiplexing permits higher data rates and computing densities, i.e. operations per area of processer, not previously attained.”

“This work is a real showcase of European collaborative research,” says David Wright at the University of Exeter, who leads the EU project FunComp, which funded the work. “Whilst every research group involved is world-leading in their own way, it was bringing all these parts together that made this work truly possible.”

The study is published in Nature this week, and has far-reaching applications: higher simultaneous (and energy-saving) processing of data in artificial intelligence, larger neural networks for more accurate forecasts and more precise data analysis, large amounts of clinical data for diagnoses, enhancing rapid evaluation of sensor data in self-driving vehicles, and expanding cloud computing infrastructures with more storage space, computing power, and applications software.

Reference: “Parallel convolutional processing using an integrated photonic tensor core” by J. Feldmann, N. Youngblood, M. Karpov, H. Gehring, X. Li, M. Stappers, M. Le Gallo, X. Fu, A. Lukashchuk, A. S. Raja, J. Liu, C. D. Wright, A. Sebastian, T. J. Kippenberg, W. H. P. Pernice and H. Bhaskaran, 6 January 2021, Nature."

https://interestingengineering.com/worlds-fastest-most-powerful-neuromorphic-processor-for-ai-unveiled

"A new optical neuromorphic processor developed by Swinburne University of Technology can operate more than 1000 times faster than any previous processor. The processor for artificial intelligence (AI) functions faster than 10 trillion operations per second (TeraOPs/s).

RELATED: HUAWEI LAUNCHES WORLD'S MOST POWERFUL AI PROCESSOR

Optical micro-combs

The invention could revolutionize neural networks and neuromorphic processing in general. “This breakthrough was achieved with ‘optical micro-combs', as was our world-record internet data speed reported in May 2020,” said in a statement Swinburne’s Professor David Moss.

Micro-combs are new devices made up of hundreds of infrared lasers all held on a single chip. Compared to other optical sources, they are much smaller, lighter, faster, and cheaper.

The new innovation demonstrated by the Swinburne team uses a single processor while simultaneously interleaving the data in time, wavelength, and spatial dimensions through a single micro-comb chip.

“In the 10 years since I co-invented them, integrated micro-comb chips have become enormously important and it is truly exciting to see them enabling these huge advances in information communication and processing. Micro-combs offer enormous promise for us to meet the world’s insatiable need for information," added Moss.

Co-lead author of the study Dr. Xingyuan (Mike) Xu explained how this innovative use of micro-combs is giving the researchers a glimpse into the processors of the future.

Cost and energy reductions

Distinguished Professor Arnan Mitchell from RMIT University added that the "technology is applicable to all forms of processing and communications" and will result in significant future cost and energy consumption reductions.

“Convolutional neural networks have been central to the artificial intelligence revolution, but existing silicon technology increasingly presents a bottleneck in processing speed and energy efficiency,” said key supporter of the research team, Professor Damien Hicks from Swinburne and the Walter and Elizabeth Hall Institute.

“This breakthrough shows how a new optical technology makes such networks faster and more efficient and is a profound demonstration of the benefits of cross-disciplinary thinking, in having the inspiration and courage to take an idea from one field and using it to solve a fundamental problem in another.”"