Monday, July 24, 2023

ZenBleed

ZenBleed Parallel Solvent RS 2023

ZenBleed, So what about 64Bit to 128Bit bleed in SiMD? Mind you; 'Bound to be One' 20:20pm 24/07/2023 (c)RS

XMM 128Bit YMM 256Bit ZMM 512Bit

My theory involves using higher modes for synchronous packing!

What do i mean ?

When you have a full system (processes), 64Bit Processes start packing 128Bit registers! Particularly with Float units 182Bits...

Indeed an olderror is packing 128Bit registers with Float unit (FPU) Values with rollover!

So?

Two things first positive:

We can pack FPU Register Values into 256Bit (and Zero : vzeroupper & tzcnt (Trailing Zero Count)),
Enabling us to directly utilise SiMD <> with <> FPU!

We can solve the lower XMM to YMM to ZMM differences! How ?

We Multiple Array fill the next register with at least 2 values!

So ?

Parallel processing!

How ?

XMM-128 | ZMM / 4 or 128 * 4! Parallel!*4 Best!

XMM-128 | YMM / 2 or 128 * 2! Parallel!*2 Best!

YMM-256 | ZMM / 2 or 256 * 2! Parallel!*2 Best!

FPU-182 | YMM or 182 * 1 = Single File FPU <> SiMD |

ZMM / 2 or 182+r * 2! Parallel!*2 Best! = Double File FPU <> SiMD

r = Remainder for vzeroupper | tzcnt

Parallel Operation Principle with CPU Register & OPS division : RS


We will be using the value split:

512/2 = 256*2
256/2 = 128*2
128/2 = 64*2
128/4 = 32*4

We will therefor be able to use 32Bit, 64Bit, 128Bit , 256Bit, 512Bit values at leasure..
But we have to optimise the entire branch to use a single precision!

Single Type Precision operations make the effects of C++ Fast-float & Half Precision removed...

No operation errors.. & Parallel operation

reference (Faster Maths & ML)

(c)Rupert S

< Yes Bug Bounty & Solve Bounty : Bounty Bounty >

https://lock.cmpxchg8b.com/zenbleed.html

Vulnerability

It turns out that with precise scheduling, you can cause some processors to recover from a mis-predicted vzeroupper incorrectly!

This technique is CVE-2023-20593 and it works on all Zen 2 class processors, which includes at least the following products:

AMD Ryzen 3000 Series Processors
AMD Ryzen PRO 3000 Series Processors
AMD Ryzen Threadripper 3000 Series Processors
AMD Ryzen 4000 Series Processors with Radeon Graphics
AMD Ryzen PRO 4000 Series Processors
AMD Ryzen 5000 Series Processors with Radeon Graphics
AMD Ryzen 7020 Series Processors with Radeon Graphics
AMD EPYC “Rome” Processors

Speculation

Hold on, there’s another complication! Modern processors use speculative execution, so sometimes operations have to be rolled back.

What should happen if the processor speculatively executed a vzeroupper, but then discovers that there was a branch misprediction? Well, we will have to revert that operation and put things back the way they were… maybe we can just unset that z-bit?

If we return to the analogy of malloc and free, you can see that it can’t be that simple - that would be like calling free() on a pointer, and then changing your mind!

That would be a use-after-free vulnerability, but there is no such thing as a use-after-free in a CPU… or is there?

RS Spectra Mitigations https://science.n-helix.com/2018/01/microprocessor-bug-meltdown.html
ZenBleed Parallel Solvent RS 2023 https://science.n-helix.com/2023/07/zenbleed.html

Core/CPU/GPU security core SSL/TLS BugFix
https://science.n-helix.com/2020/06/cryptoseed.html
https://science.n-helix.com/2019/05/zombie-load.html

Vectors & maths
https://science.n-helix.com/2022/08/simd.html
https://science.n-helix.com/2022/04/vecsr.html
https://science.n-helix.com/2016/04/3d-desktop-virtualization.html
https://science.n-helix.com/2022/04/vecsr.html
https://science.n-helix.com/2018/01/integer-floats-with-remainder-theory.html
https://science.n-helix.com/2023/02/smart-compression.html

Networking & Management
https://science.n-helix.com/2023/06/tops.html
https://science.n-helix.com/2023/06/ptp.html
https://science.n-helix.com/2023/06/map.html
https://science.n-helix.com/2023/02/pm-qos.html
https://science.n-helix.com/2022/08/jit-dongle.html
https://science.n-helix.com/2022/06/jit-compiler.html
https://science.n-helix.com/2022/03/ice-ssrtp.html
https://science.n-helix.com/2022/01/ntp.html

Faster Maths & ML
https://science.n-helix.com/2018/01/integer-floats-with-remainder-theory.html
https://science.n-helix.com/2021/02/multi-operation-maths.html
https://science.n-helix.com/2021/11/parallel-execution.html
https://science.n-helix.com/2022/12/math-error-solve.html
https://science.n-helix.com/2021/03/brain-bit-precision-int32-fp32-int16.html
https://science.n-helix.com/2022/10/ml.html

Focus on Quality
https://science.n-helix.com/2022/09/ovccans.html
https://science.n-helix.com/2022/11/frame-expand-gen-3.html
https://science.n-helix.com/2022/03/fsr-focal-length.html

https://blog.cloudflare.com/zenbleed-vulnerability/

https://www.theverge.com/2023/7/25/23806705/amd-ryzen-cpu-processor-zenbleed-vulnerability-exploit-bug

************* Reportage >

Introduction

All x86-64 CPUs have a set of 128-bit vector registers called the XMM registers. You can never have enough bits, so recent CPUs have extended the width of those registers up to 256-bit and even 512-bits.

The 256-bit extended registers are called YMM, and the 512-bit registers are ZMM.

These big registers are useful in lots of situations, not just number crunching! They’re even used by standard C library functions, like strcmp, memcpy, strlen and so on.

Let’s take a look at an example. Here are the first few instructions of glibc’s AVX2 optimized strlen:


(gdb) x/20i __strlen_avx2
...
<__strlen_avx2+9>: vpxor xmm0,xmm0,xmm0
...
<__strlen_avx2+29>: vpcmpeqb ymm1,ymm0,YMMWORD PTR [rdi]
<__strlen_avx2+33>: vpmovmskb eax,ymm1
...
<__strlen_avx2+41>: tzcnt eax,eax
<__strlen_avx2+45>: vzeroupper
<__strlen_avx2+48>: ret

The full routine is complicated and handles lots of cases, but let’s step through this simple case. Bear with me, I promise there’s a point!

The first step is to initialize ymm0 to zero, which is done by just xoring xmm0 with itself1.

VPXOR xmm0, xmm0, xmm0
> vpxor xmm0, xmm0, xmm0
vpcmpeqb ymm1, ymm0, [rdi]
vpmovmskb eax, ymm1
tzcnt eax, eax
vzeroupper

Here rdi contains a pointer to our string, so vpcmpeqb will check which bytes in ymm0 match our string, and stores the result in ymm1.

As we’ve already set ymm0 to all zero bytes, only nul bytes will match.

vpcmpeqb ymm1, ymm0, rdi
vpxor xmm0, xmm0, xmm0
> vpcmpeqb ymm1, ymm0, [rdi]
vpmovmskb eax, ymm1
tzcnt eax, eax
vzeroupper

Now we can extract the result into a general purpose register like eax with vpmovmskb.

Any nul byte will create a 1 bit, and any other value will create a 0 bit.

vpmovmskb eax, ymm1
vpxor xmm0, xmm0, xmm0
vpcmpeqb ymm1, ymm0, [rdi]
> vpmovmskb eax, ymm1
tzcnt eax, eax
vzeroupper

Finding the first zero byte is now just a case of counting the number of trailing zero bits.

That’s a common enough operation that there’s an instruction for it - tzcnt (Trailing Zero Count).

tzcnt eax, eax
vpxor xmm0, xmm0, xmm0
vpcmpeqb ymm1, ymm0, [rdi]
vpmovmskb eax, ymm1
> tzcnt eax, eax
vzeroupper

Now we have the position of the first nul byte, in just four machine instructions!

You can probably imagine just how often strlen is running on your system right now, but suffice to say, bits and bytes are flowing into these vector registers from all over your system constantly.

Zeroing Registers

You might have noticed that I missed one instruction, and that’s vzeroupper.

vzeroupper
vpxor xmm0, xmm0, xmm0
vpcmpeqb ymm1, ymm0, [rdi]
vpmovmskb eax, ymm1
tzcnt eax, eax
> vzeroupper

You guessed it, vzeroupper will zero the upper bits of the vector registers.

The reason we do this is because if you mix XMM and YMM registers, the XMM registers automatically get promoted to full width. It’s a bit like integer promotion in C.

This works fine, but superscalar processors need to track dependencies so that they know which operations can be parallelized. This promotion adds a dependency on those upper bits, and that causes unnecessary stalls while the processor waits for results it didn’t really need.

These stalls are what glibc is trying to avoid with vzeroupper. Now any future results won’t depend on what those bits are, so we safely avoid that bottleneck!

The Vector Register File

Now that we know what vzeroupper does, how does it do it?

Your processor doesn’t have a single physical location where each register lives, it has what’s called a Register File and a Register Allocation Table. This is a bit like managing the heap with malloc and free, if you think of each register as a pointer. The RAT keeps track of what space in the register file is assigned to which register.

In fact, when you zero an XMM register, the processor doesn’t store those bits anywhere at all - it just sets a flag called the z-bit in the RAT. This flag can be applied to the upper and lower parts of YMM registers independently, so vzeroupper can simply set the z-bit and then release any resources assigned to it in the register file.

Z-Bit

A register allocation table (left) and a physical register file (right).

Speculation

Hold on, there’s another complication! Modern processors use speculative execution, so sometimes operations have to be rolled back.

What should happen if the processor speculatively executed a vzeroupper, but then discovers that there was a branch misprediction? Well, we will have to revert that operation and put things back the way they were… maybe we can just unset that z-bit?

If we return to the analogy of malloc and free, you can see that it can’t be that simple - that would be like calling free() on a pointer, and then changing your mind!

That would be a use-after-free vulnerability, but there is no such thing as a use-after-free in a CPU… or is there?

Spoiler: yes there is 🙂

Zenbleed Demo

This animation shows why resetting the z-bit is not sufficient.

Vulnerability

It turns out that with precise scheduling, you can cause some processors to recover from a mispredicted vzeroupper incorrectly!

This technique is CVE-2023-20593 and it works on all Zen 2 class processors, which includes at least the following products:

AMD Ryzen 3000 Series Processors
AMD Ryzen PRO 3000 Series Processors
AMD Ryzen Threadripper 3000 Series Processors
AMD Ryzen 4000 Series Processors with Radeon Graphics
AMD Ryzen PRO 4000 Series Processors
AMD Ryzen 5000 Series Processors with Radeon Graphics
AMD Ryzen 7020 Series Processors with Radeon Graphics
AMD EPYC “Rome” Processors

The bug works like this, first of all you need to trigger something called the XMM Register Merge Optimization2, followed by a register rename and a mispredicted vzeroupper. This all has to happen within a precise window to work.

We now know that basic operations like strlen, memcpy and strcmp will use the vector registers - so we can effectively spy on those operations happening anywhere on the system! It doesn’t matter if they’re happening in other virtual machines, sandboxes, containers, processes, whatever!

This works because the register file is shared by everything on the same physical core. In fact, two hyperthreads even share the same physical register file.

Don’t believe me? Let’s write an exploit 🙂

Exploitation

There are quite a few ways to trigger this, but let’s examine a very simple example.

vcvtsi2s{s,d} xmm, xmm, r64
vmovdqa ymm, ymm
jcc overzero
vzeroupper
overzero:
nop

Here cvtsi2sd is used to trigger the merge optimization. It’s not important what cvtsi2sd is supposed to do, I’m just using it because it’s one of the instructions the manual says use that optimization3.

Then we need to trigger a register rename, vmovdqa will work. If the conditional branch4 is taken but the CPU predicts the not-taken path, the vzeroupper will be mispredicted and the bug occurs!

Optimization

Exploit Running

It turns out that mis-predicting on purpose is difficult to optimize! It took a bit of work, but I found a variant that can leak about 30 kb per core, per second.

This is fast enough to monitor encryption keys and passwords as users login!

We’re releasing our full technical advisory, along with all the associated code today. Full details will be available in our security research repository.

If you want to test the exploit, the code is available here.

Note that the code is for Linux, but the bug is not dependent on any particular operating system - all operating systems are affected!

Discovery

I found this bug by fuzzing, big surprise 🙂 I’m not the first person to apply fuzzing techniques to finding hardware flaws. In fact, vendors fuzz their own products extensively - the industry term for it is Post-Silicon Validation.

So how come this bug wasn’t found earlier? I think I did a couple of things differently, perhaps with a new perspective as I don’t have an EE background!

Feedback

The best performing fuzzers are guided by coverage feedback. The problem is that there is nothing really analogous to code coverage in CPUs… However, we do have performance counters!

These will let us know when all kinds of interesting architectural events happen.

Feeding this data to the fuzzer lets us gently guide it towards exploring interesting features that we wouldn’t have been able to find by chance alone!

It was challenging to get the details right, but I used this to teach my fuzzer to find interesting instruction sequences. This allowed me to discover features like merge optimization automatically, without any input from me!

Oracle

When we fuzz software, we’re usually looking for crashes. Software isn’t supposed to crash, so we know something must have gone wrong if it does.

How can we know if a a CPU is executing a randomly generated program correctly? It might be completely correct for it to crash!

Well, a few solutions have been proposed to this problem. One approach is called reversi. The general idea is that for every random instruction you generate, you also generate the inverse (e.g. ADD r1, r2 → SUB r1, r2). Any deviation from the initial state at the end of execution must have been an error, neat!

The reversi approach is clever, but it makes generating testcases very complicated for a CISC architecture like x86.

A simpler solution is to use an oracle. An oracle is just another CPU or a simulator that we can use to check the result. If we compare the results from our test CPU to our oracle CPU, any mismatch would suggest that something went wrong.

I developed a new approach with a combination of these two ideas, I call it Oracle Serialization.

Oracle Serialization

As developers we monitor the macro-architectural state, that’s just things like register values. There is also the micro-architectural state which is mostly invisible to us, like the branch predictor, out-of-order execution state and the instruction pipeline.

Serialization lets us have some control over that, by instructing the CPU to reset instruction-level parallelism. This includes things like store/load barriers, speculation fences, cache line flushes, and so on.

The idea of a Serialized Oracle is to generate a random program, then automatically transform it into a serialized form.

A randomly generated sequence of instructions, and the same sequence but with randomized alignment, serialization and speculation fences added.

movnti [rbp+0x0],ebx movnti [rbp+0x0],ebx
sfence
rcr dh,1 rcr dh,1
lfence
sub r10, rax sub r10, rax
mfence
rol rbx, cl rol rbx, cl
nop
xor edi,[rbp-0x57] xor edi,[rbp-0x57]

These two program might have very different performance characteristics, but they should produce identical output. The serialized form can now be my oracle!

If the final states don’t match, then there must have been some error in how they were executed micro-architecturally - that could indicate a bug.

This is exactly how we first discovered this vulnerability, the output of the serialized oracle didn’t match!

Solution

We reported this vulnerability to AMD on the 15th May 2023.

AMD have released an microcode update for affected processors. Your BIOS or Operating System vendor may already have an update available that includes it.

Workaround

It is highly recommended to use the microcode update.

If you can’t apply the update for some reason, there is a software workaround: you can set the chicken bit DE_CFG[9].

This may have some performance cost.

Linux

You can use msr-tools to set the chicken bit on all cores, like this:

# wrmsr -a 0xc0011029 $(($(rdmsr -c 0xc0011029) | (1<<9)))

FreeBSD

On FreeBSD you would use cpucontrol(8).

Others

If you’re using some other operating system and don’t know how to set MSRs, ask your vendor for assistance.

Note that it is not sufficient to disable SMT.

Detection

I am not aware of any reliable techniques to detect exploitation. This is because no special system calls or privileges are required.

It is definitely not possible to detect improper usage of vzeroupper statically, please don’t try!

Conclusion
It turns out that memory management is hard, even in silicon 🙂

Acknowledgements

This bug was discovered by me, Tavis Ormandy from Google Information Security!

I couldn’t have found it without help from my colleagues, in particular Eduardo Vela Nava and Alexandra Sandulescu. I also had help analyzing the bug from Josh Eads.

3DChiplet Side By Side 3D Magic with 3D Trenching

3DChiplet Side By Side 3D Magic with 3D Trenching 2021-2023

3D Fabric 5800X3D is hard in production but the delivery is the problem so ... i have another proposal,

Called 

Side By Side 3D Magic (c)Rupert S


Yes 3D Chips are good for cache, Simply connecting chiplets does not require 3D or 3D Stacking,

Side By Side 3D Magic (c)Rupert S

https://science.n-helix.com

Has Layered Chip wafer & PCB Board with interweaved wires:

Carbon fibers, Copper or aluminum or Iron, Not a problem

Through the PCB Chip board, These micro tunnels provide all the PCI & Chip tunnels that a Board could require!

Layered micro tunnel imprinted PCB can have 3 wires per layer (crosswise, Diagonal & Ordered form)

Additionally tunneling up and down is not a problem for you simply layer a connection point that is welded to the next layer as it is laid on top..

Micro film is available, As this is both electrostatic & noise resistant composite.

Since this is a micro multiformat PCB / Chip fabric, At no time do you have to worry about dampness or heat split when made well.

https://www.youtube.com/watch?v=pBZQeW1eeEw

Example of 3D Layered PCB, A but too rigid but good for a phone or telescope Board..

Chips can be placed inside if you need to! for space reasons; Embed the chiplet..

PCB is ideal for this task; Common view PCB is large space & coy compact?

3D PCB is a space saver & 3D Network Ethernet/Chip IO memory ops

PCB Wire mesh (internal networks) = - |, PCB Layer = _

______(CHIP With Connect)________
----------|-----|-----|----|----|-----------------
_______\____|___\___\_|___________
--------(cooling & IO Chip)--------------
_______|__|_______|___|___________


***********

07:39 23/07/2023 (c)Rupert S

Circuit 3D Print with laser (c)RS


While trenching semiconductors work, in space (vacuum) electrical energy transfers through vacuum!

So you have to use a resistor material in the trench, this is not impossible if you imbed ceramic formulas with a laser!

you can however with this technology go upto 2.7v on 5nm; Because higher voltages are faster & more resistant; this makes sense..

The trench (hole) Formatic processor 3D layering technology with:

Circuit = C, Trench = \_/ , resistor = r, Circuit in trench = c, raised bit Circuit or resistor = /C\

C\r/C C\r/C C\r/C

C\_/C C\_/C C\_/C

C\_/C C\r/C C\_/C

/C\c/C\r/C\r/C\r/C\

The challenge of using traditional circuit printing methods in space is that the vacuum can cause the circuit to degrade over time..

This is because the vacuum can strip away the electrons that carry current in the circuit.

3D laser circuit printing could help to mitigate this problem by creating a very dense and compact circuit. This reduces the surface area of the circuit that is exposed to the vacuum and it helps to protect the circuit from the harsh environment of space.

& Also..

One of the challenges of using trench & processor circuit methods in space is that electrical energy transfers through vacuum; Which can be difficult in a vacuum.

This means that you need to use a resistor material in the trench,

It is possible to imbed ceramic formulas with a laser; This could be a promising way to create resistors in/for space.

However, 3D laser circuit printing could help to mitigate this problem; As the laser can be used to create a very precise and durable circuit.

This technology is meant for the world but also with spatial integrity for deep space & So functionally Rugged/Rigid in use & Function.

Additional thoughts on the challenges and potential of 3D laser circuit printing for space applications:

Challenges:
The vacuum of space can be very harsh on materials, so it is important to use materials that are resistant to radiation and temperature extremes.

Potential:

3D laser circuit printing could allow for the creation of more complex and efficient circuits.

3D laser circuit printing could make it possible to print circuits on-demand; Which could be a major advantage for space missions.

It could also be used to create circuits that are more resistant to the harsh environment of space.

The lack of gravity can also make it difficult to print precise circuits..

(c)Rupert S

Application 23/07/2023

https://science.n-helix.com/2023/07/3dchiplet.html

https://science.n-helix.com/2023/06/map.html

https://science.n-helix.com/2023/06/ptp.html

https://science.n-helix.com/2023/06/tops.html

https://science.n-helix.com/2022/01/ntp.html


*********************

Tilly Arms; The girl with no arms, sympathetic nerve response & frequency rate : Operation Cyborg RS 2023

Tilly Arms; The girl with no arms

I think that the arms are very good, But she needs more!
Clearly artificial skin in silver would do the trick?

I noticed that she has control of them though her stimulated skin.... at the elbow....
Now i saw a study that clearly would help....

Neurons respond on training to noisy signals- & clear notes+

We can clearly get a sympathetic skin monitor to receive the feelings; By listening to skin cell responses ....

Now i feel that since a 9v battery stings the tongue; 2volts is about a bit too much right on sweaty skin, So 1.8 is around right? Dr

https://www.youtube.com/shorts/pmIoL-Ja_Co

Depending upon how much resistance there is in skin, might even help with Lightening & Shocks...

RS

20:08 23/07/2023 What have we learned; Brain Cells : RS : https://www.youtube.com/watch?v=bEXefdbQDjw

Brain Cells respond to:

Clear tones : } well to { Entropic Noisy tones }: unwell
Clean Image } to [ Entropic Noisy Image }

Cell electrode networks begin at 0.75cm for tasks like DOOM

Cell inputs are learned,
Dynamic connections form to the electrodes & We use logic on the inputs...

Here the strategy is to use tones & noise to respond to the doom player in motion.

The cell structure is clearly not a problem at 3700 * 4 mm

Rupert S

*

AnPa_Wave - Analogue Pattern Wave Vector SiMD Unit : (c)RS


The base symphony is harmony, In other words waveforms; There are a couple of Simple methods that really work:

High performance Float values F16, F32, F64, FPU

Q-Bit Quantum; All forms of Quantum wave work
Radio waves;
Light patterns
Photon wave patterns; single & multiple
Sound hardware; 1 to 3 Bit DAC; Audio conversions; Sample range
Analogue chips that work on harmony & frequency
SVM Elliptic curve maths
Sin, Arc, Tan, Time, Vector

In essence Harmony & frequency is the equivalent of Complex Elliptic curve maths

A Music note score suffices to specify harmony basics:

Waveform shape in 3D
Harmony / Disharmony
Vibration High / Vibration Low
Power High / Power Low
Volts High / Volts Low
Watts High / Wats Low

(c)Rupert S

https://science.n-helix.com/2023/07/3dchiplet.html

https://science.n-helix.com/2023/06/map.html

Wonderful Wave-Pattern Analogue waveforms in meta materials - Pattern recognition in reciprocal space with a magnon-scattering reservoir
https://www.nature.com/articles/s41467-023-39452-y.pdf