Sunday, November 21, 2021

MontiCarlo Workload Selector

Cash_Bo_Montin Selector (c)Rupert S for Cache & System Operations Optimisation & Compute

CBoMontin Processor Scheduler - Good for consoles & RT Kernels (For HTTP+JS HyperThreading)

*

Cache Loaded Runtime : CLR


OpenCL JIT Compiler inclusion as main loadable object compiler,

Mainly because when CBoMontin Processor Scheduler is intending to run in cache; We need to optimise the scheduler for each Processor Cache size & depth,

Ordering instructions from inside the Processor Cache required optimised code; We create our task list interfaces (UDP & TCP Port approximates) inside the cache..

We prefetch our workloads from kernel space & user space & order them into our processor workflows,

The main process polls priority & nice values for each task & can select the processing order..

We would be prioritising the tasks onto the same processor as the parent task if those tasks are in the same application..

For that we would have to know if the task requires out of order execution or in order; Tasks such as video rendering can afford to have Audio & Video on two threads; However time stamps will be required to be precise!

The actual Selector is compiled optimally based on:

Processor Cache size

Instruction cache size
Data cache size
Processor Thread count

Available task queues
Optimal Queue Size
Optimal Task size

Priority sort based on applied function groups combined with optimised processor selection,
Processor function optimisations
Processor Features list & preference sorting optimisation

Preferred thread & processor for sustained & fast function & reduced processor to processor transfers..

From that we compile our Cache Loaded Runtime & optimise our Processor, Process & priority.

RS

QoS To Optimise the routing: Task Management To optimise the process
https://science.n-helix.com/2021/11/monticarlo-workload-selector.html
https://science.n-helix.com/2023/02/pm-qos.html

Transparent Task Sharing Protocols
https://science.n-helix.com/2022/08/jit-dongle.html
https://science.n-helix.com/2022/06/jit-compiler.html

*

Monticarlo Workload Selector


CPU, GPU, APU, SPU, ROM, Kernel & Operating system :

CPU/GPU/Chip/Kernel Cache & Thread Work Operations management

In/out Memory operations & CU feature selection are ordered into groups based on:

CU Selection is preferred by Chip features used by code & Cache in-lining in the same group.

Global Use (In application or common DLL) Group Core CU
Localised Thread group, Sub prioritised to Sub CU in location of work use
Prioritised to local CU with Chip feature available & with lower utilisation (lowers latency)

{ Monticarlos In/Out }
System input load Predictable Statistic analysis }
Monticarlo Assumed averages per task }
System: IO, IRQ, DMA, Data Motion }

{ Process by Advantage }
{ Process By Task FeatureSet }
{ Process by time & Tick & Clock Cycle: Estimates }
{ Monticarlos Out/In }

Random task & workload optimiser ,
Task & Workload Assignment Requestor,
Pointer Allocator,
Cache RAM Allocation System.

Multithreaded pointer Cache Object tasks & management.

{SEV_TDL_TDX Kernel Interaction mount point: Input & Output by SSL Code Class}:
{Code Runtime Classification & Arch:Feature & Location Store: Kernel System Interaction Cache Flow Buffer}
https://is.gd/SEV_SSLSecureCore
https://is.gd/SSL_DRM_CleanKernel
*

Based upon the fact that you can input Monti Carlos Semi Random Ordered work loads into the core process:

*Core Process Instruction*

CPU, Cache, Light memory load job selector
Resident in Cache L3 for 256KB+- Cache list + Code 4Kb L2 with list access to L3

L2:L3 <> L1 Data + Instruction

*formula*


(c)RS 12:00 to 14:00 Haptic & 3D Audio : Group Cluster Thread SPU:GPU CU

Merge = "GPU+CPU SiMD" 3D Wave (Audio 93% * Haptic 7%)

Grouping selector
3D Wave selector

Group Property value A = Audio S=Sound G=Geometry V=Video H=Haptic B=Both BH=BothHaptic

CPU Int : ID+ (group of)"ASGVH"

Float ops FPU Light localised positioning 8 thread

Shader ID + Group 16 Blocks
SiMD/AVX Big Group 2 Cycle
GPU CU / Audio CU (Localised grouping MultiThreads)

https://www.youtube.com/watch?v=cJkx-OLgLzo

*

Task & Workload Assignment Requestor : Memory & Power


We have to bear in mind power requirements & task persistence in the :Task & Workload Assignment Requestor

knowledge of the operating systems requirements:
Latency list in groups { high processor load requirements > Low processor load requirements } : { latency Estimates }
Ram load , Store & clear {high burst : 2ns < 15ns } GB/s Ordered
Ram load , Store & clear {high burst : 5ns < 20ns } MB/s Disordered

GPU Ram load , Store & clear {high burst : 2ns < 15ns } GB/s Ordered
AUDIO Ram load , Store & clear {high burst : 1ns < 15ns } MB/s Disordered

AUDIO Ram load , Store & clear {high burst : 1ns < 15ns } MB/s Ordered
AUDIO Ram load , Store & clear {high burst : 1ns < 15ns } KB/s Disordered

Network load , Send & Receive {Medium burst : 2ns < 15ns } GB/s Ordered
Network load , Send & Receive {high burst : 1ns < 20ns } MB/s Disordered
Hard drive management & storage {medium : 15ns SSD < 40ns HDD}

*

Also Good for disassociated Asymmetric cores; Since these pose a significant challenge to most software,
However categorising by Processor function yields remarkable classification abilities:

Processor Advanced Instruction set
Core speed
Importance

Location in association with a group of baton passing & interthread messaging & cache,
Symmetry classed processes & threads.

*

Bo-Montin Workload Compute :&: Hardware Accelerated Audio : 3D Audio Dolby NR & DTS


Hardware Accelerated Audio : 3D Audio Dolby NR & DTS : Project Acoustics : Strangely enough ....
Be more positive about Audio Block : Dolby & DTS will use it & thereby in games!

Workload Compute : Where you optimise workload lists though SiMD Maths to HASH subtasks into new GPU workloads,

Simply utilize Direct ML to anticipate future motion vectors (As with video)

OpenCL & Direct Compute : Lists & Compute RAM Loads and Shaders to load...

DMA & Reversed DMA (From GPU to & from RAM)
ReBAR to vector compressed textures without intervention of one processor or another...

Compression Block :
KRAKEN & BC Compression & Decompression
&
SiMD Direct Compressed Load using the Cache Block per SiMD Work Group.

Shaders Optimised & compiled in FPU & SiMD Code form for GPU: Compiling Methods:

In advance load & compile : BRT : Before Runtime Time : task load optimised & ordered Task Executor : Bo-Montin Scheduler

GPU SiMD & FPU (micro 128KB Block encoder : decoder : compiler)
CPU SiMD & FPU (micro 128KB Block encoder : decoder : compiler)

JIT : Just in Time task load optimised & ordered Task Executor : Bo-Montin Scheduler

load & compile :

GPU SiMD & FPU (micro 128KB Block encoder : decoder : compiler)
CPU SiMD & FPU (micro 128KB Block encoder : decoder : compiler)


*

Task manager opportunistically &or Systematic Resource Allocation (c)RS


We also need a direct transport tunnel for data between GPU of different types,

Firstly my experience is as follows:

I have a RX280x & RX560 & Intel® Movidius™ Neural Compute SDK Python API v2 & both do Python work! When I have this configuration the RX280x is barely used unless clearly utilized independently!

The Task manager & Python needs to directly transfer workloads a processor tasks between each system processor,

Not limited to the primary Processor (4Ghz FX8320E) & the AVX supporting Movidius & to & from the RX280 & RX560, Both however supported direct Video rendering & Encoding though DX12,

However the RX6500 does not directly support the AMD Hardware Encode under DX12.1 (New Version 2022-04-21)

& That RX560 comes in handy! if the Video rendering work is directly transferred to RX560 or RX280x & Encoded there!

Therefore I clearly see 2 examples.. & there are more!

Clearly Movidius is advantaged for scaler work on behalf of the Python process & in addition the Upscaling RSR & Dynamic Resolution; We do however need directly to have the Task manager opportunistically or systematically plan the use of resources & Even the processor could offload AVX Work.

No-one has this planned & We DO.

*

PM-QoS - Processor Model QoS Tree for TCP, UDP & QUICC


The Method of PM-QoS Roleplayed in a way that Firmware & CPU Prefetch ML Coders can understand.

Environment:
https://science.n-helix.com/2021/11/monticarlo-workload-selector.html
https://science.n-helix.com/2023/02/pm-qos.html
https://science.n-helix.com/2022/03/security-aspect-leaf-hash-identifiers.html


Multiple Busses &or Processor Features in an Open Compute environment with competitive task scheduling

[Task Scheduler] Monticarlo-Workload-Selector

We prioritise data traffic by importance & Need to ensure that all CPU Functions are used...

In the case of a Chiplet GPU We need to assign function groups to CU & QoS is used to asses available Multiple BUSS Capacities over competing merits,
[Merits : Buss Data Capacity, Buss Cycles, Available Features, Function Endpoint]

PM-QoS is a way of Prioritising Buss traffic to processor functions & RAM & Storage Busses that:

States a data array such as:

Buss Width

divisibility ((Example) Where you transform a 128Bit buss into 32Bit x 4 Data motions and synchronize the transfers,

Data Transfer Cycles Available

Used Data Rate / Total Data Throughput Rate = N

(c)Rupert S https://science.n-helix.com

Kernel Computation Resources Management :

OpenCL, Direct Compute, Compute Shaders & MipMaps :

Optimisation of all system resource use & management 2022 HPC RS

On the matter of Asymmetric GPU / CPU configuration, As in when 2 GPU are not of the same Class or from different providers,

Such a situation is when the motherboard is NVidia & the GPU is AMD for example.

We need both to work, So how?

Firstly the kind of work matters: Operating System Managed Workload Scheduler : Open CL & Direct X as examples:

Firstly PCI 1+ has DMA Transfers of over 500MB/s so data transfer is not a problem,
Secondly DMA is card based; So a shader can transfer work.
Third the memory transfer can be compressed; Does not need to transition mainly though the CPU..
No Cache Issue; Same for Audio Bus

MipMaping is an example with a low PCI to PCI DMA Transfer cost,
But Shaders & OpenCL or Direct Compute are primary examples,
(Direct Compute & OpenCL workloads are cross compatible & convertible)

Exposing a systems potential does require that a DX11 card be utilized for MipMaps or Texture Storage & operations; Within the capacities of Direct 11, 12, 12.1 As and when compatible..

Optimisation of all system resource use & management 2022 HPC

Rupert S

*

Innate Smart Access (c)RS


The Smart-access features require 3 things:
[Innate Compression, Decompression, QoS To Optimise the routing, Task Management To optimise the process] : Task Managed Transfer : DMA:PIO : Transparent Task Sharing Protocols

The following is the initiation of the Smart-access Age

https://science.n-helix.com/2023/02/smart-compression.html

QoS To Optimise the routing:Task Management To optimise the process
https://science.n-helix.com/2021/11/monticarlo-workload-selector.html
https://science.n-helix.com/2023/02/pm-qos.html

Transparent Task Sharing Protocols
https://science.n-helix.com/2022/08/jit-dongle.html
https://science.n-helix.com/2022/06/jit-compiler.html

Innate Compression, Decompression
https://science.n-helix.com/2022/03/ice-ssrtp.html
https://science.n-helix.com/2022/09/ovccans.html
https://science.n-helix.com/2022/08/simd.html


 
*

EMS Leaf Allocations & Why we find them useful: (c)RS https://science.n-helix.com


Memory clear though page Voltage removal..

Systematic Cache randomisation flipping (On RAM Cache Directs syncobable (RAND Static, Lower quality RAND)(Why not DEV Write 8 x 16KB (Aligned Streams (2x) L2 CACHE Reasons)

Anyway in order to do this we Allocate Leaf Pages or Large Pages...
De Allocation invokes scrubbing or VOID Call in the case of a VM.

So in our case VT86 Instructions are quite useful in a Hypervisor;
&So Hypervisor from kernel = WIN!

(c)Rupert S Reference T Clear

*

Atomic: Add custom atomic.h implementation

Now we can use Statistic variance Atomic Counters inside loops with SivHASH 32Bit value hashes to add variances to dev/random & quite significantly increase motion in the pool,

But use Main thread interactions with average micro loops to reduce the overall HASH turnover rate..

Modification of the additional kind ADD's to the pre published value & additionally passes CPU Activity count numbers to the statistic pool; In the same loop main thread.

Rupert S

Atomics & Reference PID/TSC/LeafBlend

https://science.n-helix.com/2022/03/security-aspect-leaf-hash-identifiers.html
Atomics https://lkml.org/lkml/2022/4/12/84
RDPID https://lkml.org/lkml/2022/4/12/143
Opening Time Security Layering Reference PID with RDPID LeafHASH
https://lkml.org/lkml/2022/4/12/300

*

If you could "Decode" Win DLL & particularly the Compiler code, plug
in! you could use these on console :

https://bit.ly/DJ_EQ
https://bit.ly/VESA_BT

https://www.youtube.com/watch?v=cJkx-OLgLzo

High performance firmware:



https://is.gd/SEV_SSLSecureCore
https://is.gd/SSL_DRM_CleanKernel



*
More on HRTF 3D Audio

TERMINATOR Interview #Feeling https://www.youtube.com/watch?v=srksXVEkfAs & Yes you want that Conan to sound right in 3D HTRF

Cyberpunk 2077 HDR : THX, DTS, Dolby : Haptic response so clear you can feel the 3D SOUND




*

AES RAND*****


If we had a front door & a back door & we said that, "That door is only available exclusively to us "Someone would still want to use our code!
AES is good for one thing! Stopping Cyber Crime!
hod Save us from total anarchistic cynicism

Rupert S

/*
  * This function will use the architecture-specific hardware random
- * number generator if it is available.  The arch-specific hw RNG will
- * almost certainly be faster than what we can do in software, but it
- * is impossible to verify that it is implemented securely (as
- * opposed, to, say, the AES encryption of a sequence number using a
- * key known by the NSA).  So it's useful if we need the speed, but
- * only if we're willing to trust the hardware manufacturer not to
- * have put in a back door.
- *
- * Return number of bytes filled in.
+ * number generator if it is available. It is not recommended for
+ * use. Use get_random_bytes() instead. It returns the number of
+ * bytes filled in.
  */

https://lore.kernel.org/lkml/20220209135211.557032-1-Jason@zx2c4.com/t/


RAND : Callback & spinlock

Callback & spinlock are not just linux : Best we hash &or Encrypt several sources (if we have them)
If we have a pure source of Random.. we like the purity! but 90% of the time we like to hash them all together & keep the quality & source integrally variable to improve complexity.
Rupert S
https://www.spinics.net/lists/linux-crypto/msg61312.html

'function gets random data from the best available sourceThe current code has a sequence in several places that calls one or more of arch_get_random_long() or related functions, checks the return value(s) and on failure falls back to random_get_entropy().get_source long() is intended to replace all such sequences.This is better in several ways. In the fallback case it gives much more random output than random_get_entropy(). It never wasted effort by calling arch_get_random_long() et al. when the relevant config variables are not set. When it does usearch_get_random_long(), it does not deliver raw output from that function but masks it by mixing with stored random data.'

RAND : Callback & spinlock : Code Method


Spinlock IRQ Interrupted upon RAND Pool Transfer > Why not Use DMA Transfer & Memory Buffer Merge with SiMD : AVX Byte Swapping & Merge into present RAM Buffer or Future location with Memory location Fast Table.

Part of Bo-Montin Selector Code:

(CPU & Thread Synced & on same CPU)

(Thread 1 : cpu:1:2:3:4)
(RAND)
(Buffer 1) > SiMD cache & Function :

(Thread 2 : cpu:1:2:3:4)
(Memory Location Table : EMS:XMS:32Bit:64Bit)
(Selection Buffer & Transfer)

(Buffer 1) (Buffer 2) (Buffer 3)
(Entropy Sample : DieHARD : Small)

Rupert S

https://lore.kernel.org/all/20220211011446.392673-1-Jason@zx2c4.com/

Random Initiator : Linus' 50ee7529ec45


Linus' 50ee7529ec45 ("random: try to actively add entropy
rather than passively wait for it"), the RNG does a haveged-style jitter
dance around the scheduler, in order to produce entropy

The key is to initialize with a SEED key; To avoid the seed needing to be replaced too often we Encipher it in a set order with an additive key..

to create the perfect circumstances we utilize 2 seeds:
AES/SHA2/PolyCHA

Initiator math key CH1:8Bit to 32Bit High quality HASH Cryptic
& Key 2 CrH

8Bit to 256Bit : Stored HASH Cryptic

We operate maths on the differential and Crypro the HASH :
AES/SHA2/PolyCHA
CrH 'Math' CH1(1,2,3>)

AES/SHA2/PolyCHA > Save to /dev/random & use

We may also use the code directly to do unique HASH RAND & therefore keep crucial details personal or per application & MultiThreads &or CPU & GPU & Task.

Rupert S

(Spectra & Repoline Ablation) PreFETCH Statistical Load Adaptive CPU Optimising Task Manager ML(c)RS 2022


Come to think of it, Light encryption 'In State' may be possible in the Cache L3 (the main problem with repoline) & L2 (secondary) : How?

PFIO_Pol & GPIO Combined with PSLAC TaskManager (CBo_Montin) Processor, Kernel, UserSpace.
 
Byte Swapping for example or 16b instruction, If a lightly used instruction is used
(one that is under utilized)
Other XOR SiMD instructions can potentially be used to pre load L2 & L1 Instruction & Data.

Spectra & Repoline 1% CPU Hit : 75% improved Security : ALL CPU v& GPU Processor Type Compatible.

In Terms of passwords & SSL Certificate loads only, The Coding would take 20Minutes & consume only 0.1% of total CPU Time.

Also Good for disassociated Asymmetric cores; Since these pose a significant challenge to most software,
However categorising by Processor function yields remarkable classification abilities:

Processor Advanced Instruction set
Core speed
Importance

Location in association with a group of baton passing & interthread messaging & cache,
Symmetry classed processes & threads.

HASH Example

https://lkml.org/lkml/2022/3/17/120
https://lkml.org/lkml/2022/3/17/119
https://lkml.org/lkml/2022/3/17/116
https://lkml.org/lkml/2022/3/17/115
https://lkml.org/lkml/2022/3/17/118

https://science.n-helix.com/2022/02/interrupt-entropy.html
In reference to : https://science.n-helix.com/2021/11/monticarlo-workload-selector.html

CPU Statistical load debug 128 Thread :
https://lkml.org/lkml/2022/3/17/243

PFIO_Pol Generic Processor Function IO & Feature Statistics polling + CPUFunctionClass.h + VCache Memory Table Secure HASH

Also Good for disassociated Asymmetric cores; Since these pose a significant challenge to most software,
However categorising by Processor function yields remarkable classification abilities:

Processor Advanced Instruction set
Core speed
Importance

Location in association with a group of baton passing & interthread messaging & cache,
Symmetry classed processes & threads.

GPIO: Simple logic analyzer using polling : Prefer = Precise Core VClock + GPIO + Processor Function IO & Feature Statistics polling

https://lkml.org/lkml/2022/3/17/216
https://lkml.org/lkml/2022/3/17/215

No comments: