ESA Space blog - All Rights Reserved RS: August 2022

Monday, August 29, 2022

JIT Compiler Dongle - The Connection HPC 2022 RS

JIT Compiler Dongle - The Connection HPC 2022 RS (c)Rupert S

JIT Compiler Dongle makes 100% Sense & since it has no problem acting like a printer! It can in fact interface with all printers & offload Tasks,

However in High Performance Computing mode of operation the USB Dongle acts as the central processor from the device side; That is to say the device such as the printer or the Display...

You can supply a full workload to the dongle & of course it will complete the task with no necessity of assistance from the computer or the device.

The JIT Compiler comes into its own one two fronts:

Compatibility between processor types.

Aiding a device in processing &or passing work to that device to run; Work that is shared & if required workloads are passed back & forth & shared,

Shared & optimised...

The final results for example are post-scripts? no problem!
The final results for example are Directly Compute Optimised Printer Jet algorithms? no problem!
The task needs to compute specifics for a DisplayPort LED Layout ? no problem!

The device is powerful so share, JIT Compiler for real offloading & task management & runtime.

Functional Processing Dongle Classification USB3.1+ & HDMI & DisplayPort (c)RS

Theory 1 Printer

Itinerary:

Printers of a good design but low manufacturing cost of ICB printed circuits have a printhead controller,

But no Postscript Processor; But they do have a print dither controller & programmable version need to interface with the CPU on the printing device,

Print controlling is a viable Dongle & also Cache but workload cache has to have a reason!

That reason here given is the JIT Dongle that is able to interface with both Web print protocol & IDF Printing firmware.

But here we have postscript input into the JIT Compiles Kernel & output in terms of Jet Vectors & line by line Bitmap HDR & head motion calculations,

We can also tick the box on Postscript offloading on functioning PostScript printers; But we prefer to offload JIT for speed & size..

Vectors & curves & lines & Cache.

Theory 2 Screen

Itinerary as of printers but also VESA & line by line screen print & VESA Vectors & DisplayPort Active displays,

Cable Active displays require the GPU to draw the screen & calculate the Line Draw!

The Dongle activates like a screen with processor & carries the screen processing out; Instead of a smartwatch or small phone that does not have a good capacity for computer lead active display enhancements.

Theory 3 Hard Drives & controller such as network cards & plugs for PCI

Adapting to Caching & processing Storage or network data throughput commands, While at the same time being functionally responsive to system command & update makes JIT Dongle stand out at the head of both speed & function...

Network cards can send offloading tasks to the PCI socket & the plug will process them.

Hard-drives can request processing & it shall be done.

Motherboard ROMs & hardware can request IO & DMA Translation & all code install is done by the OS & Bios/Firmware.

Offloading can happen from socket to Motherboard & USB Socket & URT..

All is done & adapts to Job & function in host.

The 8M Motherboard & OS verifies the dongle, licences the dongle from the user..
& runs commands! Any Chipset, Any maker & every dongle by Firmware/Bios
What the unit constitutes is a functional Task offloader for OS & Bios/Firmware.

The utility is eternal & the functions creative & secure & licensed/Certificate verified.

Any Motherboard can be improved with the right Firmware & Plugin /+ device.

(c)RS

*****

DDM Super Immediate Display Modes with 0ms GTG : Operation Latency Zero

By initiating DDM & using the display processor aswell with DPIC JIT,
With DDM Frame Buffer Emulation & Control.

Games & Aiming for Business,
DDC & FreeSync Update today!

In order to set DDM Super Immediate Display Modes you have to set the
display as being DDM with an input frame buffer..

That way both the GPU & the Display can work on the frame in ALLM
Mode; Enhancing processing while reducing latency.

*
FreeSync - DDM - Low Latency Screen Modes

HDMI & DisplayPort : Screen Framebuffer {DDM, FreeSync, ALLM} : Minimal latency post processing : RS

Direct Drive Monitor (DDM) is a mode where the Frame is directly created by the CPU/GPU facing the screen,
The frame buffer facing the Processor must present all capacities & properties of the Screen directly..

List of common properties:

Frame Buffer & Frame buffer write control
Bit depth & FRC
DSC mode
ICC Colour Profile
Write Cache buss width
Timings
Latency
LED Colour range & profile

The GPU/CPU must have the capacity to order write cycles & DSC Decompression layer,
The GPU/CPU must not have a discussion writing to the screen; Direct Write shall be immediate!

So we need to have the frame buffer process as fast as possible & report back,
But we plan to initiate a frame buffer & process it!; Process the frame fast,
To do that we provide all the information from our frame buffer that the CPU/GPU needs to calculate..

Rupert S
*

A DDM Monitor is directly controlled by a GPU/CPU

Initiating a Direct Drive Monitor (DDM) capability enables ultra-thin monitors (and Mobile Phone Screens) With a Short Plug DSC Compression Array...

Could be simple!

Initiate a DDM Mode with DPIC: JIT Kernel to a Frame Processing Unit that directly presents as a surface; All tasks from there in will not be allowed to add latency.

(DDM) DPIC JIT Compiler Mode handles the situation of under performing hardware quite well,

The aim is to solve one of the largest issues with DDM & that is latency! & Frame Distortion such as Frame Blur,

Long cable access to a device encounters the same latency issues as RAM & Storage,
Distance means time!

By Directly compiling commands into an (ESK) Efficient Static Kernel; Stack space (Cache & RAM)...

Processing load is light & may be performed On The Edge; Close to the hardware; in our case a screen with a Single Core ARM Nano millimeters close to the screen.

No we do not need a large CPU that close; But a SiMD array & Texture decompressor & Direct screen print...

We do all our Large Problem solving previously in JIT Kernels; While doing what we can closer to the screen at our Frame Buffer,

We can also directly process commands directed from a larger processor; a CPU, GPU, HUB,

All we need to do is Initiate a DDM Mode with DPIC: JIT Kernel to a Frame Processing Unit that directly presents as a surface; All tasks from there in will not be allowed to add latency.

Rupert S

*

Direct Drive Class : Displays, Printers & Devices such as Joysticks, Mice & keyboards

You know Active Display,
The DisplayPort & HDMI Configuration,
JIT Compiler is a way of getting these to work internally inside the GPU &
In Port class units & USB Dongles that process Computation tasks,

The JIT Compiler DPIC System processes for the Display,

Therefore Able to Activate the display to the highest level of
processing with minimal requirements of necessity!

For example Active Displays with basically a Micro NUC that has an arm
processor & is 4 CM² with USB Connection,
Therefore can power an active display (the type with smaller processors)

Additionally can carry out more work & share a single NUC with
multiple Active Displays..

Bearing in mind that such a OpenCL/JIT Driver is universal to all Systems & Simply classifies by processor
class.

The primary motivation for Direct Drive Class displays & Equipment is to offload Processing tasks to the GPU/CPU...

However by example we can Flow Control frames on the HDMI & DisplayPort cables,
We do this by Writing a Kernel/OpenCL Code (Around 60KB) that queries the Frame Ready Flag/Property in the GPU...

Example of Coding Model {Display CPU <> GPU} : Audio : Video : Texture Set

OpenCL Kernel Runtime 512Kb (aim)

Set Properties of display screen (Size & compression & Unique properties such as Texture Types)
Request Frame memory Allocation
Frame Pull (Demand a frame)
Query Frame & Send Ready flag

When Frame Ready***

Send workloads to GPU on frame: Example
Decompression Stack
Frame Mask
Memory Load (Direct DMA access to RAM from Cable)

Sort functions,
Optimisation tasks such as Colour range optimisation & WCG, HDR Tone Mapping.

We keep these operations from sending frame & texture back; by operating on the frame analysis before sending frame...

Reception process involves sending:

Data From Tasks first
RAM Page Map (if we did this process on GPU)
Frame
Process (We send additional tasks if required from a worker thread)

Send out Query : Repeat!

#GoodFramingDirectDrive

RS

***

Plan 2023-03-07 Direct Map DDM : Efficient Monitor Direct Frame Forwarding Render : DDM ALLM : Rupert S

DDM Combined with Combining Texture converters, FSR & OpenCL (Compilable for processor types & Firmware),

Allows the HDMI & DisplayPort FrameBuffer Abstraction layer to pass fully optimized texture layers directly into Frame Rendering & therefor to be directly DMA Copied to the screen along with the Colour conversion table mapping (can be done by the GPU, The Monitor or be Hardware intrinsic to DSC.

DPIC JIT Compiler (Kernels : Small & into Precompiled Code Array Buffer)

https://drive.google.com/file/d/1D27MOBYKVkKib1JzP_eFucp8RRrzAhd6/view?usp=sharing, https://drive.google.com/file/d/1DbcifAxrG4XKfJ9Mrpsfq7kq1I4aV5ES/view?usp=sharing, https://drive.google.com/file/d/1d_bWbZl9fAZXsLbN_jZdqSxdWzraLSIz/view?usp=sharing
***

GTG : GoodToGame : Consoles, Gaming & Movies : RS

DDM ALLM Dongle use case - 10K Presentation of abstract data polygons

Dear VESA & HDMI; This gaming feature does rely on the GPU (in the main) but does have CPU Capacity; Particularly in consoles of the new generation!

Luckily in my experience the CPU is often under utilized by Vulkan API & DirectX 12.1 & therefore the use of DDM mode combined with the JIT Compiler OpenCL compile is a very logical choice! due to DDM being VESA we arrange it,

Clearly the RAMDAC does pre compute a frame & clearly a frame is no more than 15% work for a FreeSync monitor,

Combining the strengths of 2016+ TV & monitor ARM processors (600Mhz+; obviously less powerful than 550Mhz & DDM + ALLM + FreeSync + JIT Compiler is a clean logical choice)

RAMDACS are 600Mhz but we can multitask; So we shall.

Rupert S on behalf of VESA & HDMI & the gaming & Film community.

*****

Example Display Chain (Can be USB/Device Also For the OpenCL Runtime; To Run or be RUN) (c)RS

How a monitor ends up with an OpenCL : CPU/GPU Run Time Process: Interpolation & Screen enhancement: The process path

Firstly we need to access the GPU & CPU OpenCL Runtime such as:

Components that we need:

https://science.n-helix.com/2022/08/jit-dongle.html

https://science.n-helix.com/2022/06/jit-compiler.html

https://science.n-helix.com/2022/10/ml.html

FPGA 'Xilinx Virtex-II' HPC application Multiple-Applications & Image-Net & Matrix-Multiplication - H-SIMD machine _ configurable parallel computing for data-intensive HPC
https://digitalcommons.njit.edu/cgi/viewcontent.cgi?article=1836&context=dissertations

A SIMD architecture for hard real-time systems
https://www.repository.cam.ac.uk/bitstream/handle/1810/315712/dissertation.pdf?sequence=2

Ideal for 4Bit Int4 XBox & Int8 GPU
PULP-NN: accelerating quantized neural networks on parallel ultra-low-power RISC-V processors - Bus-width 8-bit, 4-bit, 2-bit and 1-bit
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6939244/

Predict Scaling : SiMD/AVX.SSE3:
https://science.n-helix.com/2023/03/path-trace.html

https://science.n-helix.com/2023/02/smart-compression.html

Firstly, we need an OpenCL Kernel : PocCL :

PoCL Source & Code
https://is.gd/LEDSource

MS-OpenCL
https://is.gd/MS_OpenCL

https://is.gd/OpenCL4X64
https://is.gd/OpenCL4ARM

Upscale DL
https://is.gd/UpscaleWinDL

https://is.gd/HPC_HIP_CUDA

HIP_CUDA on OpenCL & SPIRV
https://github.com/CHIP-SPV/chipStar

HIP_CUDA on OpenCL & SPIRV : ZIP
https://is.gd/HIP_CUDAonOpenCL

X86Features-Emu
https://drive.google.com/file/d/15vXBPLaU9W4ul7lmHZsw1dwVPe3lo-jK/view?usp=usp=sharing

Crucial components:

Microsoft OpenCL APP
Microsoft basic display driver OpenCL component (CPU)

CPU/GPU OpenCL Driver
PoCL Compiled runtime to run Kernels https://is.gd/LEDSource

We need an Ethernet connection to the GPU (Direct though the HDMI, DisplayPort),
A direct connection means no PCI Bus or OS Component needed,
(But indirect GPU Loaded OpenCL Kernel loading may be required)

Or

We need an Ethernet connection to the PC or computer or console!
Then we need a Driver (this can be integral or Drive) to load the OpenCL Kernel; This can have 3 parts in the main to run it!

Microsoft OpenCL APP
Microsoft basic display driver OpenCL component (CPU)

CPU/GPU OpenCL Driver
PoCL Compiled runtime to run Kernels https://is.gd/LEDSource

The compiled Kernel itself & this can be JIT : Just In Time Compile Runtime

Rupert S

*****

Example of low latency Kernel Runtime to run process sharing between the processors of a system.

VESA DDM is much better with a resident kernel in the GPU & CPU that can run the tasks of display refreshing & forward frame rendering..

One simply needs to know that the display cable has ethernet; Through Ethernet local networking the OpenCL Kernels can be compiled & run..

Enabling any tasks to be shared & averaging processor load & usage of an entire system.

RS

*****

Offloading JITcompiler : Device Driver Offloading Acceleration Pool

Microsoft makes a note of Audio Offloading in the latest chrome update,

It makes sense to note that JITcompiler is a highly efficient transparent offloader,

With CPU & GPU & Processor OpenCL compilers; There is a good chance of your function being supported!

Simple or complex maths & logic is ideal for OpenCL & the creation of a general pool for offloading functions..

OpenCL is able to cater to many maths functions & with calls to security & networking functions,

OpenCL will be very capable in that capacity.

Rupert S

References : https://is.gd/DictionarySortJS

https://science.n-helix.com/2023/06/tops.html
https://science.n-helix.com/2022/08/jit-dongle.html
https://science.n-helix.com/2022/06/jit-compiler.html

https://science.n-helix.com/2021/03/brain-bit-precision-int32-fp32-int16.html
https://science.n-helix.com/2022/10/ml.html

Firstly, we need an OpenCL Kernel : PocCL :

PoCL Source & Code
https://is.gd/LEDSource

MS-OpenCL
https://is.gd/MS_OpenCL
https://is.gd/OpenCL4X64
https://is.gd/OpenCL4ARM

*****

The DPIC Protocol in use for display, robotic hardware (arms for example) & Doctor Equipment arms & surgeries, Website loading or games.

In context of load for DPIC, We simply need a page (non-displaying Or Displaying (for example Monitor Preferences)) Inside the GPU..

Can use WebJS, WebASM : WASM, OpenCL : WebGPU : WebCL : WebGPU-ComputeShaders...

RAM Ecology wise between 1MB to 128MB RAM (But should inform client in print of options); I cannot really imagine you would need more apart from complex commands (cleaning for example & robots)

Direct Displayport & HDMI Interface; With or without use of USB Protocol HUB..

Touch screen operation examples:

Can additionally Smart pick diagnostic process of operations or equipment placement & screw & nut & bolting operations & welding or cutting!

For example, the DPIC Protocol can interface & runtime check Operations, Rotations, Motions & activations in well managed automatons; While directly interfacing the ARM/X64/RISC Processor tools & where necessary optimise memory & instruction ASM Runtime Kernel.

*

How does PTP Donation Compute work in business then:

Main JS Worker cache (couple of MB)

{ main . js }

{

{ Priority Static JS Files }

{ Priority Static Emotes & smilies (tiny) }

{ Priority Application JS & Static tiny lushi images (tiny) }

}
{

{ Work order sort task }

{ Sub tasks group }

{Compute Worker Thread }

}

*

(c)Rupert S

*****
Technology Demonstration https://is.gd/DongleTecDemo

Combining JIT PoCL with SiMD & Vector instruction optimisation we create a standard model of literally frame printed vectors :

VecSR that directly draws a frame to our display's highest floating point math & vector processor instructions; lowering data costs in visual presentation & printing.

(documents) JIT & OpenCL & Codec : https://is.gd/DisplaySourceCode

Include vector today *important* RS https://vesa.org/vesa-display-compression-codecs/

https://science.n-helix.com/2022/06/jit-compiler.html

https://science.n-helix.com/2022/08/jit-dongle.html

Bus Tec : https://drive.google.com/file/d/1M2ie8Jf_bNJaySNQZ5mqM1fD9SAUOQud/view?usp=sharing

Audio BT Codec

https://science.n-helix.com/2021/10/he-aacsbc-overlapping-wave-domains.html

DSC, ETC, ASTC & DTX Compression for display frames

https://science.n-helix.com/2022/09/ovccans.html

https://science.n-helix.com/2023/02/smart-compression.html

https://science.n-helix.com/2022/04/vecsr.html

https://science.n-helix.com/2016/04/3d-desktop-virtualization.html

https://science.n-helix.com/2019/06/vulkan-stack.html

https://science.n-helix.com/2019/06/kernel.html

https://science.n-helix.com/2023/03/path-trace.html

https://science.n-helix.com/2022/03/fsr-focal-length.html

https://science.n-helix.com/2018/01/integer-floats-with-remainder-theory.html

https://science.n-helix.com/2022/08/simd.html

*****

Good stuff for all networks nation wide, the software is certificate signed & verified
When it comes to pure security, We are grateful https://is.gd/SecurityHSM https://is.gd/WebPKI
TLS Optimised https://drive.google.com/file/d/10XL19eGjxdCGj0tK8MULKlgWhHa9_5v9/view?usp=share_link
Ethernet Security https://drive.google.com/file/d/18LNDcRSbqN7ubEzaO0pCsWaJHX68xCxf/view?usp=share_link

These are the addresses directly of some good ones; DNS & NTP & PTP 2600:c05:3010:50:47::1 2604:6600:2000:45::1 2604:6600:2000:2b::1 2a06:98c1:54::c12b 142.202.190.19 172.64.36.1 172.64.36.2 108.181.201.22 108.181.165.159

Sunday, August 14, 2022

SiMD Chiplet Fast compression & decompression (c)RS

*
Subject: SiMD Compression / Decompression chip of 2mm on side of die Chiplet (c)RS

Compression / Decompression chip of 2mm on side of die Chiplet (c)RS

Additional CPU & APU Compression / Decompression chip of 2mm to
feature on chiplet console APU's this is planned so that the Chiplet
does not require modification to the console APU,

Additionally to feature pin access Direct Discreet DMA for storage :

https://www.youtube.com/watch?v=1GvUdPn5QLg

*

Configuration of SiMD : Huffman & Compression : RS

To pack the majority of textures to 47 bit, one presumes a familiarity with Huffman codecs & the chaotic wavelets these present...

AVX256 Tasks x 4 = 64Bit
SiMD 16Bit x 2 = 32Bit / Alignment with AVX == x8
SiMD 32Bit x 2 = 64Bit / Alignment with AVX == x4

Closest to 47 = 40Bit Op x 2 (2.5Oe) | 80Bit/2 | 2 op x (1.5Oe)

So 40 Bit x2 parallel 6 Lanes

So on operation terms of precision :
32Bit Satisfies HDR,
40Bit Very much satisfies HDR,

16Bit satisfies JPG (basic)
64Bit satisfies LUT & Wide Gamut HDR Pro Rendering

*
Drill texture & image format (with contrast & depth enhancement)

https://drive.google.com/file/d/1G71Vd9d3wimVi8OkSk7Jkt6NtPB64PCG/view?usp=sharing
https://drive.google.com/file/d/1u2Qa7OVbSKIpwn24I7YDbwp2xdbjIOEo/view?usp=sharing

https://science.n-helix.com/2022/08/simd.html

Research topic RS : https://is.gd/Dot5CodecGPU https://is.gd/CodecDolby https://is.gd/CodecHDR_WCG https://is.gd/HPDigitalWavelet https://is.gd/DisplaySourceCode

GPU acceleration process : Huffman (c)RS

In the case of dictionary we create a cubic array: 16 parallel Integer cube, 32 SiMD,

FPU is used to compress the core elliptical curve with SVM Matrixing in 3D to 5D for files of 8Mb,FPU is inherently good versus Crystalline structure, We use the SiMD for comparative matrix & byte swap similarity.

It is always worth remembering that comparative operations are one of the most fundamental SiMD functions; But multiply, ADD & divide exist within SiMD,
Functional FPU code can always use arrays of SiMD to handle chaotic play in the field..

A main example is in Huffman's the variance of a wavelet from the main path,
Routes though main wavelet types are handled by table (on the amiga for example) &or FPU!
Micro changes make SiMD viable; In the same principle as a Hive & her ants.

Inherent expansion doubles the expected SiMD use; Ideally 2MB ram per cube
Taking advantage of a known quantity & precision we code-block by 16Bit to 128Bit segments.

Self correction allows us to Cube Huffman Decode into blocks, we parallelize blocks,
To (additionally) handle error we block the original compression.

"We also use fine-grained locking for the frequency dictionary, individually locking each key-value pair. Once the symbol codes have been determined, each symbol is replaced by its code, and all symbols; So are processed in parallel.

Decompression is inherently sequential, and hence much harder to parallelize. In this case, we take advantage of the self-synchronizing property of Huffman coding, which allows us to start at an arbitrary point"

In order of SiMD
https://github.com/lemire/SIMDCompressionAndIntersection
https://github.com/lemire/FastPFor

Huffmans still worth the money!

Principally an order & load+Vec https://github.com/jearmoo/parallel-data-compression

Huffman source, Requires analysis https://github.com/catid/Zpng

https://vignan.ac.in/pgr20/20ES011.pdf
https://bestofgithub.com/repo/Better-lossless-compression-than-PNG-with-a-simpler-algorithm

ZPNG
faster than PNG and compresses better for photographic images. This compressor often takes less than 6% of the time of a PNG compressor
https://github.com/catid/Zpng

SiMD Chiplet Fast compression & decompression (c)RS

3 proposals

https://is.gd/BTSource

LZ77:
https://github.com/jearmoo/parallel-data-compression

The FastPFOR C++ library : Fast integer compression :
https://github.com/lemire/FastPFor

SIMDCompressionAndIntersection
C/C++ library for fast compression and intersection of lists of sorted integers using SIMD instructions : https://github.com/lemire/SIMDCompressionAndIntersection

Compressor Improvements and LZSSE2 vs LZSSE8
http://conorstokes.github.io/compression/2016/02/24/compressor-improvements-and-lzsse2-vs-lzsse8
http://conorstokes.github.io/compression/2016/02/15/an-LZ-codec-designed-for-SSE-decompression

Compression Science Docs

A General SIMD-based Approach to Accelerating Compression
Algorithms
https://arxiv.org/ftp/arxiv/papers/1502/1502.01916.pdf

SIMD Compression and the Intersection of Sorted Integers
http://boytsov.info/pubs/simdcompressionarxiv.pdf

Fast Integer Compression using SIMD Instructions
https://www.uni-mannheim.de/media/Einrichtungen/dws/Files_People/Profs/rgemulla/publications/schlegel10compression.pdf

Fast integer compression using SIMD instructions
https://www.researchgate.net/publication/220706907_Fast_integer_compression_using_SIMD_instructions

*****

The FastPFOR C++ library : Fast integer compression
Build Status Build Status Ubuntu-CI

https://jearmoo.github.io/parallel-data-compression/

GO

https://github.com/zentures/encoding

http://zhen.org/blog/benchmarking-integer-compression-in-go/

https://github.com/golang/snappy

The FastPFOR C++ library : Fast integer compression
Build Status Build Status Ubuntu-CI

What is this?

A research library with integer compression schemes. It is broadly applicable to the compression of arrays of 32-bit integers where most integers are small. The library seeks to exploit SIMD instructions (SSE) whenever possible.

This library can decode at least 4 billions of compressed integers per second on most desktop or laptop processors. That is, it can decompress data at a rate of 15 GB/s. This is significantly faster than generic codecs like gzip, LZO, Snappy or LZ4.

https://github.com/lemire/FastPFor

https://github.com/lemire/FastPFor/archive/refs/tags/v0.1.8.zip

https://github.com/lemire/FastPFor/archive/refs/tags/v0.1.8.tar.gz

Java May have a use in JS ôo
https://github.com/lemire/JavaFastPFOR

https://github.com/lemire/JavaFastPFOR/blob/master/benchmarkresults/benchmarkresults_icore7_10may2013.txt

*****

SIMDCompressionAndIntersection

C/C++ library for fast compression and intersection of lists of sorted integers using SIMD instructions : https://github.com/lemire/SIMDCompressionAndIntersection

SIMDCompressionAndIntersection
Build Status Code Quality: Cpp

As the name suggests, this is a C/C++ library for fast compression and intersection of lists of sorted integers using SIMD instructions. The library focuses on innovative techniques and very fast schemes, with particular attention to differential coding. It introduces new SIMD intersections schemes such as SIMD Galloping.

This library can decode at least 4 billions of compressed integers per second on most desktop or laptop processors. That is, it can decompress data at a rate of 15 GB/s. This is significantly faster than generic codecs like gzip, LZO, Snappy or LZ4.

LZ77

Principally an order & load+Vec https://github.com/jearmoo/parallel-data-compression

https://jearmoo.github.io/parallel-data-compression/

Summary of What We Completed

We have written and optimized the sequential version of the Huffman encoding and decoding algorithms, and tested it. For the parallel CPU version of this, we were debating between SIMD intrinsics and ISPC, and OpenMP.

However, Huffman coding compression and decompression doesn’t seem to have a workload that can appropriately use SIMD. This is because there is no elegant way of dealing with bits instead of bytes in SIMD. Moreover, different bytes compress to a different number of bits (there is no fixed mapping of input vector size to output vector size), which makes byte alignment in SIMD very difficult (for example, the compressed form for a random 4 byte input could range from 2 to 4 bytes). This is much worse for decompression, where resolving bit-level conflicts (where a specific encoding spreads over 2 bytes) is almost impossible and might actually result in the algorithm being slower than the sequential version. Therefore, we decided to focus on OpenMP.

For compression, we first sort the array in parallel, to minimize number of concurrent updates to the shared frequency dictionary, reducing contention and false sharing. We also use fine-grained locking for the frequency dictionary, individually locking each key-value pair. Once the symbol codes have been determined, each symbol is replaced by its code, and all symbols are so processed in parallel.

Decompression is inherently sequential, and hence much harder to parallelize. In this case, we take advantage of the self-synchronizing property of Huffman coding, which allows us to start at an arbitrary point in the encoded bits, and assume that at some point, the offset in bits will correct itself, resulting in the correct output thereafter.

We read about the LZ77 algorithm and explored the different variants of the algorithm. We also explored different ways to parallelize LZ77. One naive approach is running the LZ77 algorithm along different segments of the data. This approach could output the same result as the sequential implementation if we use a fixed size sliding window and reread over some of the data. Another approach is the one outlined in Practical Parallel Lempel-Ziv Factorization which uses an unbounded sliding window and employs the use of prefix sums and segment trees to calculate the Lempel-Ziv factorization in parallel.

Update on Deliverables

Our sequential implementations are close to finished, and we have some idea of how to parallelize the algorithms. Our goal for the checkpoint was to have both of these parts finished, but we have not completely met the goal. We may pivot and work on parallelizing the compression and decompression of the Huffman coding algorithm and drop the LZ77 part of the project altogether.

Our new goals:

Parallelize the Huffman Coding compression.
Parallelize the Huffman Coding decompression or LZ77 compression

Hope to achieve:
Both parts of part 2 in our new goals.

*****ZPNG

Huffman source, Requires analysis https://github.com/catid/Zpng

Small experimental lossless photographic image compression library with a C API and command-line interface.

It's much faster than PNG and compresses better for photographic images. This compressor often takes less than 6% of the time of a PNG compressor and produces a file that is 66% of the size. It was written in just 500 lines of C code thanks to Facebook's Zstd library.

The goal was to see if I could create a better lossless compressor than PNG in just one evening (a few hours) using Zstd and some past experience writing my GCIF library. Zstd is magical.

I'm not expecting anyone else to use this, but feel free if you need some fast compression in just a few hundred lines of C code.

**************************

Main interpolation references:

Interpolation https://drive.google.com/file/d/1dn0mdYIHsbMsBaqVRIfFkZXJ4xcW_MOA/view?usp=sharing

ICC & FRC https://drive.google.com/file/d/1vKZ5Vvuyaty5XiDQvc6LeSq6n1O3xsDl/view?usp=sharing

FRC Calibration >

FRC_FCPrP(tm):RS (Reference)

https://drive.google.com/file/d/1hEU6D2nv03r3O_C-ZKR_kv6NBxcg1ddR/view?usp=sharing

FRC & AA & Super Sampling (Reference)

https://drive.google.com/file/d/1AMR0-ftMQIIC2ONnPc_gTLN31zy-YX4d/view?usp=sharing

Audio 3D Calibration

https://drive.google.com/file/d/1-wz4VFZGP5Z-1lG0bEe1G2MRTXYIecNh/view?usp=sharing

2: We use a reference pallet to get the best out of our LED; Such a reference pallet is:

Rec709 Profile in effect : use today! https://is.gd/ColourGrading

Rec709 <> Rec2020 ICC 4 Million Reference Colour Profile : https://drive.google.com/file/d/1sqTm9zuY89sp14Q36sTS2hySll40DilB/view?usp=sharing

For Broadcasting, TV, Monitor & Camera https://is.gd/ICC_Rec2020_709

ICC Colour Profiles for compatibility: https://drive.google.com/file/d/1sqTm9zuY89sp14Q36sTS2hySll40DilB/view?usp=sharing

https://is.gd/BTSource

Colour Profile Professionally

https://displayhdr.org/guide/
https://www.microsoft.com/store/apps/9NN1GPN70NF3

*Files*

This one will suite Dedicated ARM Machine in body armour 'mental state' ARM Router & TV https://drive.google.com/file/d/102pycYOFpkD1Vqj_N910vennxxIzFh_f/view?usp=sharing

Android & Linux ARM Processor configurations; routers & TV's upgrade files, Update & improve
https://drive.google.com/file/d/1JV7PaTPUmikzqgMIfNRXr4UkF2X9iZoq/

Providence: https://www.virustotal.com/gui/file/0c999ccda99be1c9535ad72c38dc1947d014966e699d7a259c67f4df56ec4b92/
https://www.virustotal.com/gui/file/ff97d7da6a89d39f7c6c3711e0271f282127c75174977439a33d44a03d4d6c8e/

Python Deep Learning: configurations

AndroLinuxML : https://drive.google.com/file/d/1N92h-nHnzO5Vfq1rcJhkF952aZ1PPZGB/view?usp=sharing

Linux : https://drive.google.com/file/d/1u64mj6vqWwq3hLfgt0rHis1Bvdx_o3vL/view?usp=sharing

Windows : https://drive.google.com/file/d/1dVJHPx9kdXxCg5272fPvnpgY8UtIq57p/view?usp=sharing

ESA Space blog - All Rights Reserved RS

Monday, August 29, 2022

JIT Compiler Dongle - The Connection HPC 2022 RS

JIT Compiler Dongle - The Connection HPC 2022 RS (c)Rupert S

DDM Super Immediate Display Modes with 0ms GTG : Operation Latency Zero

A DDM Monitor is directly controlled by a GPU/CPU

Direct Drive Class : Displays, Printers & Devices such as Joysticks, Mice & keyboards

The primary motivation for Direct Drive Class displays & Equipment is to offload Processing tasks to the GPU/CPU...

Plan 2023-03-07 Direct Map DDM : Efficient Monitor Direct Frame Forwarding Render : DDM ALLM : Rupert S

Example Display Chain (Can be USB/Device Also For the OpenCL Runtime; To Run or be RUN) (c)RS

Offloading JITcompiler : Device Driver Offloading Acceleration Pool

The DPIC Protocol in use for display, robotic hardware (arms for example) & Doctor Equipment arms & surgeries, Website loading or games.

Sunday, August 14, 2022

SiMD Chiplet Fast compression & decompression (c)RS

SiMD Chiplet Fast compression & decompression (c)RS

GPU acceleration process : Huffman (c)RS

SiMD Chiplet Fast compression & decompression (c)RS

3 proposals

Compression Science Docs

The FastPFOR C++ library : Fast integer compression
Build Status Build Status Ubuntu-CI

SIMDCompressionAndIntersection

LZ77

*****ZPNG

Main interpolation references:

Blog Archive

About Me

ESA Space blog - All Rights Reserved RS

Monday, August 29, 2022

JIT Compiler Dongle - The Connection HPC 2022 RS

JIT Compiler Dongle - The Connection HPC 2022 RS (c)Rupert S

DDM Super Immediate Display Modes with 0ms GTG : Operation Latency Zero

A DDM Monitor is directly controlled by a GPU/CPU

Direct Drive Class : Displays, Printers & Devices such as Joysticks, Mice & keyboards

The primary motivation for Direct Drive Class displays & Equipment is to offload Processing tasks to the GPU/CPU...

Plan 2023-03-07 Direct Map DDM : Efficient Monitor Direct Frame Forwarding Render : DDM ALLM : Rupert S

Example Display Chain (Can be USB/Device Also For the OpenCL Runtime; To Run or be RUN) (c)RS

Offloading JITcompiler : Device Driver Offloading Acceleration Pool

The DPIC Protocol in use for display, robotic hardware (arms for example) & Doctor Equipment arms & surgeries, Website loading or games.

Sunday, August 14, 2022

SiMD Chiplet Fast compression & decompression (c)RS

SiMD Chiplet Fast compression & decompression (c)RS

GPU acceleration process : Huffman (c)RS

SiMD Chiplet Fast compression & decompression (c)RS

3 proposals

Compression Science Docs

The FastPFOR C++ library : Fast integer compressionBuild Status Build Status Ubuntu-CI

SIMDCompressionAndIntersection

*****LZ77*****

*****ZPNG

Main interpolation references:

Blog Archive

About Me

The FastPFOR C++ library : Fast integer compression
Build Status Build Status Ubuntu-CI

LZ77