Monday, August 29, 2022

JIT Compiler Dongle - The Connection HPC 2022 RS

JIT Compiler Dongle - The Connection HPC 2022 RS (c)Rupert S

JIT Compiler Dongle makes 100% Sense & since it has no problem acting like a printer! It can in fact interface with all printers & offload Tasks,

However in High Performance Computing mode of operation the USB Dongle acts as the central processor from the device side; That is to say the device such as the printer or the Display...

You can supply a full workload to the dongle & of course it will complete the task with no necessity of assistance from the computer or the device.

The JIT Compiler comes into its own one two fronts:

Compatibility between processor types.

Aiding a device in processing &or passing work to that device to run; Work that is shared & if required workloads are passed back & forth & shared,

Shared & optimised...

The final results for example are post-scripts? no problem!
The final results for example are Directly Compute Optimised Printer Jet algorithms? no problem!
The task needs to compute specifics for a DisplayPort LED Layout ? no problem!

The device is powerful so share, JIT Compiler for real offloading & task management & runtime.

Functional Processing Dongle Classification USB3.1+ & HDMI & DisplayPort (c)RS

Theory 1 Printer

Itinerary:

Printers of a good design but low manufacturing cost of ICB printed circuits have a printhead controller,

But no Postscript Processor; But they do have a print dither controller & programmable version need to interface with the CPU on the printing device,

Print controlling is a viable Dongle & also Cache but workload cache has to have a reason!

That reason here given is the JIT Dongle that is able to interface with both Web print protocol & IDF Printing firmware.

But here we have postscript input into the JIT Compiles Kernel & output in terms of Jet Vectors & line by line Bitmap HDR & head motion calculations,

We can also tick the box on Postscript offloading on functioning PostScript printers; But we prefer to offload JIT for speed & size..

Vectors & curves & lines & Cache.

Theory 2 Screen

Itinerary as of printers but also VESA & line by line screen print & VESA Vectors & DisplayPort Active displays,

Cable Active displays require the GPU to draw the screen & calculate the Line Draw!

The Dongle activates like a screen with processor & carries the screen processing out; Instead of a smartwatch or small phone that does not have a good capacity for computer lead active display enhancements.

Theory 3 Hard Drives & controller such as network cards & plugs for PCI

Adapting to Caching & processing Storage or network data throughput commands, While at the same time being functionally responsive to system command & update makes JIT Dongle stand out at the head of both speed & function...

Network cards can send offloading tasks to the PCI socket & the plug will process them.

Hard-drives can request processing & it shall be done.

Motherboard ROMs & hardware can request IO & DMA Translation & all code install is done by the OS & Bios/Firmware.

Offloading can happen from socket to Motherboard & USB Socket & URT..

All is done & adapts to Job & function in host.

The 8M Motherboard & OS verifies the dongle, licences the dongle from the user..
& runs commands! Any Chipset, Any maker & every dongle by Firmware/Bios
What the unit constitutes is a functional Task offloader for OS & Bios/Firmware.

The utility is eternal & the functions creative & secure & licensed/Certificate verified.

Any Motherboard can be improved with the right Firmware & Plugin /+ device.

(c)RS

*****

Technology Demonstration https://is.gd/DongleTecDemo

Combining JIT PoCL with SiMD & Vector instruction optimisation we create a standard model of literally frame printed vectors :

VecSR that directly draws a frame to our display's highest floating point math & vector processor instructions; lowering data costs in visual presentation & printing.

(documents) JIT & OpenCL & Codec : https://is.gd/DisplaySourceCode

Include vector today *important* RS https://vesa.org/vesa-display-compression-codecs/

https://science.n-helix.com/2022/06/jit-compiler.html

https://science.n-helix.com/2022/04/vecsr.html

https://science.n-helix.com/2016/04/3d-desktop-virtualization.html

https://science.n-helix.com/2019/06/vulkan-stack.html

https://science.n-helix.com/2019/06/kernel.html

https://science.n-helix.com/2022/03/fsr-focal-length.html

https://science.n-helix.com/2018/01/integer-floats-with-remainder-theory.html

https://science.n-helix.com/2022/08/simd.html

Sunday, August 14, 2022

SiMD Chiplet Fast compression & decompression (c)RS

SiMD Chiplet Fast compression & decompression (c)RS


*
Subject: SiMD Compression / Decompression chip of 2mm on side of die Chiplet (c)RS

Compression / Decompression chip of 2mm on side of die Chiplet (c)RS

Additional CPU & APU Compression / Decompression chip of 2mm to
feature on chiplet console APU's this is planned so that the Chiplet
does not require modification to the console APU,

Additionally to feature pin access Direct Discreet DMA for storage :

https://www.youtube.com/watch?v=1GvUdPn5QLg

*

Configuration of SiMD : Huffman & Compression : RS

To pack the majority of textures to 47 bit, one presumes a familiarity with Huffman codecs & the chaotic wavelets these present...

AVX256 Tasks x 4 = 64Bit
SiMD 16Bit x 2 = 32Bit / Alignment with AVX == x8
SiMD 32Bit x 2 = 64Bit / Alignment with AVX == x4

Closest to 47 = 40Bit Op x 2 (2.5Oe) | 80Bit/2 | 2 op x (1.5Oe)

So 40 Bit x2 parallel 6 Lanes

So on operation terms of precision :
32Bit Satisfies HDR,
40Bit Very much satisfies HDR,

16Bit satisfies JPG (basic)
64Bit satisfies LUT & Wide Gamut HDR Pro Rendering

*
Drill texture & image format (with contrast & depth enhancement)

https://drive.google.com/file/d/1G71Vd9d3wimVi8OkSk7Jkt6NtPB64PCG/view?usp=sharing
https://drive.google.com/file/d/1u2Qa7OVbSKIpwn24I7YDbwp2xdbjIOEo/view?usp=sharing

https://science.n-helix.com/2022/08/simd.html

Research topic RS : https://is.gd/Dot5CodecGPU https://is.gd/CodecDolby https://is.gd/CodecHDR_WCG https://is.gd/HPDigitalWavelet https://is.gd/DisplaySourceCode

*

GPU acceleration process : Huffman (c)RS


In the case of dictionary we create a cubic array: 16 parallel Integer cube, 32 SiMD,

FPU is used to compress the core elliptical curve with SVM Matrixing in 3D to 5D for files of 8Mb,FPU is inherently good versus Crystalline structure, We use the SiMD for comparative matrix & byte swap similarity.

It is always worth remembering that comparative operations are one of the most fundamental SiMD functions; But multiply, ADD & divide exist within SiMD,
Functional FPU code can always use arrays of SiMD to handle chaotic play in the field..

A main example is in Huffman's the variance of a wavelet from the main path,
Routes though main wavelet types are handled by table (on the amiga for example) &or FPU!
Micro changes make SiMD viable; In the same principle as a Hive & her ants.

Inherent expansion doubles the expected SiMD use; Ideally 2MB ram per cube
Taking advantage of a known quantity & precision we code-block by 16Bit to 128Bit segments.

Self correction allows us to Cube Huffman Decode into blocks, we parallelize blocks,
To (additionally) handle error we block the original compression.

"We also use fine-grained locking for the frequency dictionary, individually locking each key-value pair. Once the symbol codes have been determined, each symbol is replaced by its code, and all symbols; So are processed in parallel.

Decompression is inherently sequential, and hence much harder to parallelize. In this case, we take advantage of the self-synchronizing property of Huffman coding, which allows us to start at an arbitrary point"
Huffman source, Requires analysis https://github.com/catid/Zpng

https://vignan.ac.in/pgr20/20ES011.pdf
https://bestofgithub.com/repo/Better-lossless-compression-than-PNG-with-a-simpler-algorithm

ZPNG
faster than PNG and compresses better for photographic images. This compressor often takes less than 6% of the time of a PNG compressor
https://github.com/catid/Zpng
*

SiMD Chiplet Fast compression & decompression (c)RS


3 proposals


https://is.gd/BTSource

LZ77:
https://github.com/jearmoo/parallel-data-compression

The FastPFOR C++ library : Fast integer compression :
https://github.com/lemire/FastPFor

SIMDCompressionAndIntersection
C/C++ library for fast compression and intersection of lists of sorted integers using SIMD instructions : https://github.com/lemire/SIMDCompressionAndIntersection

Compressor Improvements and LZSSE2 vs LZSSE8
http://conorstokes.github.io/compression/2016/02/24/compressor-improvements-and-lzsse2-vs-lzsse8
http://conorstokes.github.io/compression/2016/02/15/an-LZ-codec-designed-for-SSE-decompression

Compression Science Docs


A General SIMD-based Approach to Accelerating Compression
Algorithms
https://arxiv.org/ftp/arxiv/papers/1502/1502.01916.pdf

SIMD Compression and the Intersection of Sorted Integers
http://boytsov.info/pubs/simdcompressionarxiv.pdf

Fast Integer Compression using SIMD Instructions
https://www.uni-mannheim.de/media/Einrichtungen/dws/Files_People/Profs/rgemulla/publications/schlegel10compression.pdf

Fast integer compression using SIMD instructions
https://www.researchgate.net/publication/220706907_Fast_integer_compression_using_SIMD_instructions

*****

The FastPFOR C++ library : Fast integer compression
Build Status Build Status Ubuntu-CI


https://jearmoo.github.io/parallel-data-compression/

GO

https://github.com/zentures/encoding

http://zhen.org/blog/benchmarking-integer-compression-in-go/

https://github.com/golang/snappy

The FastPFOR C++ library : Fast integer compression
Build Status Build Status Ubuntu-CI

What is this?

A research library with integer compression schemes. It is broadly applicable to the compression of arrays of 32-bit integers where most integers are small. The library seeks to exploit SIMD instructions (SSE) whenever possible.

This library can decode at least 4 billions of compressed integers per second on most desktop or laptop processors. That is, it can decompress data at a rate of 15 GB/s. This is significantly faster than generic codecs like gzip, LZO, Snappy or LZ4.

https://github.com/lemire/FastPFor

https://github.com/lemire/FastPFor/archive/refs/tags/v0.1.8.zip

https://github.com/lemire/FastPFor/archive/refs/tags/v0.1.8.tar.gz

Java May have a use in JS ôo
https://github.com/lemire/JavaFastPFOR

https://github.com/lemire/JavaFastPFOR/blob/master/benchmarkresults/benchmarkresults_icore7_10may2013.txt

*****

SIMDCompressionAndIntersection


C/C++ library for fast compression and intersection of lists of sorted integers using SIMD instructions : https://github.com/lemire/SIMDCompressionAndIntersection

SIMDCompressionAndIntersection
Build Status Code Quality: Cpp

As the name suggests, this is a C/C++ library for fast compression and intersection of lists of sorted integers using SIMD instructions. The library focuses on innovative techniques and very fast schemes, with particular attention to differential coding. It introduces new SIMD intersections schemes such as SIMD Galloping.

This library can decode at least 4 billions of compressed integers per second on most desktop or laptop processors. That is, it can decompress data at a rate of 15 GB/s. This is significantly faster than generic codecs like gzip, LZO, Snappy or LZ4.

*****LZ77*****

Principally an order & load+Vec https://github.com/jearmoo/parallel-data-compression

https://jearmoo.github.io/parallel-data-compression/


Summary of What We Completed

We have written and optimized the sequential version of the Huffman encoding and decoding algorithms, and tested it. For the parallel CPU version of this, we were debating between SIMD intrinsics and ISPC, and OpenMP.

However, Huffman coding compression and decompression doesn’t seem to have a workload that can appropriately use SIMD. This is because there is no elegant way of dealing with bits instead of bytes in SIMD. Moreover, different bytes compress to a different number of bits (there is no fixed mapping of input vector size to output vector size), which makes byte alignment in SIMD very difficult (for example, the compressed form for a random 4 byte input could range from 2 to 4 bytes). This is much worse for decompression, where resolving bit-level conflicts (where a specific encoding spreads over 2 bytes) is almost impossible and might actually result in the algorithm being slower than the sequential version. Therefore, we decided to focus on OpenMP.

For compression, we first sort the array in parallel, to minimize number of concurrent updates to the shared frequency dictionary, reducing contention and false sharing. We also use fine-grained locking for the frequency dictionary, individually locking each key-value pair. Once the symbol codes have been determined, each symbol is replaced by its code, and all symbols are so processed in parallel.

Decompression is inherently sequential, and hence much harder to parallelize. In this case, we take advantage of the self-synchronizing property of Huffman coding, which allows us to start at an arbitrary point in the encoded bits, and assume that at some point, the offset in bits will correct itself, resulting in the correct output thereafter.

We read about the LZ77 algorithm and explored the different variants of the algorithm. We also explored different ways to parallelize LZ77. One naive approach is running the LZ77 algorithm along different segments of the data. This approach could output the same result as the sequential implementation if we use a fixed size sliding window and reread over some of the data. Another approach is the one outlined in Practical Parallel Lempel-Ziv Factorization which uses an unbounded sliding window and employs the use of prefix sums and segment trees to calculate the Lempel-Ziv factorization in parallel.

Update on Deliverables

Our sequential implementations are close to finished, and we have some idea of how to parallelize the algorithms. Our goal for the checkpoint was to have both of these parts finished, but we have not completely met the goal. We may pivot and work on parallelizing the compression and decompression of the Huffman coding algorithm and drop the LZ77 part of the project altogether.

Our new goals:

Parallelize the Huffman Coding compression.
Parallelize the Huffman Coding decompression or LZ77 compression

Hope to achieve:
Both parts of part 2 in our new goals.

*****ZPNG


Huffman source, Requires analysis https://github.com/catid/Zpng

Small experimental lossless photographic image compression library with a C API and command-line interface.

It's much faster than PNG and compresses better for photographic images. This compressor often takes less than 6% of the time of a PNG compressor and produces a file that is 66% of the size. It was written in just 500 lines of C code thanks to Facebook's Zstd library.

The goal was to see if I could create a better lossless compressor than PNG in just one evening (a few hours) using Zstd and some past experience writing my GCIF library. Zstd is magical.

I'm not expecting anyone else to use this, but feel free if you need some fast compression in just a few hundred lines of C code.

**************************

Main interpolation references:


Interpolation https://drive.google.com/file/d/1dn0mdYIHsbMsBaqVRIfFkZXJ4xcW_MOA/view?usp=sharing

ICC & FRC https://drive.google.com/file/d/1vKZ5Vvuyaty5XiDQvc6LeSq6n1O3xsDl/view?usp=sharing

FRC Calibration >

FRC_FCPrP(tm):RS (Reference)

https://drive.google.com/file/d/1hEU6D2nv03r3O_C-ZKR_kv6NBxcg1ddR/view?usp=sharing

FRC & AA & Super Sampling (Reference)

https://drive.google.com/file/d/1AMR0-ftMQIIC2ONnPc_gTLN31zy-YX4d/view?usp=sharing

Audio 3D Calibration

https://drive.google.com/file/d/1-wz4VFZGP5Z-1lG0bEe1G2MRTXYIecNh/view?usp=sharing

2: We use a reference pallet to get the best out of our LED; Such a reference pallet is:

Rec709 Profile in effect : use today! https://is.gd/ColourGrading

Rec709 <> Rec2020 ICC 4 Million Reference Colour Profile : https://drive.google.com/file/d/1sqTm9zuY89sp14Q36sTS2hySll40DilB/view?usp=sharing

For Broadcasting, TV, Monitor & Camera https://is.gd/ICC_Rec2020_709

ICC Colour Profiles for compatibility: https://drive.google.com/file/d/1sqTm9zuY89sp14Q36sTS2hySll40DilB/view?usp=sharing

https://is.gd/BTSource

Colour Profile Professionally

https://displayhdr.org/guide/
https://www.microsoft.com/store/apps/9NN1GPN70NF3

*Files*

This one will suite Dedicated ARM Machine in body armour 'mental state' ARM Router & TV https://drive.google.com/file/d/102pycYOFpkD1Vqj_N910vennxxIzFh_f/view?usp=sharing

Android & Linux ARM Processor configurations; routers & TV's upgrade files, Update & improve
https://drive.google.com/file/d/1JV7PaTPUmikzqgMIfNRXr4UkF2X9iZoq/

Providence: https://www.virustotal.com/gui/file/0c999ccda99be1c9535ad72c38dc1947d014966e699d7a259c67f4df56ec4b92/
https://www.virustotal.com/gui/file/ff97d7da6a89d39f7c6c3711e0271f282127c75174977439a33d44a03d4d6c8e/

Python Deep Learning: configurations

AndroLinuxML : https://drive.google.com/file/d/1N92h-nHnzO5Vfq1rcJhkF952aZ1PPZGB/view?usp=sharing

Linux : https://drive.google.com/file/d/1u64mj6vqWwq3hLfgt0rHis1Bvdx_o3vL/view?usp=sharing

Windows : https://drive.google.com/file/d/1dVJHPx9kdXxCg5272fPvnpgY8UtIq57p/view?usp=sharing