Sunday, August 14, 2022

SiMD Chiplet Fast compression & decompression (c)RS

SiMD Chiplet Fast compression & decompression (c)RS


*
Subject: SiMD Compression / Decompression chip of 2mm on side of die Chiplet (c)RS

Compression / Decompression chip of 2mm on side of die Chiplet (c)RS

Additional CPU & APU Compression / Decompression chip of 2mm to
feature on chiplet console APU's this is planned so that the Chiplet
does not require modification to the console APU,

Additionally to feature pin access Direct Discreet DMA for storage :

https://www.youtube.com/watch?v=1GvUdPn5QLg

*

Configuration of SiMD : Huffman & Compression : RS

To pack the majority of textures to 47 bit, one presumes a familiarity with Huffman codecs & the chaotic wavelets these present...

AVX256 Tasks x 4 = 64Bit
SiMD 16Bit x 2 = 32Bit / Alignment with AVX == x8
SiMD 32Bit x 2 = 64Bit / Alignment with AVX == x4

Closest to 47 = 40Bit Op x 2 (2.5Oe) | 80Bit/2 | 2 op x (1.5Oe)

So 40 Bit x2 parallel 6 Lanes

So on operation terms of precision :
32Bit Satisfies HDR,
40Bit Very much satisfies HDR,

16Bit satisfies JPG (basic)
64Bit satisfies LUT & Wide Gamut HDR Pro Rendering

*
Drill texture & image format (with contrast & depth enhancement)

https://drive.google.com/file/d/1G71Vd9d3wimVi8OkSk7Jkt6NtPB64PCG/view?usp=sharing
https://drive.google.com/file/d/1u2Qa7OVbSKIpwn24I7YDbwp2xdbjIOEo/view?usp=sharing

https://science.n-helix.com/2022/08/simd.html

Research topic RS : https://is.gd/Dot5CodecGPU https://is.gd/CodecDolby https://is.gd/CodecHDR_WCG https://is.gd/HPDigitalWavelet https://is.gd/DisplaySourceCode

*

GPU acceleration process : Huffman (c)RS


In the case of dictionary we create a cubic array: 16 parallel Integer cube, 32 SiMD,

FPU is used to compress the core elliptical curve with SVM Matrixing in 3D to 5D for files of 8Mb,FPU is inherently good versus Crystalline structure, We use the SiMD for comparative matrix & byte swap similarity.

It is always worth remembering that comparative operations are one of the most fundamental SiMD functions; But multiply, ADD & divide exist within SiMD,
Functional FPU code can always use arrays of SiMD to handle chaotic play in the field..

A main example is in Huffman's the variance of a wavelet from the main path,
Routes though main wavelet types are handled by table (on the amiga for example) &or FPU!
Micro changes make SiMD viable; In the same principle as a Hive & her ants.

Inherent expansion doubles the expected SiMD use; Ideally 2MB ram per cube
Taking advantage of a known quantity & precision we code-block by 16Bit to 128Bit segments.

Self correction allows us to Cube Huffman Decode into blocks, we parallelize blocks,
To (additionally) handle error we block the original compression.

"We also use fine-grained locking for the frequency dictionary, individually locking each key-value pair. Once the symbol codes have been determined, each symbol is replaced by its code, and all symbols; So are processed in parallel.

Decompression is inherently sequential, and hence much harder to parallelize. In this case, we take advantage of the self-synchronizing property of Huffman coding, which allows us to start at an arbitrary point"
Huffman source, Requires analysis https://github.com/catid/Zpng

https://vignan.ac.in/pgr20/20ES011.pdf
https://bestofgithub.com/repo/Better-lossless-compression-than-PNG-with-a-simpler-algorithm

ZPNG
faster than PNG and compresses better for photographic images. This compressor often takes less than 6% of the time of a PNG compressor
https://github.com/catid/Zpng
*

SiMD Chiplet Fast compression & decompression (c)RS


3 proposals


https://is.gd/BTSource

LZ77:
https://github.com/jearmoo/parallel-data-compression

The FastPFOR C++ library : Fast integer compression :
https://github.com/lemire/FastPFor

SIMDCompressionAndIntersection
C/C++ library for fast compression and intersection of lists of sorted integers using SIMD instructions : https://github.com/lemire/SIMDCompressionAndIntersection

Compressor Improvements and LZSSE2 vs LZSSE8
http://conorstokes.github.io/compression/2016/02/24/compressor-improvements-and-lzsse2-vs-lzsse8
http://conorstokes.github.io/compression/2016/02/15/an-LZ-codec-designed-for-SSE-decompression

Compression Science Docs


A General SIMD-based Approach to Accelerating Compression
Algorithms
https://arxiv.org/ftp/arxiv/papers/1502/1502.01916.pdf

SIMD Compression and the Intersection of Sorted Integers
http://boytsov.info/pubs/simdcompressionarxiv.pdf

Fast Integer Compression using SIMD Instructions
https://www.uni-mannheim.de/media/Einrichtungen/dws/Files_People/Profs/rgemulla/publications/schlegel10compression.pdf

Fast integer compression using SIMD instructions
https://www.researchgate.net/publication/220706907_Fast_integer_compression_using_SIMD_instructions

*****

The FastPFOR C++ library : Fast integer compression
Build Status Build Status Ubuntu-CI


https://jearmoo.github.io/parallel-data-compression/

GO

https://github.com/zentures/encoding

http://zhen.org/blog/benchmarking-integer-compression-in-go/

https://github.com/golang/snappy

The FastPFOR C++ library : Fast integer compression
Build Status Build Status Ubuntu-CI

What is this?

A research library with integer compression schemes. It is broadly applicable to the compression of arrays of 32-bit integers where most integers are small. The library seeks to exploit SIMD instructions (SSE) whenever possible.

This library can decode at least 4 billions of compressed integers per second on most desktop or laptop processors. That is, it can decompress data at a rate of 15 GB/s. This is significantly faster than generic codecs like gzip, LZO, Snappy or LZ4.

https://github.com/lemire/FastPFor

https://github.com/lemire/FastPFor/archive/refs/tags/v0.1.8.zip

https://github.com/lemire/FastPFor/archive/refs/tags/v0.1.8.tar.gz

Java May have a use in JS ôo
https://github.com/lemire/JavaFastPFOR

https://github.com/lemire/JavaFastPFOR/blob/master/benchmarkresults/benchmarkresults_icore7_10may2013.txt

*****

SIMDCompressionAndIntersection


C/C++ library for fast compression and intersection of lists of sorted integers using SIMD instructions : https://github.com/lemire/SIMDCompressionAndIntersection

SIMDCompressionAndIntersection
Build Status Code Quality: Cpp

As the name suggests, this is a C/C++ library for fast compression and intersection of lists of sorted integers using SIMD instructions. The library focuses on innovative techniques and very fast schemes, with particular attention to differential coding. It introduces new SIMD intersections schemes such as SIMD Galloping.

This library can decode at least 4 billions of compressed integers per second on most desktop or laptop processors. That is, it can decompress data at a rate of 15 GB/s. This is significantly faster than generic codecs like gzip, LZO, Snappy or LZ4.

*****LZ77*****

Principally an order & load+Vec https://github.com/jearmoo/parallel-data-compression

https://jearmoo.github.io/parallel-data-compression/


Summary of What We Completed

We have written and optimized the sequential version of the Huffman encoding and decoding algorithms, and tested it. For the parallel CPU version of this, we were debating between SIMD intrinsics and ISPC, and OpenMP.

However, Huffman coding compression and decompression doesn’t seem to have a workload that can appropriately use SIMD. This is because there is no elegant way of dealing with bits instead of bytes in SIMD. Moreover, different bytes compress to a different number of bits (there is no fixed mapping of input vector size to output vector size), which makes byte alignment in SIMD very difficult (for example, the compressed form for a random 4 byte input could range from 2 to 4 bytes). This is much worse for decompression, where resolving bit-level conflicts (where a specific encoding spreads over 2 bytes) is almost impossible and might actually result in the algorithm being slower than the sequential version. Therefore, we decided to focus on OpenMP.

For compression, we first sort the array in parallel, to minimize number of concurrent updates to the shared frequency dictionary, reducing contention and false sharing. We also use fine-grained locking for the frequency dictionary, individually locking each key-value pair. Once the symbol codes have been determined, each symbol is replaced by its code, and all symbols are so processed in parallel.

Decompression is inherently sequential, and hence much harder to parallelize. In this case, we take advantage of the self-synchronizing property of Huffman coding, which allows us to start at an arbitrary point in the encoded bits, and assume that at some point, the offset in bits will correct itself, resulting in the correct output thereafter.

We read about the LZ77 algorithm and explored the different variants of the algorithm. We also explored different ways to parallelize LZ77. One naive approach is running the LZ77 algorithm along different segments of the data. This approach could output the same result as the sequential implementation if we use a fixed size sliding window and reread over some of the data. Another approach is the one outlined in Practical Parallel Lempel-Ziv Factorization which uses an unbounded sliding window and employs the use of prefix sums and segment trees to calculate the Lempel-Ziv factorization in parallel.

Update on Deliverables

Our sequential implementations are close to finished, and we have some idea of how to parallelize the algorithms. Our goal for the checkpoint was to have both of these parts finished, but we have not completely met the goal. We may pivot and work on parallelizing the compression and decompression of the Huffman coding algorithm and drop the LZ77 part of the project altogether.

Our new goals:

Parallelize the Huffman Coding compression.
Parallelize the Huffman Coding decompression or LZ77 compression

Hope to achieve:
Both parts of part 2 in our new goals.

*****ZPNG


Huffman source, Requires analysis https://github.com/catid/Zpng

Small experimental lossless photographic image compression library with a C API and command-line interface.

It's much faster than PNG and compresses better for photographic images. This compressor often takes less than 6% of the time of a PNG compressor and produces a file that is 66% of the size. It was written in just 500 lines of C code thanks to Facebook's Zstd library.

The goal was to see if I could create a better lossless compressor than PNG in just one evening (a few hours) using Zstd and some past experience writing my GCIF library. Zstd is magical.

I'm not expecting anyone else to use this, but feel free if you need some fast compression in just a few hundred lines of C code.

**************************

Main interpolation references:


Interpolation https://drive.google.com/file/d/1dn0mdYIHsbMsBaqVRIfFkZXJ4xcW_MOA/view?usp=sharing

ICC & FRC https://drive.google.com/file/d/1vKZ5Vvuyaty5XiDQvc6LeSq6n1O3xsDl/view?usp=sharing

FRC Calibration >

FRC_FCPrP(tm):RS (Reference)

https://drive.google.com/file/d/1hEU6D2nv03r3O_C-ZKR_kv6NBxcg1ddR/view?usp=sharing

FRC & AA & Super Sampling (Reference)

https://drive.google.com/file/d/1AMR0-ftMQIIC2ONnPc_gTLN31zy-YX4d/view?usp=sharing

Audio 3D Calibration

https://drive.google.com/file/d/1-wz4VFZGP5Z-1lG0bEe1G2MRTXYIecNh/view?usp=sharing

2: We use a reference pallet to get the best out of our LED; Such a reference pallet is:

Rec709 Profile in effect : use today! https://is.gd/ColourGrading

Rec709 <> Rec2020 ICC 4 Million Reference Colour Profile : https://drive.google.com/file/d/1sqTm9zuY89sp14Q36sTS2hySll40DilB/view?usp=sharing

For Broadcasting, TV, Monitor & Camera https://is.gd/ICC_Rec2020_709

ICC Colour Profiles for compatibility: https://drive.google.com/file/d/1sqTm9zuY89sp14Q36sTS2hySll40DilB/view?usp=sharing

https://is.gd/BTSource

Colour Profile Professionally

https://displayhdr.org/guide/
https://www.microsoft.com/store/apps/9NN1GPN70NF3

*Files*

This one will suite Dedicated ARM Machine in body armour 'mental state' ARM Router & TV https://drive.google.com/file/d/102pycYOFpkD1Vqj_N910vennxxIzFh_f/view?usp=sharing

Android & Linux ARM Processor configurations; routers & TV's upgrade files, Update & improve
https://drive.google.com/file/d/1JV7PaTPUmikzqgMIfNRXr4UkF2X9iZoq/

Providence: https://www.virustotal.com/gui/file/0c999ccda99be1c9535ad72c38dc1947d014966e699d7a259c67f4df56ec4b92/
https://www.virustotal.com/gui/file/ff97d7da6a89d39f7c6c3711e0271f282127c75174977439a33d44a03d4d6c8e/

Python Deep Learning: configurations

AndroLinuxML : https://drive.google.com/file/d/1N92h-nHnzO5Vfq1rcJhkF952aZ1PPZGB/view?usp=sharing

Linux : https://drive.google.com/file/d/1u64mj6vqWwq3hLfgt0rHis1Bvdx_o3vL/view?usp=sharing

Windows : https://drive.google.com/file/d/1dVJHPx9kdXxCg5272fPvnpgY8UtIq57p/view?usp=sharing

No comments: