DSC & Codec Direct Write Chunk Allocator: SMT & Hyper Threading : (c)RS 2025
To take advantage of the DSC Screen write that is written in accord with Dual Blend is to be a multiple blocks per group of scan-lines,..
Now according to codec development & PAL, NTSC Screen size estimated optimums 8x8 & 16x16,..
Now an AMD & Intel CPU goes about allocating Two threads differently because the AMD used SMT mostly & Intel used Hyper threading,..
Now these days both use Hyper threading & SMT of various forms, With offcentric processor sizes Intel & ARM often cannot align SMT,..
SMT however by my reason works fine when allocated between aligned by speed & feature on the same CU with identical cores..
What is all the SMT & Hyper threading Invention about then RS?
We are making a block allocator that Hyper Thread / SMT in multiple groups
PAL / NTSC : HD, 4K, 8K : HDR & WCG
[16x16] , [16x16] , [16x16] , [16x16] , ..
[16x16] , [16x16] , [16x16] , [16x16] , ..
[16x16] , [16x16] , [16x16] , [16x16] , ..
[16x16] , [16x16] , [16x16] , [16x16] , ..
The screen can be drawn in cubic measurements as planned in DualBlend & sent to the screen surface as texture blocks.. known as Cube-Maps,..
Latency will be low & allow us to render the screen from both the CPU & GPU
CPU SMT parallel render blocks:
A: 1, 2
B: 1, 2
GPU SiMD 2D Layer parallel render blocks:
A: 1, 2, 3, 4
B: 1, 2, 3, 4
C: 1, 2, 3, 4
D: 1, 2, 3, 4
We will be rendering the CPU into the GPU layer when we need to!
We will be rendering Audio & Graphics using SMT & parallel Compute Shading,..
With rasterization from both to final frames on GPU that are directed to the display compressed from GPU Pixel-Shaders.
Rupert S
*
Texture formats such as BC, DXT, ETC2, VP9, VVC, H265, H264, H263, JPG, PNG is an open standard : Nothing wrong with using Colour Table Interpolation : (c)RS
https://www.w3.org/TR/png-3/#4Concepts.Scaling
What we have is 4 layers of colour RGBA & it is to be done 2 ways,..
R Red
G Green
B Blue
A Alpha
I Interleav Properties & compression standard bits,
Storage intentions, 32Bit values composed of 1 to 8Bit values in DOT
4 layers
R, R, R
G, G, G
B, B, B
A, A, A
I, I, I
High profile alteration & single colour matric compression, Fast to compress in 4 streams = 2 SMT threads or 4 parallel SiMD & pixel line scan compression,..
RGB, RGB, RGB
A , A , A
I, , I , I
Pixel Matrix
[], [], []
[], [], []
[], [], []
Compact pixel arrays that compress fast on large bit depth arrays such as 256Bit AVX & 64Bit Integers & FP on CPU,..
Interlacing is done with an additional layer containing multiple properties per pixel, Or alternatively very low bit weight feature sets,..
Allows blending of colours to averages of 1x1 to 32x32 ppi, Compression bit properties are an example use.
Rupert S
*
Planar Data Types for limited size SiMD with large parallelism:(c)RS
Defining 8Bit & 16Bit SiMD & Matrix as capable of applying a gradated & skillful response to RGBA & RGB+BW 8,8,8,8, 10,10,10,2 & yes 565 RGB,
We observe that 8 bit & 16 Bit SiMD have limited bit-depth in maximum Byte size:
Console, EdgeTPU & Intel's Xe graphics architecture
Xe Vector Engine (XVE)
Xe3 XVEs can run 10 threads concurrently
https://old.chipsandcheese.com/2025/03/19/looking-ahead-at-intels-xe3-gpu-architecture/
https://www.intel.com/content/www/us/en/developer/articles/technical/xess-sr-developer-guide.html
Planar Data Types for limited size SiMD with large parallelism:
We would rather handle data planar in FP8 & Int8 8,8,8,8 & have a total precision of 32Bit HDR & variously FP16 & Int16 10,10,10,2 & 16,16,16,16
Handing logic of Planar & Combined Byte Colour & Pixel handing..
Various 4bit & 8Bit & so on inferencing enabled colour packing systems,..
These allow systems such as Intel, AMD, NPU & GPU to use 4Bit & 8Bit & 16Bit packed SiMD,..
Packed SiMD are parallel in nature, But they require colour systems.
111 & 1111 & 11111
222 & 2222 & 22222
444 & 4444 & 44444
888 & 8888 & 88888
& so on
Example low bit Alpha & BW
5551 represents where we have 555 Bit & 1 Alpha, What do we do with 1 BW, Alpha? 75%, 50%, 25% BW, Transparency or a Shader set level!
4 layers handled Planar, Example for fast parallel SiMD
R, R, R
G, G, G
B, B, B
A, A, A
I, I, I
for 8bit:
8,8,8,8
With maths to solve:
2321, 2222 4bit + RG, RB, RA, GA, BA 8Bit
565 + RG, RB, RA, GA, BA for half precision
565 & 8,8,8,8 & 10,10,10,2 RGBA for single & double precision
10,10,10,2 & 16,16,16,16 RGBA Double Precision
& Combined Bytes for higher precision Powerful SiMD
RGB, RGB, RGB
A , A , A
I, , I , I
2321, 2222 8Bit
565 + RG, RB, RA, GA, BA for half precision
565 & 8,8,8,8 & 10,10,10,2 RGBA for single precision
10,10,10,2 & 16,16,16,16 RGBA Double Precision
The status of Planar versus block solve is an issue that depends on what you wish to do!
Single channel compression is first tier example where single colour blends & compression are smoother but require larger parallel arrangements,..
Micro-block planar has memory overhead, But not over a large field array.
Merged RGB allows same block larger cycles & more efficient RAM arrays
(c)RS
*
Y is significantly more important than CbCr according to Wikki thoughts & bard,.. My basic thought is that Cb & Cr are referenced in 8 bit,
I am less than convinced thet we need YCbCr to be all 8 bit these days,.. Because of HDR,.. Now to be clear DSC Display Codec is defined through that 8Bit pinhole,..
As a user of YCbCr Myself in the form of the display settings in AMD's control panel, I have tested RGB versus YCbCr over & over with a colour monitor DataSpyder 48Bit & the difference in 10 Bit mode is clearly very small!
The composition of YCbCr is clearly good for most colours & the differences in 10Bit to RGB mean that you have more bandwidth,..
For example HDMI 2 mode set RGB is 8Bit, With YCbCr 4:2:2 the mode is 12Bit,.. There is a clear advantage to YCbCr modes being able to set 4:2:2! Simple!
My first method involves having FP16 & FP8 in the SiMD line:
FP16:Y, FP8:Cb&Cr
Clearly faster the HDR range is higher & the WCG remains approximately the same apart from green & that is faster!
A: 1, 2, 3, 4
B: 1, 2, 3, 4
C: 1, 2, 3, 4
D: 1, 2, 3, 4
We will be rendering the CPU into the GPU layer when we need to!
We will be rendering Audio & Graphics using SMT & parallel Compute Shading,..
With rasterization from both to final frames on GPU that are directed to the display compressed from GPU Pixel-Shaders.
Rupert S
*
Texture formats such as BC, DXT, ETC2, VP9, VVC, H265, H264, H263, JPG, PNG is an open standard : Nothing wrong with using Colour Table Interpolation : (c)RS
https://www.w3.org/TR/png-3/#4Concepts.Scaling
Colour Table Interpolation, What is it & how we use it,
What we have is 4 layers of colour RGBA & it is to be done 2 ways,..
R Red
G Green
B Blue
A Alpha
I Interleav Properties & compression standard bits,
Storage intentions, 32Bit values composed of 1 to 8Bit values in DOT
4 layers
R, R, R
G, G, G
B, B, B
A, A, A
I, I, I
High profile alteration & single colour matric compression, Fast to compress in 4 streams = 2 SMT threads or 4 parallel SiMD & pixel line scan compression,..
RGB, RGB, RGB
A , A , A
I, , I , I
Pixel Matrix
[], [], []
[], [], []
[], [], []
Compact pixel arrays that compress fast on large bit depth arrays such as 256Bit AVX & 64Bit Integers & FP on CPU,..
Interlacing is done with an additional layer containing multiple properties per pixel, Or alternatively very low bit weight feature sets,..
Allows blending of colours to averages of 1x1 to 32x32 ppi, Compression bit properties are an example use.
Rupert S
*
Planar Data Types for limited size SiMD with large parallelism:(c)RS
Defining 8Bit & 16Bit SiMD & Matrix as capable of applying a gradated & skillful response to RGBA & RGB+BW 8,8,8,8, 10,10,10,2 & yes 565 RGB,
We observe that 8 bit & 16 Bit SiMD have limited bit-depth in maximum Byte size:
Console, EdgeTPU & Intel's Xe graphics architecture
Xe Vector Engine (XVE)
Xe3 XVEs can run 10 threads concurrently
https://old.chipsandcheese.com/2025/03/19/looking-ahead-at-intels-xe3-gpu-architecture/
https://www.intel.com/content/www/us/en/developer/articles/technical/xess-sr-developer-guide.html
Planar Data Types for limited size SiMD with large parallelism:
We would rather handle data planar in FP8 & Int8 8,8,8,8 & have a total precision of 32Bit HDR & variously FP16 & Int16 10,10,10,2 & 16,16,16,16
Handing logic of Planar & Combined Byte Colour & Pixel handing..
Various 4bit & 8Bit & so on inferencing enabled colour packing systems,..
These allow systems such as Intel, AMD, NPU & GPU to use 4Bit & 8Bit & 16Bit packed SiMD,..
Packed SiMD are parallel in nature, But they require colour systems.
111 & 1111 & 11111
222 & 2222 & 22222
444 & 4444 & 44444
888 & 8888 & 88888
& so on
Example low bit Alpha & BW
5551 represents where we have 555 Bit & 1 Alpha, What do we do with 1 BW, Alpha? 75%, 50%, 25% BW, Transparency or a Shader set level!
4 layers handled Planar, Example for fast parallel SiMD
R, R, R
G, G, G
B, B, B
A, A, A
I, I, I
for 8bit:
8,8,8,8
With maths to solve:
2321, 2222 4bit + RG, RB, RA, GA, BA 8Bit
565 + RG, RB, RA, GA, BA for half precision
565 & 8,8,8,8 & 10,10,10,2 RGBA for single & double precision
10,10,10,2 & 16,16,16,16 RGBA Double Precision
& Combined Bytes for higher precision Powerful SiMD
RGB, RGB, RGB
A , A , A
I, , I , I
2321, 2222 8Bit
565 + RG, RB, RA, GA, BA for half precision
565 & 8,8,8,8 & 10,10,10,2 RGBA for single precision
10,10,10,2 & 16,16,16,16 RGBA Double Precision
The status of Planar versus block solve is an issue that depends on what you wish to do!
Single channel compression is first tier example where single colour blends & compression are smoother but require larger parallel arrangements,..
Micro-block planar has memory overhead, But not over a large field array.
Merged RGB allows same block larger cycles & more efficient RAM arrays
(c)RS
*
DSC YCbCr Acceleration : Method
Y is significantly more important than CbCr according to Wikki thoughts & bard,.. My basic thought is that Cb & Cr are referenced in 8 bit,
I am less than convinced thet we need YCbCr to be all 8 bit these days,.. Because of HDR,.. Now to be clear DSC Display Codec is defined through that 8Bit pinhole,..
As a user of YCbCr Myself in the form of the display settings in AMD's control panel, I have tested RGB versus YCbCr over & over with a colour monitor DataSpyder 48Bit & the difference in 10 Bit mode is clearly very small!
The composition of YCbCr is clearly good for most colours & the differences in 10Bit to RGB mean that you have more bandwidth,..
For example HDMI 2 mode set RGB is 8Bit, With YCbCr 4:2:2 the mode is 12Bit,.. There is a clear advantage to YCbCr modes being able to set 4:2:2! Simple!
My first method involves having FP16 & FP8 in the SiMD line:
FP16:Y, FP8:Cb&Cr
Clearly faster the HDR range is higher & the WCG remains approximately the same apart from green & that is faster!
All FP16: YCbCr is a much deeper data usage on the HDMI & DP cable, But at 80GB/s .. Why not enjoy rich HDR & WCG!
FP16 with FP8 still offers more to the user than all FP8 YCbCr that is used by default! & still only uses 1/3 more data!..
& Is much richer..
Now i was saying FP8 but more likely it is INT8!,.. We could improve this situation if integer is required..
Int16: Y & Int8: CbCr , Again improving Y improves the HDR level & improves average colour differences on both Cb & Cr & Y,..
Permission to use Int16 for all and we get : INT16: YCbCr, But again this value does double bandwidth requirements,..
But again! With the 80GB/S HDMI & DP & Again .. Maybe only 4K @ 120Hz,
Because yes we wanted a richer experience & in any case.. Are using standard LED for TV.
The 2 methods we would be using are:
4 layers handled Planar, Example for fast parallel SiMD
R, R, R
G, G, G
B, B, B
A, A, A
I, I, I
Combined Bytes for higher precision Powerful SiMD
RGB, RGB, RGB
A , A , A
I, , I , I
Planar being more natural to YCbCr,.. Because they begin planar due to the maths we use!
FP16 with FP8 still offers more to the user than all FP8 YCbCr that is used by default! & still only uses 1/3 more data!..
& Is much richer..
Now i was saying FP8 but more likely it is INT8!,.. We could improve this situation if integer is required..
Int16: Y & Int8: CbCr , Again improving Y improves the HDR level & improves average colour differences on both Cb & Cr & Y,..
Permission to use Int16 for all and we get : INT16: YCbCr, But again this value does double bandwidth requirements,..
But again! With the 80GB/S HDMI & DP & Again .. Maybe only 4K @ 120Hz,
Because yes we wanted a richer experience & in any case.. Are using standard LED for TV.
The 2 methods we would be using are:
4 layers handled Planar, Example for fast parallel SiMD
R, R, R
G, G, G
B, B, B
A, A, A
I, I, I
Combined Bytes for higher precision Powerful SiMD
RGB, RGB, RGB
A , A , A
I, , I , I
Planar being more natural to YCbCr,.. Because they begin planar due to the maths we use!
https://en.wikipedia.org/wiki/YCbCr
(c)Rupert S
*
XBox 4bit SiMD, 8Bit PS5 & RX GPU & Intel 8Bit XMM 8 x parallel SiMD, RTX Mul Matrix & NPU's
Now the exact reasoning behind the 8888 RGB+BW mode may come as a surprise to you but I have experience with VGA & Scart cables and they have 3 Colour pins & one BW,..
Now they have both digital & Analogue & there are merits to both,
Jagged Digital is sharper digital,.. Analogue is naturally blended in the form of non digital blending,..
But 4 Pin RGB+BW is my own system of use & I made cables comply with my theory at university..
I made them for my friends & family & they worked on PS2, PS1, Nintendo 32 & PC's
But yes ok 4 x 8Bit channel, That relevant to today? We have 10Bit! Yes it is,.. You see Black & White adds an effect we call HDR to a display,..
BW channel adds a lot of contrast & sharp black edges that we call .. Clean Image Generation,..
Now HDMI & DisplayPort both output to VGA & SVGA on demand, So the BW channel is still active,..
We can use the 4 colour system & produce a very active HDR, WCG will require the use of supplements to the standard ..
Such as 10Bit! Yes we have the principles & We have methods..
4 Bit Inferencing & 8Bit inferencing such as the TPU 5e are to be used to handle video,..
4 Bit tops are a challenge to produce HDR & WCG & Planar Texture formats are our usable function call,..
Format examples:
16Bit, 8Bit & 4Bit multi thread, combined endpoint
2, 2, 2, 2 , 2x 4Bit mode or 1x 8Bit
(c)Rupert S
*
Planar Colour Expansion bits in RGB (c)RS
XBox 4bit SiMD, 8Bit PS5 & RX GPU & Intel 8Bit XMM 8 x parallel SiMD, RTX Mul Matrix & NPU's
Now the exact reasoning behind the 8888 RGB+BW mode may come as a surprise to you but I have experience with VGA & Scart cables and they have 3 Colour pins & one BW,..
Now they have both digital & Analogue & there are merits to both,
Jagged Digital is sharper digital,.. Analogue is naturally blended in the form of non digital blending,..
But 4 Pin RGB+BW is my own system of use & I made cables comply with my theory at university..
I made them for my friends & family & they worked on PS2, PS1, Nintendo 32 & PC's
But yes ok 4 x 8Bit channel, That relevant to today? We have 10Bit! Yes it is,.. You see Black & White adds an effect we call HDR to a display,..
BW channel adds a lot of contrast & sharp black edges that we call .. Clean Image Generation,..
Now HDMI & DisplayPort both output to VGA & SVGA on demand, So the BW channel is still active,..
We can use the 4 colour system & produce a very active HDR, WCG will require the use of supplements to the standard ..
Such as 10Bit! Yes we have the principles & We have methods..
4 Bit Inferencing & 8Bit inferencing such as the TPU 5e are to be used to handle video,..
4 Bit tops are a challenge to produce HDR & WCG & Planar Texture formats are our usable function call,..
Format examples:
16Bit, 8Bit & 4Bit multi thread, combined endpoint
2, 2, 2, 2 , 2x 4Bit mode or 1x 8Bit
4, 4, 4, 4 , RGBA & RGB+BW
4, 4,4, 2, 2 , RGBA+BW
8, 8, 8, 4, 4 , RGBA+BW
8, 8, 8, 8 , RGBA, RGB+BW
Alternative additional colour format examples, I do not wish to iterate every conclusive answer..
4, 4, 4, +1r, 1g, 1b + BW or A or BW + A
8, 8, 8, +2r, 2g, 2b + BW or A or 1, 1 BW + A
& There you go! Now you may be wondering, But TOP's Heavy systems.. being unable to do art ? No way!
Rupert S
*
Primarily being aware that the base texture is going to be codified in either..
planar data type, Per channel R, G, B, BW , 5x & 4x Channel parallel processing, To handle larger than total Data Width Data, In layers
Grouped Data, Where you grab an array that includes as much of the date in a single channel, F16, F32, F64 Data Types when given 8 Bit & 10 Bit Data
As stated the reasoning for planar handling is for the 4Bit & 8Bit & F16 SiMD being unable to process it all in a single pass..
Planar handling of data is aimed at parallel SiMD & multiple passes by processor (the processor is fast!)
Single pass data handling is normal for 32Bit processors, When handling 8Bit Data, 24Bit & 32Bit total size..
64Bit processors can single pass most Data Types such as 8Bit & 10Bit & only have to worry about planar handling for 16Bit per channel data..
Your motives for handling data Planar are the clear advantages of Single channel data processing & parallelism,..
When you smooth single channel data, You have a very smooth blend, When you sharpen it,..
The data is very pure!
64Bit & 32Bit SiMD; Block data handling for processing has advantages..
Single data passes require less fetches, Planar data can require more fetches per cycle,..
Smooths & sharpens involve a single pass that includes all channels, That can be good!
So planar fetching is 3, 4 or 5 passes, You can group them in DMA,..
Single fetching with 64Bit processors requires less fetching calls in the stack.
Rupert S
*
4, 4,4, 2, 2 , RGBA+BW
8, 8, 8, 4, 4 , RGBA+BW
8, 8, 8, 8 , RGBA, RGB+BW
Alternative additional colour format examples, I do not wish to iterate every conclusive answer..
4, 4, 4, +1r, 1g, 1b + BW or A or BW + A
8, 8, 8, +2r, 2g, 2b + BW or A or 1, 1 BW + A
& There you go! Now you may be wondering, But TOP's Heavy systems.. being unable to do art ? No way!
Rupert S
*
Fetch Cycles & SiMD : Base texture awareness.. (c)RS
Primarily being aware that the base texture is going to be codified in either..
planar data type, Per channel R, G, B, BW , 5x & 4x Channel parallel processing, To handle larger than total Data Width Data, In layers
Grouped Data, Where you grab an array that includes as much of the date in a single channel, F16, F32, F64 Data Types when given 8 Bit & 10 Bit Data
As stated the reasoning for planar handling is for the 4Bit & 8Bit & F16 SiMD being unable to process it all in a single pass..
Planar handling of data is aimed at parallel SiMD & multiple passes by processor (the processor is fast!)
Single pass data handling is normal for 32Bit processors, When handling 8Bit Data, 24Bit & 32Bit total size..
64Bit processors can single pass most Data Types such as 8Bit & 10Bit & only have to worry about planar handling for 16Bit per channel data..
Your motives for handling data Planar are the clear advantages of Single channel data processing & parallelism,..
When you smooth single channel data, You have a very smooth blend, When you sharpen it,..
The data is very pure!
64Bit & 32Bit SiMD; Block data handling for processing has advantages..
Single data passes require less fetches, Planar data can require more fetches per cycle,..
Smooths & sharpens involve a single pass that includes all channels, That can be good!
So planar fetching is 3, 4 or 5 passes, You can group them in DMA,..
Single fetching with 64Bit processors requires less fetching calls in the stack.
Rupert S
*
Colour Definition, 8 Bit & 32Bit & 64Bit quantification (c)RS
The other day I was writing about 8 Bit in terms of colour & saying the big issue with 8Bit SiMD such as Intel & AMD & NVidia have as of 2024 is defining colours in HDR & WCG
The prime colour palette of 10, 10, 10, 2 colour presents no issue to 32 Integer on ARM & CPU processors,..
Indeed 32 bit data types are perfect for 32Bit Integers & floats, Indeed my primary statement is that in terms of 10Bit, 32Bit is perfect,..
Indeed a 32 Bit type such as 9, 9, 9, 5 : RGB+BW is perfected for many scenarios,..
But as we can see 9 bits per colour & 5 Bits for BW presents quite a large palette,..
My argument for the 10, 10, 10, 2 RGB+BW palette presents quite an argument to bard, Because bard thinks that 2 bits of BW probably presents nothing much to define!
However my data set goes like this, The 2 bit represents a total of 4 states,..
That is 4 Defining variables in light to dark palette,.. 4 levels of light to dark..
So 10, 10, 10 = 30 Bit & Multiply 30 Bit * 4 Versions! Sounds like a lot doesn't it!...
Not convinced yet ? The 30Bit is still controlled by the shade of light it produces..
Gama curving the palette of the 30 Bit produces a variance in light levels over colour palette ..
Combine this with 4 Bits of BW & that is quite good.
9,9,9,5 presents the next level in light & dark in 32Bit, As you think about it,..
Presenting the case where the colour brightness, presents a total of 25 Variations in level of brightness!
8,8,8,8 RGB+BW presents an 8x8 variance of BW & yet presents a total of 32Bit..
So presenting a.. 2 operations per pixel mode should be no issue? Could we do that ?
We could present colour palettes with 2 x 32 Bit operations.. Like so:
8,8,8,8 or 9,9,9,5 or 10, 10,10, 2 & an additional operation of one of those... with additive LUT,..
In terms of screen Additive LUT ADDS 2 potential values per frame & effectively refreshes the LED 2x per refresh cycle (additive),..
Our approach to 8Bit would be the same,.. Primarily for 8Bit palette we would use 4 x operation,..
On single pure channels R , G, B, BW
Grouped 8Bit such as intel has could operate on the 4 channels in 8Bit per colour & 8Bit BW,..
Presenting the 8,8,8,8 channel arrangement = 32Bit,..
& there is our solution, Multiple refreshes per luminance cycle of LED for 32Bit * many & singularly presents an argument of how to page flip..
8Bit SiMD
32Bit
64Bit
For a total High complexity LUT package for LED
(c)Rupert S
*****
A data processing strategy for modern GPUs and NPUs, focusing on the efficient use of wide, lower-precision SiMD (Single Instruction, Multiple Data) units,..
Such as those found in Console, EdgeTPU & Intel's Xe graphics architecture.
The core proposal is to use planar data layouts for color information to maximize the parallelism of hardware that excels at 8-bit and 16-bit operations.
The Challenge: Limited Bit-Depth in Wide SiMD
Modern processors, particularly GPUs like Intel Xe and various NPUs (Neural Processing Units),..
Achieve high performance through massive parallelism..
They use wide SiMD vector engines that can perform the same operation on many pieces of data simultaneously.
However, these execution units often operate most efficiently on smaller data types, such as 8-bit integers (Int8) or 8-bit floating-point numbers (FP8)..
This presents a challenge when working with standard, high-precision color formats like 32-bit RGBA (8,8,8,8) or higher-dynamic-range formats (10,10,10,2, 16,16,16,16).
The traditional method of storing pixel data is packed or interleaved, where all the color components for a single pixel are stored together in memory:
[R1, G1, B1, A1], [R2, G2, B2, A2], [R3, G3, B3, A3], ...
This layout is inefficient for wide, 8-bit SiMD units because the processor must de-interleave the data before it can perform parallel operations on a single color channel.
The Solution: Planar Data Layouts
The proposed solution is to organize data in a planar format..
In this layout, all data for a single channel is stored contiguously in memory, creating separate "planes" for each component.
For a series of RGBA pixels, the memory would be organized as:
Red Plane: [R1, R2, R3, R4, ...]
Green Plane: [G1, G2, G3, G4, ...]
Blue Plane: [B1, B2, B3, B4, ...]
Alpha Plane: [A1, A2, A3, A4, ...]
Advantages of the Planar Approach
Maximized Parallelism: A wide SiMD engine can load a large, contiguous block from a single plane (e.g., 64 red values) and process them all in a single instruction..
This perfectly aligns with the hardware's capabilities, such as an Intel XVE running multiple threads concurrently.
Effective Precision: By processing each 8-bit or 16-bit plane separately,.. The results can be combined later to achieve full 32-bit or 64-bit precision..
This allows limited-bit-depth hardware to deliver a "gradated & skillful response" to high-precision color spaces.
Efficiency in Compression: This model is highly effective for tasks like video compression (codecs) and Display Stream Compression (DSC).
Single-channel operations, such as applying a blend or a filter to only the blue channel, become trivial and highly efficient without disturbing the other color data.
Trade-Offs
Packed/Merged Layout: A traditional packed layout (RGB, RGB, ...) can be more efficient for RAM access patterns when the computation requires all channels of a single pixel at once..
Data locality for a single pixel is higher.
Planar Layout: The planar approach may have some memory management overhead and can be less cache-friendly if an algorithm frequently needs to access all R, G, B, and A components of the same pixel.
Ultimately, as the text notes, the choice between a planar or packed ("block solve") approach depends entirely on the specific application and computational workload..
For tasks that are inherently parallel across a single channel, the planar model offers a significant performance advantage on modern hardware.
RS
*
By palette we don't have to mean, Small 256 Palette, Because a potent Palette is per channel,
So a combination of combined Paletted RGB+BW & a 256 x ( R, G, B, BW ) palette
Paletted is not ideal for DSC Display Compression I suppose, But 256 x ( R, G, B, BW ) palette may be!
RS
# Palette-Based Compression in DSC: Feasibility and Trade-Offs
---
## Why a Small 256-Entry Palette Isn’t Ideal for DSC
Using a tiny, per-block palette in a DSC stream runs into several hurdles:
- **Table Overhead Per Block**
DSC compresses 16×16 (or 8×8) pixel slices. Inserting a 256-entry palette for each slice adds hundreds of bytes of table data, wiping out any payload savings.
- **Algorithm Mismatch**
DSC’s entropy and delta predictors expect raw pixel values. Introducing indexed lookups breaks the existing residual-coding pipeline, forcing a more complex, hybrid encoder/decoder.
- **Latency & Complexity**
Carrying palette tables through low-latency display paths (DP, HDMI) demands extra handshakes and metadata flags, risking frame drops or increased micro-stalls.
---
## The Per-Channel Palette Alternative
Instead of one big RGBA table, you could maintain four smaller tables—one each for R, G, B, and a BW/Alpha plane. This reduces table size but still suffers:
| Channel | Palette Entries | Table Size (bytes) | Index Bits per Pixel |
|---------------|-----------------|--------------------|----------------------|
| Red | 256 | 256 × 1 = 256 | 8 |
| Green | 256 | 256 | 8 |
| Blue | 256 | 256 | 8 |
| BW/Alpha | 256 | 256 | 8 |
| **Total** | — | **1 024** | **32** |
- Even split across channels, you still carry ~1 KB of table per 16×16 block.
- You’ve replaced 64 bytes of raw RGBA data (16×16×4 bytes) with ~1 064 bytes total—clearly a net expansion.
---
## When a Palette Might Make Sense
1. **Global or Frame-Level Palettes**
Maintain a single palette for the entire frame or scene region rather than per block. Overhead amortizes over millions of pixels.
2. **Dynamic Colour-Index Mode**
Switch to an indexed-colour slice only when a scene region contains very few distinct hues (e.g., UI overlays or simple graphics).
3. **Palette as Side-Channel Metadata**
Send palette updates out-of-band (e.g., via ancillary pixel streams) so the main DSC pipeline remains untouched.
---
## Some Alternatives for DSC
- **Bit-Depth Adaptation**
Use 10-bit Y + 8-bit Cb/Cr in DSC’s native YUV modes..
You get finer luma precision where it matters without palette overhead.
- **Adaptive Block Predictors**
Leverage multiple prediction formulas per slice (flat, gradient, palette-inspired pre-clustering) within DSC’s existing framework.
- **Region-Based Coding**
For UI or text overlays, switch to simple RLE or LZ-based slices and fall back to full DSC for photographic content.
---
While per-block palettization sounds attractive for highly quantized scenes, It may without work clash with DSC’s low-latency, high-throughput goals..
Instead, consider global or dynamic palette modes and lean on DSC’s built-in bit-depth and predictor flexibility for bandwidth-efficient, artifact-free streaming.
RS
*
# Evaluating a 256×(R, G, B, BW) Palette for DSC
---
## Why a Per-Channel, 256-Entry Table Looks Promising
By splitting your palette into four 256-entry tables (one for R, G, B, and a BW/alpha channel), you:
- Gain finer quantization control on each colour axis
- Can independently optimize the BW plane for transparency or interlacing flags
- Keep index-stream logic simple 8 bits per plane
Compared to a single 256-entry RGBA palette, you trade a little more metadata for per-component precision.
---
## Overhead Analysis
| Scope | Table Size | Pixels Covered | Bytes per Pixel (metadata) |
|------------------------|-------------------|----------------------|----------------------------|
| Per-Block (16×16) | 4 × 256 = 1 024 B | 256 pixels | 1 024 B / 256 ≈ 4 B |
| Per-Row (1 × 1 024 px) | 1 024 B | 1 024 pixels | 1 B |
| Per-Frame (4K UHD) | 1 024 B | ~8 M pixels | ∼0.000125 B (0.125 mB) |
- **Per-block** overhead (∼4 B/pixel) nullifies any compression gains.
- **Per-row** or **per-frame** palettes amortize table cost dramatically.
---
## A More Practical Hybrid
1. **Luma-Raw + Chroma-Paletted**
- Keep Y (luma) as 10–12 bit raw samples—no palette.
- Use two 256-entry tables for Cb and Cr only.
- Metadata: 2 × 256 = 512 B per frame → ≈ 0.06 B/pixel on 4K.
2. **Dynamic Segment Palettes**
- Divide the frame into large macro-regions (e.g., UI vs. video).
- Assign each region its own per-channel tables.
- Only send tables when the region’s palette changes.
3. **Palette-As-Predictor**
- Integrate palette lookup into DSC’s delta predictors:
- Predict chroma from previous indexed value
- Encode only small residuals
---
## Next Steps
- **Prototype & Measure**: Simulate luma-raw + chroma-palette streams in your DSC pipeline.
- **Perceptual Testing**: Run A/B tests on HDR/WCG content to find acceptable Cb/Cr quantization.
- **Adaptive Schemes**: Trigger palette mode only when the chroma variance falls below a threshold.
By offloading only chroma into 256-entry per-channel palettes and keeping luma untouched,..
You preserve visual fidelity where it counts, slash metadata overhead, and slot neatly into DSC’s low-latency compressor.
Let’s experiment with these hybrids and see which gives you the sweetest bandwidth-quality balance!
RS
*
# Colour Table Interpolation: What It Is and How to Use It
---
## Definition of Colour Table Interpolation
Colour table (palette) interpolation refers to taking a discrete set of palette entries—each an RGBA quadruple—and computing intermediate colours by mathematically blending neighbouring entries when you scale or transform an image.
Instead of re-sampling raw RGB pixels, you:
- Map each pixel to a palette index
- Interpolate between palette entries based on fractional positions
- Produce smooth gradients or zoomed views while storing only indexed data
---
## How PNG Uses It (per W3C PNG-3 §4)
1. **Palette Image**
- Image data consists of 1–8 bit indices into a palette table of up to 256 RGBA entries.
2. **Scaling Modes**
- **Nearest-neighbour**: replicate the nearest palette entry—fast but blocky.
- **Bilinear**: blend the four nearest palette entries proportionally by distance—smooth gradients.
- **Bicubic**: higher-order blend for ultra-smooth scaling (less common in PNG implementations).
3. **Workflow**
- Read index stream
- For each target pixel, compute source-coordinate → fractional index offsets
- Retrieve neighbouring palette entries and apply weighted blend
---
## Integrating with Your DSC Chunk Allocator
When you organise your screen into 8×8 or 16×16 blocks and stream them via DSC:
1. **Build or Update Palette per Block**
- Analyse each block’s RGBA distribution
- Generate a localized palette (≤256 entries) to minimise index bit-depth
2. **Planar Stream Layout**
- Separate planes:
- R-plane (8 bits)
- G-plane (8 bits)
- B-plane (8 bits)
- A-plane (8 bits)
- I-plane (interleaved properties, compression flags)
3. **SMT/SiMD Parallelisation**
- **CPU SMT**: assign two threads, each handling half the scan-line of indices and interpolating palette lookups
- **GPU SiMD**: pack four scan-line segments per warp/wavefront, use texture units for bilinear fetch of palette entries
4. **Interpolation Kernel**
- Precompute blend weights for each fractional offset
- For each output pixel index `i + δ`:
- Fetch palette entries `P[i]` and `P[i+1]` (and `P[i+width]`, `P[i+width+1]` for 2D)
- Compute:
```
R_out = R0*(1−δx)*(1−δy) + R1*δx*(1−δy) + R2*(1−δx)*δy + R3*δx*δy
```
- Repeat for G, B, A
5. **Compression and Write-Out**
- Store interpolated RGBA in planar buffers
- Apply your block-based DSC compressor
- Enqueue compressed blocks for CPU→GPU transfer
---
## Example Block-Level Workflow
| Step | Task | Parallelism |
|-----------------------------|-----------------------------------------------------------|---------------|
| 1. Palette Analysis | Cluster 256 colours from 16×16 block | CPU×2 SMT |
| 2. Index Mapping | Map each pixel to nearest palette index | GPU SIMD×4 |
| 3. Weight Computation | Compute δx, δy blend weights for each target resolution | GPU SIMD×4 |
| 4. Palette Interpolation | Blend 2×2 palette entries for each output pixel | GPU SIMD×4 |
| 5. Planar Buffer Packing | Write R/G/B/A/I planes into 5 separate streams | CPU SMT+GPU |
| 6. DSC Compression & Encode | Compress each plane per block and packetise for transfer | CPU×2 SMT |
---
## Benefits of This Approach
- Extremely low memory footprint per block (palette+indices vs. full RGBA).
- High compression ratios due to indexed streams.
- Smooth scaling with bilinear or higher-order filters.
- Easily parallelised across SMT threads and GPU SIMD lanes.
---
With colour table interpolation feeding your DSC pipeline, you get the smoothness of true-colour scaling with the bandwidth savings of indexed images—perfect for low-latency, high-resolution HDR streaming.
RS
*
# Colour Table Interpolation: Expanding Palette Strategies
---
- **Nearest-neighbour**: replicate the nearest palette entry—fast but blocky.
- **Bilinear**: blend the four nearest palette entries proportionally by distance—smooth gradients.
- **Bicubic**: higher-order blend for ultra-smooth scaling (less common in PNG implementations).
3. **Workflow**
- Read index stream
- For each target pixel, compute source-coordinate → fractional index offsets
- Retrieve neighbouring palette entries and apply weighted blend
---
## Integrating with Your DSC Chunk Allocator
When you organise your screen into 8×8 or 16×16 blocks and stream them via DSC:
1. **Build or Update Palette per Block**
- Analyse each block’s RGBA distribution
- Generate a localized palette (≤256 entries) to minimise index bit-depth
2. **Planar Stream Layout**
- Separate planes:
- R-plane (8 bits)
- G-plane (8 bits)
- B-plane (8 bits)
- A-plane (8 bits)
- I-plane (interleaved properties, compression flags)
3. **SMT/SiMD Parallelisation**
- **CPU SMT**: assign two threads, each handling half the scan-line of indices and interpolating palette lookups
- **GPU SiMD**: pack four scan-line segments per warp/wavefront, use texture units for bilinear fetch of palette entries
4. **Interpolation Kernel**
- Precompute blend weights for each fractional offset
- For each output pixel index `i + δ`:
- Fetch palette entries `P[i]` and `P[i+1]` (and `P[i+width]`, `P[i+width+1]` for 2D)
- Compute:
```
R_out = R0*(1−δx)*(1−δy) + R1*δx*(1−δy) + R2*(1−δx)*δy + R3*δx*δy
```
- Repeat for G, B, A
5. **Compression and Write-Out**
- Store interpolated RGBA in planar buffers
- Apply your block-based DSC compressor
- Enqueue compressed blocks for CPU→GPU transfer
---
## Example Block-Level Workflow
| Step | Task | Parallelism |
|-----------------------------|-----------------------------------------------------------|---------------|
| 1. Palette Analysis | Cluster 256 colours from 16×16 block | CPU×2 SMT |
| 2. Index Mapping | Map each pixel to nearest palette index | GPU SIMD×4 |
| 3. Weight Computation | Compute δx, δy blend weights for each target resolution | GPU SIMD×4 |
| 4. Palette Interpolation | Blend 2×2 palette entries for each output pixel | GPU SIMD×4 |
| 5. Planar Buffer Packing | Write R/G/B/A/I planes into 5 separate streams | CPU SMT+GPU |
| 6. DSC Compression & Encode | Compress each plane per block and packetise for transfer | CPU×2 SMT |
---
## Benefits of This Approach
- Extremely low memory footprint per block (palette+indices vs. full RGBA).
- High compression ratios due to indexed streams.
- Smooth scaling with bilinear or higher-order filters.
- Easily parallelised across SMT threads and GPU SIMD lanes.
---
With colour table interpolation feeding your DSC pipeline, you get the smoothness of true-colour scaling with the bandwidth savings of indexed images—perfect for low-latency, high-resolution HDR streaming.
RS
*
# Colour Table Interpolation: Expanding Palette Strategies
---
## Definition of Colour Table Interpolation
Colour table interpolation blends discrete palette entries—each an RGBA or multi-channel tuple—to produce intermediate colours during scaling or transformation.
Rather than process full-precision pixels, you index into palettes and compute weighted blends, achieving smooth results with much less stored data.
---
## How PNG Uses It (per W3C PNG-3 §4)
1. Palette image data carries 1–8 bit indices into a table of up to 256 RGBA entries.
2. Scaling modes include nearest-neighbour (fast but blocky), bilinear (smooth 2×2 blend), and bicubic (higher-order smoothness).
3. Workflow:
- Read the index stream
- For each target pixel, compute source coordinates → fractional offsets
- Fetch neighbouring palette entries and apply weighted blending
---
## Potent Palettes: Channel-Wise vs. Combined
Palettes need not be a single 256-entry RGBA table. You can instead:
- Use **per-channel palettes**: separate tables (e.g., up to 256 entries) for Red, Green, Blue, and a BW/Alpha channel.
- Use a **combined RGBA palette**: 256 entries where each entry holds R, G, B, BW values.
- Employ a **hybrid** mix: smaller per-channel palettes plus a tiny combined palette for cross-channel nuances.
| Palette Scheme | Entries | Index Bits per Plane | Total Bits per Pixel |
|-----------------------|--------------------|----------------------|-----------------------|
| Combined RGBA | 256 × (R,G,B,BW) | 8 | 8 |
| Per-Channel | 256 × R, 256 × G, |
| 256 × B, 256 × BW | 8 each | 32 |
| Hybrid (e.g., 64 each)| 64 × R, G, B, BW | 6 each | 24 |
| Paletted RGB+BW | 256 × (R,G,B,BW) | 8 | 8 |
---
## Integrating Palettes with Your DSC Chunk Allocator
When streaming 8×8 or 16×16 blocks via DSC:
1. Build per-block palettes
- For each colour plane—R, G, B, BW/Alpha—cluster the most frequent values into a small table (≤256 entries).
2. Planar stream layout
- R-plane indices, G-plane indices, B-plane indices, BW/Alpha-plane indices, plus an I-plane for interleaved properties.
3. SMT/SiMD parallelisation
- CPU SMT: two threads handle separate halves of a block’s index planes and palette updates.
- GPU SIMD: pack four scan-line segments per warp, leveraging texture units for bilinear palette fetches.
4. Interpolation kernel
- Precompute δx/δy blend weights
- For each output pixel index (i + δ):
```
R_out = R00·(1−δx)(1−δy) + R10·δx(1−δy) + R01·(1−δx)δy + R11·δx·δy
```
- Repeat for G, B, BW/Alpha
5. Compress and write out
- Store blended planes in planar buffers
- Apply your block-based DSC compressor
- Enqueue for CPU→GPU transfer
---
## Benefits of Channel-Wise and Hybrid Palettes
- Greater quantization control per colour channel.
- Potentially lower per-pixel index bits in hybrid schemes.
- Smooth scaling and colour fidelity with minimal data overhead.
- Easily parallelised across SMT threads and GPU SIMD lanes.
---
By treating each colour channel—or combining them thoughtfully—you can tailor palette size and precision to your block-allocator, maximizing compression and visual quality for low-latency HDR streaming.
RS
*****
Dual Blend & DSC low Latency Connection Proposal - texture compression formats available (c)RS
https://is.gd/TV_GPU25_6D4
Reference
https://is.gd/SVG_DualBlend https://is.gd/MediaSecurity https://is.gd/JIT_RDMA
https://is.gd/PackedBit https://is.gd/BayerDitherPackBitDOT
https://is.gd/QuantizedFRC https://is.gd/BlendModes https://is.gd/TPM_VM_Sec
https://is.gd/IntegerMathsML https://is.gd/ML_Opt https://is.gd/OPC_ML_Opt
https://is.gd/OPC_ML_QuBit https://is.gd/QuBit_GPU https://is.gd/NUMA_Thread
On the subject of how deep a personality of 4Bit, 8Bit, 16Bit is reference:
https://science.n-helix.com/2021/03/brain-bit-precision-int32-fp32-int16.html
https://science.n-helix.com/2022/10/ml.html
On the subject of how deep a personality of 4Bit, 8Bit, 16Bit is reference:
https://science.n-helix.com/2021/03/brain-bit-precision-int32-fp32-int16.html
https://science.n-helix.com/2022/10/ml.html
https://science.n-helix.com/2025/07/layertexture.html
https://youtu.be/3c-jU3Ynpkg
No comments:
Post a Comment