Wednesday, January 17, 2018

integer floats with remainder theory

integer floats with remainder theory - copyright RS

The relevance of integer floats is that we can do 2 things:Float on Integer instruction sets at half resolution 32Bit Int > 16Bit.16Bit > 24Bit.8Bit > 28Bit.4BitRemainder theorem is the capacity to convert back and forth with data.

RAM/Memory and hard drive storage are major components & we also need to consider compression & color formats like DOT5

Machine learning & Server improvement: Files attached please utilise.

Integer_Float Remainder op Form 1:(c)RS

The float is formed of 2 integers...

One being the integer and the remainder being the floating component....

thusly we need two integers per float for example 2 32bit integers will make one single float instruction....

integer A : Remainder B

A + B = float
(A + B) x (A²+B²)

= float C dislocating A and B by a certain number of places = a float that travels as the integer.

Expansion data sets:

A1 : B1
A2 : B2
Ar : Br

F1 : Bf1
F2 : Bf2
Fr : Bfr

A : Integer
F : Float
r : Remainder

The data set expansion can be infinite and the expansion of the data set doubles the precision,
With the remainder... infinite computation = infinite precision.

Not only that but the computation can be executed as an Integer or as a float or indeed with both.
Relevance is that on computers there are a lot of integer registers; Float also..
Also the data can be compressed in ram without using larger buffer widths.

copyright Rupert Summerskill

COP-Roll : (c)Rupert S

ROLL Operation Syntax : RS :(Integer & Float)

Processing Cache (displacement) Operation Roll Arithmetic Maths : For
Multiplication, Division, Addition & Subtraction : P-COR-SAM

Addressable by Compiler update, Firmware update, CPU/GPU Rules
Firmware, Bios, Operating System & Program/Machine Learning.

Machine Learning will considerably enhance Cache & Processor routine
operations efficiency & make rules for all developers & firmware

AI Machine Learning Optimization :

In a single loop a multiply of a float point precision of under 1 for example 0.00001 requires that:

In Integer float :

Multiply of a sum such as 15.05 * 3 is 2 operations:
(15 x 3) + ((roll 0.05 left 2 places)*3) = R=(5 x 3) + 45

In other words : 2 storage values R remainder (the float component) & the number,
However multiplication of a float such as 0.01 is a division in one example & a multiply roll in another,

Roll is a memory operation in CPU terms & is a single processor loop push

In all operations where division is banned we have to decide whether the operation is multiples or division of base value 10 or 1,10,100>,

Such an operation can be carried out by addition or subtraction or roll, Values such as 200* ,
Require multiple additions under the multiply is banned principle.

Multiple sets of memory arrays in a series parallel is the equivalent of multiplication through addition,

Subtraction through addition requires inverting the power phase of a single component array.

Thus we are able to addition and subtraction all sums ? traditional math solves have done this before,
Roll operations are our fast way to multiply;

However arrays of addition & subtraction are a (logical fast loop)..
Full Operation in a single cycle, Because there is no sideways roll.

However direct memory displacement between 010100 & 101000 can use a cache to displace a 1,
Such an arrangement such as a 4 digit displacement cache to roll the operation on memory transfer.

Displace on operation (cycle 2) does minimize operations.

Having that cache further up the binary pipeline does reduce the number of roll cache modifier buffers that we need,

However the time we save & the time we lose & the CPU space we lose or gain.. depends specifically how limited the Roll Cache is.

Integer_Float Remainder op Form 2:(c)RS

32Bit (2x16Bit) is the most logical for 32Bit registers
64Bit (2x32Bit) is the most logical for 32Bit registers

Byte Swap operation
Byte Inversion operation

For example DWord: 8

2 x DWord: 8 Bit Integer & 8 Bit 4 roll places & 4 Bit Value.
Displacing the value 4 bits in 8 makes the value an integer,
Alternatively Adaptive maths adds 0 as for example multiplication & removes it afterwards..
The usage of adaptation takes the second DWord & effectively makes it an accurate remainder.

In that example i believe one less operation is needed in the 16Bit example,

Operation example 2 uses an embedded multiply x 10 &  divide after (to get resulting float)

32Bit memory space: 2x 16Bit Value, 1 Integer 16Bit & 1 0. value,
That can effectively be displaced 16 decimal places

The maths required as displayed above require inverting Multiply & Division,
For Mul & Div Ops on remainder; However does not when used finally:
In the FLOAT Unit FPU for large precision maths

This allows fully Integer CPU to do Float maths and store them as integer..
Both allowing fully the use of all registers & also storage as purely Integer_Float,
It also allows Full cache usage for SiMD,AVX & Vector Units.

Byte Inversion simply allows Byte Swap & Inversion to fully realise performance improvements..
& Also Byte Inversion maths.

SiMD,AVX,Vector : ByteSwap,Invert,Mul,Div etcetera Ergo Float compatible & Acceleration
Float : High Precision finalisation .. Lower Frequency = More potential
Integer + Byte Functions : Pure Acceleration with minimal loss Core Function utilisation

This is all algebra; Categorically.

(c) Rupert S

Optimisation & Use:


Multi-line Packed-Bit Int SiMD Maths : Relevance HDR, WCG, ML Machine Learning (Most advantaged ADDER Maths)

The rules of multiple Maths with lower Bit widths into SiMD 256Bit (example) 64Bit & 128Bit & 512Bit can be used

In all methods you use packed bits per save, so single line save or load, Parallel, No ram thrashing.

You cannot flow a 16Bit block into another segment (the next 16Bit block)

You can however use 9 bit as a separator & rolling an addition to the next bit means a more accurate result!
in 32Bit you do 3 * 8bit & 1 * 4Bit, in this example the 4Bit op has 5 Bit results & The 8Bit have 9Bit results..
This is preferable!

2Bit, 3Bit, 4Bit Operation 1 , 8Bit Operations 3: Table

4 : 1, 8 : 3

4 : 2, 8 : 6
2 : 1, 7 : 8
3 : 1, 8 : 1, 16 : 3

Addition is the only place where 16Bit * 4 = 64Bit works easily, but when you ADD or - you can only roll to the lowest boundary of each 16Bit segment & not into the higher or lower segment.

A: In order to multiply you need adaptable rules to division & multiply
B: you need a dividable Maths unit with And OR & Not gates to segment the registered Mul SiMD Unit..

In the case of + * you need to use single line rule addition (no over flow per pixel)..
& Either Many AND-OR / Not gate layer or Parallel 16Bit blocks..

You can however painful as it is Multi Load & Zero remainder registers & &or X or Not remainder 00000 on higher depth instructions & so remain pure!

8Bit blocks are a bit small and we use HDR & WCG, So mostly pointless!

We can however 8Bit Write a patch of pallet & sub divide our colour pallet & Light Shadow Curves in anything over 8Bit depth colour,

In the case of Intel 8Bit * 8 Inferencing unit : 16 Bit Colour in probably (WCG 8 * 8) + (HDR 8 * 8) Segments,

In any case Addition is fortunately what we need! so with ADD we can use SiMD & Integer Today.

Rupert S


Main Operation solves: Bit-Depth Conversions & Operations

The storage of multiple bit operations with Sync Read & Write,
The purpose of this is to Read, Write & Store Operations on:

F16, F32, F64

In RAM of 32Bit, 64Bit, 128Bit

Values Storage Table

32Bit = [16bit:16Bit]
32Bit = [8bit:8Bit:8bit:8Bit]
32Bit = [4bit:4Bit:4bit:4Bit:4bit:4Bit:4bit:4Bit]

64Bit = [32bit:32Bit]
64Bit = [16bit:16Bit:16bit:16Bit]
64Bit = [8bit:8Bit:8bit:8Bit:8bit:8Bit:8bit:8Bit]
64Bit = [4bit:4Bit:4bit:4Bit:4bit:4Bit:4bit:4Bit:4bit:4Bit:4bit:4Bit:4bit:4Bit:4bit:4Bit]

128Bit = [64bit:64Bit]
128Bit = [32bit:32Bit:32bit:32Bit]
128Bit = [16bit:16Bit:16bit:16Bit:16bit:16Bit:16bit:16Bit]
128Bit = [8bit:8Bit:8bit:8Bit:8bit:8Bit:8bit:8Bit:8bit:8Bit:8bit:8Bit:8bit:8Bit:8bit:8Bit]
128Bit = [4bit:4Bit:4bit:4Bit:4bit:4Bit:4bit:4Bit:4bit:4Bit:4bit:4Bit:4bit:4Bit:4bit:4Bit:4bit:4Bit:4bit:4Bit:4bit:4Bit:4bit:4Bit:4bit:4Bit:4bit:4Bit:4bit:4Bit:4bit:4Bit]

Bear in mind that Integer 64Bit is 2 x 32Bit on AMD; So you can compute 2 operations at 32Bit per 64Bit operation,
Some 64Bit units are only 64Bit; So we need to know how many!

32Bit operations are fine! & Conversion of 16Bit value ranges into 32Bit Operations can still be within range of 16Bit Storage..
If we stick within the 16Bit value range on Multiply & ADD,
We can therefore simply post a 16Bit value range data set & expect to be able to Store 16Bit!

The simple method is to store 2 16Bit values in the same 32Bit table; like [16bit:16Bit] = 32Bit

With this we can Load, Store, Run & Save 8bit INT8 operations in 32Bit devices such as Alexa as 8bit x 4 = 32Bit, So we don't Waste RAM or resources!

But we still have access to 32Bit RAM Paging; But with values loaded in 4Bit, 8Bit, 16Bit, 32Bit & so on.

With NANO Android on F16 & F32 & MIPS the same & AMD, Intel, NVidia,
Learning F16 offers considerable value for performance with 16M Values!


Direct DMA 32Bit & 64Bit RAM : Multiple Sync 16Bit Texture:

A good example of where 8Bit & 16Bit Value load works well is in the case of the texture,
To load 4 x 16Bit into a single 64Bit Cache:

32Bit RAM = 16Bit, 16Bit
64Bit RAM = 16Bit, 16Bit, 16Bit, 16Bit
128Bit RAM = 16Bit, 16Bit, 16Bit, 16Bit

In the case of direct DMA, you would be aware that you have,
128Bit, 192Bit Buss on GPU
32Bit & 64Bit on CPU

So a direct 4 * 32Bit or 2 * 64Bit Cache loads is a logically fast method to DMA directly from Cache to GPU!
In short you convert 8 x 16Bit into a 2x 64Bit DMA push; Which is very fast!

You can do the same with batches of vertices in many storage sizes.



On the subject of how deep a personality of 4Bit, 8Bit, 16Bit is reference:


HDR Pure flow (c)RS
data is converted from the OS to gpu in compressed & optimized memory data into dithered & optimized smooth; precise rendering in every compatible monitor and other device..
the reason we do this is flow control and optimization of the final output of the devices, also the main chunk of data the os used is transparently the best,
In 5D,4D,3D & 2D data and can thusly be pre compressed and cache optimized & rendered.

No comments: