Java中的半精度浮点

• 将数字保留为半精度格式，并使用整数算术和位扭曲进行计算（就像MicroFloat的单精度和双精度一样）
• 以单精度或双精度执行所有计算，转换成半精度以进行传输（在这种情况下，我需要的是经过良好测试的转换函数。）

## 最佳答案

You can Use `Float.intBitsToFloat()` and `Float.floatToIntBits()` to convert them to and from primitive float values. If you can live with truncated precision (as opposed to rounding) the conversion should be possible to implement with just a few bit shifts.

``````// ignores the higher 16 bits
public static float toFloat( int hbits )
{
int mant = hbits & 0x03ff;            // 10 bits mantissa
int exp =  hbits & 0x7c00;            // 5 bits exponent
if( exp == 0x7c00 )                   // NaN/Inf
exp = 0x3fc00;                    // -> NaN/Inf
else if( exp != 0 )                   // normalized value
{
exp += 0x1c000;                   // exp - 15 + 127
if( mant == 0 && exp > 0x1c400 )  // smooth transition
return Float.intBitsToFloat( ( hbits & 0x8000 ) << 16
| exp << 13 | 0x3ff );
}
else if( mant != 0 )                  // && exp==0 -> subnormal
{
exp = 0x1c400;                    // make it normal
do {
mant <<= 1;                   // mantissa * 2
exp -= 0x400;                 // decrease exp by 1
} while( ( mant & 0x400 ) == 0 ); // while not normal
mant &= 0x3ff;                    // discard subnormal bit
}                                     // else +/-0 -> +/-0
return Float.intBitsToFloat(          // combine all parts
( hbits & 0x8000 ) << 16          // sign  << ( 31 - 15 )
| ( exp | mant ) << 13 );         // value << ( 23 - 10 )
}
``````
``````// returns all higher 16 bits as 0 for all results
public static int fromFloat( float fval )
{
int fbits = Float.floatToIntBits( fval );
int sign = fbits >>> 16 & 0x8000;          // sign only
int val = ( fbits & 0x7fffffff ) + 0x1000; // rounded value

if( val >= 0x47800000 )               // might be or become NaN/Inf
{                                     // avoid Inf due to rounding
if( ( fbits & 0x7fffffff ) >= 0x47800000 )
{                                 // is or must become NaN/Inf
if( val < 0x7f800000 )        // was value but too large
return sign | 0x7c00;     // make it +/-Inf
return sign | 0x7c00 |        // remains +/-Inf or NaN
( fbits & 0x007fffff ) >>> 13; // keep NaN (and Inf) bits
}
return sign | 0x7bff;             // unrounded not quite Inf
}
if( val >= 0x38800000 )               // remains normalized value
return sign | val - 0x38000000 >>> 13; // exp - 127 + 15
if( val < 0x33000000 )                // too small for subnormal
return sign;                      // becomes +/-0
val = ( fbits & 0x7fffffff ) >>> 23;  // tmp exp for subnormal calc
return sign | ( ( fbits & 0x7fffff | 0x800000 ) // add subnormal bit
+ ( 0x800000 >>> val - 102 )     // round depending on cut off
>>> 126 - val );   // div by 2^(1-(exp-127+15)) and >> 13 | exp=0
}
``````

The first one are these two lines in the `toFloat()` function:

``````if( mant == 0 && exp > 0x1c400 )  // smooth transition
return Float.intBitsToFloat( ( hbits & 0x8000 ) << 16 | exp << 13 | 0x3ff );
``````

``````covered number space on either side of the returned value:
6.0E-8             #######                  ##########
4.5E-8             |                       #
3.0E-8     #########               ########
``````

The second extension is in the `fromFloat()` function:

``````    {                                     // avoid Inf due to rounding
if( ( fbits & 0x7fffffff ) >= 0x47800000 )
...
return sign | 0x7bff;             // unrounded not quite Inf
}
``````

I tried to optimize the path for normal values in the `fromFloat()` function as much as possible which made it a bit less readable due to the use of precomputed and unshifted constants. I didn't put as much effort into 'toFloat()' since it would not exceed the performance of a lookup table anyway. So if speed really matters could use the `toFloat()` function only to fill a static lookup table with 0x10000 elements and than use this table for the actual conversion. This is about 3 times faster with a current x64 server VM and about 5 times faster with the x86 client VM.

公众号
关注公众号订阅更多技术干货！