Java中的半精度浮点
收藏

是否在任何地方都有Java库可以对IEEE 754半精度数字执行计算或将其与双精度数字进行转换?

这些方法中的任何一种都是合适的:

  • 将数字保留为半精度格式,并使用整数算术和位扭曲进行计算(就像MicroFloat的单精度和双精度一样)
  • 以单精度或双精度执行所有计算,转换成半精度以进行传输(在这种情况下,我需要的是经过良好测试的转换函数。)

编辑:转换需要100%准确-输入文件中有很多NaN,无穷大和次法线。

相关问题,但仅适用于JavaScript:解压缩Java中的半精度浮点数

最佳答案

You can Use Float.intBitsToFloat() and Float.floatToIntBits() to convert them to and from primitive float values. If you can live with truncated precision (as opposed to rounding) the conversion should be possible to implement with just a few bit shifts.

我现在付出了更多的努力,结果却没有一开始就那么简单。现在,该版本已经在我能想到的各个方面进行了测试和验证,我非常有信心它可以为所有可能的输入值产生准确的结果。它支持任一方向上的精确舍入和次正规转换。

// ignores the higher 16 bits
public static float toFloat( int hbits )
{
    int mant = hbits & 0x03ff;            // 10 bits mantissa
    int exp =  hbits & 0x7c00;            // 5 bits exponent
    if( exp == 0x7c00 )                   // NaN/Inf
        exp = 0x3fc00;                    // -> NaN/Inf
    else if( exp != 0 )                   // normalized value
    {
        exp += 0x1c000;                   // exp - 15 + 127
        if( mant == 0 && exp > 0x1c400 )  // smooth transition
            return Float.intBitsToFloat( ( hbits & 0x8000 ) << 16
                                            | exp << 13 | 0x3ff );
    }
    else if( mant != 0 )                  // && exp==0 -> subnormal
    {
        exp = 0x1c400;                    // make it normal
        do {
            mant <<= 1;                   // mantissa * 2
            exp -= 0x400;                 // decrease exp by 1
        } while( ( mant & 0x400 ) == 0 ); // while not normal
        mant &= 0x3ff;                    // discard subnormal bit
    }                                     // else +/-0 -> +/-0
    return Float.intBitsToFloat(          // combine all parts
        ( hbits & 0x8000 ) << 16          // sign  << ( 31 - 15 )
        | ( exp | mant ) << 13 );         // value << ( 23 - 10 )
}
// returns all higher 16 bits as 0 for all results
public static int fromFloat( float fval )
{
    int fbits = Float.floatToIntBits( fval );
    int sign = fbits >>> 16 & 0x8000;          // sign only
    int val = ( fbits & 0x7fffffff ) + 0x1000; // rounded value

    if( val >= 0x47800000 )               // might be or become NaN/Inf
    {                                     // avoid Inf due to rounding
        if( ( fbits & 0x7fffffff ) >= 0x47800000 )
        {                                 // is or must become NaN/Inf
            if( val < 0x7f800000 )        // was value but too large
                return sign | 0x7c00;     // make it +/-Inf
            return sign | 0x7c00 |        // remains +/-Inf or NaN
                ( fbits & 0x007fffff ) >>> 13; // keep NaN (and Inf) bits
        }
        return sign | 0x7bff;             // unrounded not quite Inf
    }
    if( val >= 0x38800000 )               // remains normalized value
        return sign | val - 0x38000000 >>> 13; // exp - 127 + 15
    if( val < 0x33000000 )                // too small for subnormal
        return sign;                      // becomes +/-0
    val = ( fbits & 0x7fffffff ) >>> 23;  // tmp exp for subnormal calc
    return sign | ( ( fbits & 0x7fffff | 0x800000 ) // add subnormal bit
         + ( 0x800000 >>> val - 102 )     // round depending on cut off
      >>> 126 - val );   // div by 2^(1-(exp-127+15)) and >> 13 | exp=0
}

与本书相比,我实现了两个小的扩展,因为16位浮点数的通用精度相当低,与较大的浮点数类型(由于精度高而通常不会注意到)相比,它们可以使浮点格式的固有异常在视觉上可以察觉。

The first one are these two lines in the toFloat() function:

if( mant == 0 && exp > 0x1c400 )  // smooth transition
    return Float.intBitsToFloat( ( hbits & 0x8000 ) << 16 | exp << 13 | 0x3ff );

字体大小的正常范围内的浮点数采用指数,因此采用数值大小的精度。但这并不是一个平稳的采用,它是分步进行的:切换到下一个更高的指数将导致一半的精度。现在,对于尾数的所有值,精度都保持不变,直到下一次跳转到下一个较高的指数为止。上面的扩展代码通过返回此特定的半浮点值在覆盖的32位浮点范围的地理中心的值,使这些过渡更加平滑。每个正常的半浮点值都精确映射到8192个32位浮点值。返回值应该恰好在这些值的中间。但是在半浮点指数的过渡处,较低的4096值的精度是较高的4096值的两倍,因此所覆盖的数字空间仅为另一侧的一半。所有这些8192个32位浮点值都映射到相同的半浮点值,因此,无论选择了8192个中间32位值中的哪个值,将一个半浮点数转换为32位并返回都会导致相同的半浮点值。扩展现在导致在过渡处像sqrt(2)一样平滑的半步,如下图右图所示,而左图应该将锐化步幅可视化为二倍而没有抗混叠。您可以安全地从代码中删除这两行以获得标准行为。

covered number space on either side of the returned value:
       6.0E-8             #######                  ##########
       4.5E-8             |                       #
       3.0E-8     #########               ########

The second extension is in the fromFloat() function:

    {                                     // avoid Inf due to rounding
        if( ( fbits & 0x7fffffff ) >= 0x47800000 )
...
        return sign | 0x7bff;             // unrounded not quite Inf
    }

此扩展通过保存一些32位值形式(提升为Infinity)来稍微扩展半浮点格式的数字范围。受影响的值是那些没有四舍五入而小于Infinity的值,仅由于四舍五入而变为Infinity的值。如果您不需要此扩展名,则可以安全地删除上面显示的行。

I tried to optimize the path for normal values in the fromFloat() function as much as possible which made it a bit less readable due to the use of precomputed and unshifted constants. I didn't put as much effort into 'toFloat()' since it would not exceed the performance of a lookup table anyway. So if speed really matters could use the toFloat() function only to fill a static lookup table with 0x10000 elements and than use this table for the actual conversion. This is about 3 times faster with a current x64 server VM and about 5 times faster with the x86 client VM.

我在此将代码放入公共领域。

    公众号
    关注公众号订阅更多技术干货!