如何在不破坏基于ASCII的代码的情况下增加美国ASCII字符集的范围?

我正在打开一个文件

private String getStringFromFile(File file) {
    try {
        return Files.readString(Paths.get(file.getPath()), StandardCharsets.US_ASCII);
    }
    catch (Exception e) {
        System.out.println("Error while reading: " + file.getName());
        return "";
    }
}

and even though the file seems to be clearly ASCII compatible, I'm getting Error while reading: fileName.

该文件如下所示:

enter image description here

如果我在打开标题之前手动删除了标题(带有方括号的部分),则该代码有效(以后无论如何我都会在代码中删除它们)。有没有一种方法可以扩展字符集的范围,而又不会破坏仅适用于ASCII的代码,这是一种罕见的例外吗?

Here's the file in pgn (it can be openned as txt).

评论
gcum
gcum

该文件几乎是ASCII格式。问题出在“科特迪瓦”中的引号字符。

该文件包含一个0x92字节。在Windows代码页1252(西欧语言)中,它是Unicode字符U + 2019右单引号。

The problem is that the 1252 code page is a slight variation from ISO-8859-1 which uses unmapped position for some common characters like the euro symbol and the right and left quotation marks. And it is not in the list of the always present charsets.

怎么修:

  • if your system supports the win1252 or cp1252 charset, use it.
  • else, you should use a FilterInputStream to replace the non-ascii characters for example with a space (ASCII 0x20) or from a custom Map (0x92 -> 0x27 to replace the RIGHT SINGLE QUOTATION MARK () with a simple APOSTROPHE (')). After that, the InputStreamReader will give you the expected characters.
点赞
评论