我正在打开一个文件
private String getStringFromFile(File file) {
try {
return Files.readString(Paths.get(file.getPath()), StandardCharsets.US_ASCII);
}
catch (Exception e) {
System.out.println("Error while reading: " + file.getName());
return "";
}
}
and even though the file seems to be clearly ASCII compatible, I'm getting Error while reading: fileName
.
该文件如下所示:
如果我在打开标题之前手动删除了标题(带有方括号的部分),则该代码有效(以后无论如何我都会在代码中删除它们)。有没有一种方法可以扩展字符集的范围,而又不会破坏仅适用于ASCII的代码,这是一种罕见的例外吗?
Here's the file in pgn (it can be openned as txt).
该文件几乎是ASCII格式。问题出在“科特迪瓦”中的引号字符。
该文件包含一个0x92字节。在Windows代码页1252(西欧语言)中,它是Unicode字符U + 2019右单引号。
The problem is that the 1252 code page is a slight variation from ISO-8859-1 which uses unmapped position for some common characters like the euro symbol
€
and the right and left quotation marks. And it is not in the list of the always present charsets.怎么修:
win1252
orcp1252
charset, use it.FilterInputStream
to replace the non-ascii characters for example with a space (ASCII 0x20) or from a custom Map (0x92 -> 0x27 to replace the RIGHT SINGLE QUOTATION MARK (’
) with a simple APOSTROPHE ('
)). After that, theInputStreamReader
will give you the expected characters.