I am currently trying to parse UnicodeData.txt with this format: ftp://ftp.unicode.org/Public/3.0-Update/UnicodeData-3.0.0.html However, I am hitting a problem in that when I try to read, say a line like the following.
something;123D;;LINE TABULATION;
I try to get the data from the fields by code such as the following. The problem is that fields[3] is not getting filled in, and scanf is returning 2. in
is the current line.
char fields[4][256];
sscanf(in, "%[^;];%[^;];%[^;];%[^;];%[^;];",
fields[0], fields[1], fields[2], fields[3]);
I know this is the correct implementation of scanf()
, but is there a way to get this to work, short of making my own scanf()
?
I don't think
sscanf
will do what you need:sscanf
format%[^;]
will match a non-empty sequence of not-semicolon characters. The alternative would be usingreadline
with the separator being';'
, like:(I have only seen you took the
c++
tag off this question now, if your problem is c-only, I have another solution, below:)Just in case you would like to consider this following alternative, using
scanf
s and"%n"
format-specifier, used for reading in how many characters have been read by far, into an integer:On every cycle, it reads maximum of 255 characters into the corresponding
fields[i]
, until it encounters a delimiter semicolon;
. After reading them, it reads in how many characters it had read, into then
, which had been zeroed (oh my...) beforehand.它会将指向字符串的指针增加读取的字符数量,再加上一个用于定界分号的指针。
printf
for the return value ofsscanf
, and the printing of the result is just for demonstration purposes. You can see the code working on http://codepad.org/kae8smPF without thegetchar();
and withfor
declaration moved outside for C90 compliance.scanf
does not handle "empty" fields. So you will have to parse it on your own.以下解决方案是:
strchr
rather than the quite slowsscanf
The function
parse
extracts fields from the inputstr
, separated by semi-colons. Four semi-colons give five fields, some or all of which can be blank. No provision is made for escaping the semi-colons.该测试程序的输出为:
更新:稍短一点的版本是:不需要额外的数组来存储每个子字符串的开头,是: