如何获取scanf以继续使用空scanset

I am currently trying to parse UnicodeData.txt with this format: ftp://ftp.unicode.org/Public/3.0-Update/UnicodeData-3.0.0.html However, I am hitting a problem in that when I try to read, say a line like the following.

something;123D;;LINE TABULATION;

I try to get the data from the fields by code such as the following. The problem is that fields[3] is not getting filled in, and scanf is returning 2. in is the current line.

char fields[4][256];
sscanf(in, "%[^;];%[^;];%[^;];%[^;];%[^;];",
    fields[0], fields[1], fields[2], fields[3]);

I know this is the correct implementation of scanf(), but is there a way to get this to work, short of making my own scanf()?

评论
  • 木叶
    木叶 回复

    I don't think sscanf will do what you need: sscanf format %[^;] will match a non-empty sequence of not-semicolon characters. The alternative would be using readline with the separator being ';', like:

    #include <iostream>
    #include <sstream>
    #include <string>
    
    int main() {
      using namespace std;
      istringstream i { "something;123D;;LINE TABULATION;\nsomething;123D;;LINE TABULATION;\nsomething;123D;;LINE TABULATION;\n" };
      string a, b, c, d, newline;
      while( getline(i, a, ';') && getline(i, b, ';') && getline(i, c, ';') && getline (i, d, ';') && getline(i, newline) )
        cout << d << ',' << c << '-' << b << ':' << a << endl; 
    }
    

    (I have only seen you took the c++ tag off this question now, if your problem is c-only, I have another solution, below:)

    #include <string.h>
    #include <stdio.h>
    
    int main() {
      typedef char buffer[2048];
      buffer line;
      while( fgets(line, sizeof(line), stdin) > 0 ) {
        printf("(%s)\n", line);
        char *end = line;
        char *s1 = *end == ';' ? (*end = '\0'), end++ : strtok_r(end, ";", &end);
        char *s2 = *end == ';' ? (*end = '\0'), end++ : strtok_r(end, ";", &end);
        char *s3 = *end == ';' ? (*end = '\0'), end++ : strtok_r(end, ";", &end);
        char *s4 = *end == ';' ? (*end = '\0'), end++ : strtok_r(end, ";", &end);
        printf("[%s][%s][%s][%s]\n", s4, s3, s2, s1);
      }
    }
    
  • 痛苦的/呻吟
    痛苦的/呻吟 回复

    Just in case you would like to consider this following alternative, using scanfs and "%n" format-specifier, used for reading in how many characters have been read by far, into an integer:

    #include <stdio.h>
    #define N 4
    
    int main( ){
    
        char * str = "something;123D;;LINE TABULATION;";
        char * wanderer = str;
        char fields[N][256] = { 0 };
        int n;
    
        for ( int i = 0; i < N; i++ ) {
            n = 0;
            printf( "%d ", sscanf( wanderer, "%255[^;]%n", fields[i], &n ) );
            wanderer += n + 1;
        }
    
        putchar( 10 );
    
        for ( int i = 0; i < N; i++ )
            printf( "%d: %s\n", i, fields[i] );
    
        getchar( );
        return 0;
    }
    

    On every cycle, it reads maximum of 255 characters into the corresponding fields[i], until it encounters a delimiter semicolon ;. After reading them, it reads in how many characters it had read, into the n, which had been zeroed (oh my...) beforehand.

    它会将指向字符串的指针增加读取的字符数量,再加上一个用于定界分号的指针。

    printf for the return value of sscanf, and the printing of the result is just for demonstration purposes. You can see the code working on http://codepad.org/kae8smPF without the getchar(); and with for declaration moved outside for C90 compliance.

  • 御林军
    御林军 回复

    scanf does not handle "empty" fields. So you will have to parse it on your own.

    以下解决方案是:

    • fast, as it uses strchr rather than the quite slow sscanf
    • flexible, as it will detect an arbitrary number of fields, up to a given maximum.

    The function parse extracts fields from the input str, separated by semi-colons. Four semi-colons give five fields, some or all of which can be blank. No provision is made for escaping the semi-colons.

    #include <stdio.h>
    #include <string.h>
    
    static int parse(char *str, char *out[], int max_num) {
        int num = 0;
        out[num++] = str;
        while (num < max_num && str && (str = strchr(str, ';'))) {
            *str = 0;           // nul-terminate previous field
            out[num++] = ++str; // save start of next field
        }
        return num;
    }
    
    int main(void) {
        char test[] = "something;123D;;LINE TABULATION;";
        char *field[99];
        int num = parse(test, field, 99);
        int i;
        for (i = 0; i < num; i++)
            printf("[%s]", field[i]);
        printf("\n");
        return 0;
    }
    

    该测试程序的输出为:

    [something][123D][][LINE TABULATION][]
    

    更新:稍短一点的版本是:不需要额外的数组来存储每个子字符串的开头,是:

    #include <stdio.h>
    #include <string.h>
    
    static int replaceSemicolonsWithNuls(char *p) {
        int num = 0;
        while ((p = strchr(p, ';'))) {
            *p++ = 0;
            num++; 
        }
        return num;
    }
    
    int main(void) {
        char test[] = "something;123D;;LINE TABULATION;";
        int num = replaceSemicolonsWithNuls(test);
        int i;
        char *p = test;
        for (i = 0; i < num; i++, p += strlen(p) + 1)
            printf("[%s]", p);
        printf("\n");
        return 0;
    }