如何在sql中创建查询以将句子切成单词并将其频率添加到新表中

我正在尝试执行一个不确定的查询 我有一个称为句子的表格,其中包含ID,句子,并按下面的图片所示进行验证。

enter image description here

我有另一个表,单词计数,其中包含ID,单词和出现频率。所以我想无论何时输入一个句子进行更新,或删除该表以进行相应更新或每天更新一次,因为可能会有很多句子

我的预期输出如下图所示。

enter image description here

任何想法是可行的,任何人都可以帮忙。

评论
  • wearum
    wearum 回复

    对于最新的MySQL版本(8.0.4和更高版本),您可以使用

    SELECT ROW_NUMBER() OVER (ORDER BY COUNT(word) DESC, word) wid, word, COUNT(word) freq 
    FROM sentencess 
    CROSS JOIN JSON_TABLE( CONCAT('["', REPLACE(sentencess.sent, ' ', '","'), '"]'),
                           "$[*]" COLUMNS( word VARCHAR(254) PATH "$" )
                         ) AS jsontable
    GROUP BY word
    ORDER BY freq DESC, word;
    

    fiddle

    PS. I cannot reproduce the output ordering because I cannot understand ordering criteria within freq groups.

  • lipsum
    lipsum 回复

    Based on this DBA Stack Exchange post I could imagine something like the following.

    基本步骤:

    1. Create a table to contain a list of words (word_index in my example)
    2. Create a table to contain the word counts (word_count in my example)
    3. Create a Stored Procedure to split the sentences into words based on SPACE (might have to be tweaked to also allow other whitespace like line wraps) and write it into the word_index table
    4. calculate statistics and write it into word_count

    逐步执行代码:

    Create word_index:

    CREATE TABLE IF NOT EXISTS `word_index` (
      `id` int(11) NOT NULL AUTO_INCREMENT,
      `word` varchar(150) DEFAULT NULL,
      PRIMARY KEY (`id`)
    );
    

    Create word_count:

    CREATE TABLE IF NOT EXISTS `word_count` (
      `word` varchar(150) NOT NULL,
      `occurrences` int(11) DEFAULT NULL,
      PRIMARY KEY (`word`)
    )
    

    Create procedure transfer_cell to transfer split words into target table:

    DELIMITER //
    CREATE FUNCTION `SPLIT_STRING`(val TEXT, delim VARCHAR(12), pos INT) RETURNS text CHARSET latin1
    BEGIN
            DECLARE output TEXT;
            SET output = REPLACE(SUBSTRING(SUBSTRING_INDEX(val, delim, pos), CHAR_LENGTH(SUBSTRING_INDEX(val, delim, pos - 1)) + 1), delim, '');
            IF output = '' THEN
                SET output = null;
            END IF;
            RETURN output;
        END//
    DELIMITER ;
    
    -- Dumping structure for procedure test.TRANSFER_CELL
    DELIMITER //
    CREATE PROCEDURE `transfer_cell`()
    BEGIN
            DECLARE i INTEGER;
            SET i = 1;
            REPEAT
                INSERT INTO word_index (word)
                SELECT SPLIT_STRING(sent, ' ', i)
                FROM sentences
                WHERE SPLIT_STRING(sent, ' ', i) IS NOT NULL;
                SET i = i + 1;
            UNTIL ROW_COUNT() = 0
            END REPEAT;
        END//
    DELIMITER ;
    

    That's the basic setup. Please note that I used table sentences rather than sentencess with double s.

    更新统计信息:

    TRUNCATE TABLE word_index;
    TRUNCATE TABLE word_count;
    
    CALL transfer_cell();
    INSERT INTO word_count
      SELECT word, COUNT(1) occurrences FROM word_index 
      GROUP BY word;
    

    结果:

    这是上面记录的结果的屏幕截图:

    Result image

  • 花裤衩
    花裤衩 回复

    Perl和PHP等具有用于拆分的更强大的正则表达式引擎。我会使用其中之一,而不是SQL。

    我会使用批量插入

    INSERT INTO words (word, ct)
        VALUES ('this', 1), ('that', 1), ...   -- about 100 words at a time
        ON DUPLICATE KEY UPDATE ct = VALUES(ct) + 1;
    
    CREATE TABLE words (
        word VARCHAR(66) NOT NULL,
        ct MEDIUMINT UNSIGNED NOT NULL,
        PRIMARY KEY(word)
    ) ENGINE=InnoDB;
    

    I see no need for having words and counts in separate tables, nor any need for an AUTO_INCREMENT for a "word_id". The word is a perfectly good "natural PK". However, you should decide what to do about case folding and accent stripping.

    至于拆分成单词...,双引号和其他一些字符显然是单词边界。但是有些字符是模棱两可的:

    ' -- part of a contraction or a quote?
    . -- abbreviation or end of a sentence

    等等。

  • 习惯性隐身
    习惯性隐身 回复

    如果您正在运行MySQL 8.0,则建议为此使用递归公用表表达式。这个想法是迭代遍历每条消息,并沿途将其拆分为单词。然后剩下要做的就是聚合。

    with recursive cte as (
        select 
            substring(concat(sent, ' '), 1, locate(' ', sent)) word,
            substring(concat(sent, ' '), locate(' ', sent) + 1) sent
        from messages
        union all
        select 
            substring(sent, 1, locate(' ', sent)) word,
            substring(sent, locate(' ', sent) + 1) sent
        from cte
        where locate(' ', sent) > 0
    )
    select row_number() over(order by count(*) desc, word) wid, word, count(*) freq
    from cte 
    group by word
    order by wid
    

    在早期版本中,您可以使用数字表模拟相同的行为。

    Demo on DB Fiddle

    样本数据:

    发送|验证
    :------------------------- | ----:
    你好,我叫亚历克斯|空值
    嘿alin和alex我是汤姆|空值
    您好亚历克斯,我叫alin |空值

    结果:

    威德字|频率
    -:| :----- | ---:
      1 |亚历克斯| 3
      2 |艾琳| 2
      3 |你好2
      4 |是| 2
      5 |我的2
      6 |名称| 2
      7 |和1个
      8 |嘿1个
      9 |我是1个
     10 |汤姆| 1个

    在将查询结果保存在单独的表中时,它可能比您想象的要复杂:您需要能够根据原始表中的更改来插入,删除或更新目标表,而这不能在MySQL中的单个语句中完成。此外,在原始表中保持标记为最新状态会创建竞争条件,在更新目标表时可能会发生更改。

    A simpler option would be to put the query in a view, so you get an always-up-to-date perspective on your data. For this, you can just wrap the above query in a create view statement, like:

    create view words_view as < above query >;
    

    如果性能成为问题,那么您也可以定期截断并重新填充word表。

    truncate table words;
    insert into words < above query >;
    
  • 黑色爱情
    黑色爱情 回复

    警告:这是TSql,不是MySQL。

    -- 1. To create a function that splits the sentence into words, and returns the Words Table
    -- 2. To insert into your Result Table all of the words Table results
    -- 3. Calculate the Frequency
    
    ----------[ The Split Function ]
    CREATE FUNCTION dbo.udf_SplitString 
            (
                    @Sentence   varchar(max)
                ,   @Separator  char(1) 
            )
            RETURNS @WordList TABLE (Word varchar(50)) 
        AS
            BEGIN
                SET @Separator  =   ISNULL(@Separator, ' ') 
    
                DECLARE @Word   varchar(50)
    
                SET @Sentence = LTRIM(@Sentence) + @Separator -- Make sure last word has a separator after. Also get rid of leading spaces.
    
                WHILE   (CHARINDEX(@Separator, @Sentence) > 0)
                    BEGIN
                        SET @Word = SUBSTRING(@Sentence, 1, CHARINDEX(@Separator, @Sentence) - 1)
                        INSERT INTO @WordList   SELECT LTRIM(@Word)
                        -- Remove word added to the List from the sentence.
                        SET @Sentence =  SUBSTRING(@Sentence,   CHARINDEX(@Separator, @Sentence) + 1,   LEN(@Sentence))
                        SET @Sentence =  LTRIM(@Sentence)           
                END                 
                RETURN
            END 
    
    ----------[ The Script ]
    DECLARE     @SentenceList   TABLE   (Sentence varchar(max))
    INSERT INTO @SentenceList   VALUES
                ('hello my name is alex')
            ,   ('hey alin and alex I''m tom')  
            ,   ('hello alex my name is alin')
    
    DECLARE     @WordList   TABLE   (Word varchar(50))
    
    INSERT INTO @WordList   
    SELECT  
            W.Word
    FROM        @SentenceList   S
    CROSS APPLY (
                    SELECT Word FROM dbo.udf_SplitString(S.Sentence, ' ')
                ) W 
    
    SELECT 
            ID  =   ROW_NUMBER() OVER(ORDER BY SUM(1) DESC, Word)
        ,   Word
        ,   Frequency   =   SUM(1)
    FROM @WordList
    GROUP BY Word