SELECT ROW_NUMBER() OVER (ORDER BY COUNT(word) DESC, word) wid, word, COUNT(word) freq
FROM sentencess
CROSS JOIN JSON_TABLE( CONCAT('["', REPLACE(sentencess.sent, ' ', '","'), '"]'),
"$[*]" COLUMNS( word VARCHAR(254) PATH "$" )
) AS jsontable
GROUP BY word
ORDER BY freq DESC, word;
Create a table to contain a list of words (word_index in my example)
Create a table to contain the word counts (word_count in my example)
Create a Stored Procedure to split the sentences into words based on SPACE (might have to be tweaked to also allow other whitespace like line wraps) and write it into the word_index table
calculate statistics and write it into word_count
逐步执行代码:
Create word_index:
CREATE TABLE IF NOT EXISTS `word_index` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`word` varchar(150) DEFAULT NULL,
PRIMARY KEY (`id`)
);
Create word_count:
CREATE TABLE IF NOT EXISTS `word_count` (
`word` varchar(150) NOT NULL,
`occurrences` int(11) DEFAULT NULL,
PRIMARY KEY (`word`)
)
Create procedure transfer_cell to transfer split words into target table:
DELIMITER //
CREATE FUNCTION `SPLIT_STRING`(val TEXT, delim VARCHAR(12), pos INT) RETURNS text CHARSET latin1
BEGIN
DECLARE output TEXT;
SET output = REPLACE(SUBSTRING(SUBSTRING_INDEX(val, delim, pos), CHAR_LENGTH(SUBSTRING_INDEX(val, delim, pos - 1)) + 1), delim, '');
IF output = '' THEN
SET output = null;
END IF;
RETURN output;
END//
DELIMITER ;
-- Dumping structure for procedure test.TRANSFER_CELL
DELIMITER //
CREATE PROCEDURE `transfer_cell`()
BEGIN
DECLARE i INTEGER;
SET i = 1;
REPEAT
INSERT INTO word_index (word)
SELECT SPLIT_STRING(sent, ' ', i)
FROM sentences
WHERE SPLIT_STRING(sent, ' ', i) IS NOT NULL;
SET i = i + 1;
UNTIL ROW_COUNT() = 0
END REPEAT;
END//
DELIMITER ;
That's the basic setup. Please note that I used table sentences rather than sentencess with double s.
更新统计信息:
TRUNCATE TABLE word_index;
TRUNCATE TABLE word_count;
CALL transfer_cell();
INSERT INTO word_count
SELECT word, COUNT(1) occurrences FROM word_index
GROUP BY word;
INSERT INTO words (word, ct)
VALUES ('this', 1), ('that', 1), ... -- about 100 words at a time
ON DUPLICATE KEY UPDATE ct = VALUES(ct) + 1;
CREATE TABLE words (
word VARCHAR(66) NOT NULL,
ct MEDIUMINT UNSIGNED NOT NULL,
PRIMARY KEY(word)
) ENGINE=InnoDB;
I see no need for having words and counts in separate tables, nor any need for an AUTO_INCREMENT for a "word_id". The word is a perfectly good "natural PK". However, you should decide what to do about case folding and accent stripping.
至于拆分成单词...,双引号和其他一些字符显然是单词边界。但是有些字符是模棱两可的:
' -- part of a contraction or a quote? . -- abbreviation or end of a sentence
with recursive cte as (
select
substring(concat(sent, ' '), 1, locate(' ', sent)) word,
substring(concat(sent, ' '), locate(' ', sent) + 1) sent
from messages
union all
select
substring(sent, 1, locate(' ', sent)) word,
substring(sent, locate(' ', sent) + 1) sent
from cte
where locate(' ', sent) > 0
)
select row_number() over(order by count(*) desc, word) wid, word, count(*) freq
from cte
group by word
order by wid
A simpler option would be to put the query in a view, so you get an always-up-to-date perspective on your data. For this, you can just wrap the above query in a create view statement, like:
create view words_view as < above query >;
如果性能成为问题,那么您也可以定期截断并重新填充word表。
truncate table words;
insert into words < above query >;
-- 1. To create a function that splits the sentence into words, and returns the Words Table
-- 2. To insert into your Result Table all of the words Table results
-- 3. Calculate the Frequency
----------[ The Split Function ]
CREATE FUNCTION dbo.udf_SplitString
(
@Sentence varchar(max)
, @Separator char(1)
)
RETURNS @WordList TABLE (Word varchar(50))
AS
BEGIN
SET @Separator = ISNULL(@Separator, ' ')
DECLARE @Word varchar(50)
SET @Sentence = LTRIM(@Sentence) + @Separator -- Make sure last word has a separator after. Also get rid of leading spaces.
WHILE (CHARINDEX(@Separator, @Sentence) > 0)
BEGIN
SET @Word = SUBSTRING(@Sentence, 1, CHARINDEX(@Separator, @Sentence) - 1)
INSERT INTO @WordList SELECT LTRIM(@Word)
-- Remove word added to the List from the sentence.
SET @Sentence = SUBSTRING(@Sentence, CHARINDEX(@Separator, @Sentence) + 1, LEN(@Sentence))
SET @Sentence = LTRIM(@Sentence)
END
RETURN
END
----------[ The Script ]
DECLARE @SentenceList TABLE (Sentence varchar(max))
INSERT INTO @SentenceList VALUES
('hello my name is alex')
, ('hey alin and alex I''m tom')
, ('hello alex my name is alin')
DECLARE @WordList TABLE (Word varchar(50))
INSERT INTO @WordList
SELECT
W.Word
FROM @SentenceList S
CROSS APPLY (
SELECT Word FROM dbo.udf_SplitString(S.Sentence, ' ')
) W
SELECT
ID = ROW_NUMBER() OVER(ORDER BY SUM(1) DESC, Word)
, Word
, Frequency = SUM(1)
FROM @WordList
GROUP BY Word
对于最新的MySQL版本(8.0.4和更高版本),您可以使用
fiddle
PS. I cannot reproduce the output ordering because I cannot understand ordering criteria within
freq
groups.Based on this DBA Stack Exchange post I could imagine something like the following.
基本步骤:
word_index
in my example)word_count
in my example)SPACE
(might have to be tweaked to also allow other whitespace like line wraps) and write it into theword_index
tableword_count
逐步执行代码:
Create
word_index
:Create
word_count
:Create procedure
transfer_cell
to transfer split words into target table:That's the basic setup. Please note that I used table
sentences
rather thansentencess
with double s.更新统计信息:
结果:
这是上面记录的结果的屏幕截图:
Perl和PHP等具有用于拆分的更强大的正则表达式引擎。我会使用其中之一,而不是SQL。
我会使用批量插入
I see no need for having words and counts in separate tables, nor any need for an
AUTO_INCREMENT
for a "word_id". Theword
is a perfectly good "natural PK". However, you should decide what to do about case folding and accent stripping.至于拆分成单词...,双引号和其他一些字符显然是单词边界。但是有些字符是模棱两可的:
'
-- part of a contraction or a quote?.
-- abbreviation or end of a sentence等等。
如果您正在运行MySQL 8.0,则建议为此使用递归公用表表达式。这个想法是迭代遍历每条消息,并沿途将其拆分为单词。然后剩下要做的就是聚合。
在早期版本中,您可以使用数字表模拟相同的行为。
Demo on DB Fiddle
样本数据:
结果:
在将查询结果保存在单独的表中时,它可能比您想象的要复杂:您需要能够根据原始表中的更改来插入,删除或更新目标表,而这不能在MySQL中的单个语句中完成。此外,在原始表中保持标记为最新状态会创建竞争条件,在更新目标表时可能会发生更改。
A simpler option would be to put the query in a view, so you get an always-up-to-date perspective on your data. For this, you can just wrap the above query in a
create view
statement, like:如果性能成为问题,那么您也可以定期截断并重新填充word表。
警告:这是TSql,不是MySQL。