SET cz.storage.parquet.inverted.index.similarity.bm25={"k1": 1.2, "b": 0.75}
实现对搜索效果的控制。
示例:
Step1 数据准备:
-- 1. CREATE TEST TABLE
CREATE TABLE bm25_demo_table (
id INT,
title STRING,
content_en STRING,
content_cn STRING,
doc_type STRING
);
-- 2. Create index and insert data
CREATE INVERTED INDEX idx_content_cn_score ON TABLE bm25_demo_table(content_cn)
PROPERTIES("analyzer"="chinese", "support_score"="true")
;
CREATE INVERTED INDEX idx_content_en_score ON TABLE bm25_demo_table(content_en)
PROPERTIES("analyzer"="english", "support_score"="true")
;
INSERT INTO bm25_demo_table VALUES
-- 短文档系列:高密度关键词
(1, 'AI简介', 'AI technology.', '人工智能技术。', 'short'),
(2, 'AI应用', 'AI AI applications in business.', '人工智能人工智能在商业中的应用。', 'short'),
-- 中等文档系列:中等密度关键词
(3, 'AI发展历史', 'The history of AI technology spans decades. AI researchers developed machine learning algorithms. Modern AI systems use deep learning techniques.', '人工智能技术的发展历史跨越数十年。人工智能研究者开发了机器学习算法。现代人工智能系统使用深度学习技术。', 'medium'),
(4, 'AI实际应用', 'AI technology revolutionizes industries. Companies implement AI solutions for automation. AI chatbots improve customer service efficiency.', '人工智能技术革命性地改变了各行各业。公司实施人工智能解决方案进行自动化。人工智能聊天机器人提高客户服务效率。', 'medium'),
-- 长文档系列:低密度但多次出现
(5, 'AI技术综述', 'Artificial intelligence represents one of the most transformative technologies of our time. AI systems can process vast amounts of data, recognize patterns, and make predictions with remarkable accuracy. The field of AI encompasses machine learning, deep learning, natural language processing, and computer vision. Modern AI applications span across healthcare, finance, transportation, and entertainment industries. As AI technology continues to evolve, researchers are exploring new frontiers in artificial general intelligence and quantum computing integration with AI systems.', '人工智能代表了我们时代最具变革性的技术之一。人工智能系统可以处理大量数据,识别模式,并以惊人的准确性进行预测。人工智能领域包括机器学习、深度学习、自然语言处理和计算机视觉。现代人工智能应用跨越医疗保健、金融、交通运输和娱乐行业。随着人工智能技术的不断发展,研究人员正在探索通用人工智能和量子计算与人工智能系统集成的新前沿。', 'long'),
-- 干扰文档:不包含目标关键词
(6, '区块链技术', 'Blockchain technology provides decentralized solutions. Cryptocurrency mining requires significant computational power. Smart contracts automate business processes.', '区块链技术提供去中心化解决方案。加密货币挖矿需要大量计算能力。智能合约自动化业务流程。', 'control'),
-- 高频关键词文档
(7, 'AI密集讨论', 'AI AI AI is everywhere. AI development, AI research, AI implementation, AI optimization, AI performance, AI scalability, AI security, AI ethics, AI governance, AI regulation.', '人工智能人工智能人工智能无处不在。人工智能开发、人工智能研究、人工智能实施、人工智能优化、人工智能性能、人工智能可扩展性、人工智能安全、人工智能伦理、人工智能治理、人工智能监管。', 'high_freq'),
-- 低频但精确匹配
(8, '精确AI定义', 'The definition of AI varies among experts.', '人工智能的定义在专家中各不相同。', 'precise')
;
注意:如果先插入数据,再创建索引,需要 BUILD INDEX 使索引生效:
BUILD INDEX idx_content_cn_score ON bm25_demo_table;
BUILD INDEX idx_content_en_score ON bm25_demo_table;
我们构造了一个包含不同类型文档的测试集:
短文档 (2-20 字符): 高关键词密度
中等文档 (100-200 字符): 中等关键词密度
长文档 (500+ 字符): 低关键词密度但多次出现
高频文档: 大量重复关键词
精确匹配: 少量但精确的关键词匹配
Step2 搜索验证:
英文搜索测试
情况1:默认参数(undefined) - 平衡配置
搜索关键词:"AI"
SELECT score (),
id,
title,
doc_type,
LENGTH(content_en) AS len,
REGEXP_COUNT (content_en, 'AI') AS ai_count,
SUBSTRING(content_en, 1, 50) AS preview
FROM bm25_demo_table
WHERE match_any(content_en, 'AI')
ORDER BY score () DESC, id
LIMIT 50;
score()
id
title
doc_type
len
ai_count
preview
0.16484952
7
AI密集讨论
high_freq
174
13
AI AI AI is everywhere. AI development, AI researc
0.14495456
2
AI应用
short
31
2
AI AI applications in business.
0.13709007
4
AI实际应用
medium
138
3
AI technology revolutionizes industries. Companies
0.13152356
1
AI简介
short
14
1
AI technology.
0.13141003
3
AI发展历史
medium
145
3
The history of AI technology spans decades. AI res
0.1138232
8
精确AI定义
precise
42
1
The definition of AI varies among experts.
0.10628956
5
AI技术综述
long
580
5
Artificial intelligence represents one of the most
关键观察:
高频词文档排名第一(符合预期)
短文档获得显著的长度优势
长文档尽管有较多匹配,但被长度惩罚
情况2:undefined - 忽略文档长度
SET cz.storage.parquet.inverted.index.similarity.bm25={"k1": 1.2, "b": 0.0};
SELECT score (),
id,
title,
doc_type,
LENGTH(content_en) AS len,
REGEXP_COUNT (content_en, 'AI') AS ai_count,
SUBSTRING(content_en, 1, 50) AS preview
FROM bm25_demo_table
WHERE match_any(content_en, 'AI')
ORDER BY score () DESC, id
LIMIT 10;
score()
id
title
doc_type
len
ai_count
preview
0.16691414
7
AI密集讨论
high_freq
174
13
AI AI AI is everywhere. AI development, AI researc
0.14703354
5
AI技术综述
long
580
5
Artificial intelligence represents one of the most
0.13022971
3
AI发展历史
medium
145
3
The history of AI technology spans decades. AI res
0.13022971
4
AI实际应用
medium
138
3
AI technology revolutionizes industries. Companies
0.11395099
2
AI应用
short
31
2
AI AI applications in business.
0.08287345
1
AI简介
short
14
1
AI technology.
0.08287345
8
精确AI定义
precise
42
1
The definition of AI varies among experts.
关键观察:
长文档显著受益:从最后一名跃升到第二名
短文档失去优势:长度优势被完全消除
纯粹按词频排序:完全基于内容相关性
中文搜索测试
情况1:默认参数(undefined) - 平衡配置
SET cz.storage.parquet.inverted.index.similarity.bm25={"k1": 1.2, "b": 0.75};
SELECT score (),
id,
title,
doc_type,
LENGTH(content_en) AS len,
REGEXP_COUNT (content_en, '人工智能') AS ai_count,
SUBSTRING(content_en, 1, 50) AS preview
FROM bm25_demo_table
WHERE match_any(content_en, '人工智能')
ORDER BY score () DESC, id
LIMIT 10;