Skip to content

Commit

Permalink
Merge pull request #32 from whyuds/main
Browse files Browse the repository at this point in the history
  • Loading branch information
jerryjzhang authored Sep 14, 2024
2 parents cf80a8d + dde79f0 commit f208ea8
Show file tree
Hide file tree
Showing 8 changed files with 1,000 additions and 0 deletions.
66 changes: 66 additions & 0 deletions puml/Correct阶段_GrammarCorrector_doCorrect.puml
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
@startuml
!define PROJECT_DIR ..
!define HEADLESS_DIR PROJECT_DIR\headless\chat\src\main\java\com\tencent\supersonic\headless
!define COMMON_DIR PROJECT_DIR\common\src\main\java\com\tencent\supersonic\common
!define SchemaCorrector_DIR HEADLESS_DIR\chat\corrector\SchemaCorrector.java
!define GrammarCorrector_DIR HEADLESS_DIR\chat\corrector\GrammarCorrector.java
!define SelectCorrector_DIR HEADLESS_DIR\chat\corrector\SelectCorrector.java
!define WhereCorrector_DIR HEADLESS_DIR\chat\corrector\WhereCorrector.java
!define GroupByCorrector_DIR HEADLESS_DIR\chat\corrector\GroupByCorrector.java
!define AggCorrector_DIR HEADLESS_DIR\chat\corrector\AggCorrector.java
!define HavingCorrector_DIR HEADLESS_DIR\chat\corrector\HavingCorrector.java

participant Actor
Actor -> GrammarCorrector : [[GrammarCorrector_DIR#doCorrect doCorrect]]
activate GrammarCorrector
GrammarCorrector -> SelectCorrector : [[SelectCorrector_DIR#doCorrect doCorrect]]
note right
对select的语义sql进行修正,比如以下情况
SELECT 中字段不足:当聚合字段和选择字段数量不匹配时
DETAIL明细模式且使用通配符 *:在 DETAIL 查询类型中使用 * 时,添加默认度量和维度。
ORDER BY 字段需要加入 SELECT:依据配置项决定是否将 ORDER BY 字段加入到 SELECT 中。
确保 GROUP BY 字段在 SELECT 中:确保 GROUP BY 子句中的字段出现在 SELECT 语句中。
end note
activate SelectCorrector
SelectCorrector --> GrammarCorrector
deactivate SelectCorrector
GrammarCorrector -> WhereCorrector : [[WhereCorrector_DIR#doCorrect doCorrect]]
note right
1.将llm生成的枚举值进行映射,比如where 等级=2级=>等级=2
2. 使用QueryFilter增加过滤条件(如果使用llm而非规则一般不会触发这步)
end note
activate WhereCorrector
WhereCorrector --> GrammarCorrector
deactivate WhereCorrector
GrammarCorrector -> GroupByCorrector : [[GroupByCorrector_DIR#doCorrect doCorrect]]
note right
判断是否需要添加group by
* 查询类型必须是 {@link QueryType#METRIC}。
* 修正后的 SQL 查询(S2SQL)中不应包含 `DISTINCT` 子句。
* 选择字段列表和维度列表不应为空。
* 选择字段不应仅包含一个日期维度(如 {@link TimeDimensionEnum#DAY})。
* 修正后的 SQL 查询(S2SQL)中不应已经存在 `GROUP BY` 子句。
* 可选的配置项(`s2.corrector.additional.information`)存在且设为 `true`。
end note
activate GroupByCorrector
GroupByCorrector --> GrammarCorrector
deactivate GroupByCorrector
GrammarCorrector -> AggCorrector : [[AggCorrector_DIR#doCorrect doCorrect]]
note right
如果LLM生成SQL中指标缺少聚合函数,同时又有Group By
则使用指标配置的聚合函数进行修正
end note
activate AggCorrector
AggCorrector --> GrammarCorrector
deactivate AggCorrector
GrammarCorrector -> HavingCorrector : [[HavingCorrector_DIR#doCorrect doCorrect]]
note right
1. 当有group by将过滤条件里的有聚合函数指标放到having中
2. 将having语法中的聚合函数同时加到select中
end note
activate HavingCorrector
HavingCorrector --> GrammarCorrector
deactivate HavingCorrector
GrammarCorrector -> GrammarCorrector : [[GrammarCorrector_DIR#removeSameFieldFromSelect removeSameFieldFromSelect]]
return
@enduml
51 changes: 51 additions & 0 deletions puml/Correct阶段_SchemaCorrector_doCorrect.puml
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
@startuml
!define PROJECT_DIR ..
!define HEADLESS_DIR PROJECT_DIR\headless\chat\src\main\java\com\tencent\supersonic\headless
!define COMMON_DIR PROJECT_DIR\common\src\main\java\com\tencent\supersonic\common
!define SchemaCorrector_DIR HEADLESS_DIR\chat\corrector\SchemaCorrector.java

participant Actor
Actor -> SchemaCorrector : [[SchemaCorrector_DIR#doCorrect doCorrect]]
activate SchemaCorrector
SchemaCorrector -> SchemaCorrector : [[SchemaCorrector_DIR#correctAggFunction correctAggFunction]]
note right
对聚合函数进行correct,比如select 最多(v)=>select max(v),
映射关系在AggregateEnum中(同时处理where,order by,having等)
end note
activate SchemaCorrector
SchemaCorrector --> SchemaCorrector
deactivate SchemaCorrector
SchemaCorrector -> SchemaCorrector : [[SchemaCorrector_DIR#replaceAlias replaceAlias]]
note right
对sql中的alias进行修正——当生成的alias和其中指标字段不同时删除alias,
相同时不处理(同时处理where,order by,having等)
end note
activate SchemaCorrector
SchemaCorrector --> SchemaCorrector
deactivate SchemaCorrector
SchemaCorrector -> SchemaCorrector : [[SchemaCorrector_DIR#updateFieldNameByLinkingValue updateFieldNameByLinkingValue]]
note right
尝试根据Link信息,将LLM生成错误的FieldName根据Value映射成正确的
比如等级=张三=>创作者名=>张三
end note
activate SchemaCorrector
SchemaCorrector --> SchemaCorrector
deactivate SchemaCorrector
SchemaCorrector -> SchemaCorrector : [[SchemaCorrector_DIR#updateFieldValueByLinkingValue updateFieldValueByLinkingValue]]
note right
尝试根据Link信息,将LLM生成错误的FieldValue根据Name映射成正确的
比如杰伦=>周杰伦
link信息非techValue一致,所以whereCorrector的value映射不重复
end note
activate SchemaCorrector
SchemaCorrector --> SchemaCorrector
deactivate SchemaCorrector
SchemaCorrector -> SchemaCorrector : [[SchemaCorrector_DIR#correctFieldName correctFieldName]]
note right
未get到使用场景、理解是想把alias->filedName
end note
activate SchemaCorrector
SchemaCorrector --> SchemaCorrector
deactivate SchemaCorrector
return
@enduml
83 changes: 83 additions & 0 deletions puml/Mapper阶段_EmbeddingMapper_doMap.puml
Original file line number Diff line number Diff line change
@@ -0,0 +1,83 @@
@startuml
!define PROJECT_DIR ..
!define HEADLESS_DIR PROJECT_DIR\headless\chat\src\main\java\com\tencent\supersonic\headless
!define COMMON_DIR PROJECT_DIR\common\src\main\java\com\tencent\supersonic\common
!define EmbeddingMapper_PATH HEADLESS_DIR\chat\mapper\EmbeddingMapper.java
!define EmbeddingMatchStrategy_PATH HEADLESS_DIR\chat\mapper\EmbeddingMatchStrategy.java
!define MetaEmbeddingService_PATH HEADLESS_DIR\chat\knowledge\MetaEmbeddingService.java
!define EmbeddingService_PATH COMMON_DIR\service\impl\EmbeddingServiceImpl.java
!define HanlpHelper_PATH HEADLESS_DIR\chat\knowledge\helper\HanlpHelper.java

participant Actor
Actor -> EmbeddingMapper: [[EmbeddingMapper_PATH#doMap doMap]]
activate EmbeddingMapper
EmbeddingMapper -> EmbeddingMapper : [[EmbeddingMatchStrategy_PATH#getTerms getTerms]]
note right
分词之前版本用来加快embedding阶段
目前版本看起来用的是detect中的召回长度、步长来控制
end note
EmbeddingMapper -> EmbeddingMatchStrategy : [[EmbeddingMatchStrategy_PATH#getMatches getMatches]]
activate EmbeddingMatchStrategy
EmbeddingMatchStrategy -> EmbeddingMatchStrategy : [[EmbeddingMatchStrategy_PATH#filterByDataSetId filterByDataSetId]]
note right: 分词过滤
EmbeddingMatchStrategy -> EmbeddingMatchStrategy: [[EmbeddingMatchStrategy_PATH#match match]]
activate EmbeddingMatchStrategy
EmbeddingMatchStrategy -> EmbeddingMatchStrategy : [[EmbeddingMatchStrategy_PATH#detect detect]]
note right
根据文本大小和步长配置参数将,
查询文本分割成较小的段,然后对这些段进行批量检测
EMBEDDING_MAPPER_TEXT_SIZE:用于向量召回文本长度
EMBEDDING_MAPPER_TEXT_STEP:向量召回文本每步长度
注:建议增大SIZE配置,防止截断关键信息失去上下文语义
end note
activate EmbeddingMatchStrategy
EmbeddingMatchStrategy -> EmbeddingMatchStrategy: [[EmbeddingMatchStrategy_PATH#detectByBatch detectByBatch]]
note right: EMBEDDING_MAPPER_BATCH:批量向量召回文本请求个数
activate EmbeddingMatchStrategy
EmbeddingMatchStrategy -> EmbeddingMatchStrategy: [[EmbeddingMatchStrategy_PATH#detectByQueryTextsSub detectByQueryTextsSub]]
note right
1. 构建参数,可以调整向量召回相似度阈值、最小阈值
EMBEDDING_MAPPER_THRESHOLD
EMBEDDING_MAPPER_THRESHOLD_MIN
2.可以调整参数控制召回个数(单subtext非整体)
EMBEDDING_MAPPER_NUMBER
3.执行召回
4.构建返回结果
注:可以根据模型能力适当降低阈值、增加召回个数
end note
activate MetaEmbeddingService
EmbeddingMatchStrategy -> MetaEmbeddingService : [[MetaEmbeddingService_PATH#retrieveQuery retrieveQuery]]
note right:将模型id加入到Filter
activate EmbeddingService
MetaEmbeddingService -> EmbeddingService: [[EmbeddingService_PATH#retrieveQuery retrieveQuery]]
note right:获取embeddingStore、embeddingModel
activate EmbeddingService
EmbeddingService -> EmbeddingService: [[EmbeddingService_PATH#retrieveSingleQuery retrieveSingleQuery]]
note right: 构建request,加入filter,并执行查询数据
activate EmbeddingService
activate EmbeddingModel
EmbeddingService -> EmbeddingModel: embed
EmbeddingModel --> EmbeddingService
deactivate EmbeddingModel
EmbeddingService -> EmbeddingService: [[EmbeddingService_PATH#createCombinedFilter createCombinedFilter]]
activate EmbeddingStore
EmbeddingService -> EmbeddingStore: search
Embedding写入流程 --> EmbeddingStore: <size 17>**[[EmbeddingService_PATH#createCombinedFilter emebedding写入流程]]**</size>
EmbeddingStore --> EmbeddingService
EmbeddingService-->MetaEmbeddingService
deactivate EmbeddingStore
deactivate EmbeddingService
deactivate EmbeddingService
deactivate EmbeddingService
MetaEmbeddingService --> EmbeddingMatchStrategy
deactivate MetaEmbeddingService
deactivate EmbeddingMatchStrategy
deactivate EmbeddingMatchStrategy
deactivate EmbeddingMatchStrategy
EmbeddingMatchStrategy --> EmbeddingMapper
deactivate EmbeddingMatchStrategy
EmbeddingMapper -> EmbeddingMapper : [[HanlpHelper_PATH#transLetterOriginal transLetterOriginal]]
EmbeddingMapper -> EmbeddingMapper : 构建返回结果

return
@enduml
86 changes: 86 additions & 0 deletions puml/Mapper阶段_KeywordMapper_doMap.puml
Original file line number Diff line number Diff line change
@@ -0,0 +1,86 @@
@startuml
!define PROJECT_DIR ..
!define HEADLESS_DIR PROJECT_DIR\headless\chat\src\main\java\com\tencent\supersonic\headless
!define COMMON_DIR PROJECT_DIR\common\src\main\java\com\tencent\supersonic\common
!define EmbeddingMapper_PATH HEADLESS_DIR\chat\mapper\EmbeddingMapper.java
!define KeywordMapper_PATH HEADLESS_DIR\chat\mapper\KeywordMapper.java

!define EmbeddingMatchStrategy_PATH HEADLESS_DIR\chat\mapper\EmbeddingMatchStrategy.java
!define HanlpDictMatchStrategy_PATH HEADLESS_DIR\chat\mapper\HanlpDictMatchStrategy.java
!define DatabaseMatchStrategy_PATH HEADLESS_DIR\chat\mapper\DatabaseMatchStrategy.java

!define MetaEmbeddingService_PATH HEADLESS_DIR\chat\knowledge\MetaEmbeddingService.java
!define EmbeddingService_PATH COMMON_DIR\service\impl\EmbeddingServiceImpl.java
!define HanlpHelper_PATH HEADLESS_DIR\chat\knowledge\helper\HanlpHelper.java
!define KnowledgeBaseService_PATH HEADLESS_DIR\chat\knowledge\KnowledgeBaseService.java
!define SearchService_PATH HEADLESS_DIR\chat\knowledge\SearchService.java

participant Actor
Actor -> KeywordMapper : [[KeywordMapper_PATH#doMap doMap]]
activate KeywordMapper
KeywordMapper -> KeywordMapper : [[HanlpHelper_PATH#getTerms getTerms]]
note right
1. Viterbi分词器
2. 支持索引模式、自定义词典、词语偏移,关闭命名实体识别等
3. 过滤所有数据集内相关词性的term
end note
KeywordMapper -> HanlpDictMatchStrategy : [[HanlpDictMatchStrategy_PATH#getMatches getMatches]]
activate HanlpDictMatchStrategy
HanlpDictMatchStrategy -> HanlpDictMatchStrategy : [[HanlpDictMatchStrategy_PATH#filterByDataSetId filterByDataSetId]]
note right: 根据数据集过滤term
HanlpDictMatchStrategy -> HanlpDictMatchStrategy : [[HanlpDictMatchStrategy_PATH#match match]]
activate HanlpDictMatchStrategy
HanlpDictMatchStrategy -> HanlpDictMatchStrategy : [[HanlpDictMatchStrategy_PATH#detect detect]]
note right
遍历文本的每一个可能的子字符串片段,
并利用提供的分词和偏移进行内容检测。
每找到一个符合条件的片段,就记录并在最终返回这个结果列表
end note
activate HanlpDictMatchStrategy
HanlpDictMatchStrategy -> HanlpDictMatchStrategy: [[HanlpDictMatchStrategy_PATH#detectByStep detectByStep]]
note right: MAPPER_DETECTION_MAX_SIZE:一次探测前后缀匹配结果返回个数
HanlpDictMatchStrategy -> KnowledgeBaseService: [[KnowledgeBaseService_PATH#prefixSearch prefixSearch]]
activate KnowledgeBaseService
KnowledgeBaseService -> SearchService: [[SearchService_PATH#prefixSearch prefixSearch]]
activate SearchService
SearchService -> SearchService: [[SearchService_PATH#search search]]
note right
在二进制字典树(BinTrie)中搜索给定键值匹配的所有条目。
将输入的键转换为小写以进行不区分大小写的匹配,并初始化一个空的 TreeSet
用于存储匹配的结果。深度优先遍历给定的键的每个字符逐层查找字典树中的分支节点,
直到找到所有匹配的条目,并将其添加到结果集合中
end note
SearchService --> KnowledgeBaseService
KnowledgeBaseService --> HanlpDictMatchStrategy
HanlpDictMatchStrategy -> KnowledgeBaseService: [[KnowledgeBaseService_PATH#suffixSearch suffixSearch]]
KnowledgeBaseService -> SearchService: [[SearchService_PATH#suffixSearch suffixSearch]]
SearchService -> SearchService: [[SearchService_PATH#search search]]

SearchService --> KnowledgeBaseService
deactivate SearchService
KnowledgeBaseService --> HanlpDictMatchStrategy
deactivate KnowledgeBaseService
HanlpDictMatchStrategy -> HanlpDictMatchStrategy: 合并前后缀匹配结果
HanlpDictMatchStrategy -> HanlpDictMatchStrategy: [[HanlpDictMatchStrategy_PATH#getThresholdMatch getThresholdMatch]]
note right
计算阈值
MAPPER_NAME_THRESHOLD: 指标名、维度名文本相似度阈值
MAPPER_NAME_THRESHOLD_MIN: 指标名、维度名最小文本相似度阈值
MAPPER_VALUE_THRESHOLD: 维度值文本相似度阈值
MAPPER_VALUE_THRESHOLD_MIN: 维度值最小文本相似度阈值
end note
HanlpDictMatchStrategy -> HanlpDictMatchStrategy: 根据相似度过滤
HanlpDictMatchStrategy --> KeywordMapper
deactivate HanlpDictMatchStrategy
deactivate HanlpDictMatchStrategy
deactivate HanlpDictMatchStrategy
KeywordMapper -> KeywordMapper : [[KeywordMapper_PATH#convertHanlpMapResultToMapInfo convertHanlpMapResultToMapInfo]]
note right: 将nlp的match结果转换成map结果,拼接相关element元信息
activate KeywordMapper
KeywordMapper -> KeywordMapper : transLetterOriginal
KeywordMapper -> DatabaseMatchStrategy : [[DatabaseMatchStrategy_PATH#getMatches getMatches]]
activate DatabaseMatchStrategy

DatabaseMatchStrategy --> KeywordMapper

@enduml
99 changes: 99 additions & 0 deletions puml/Parse阶段_LLMSqlParser_parse.puml
Original file line number Diff line number Diff line change
@@ -0,0 +1,99 @@
@startuml
!define PROJECT_DIR ..
!define HEADLESS_DIR PROJECT_DIR\headless\chat\src\main\java\com\tencent\supersonic\headless
!define COMMON_DIR PROJECT_DIR\common\src\main\java\com\tencent\supersonic\common
!define LLMSqlParser_PATH HEADLESS_DIR\chat\parser\llm\LLMSqlParser.java
!define LLMRequestService_PATH HEADLESS_DIR\chat\parser\llm\LLMRequestService.java
!define OnePassSCSqlGenStrategy_PATH HEADLESS_DIR\chat\parser\llm\OnePassSCSqlGenStrategy.java
!define PromptHelper_PATH HEADLESS_DIR\chat\parser\llm\PromptHelper.java
!define LLMResponseService_PATH HEADLESS_DIR\chat\parser\llm\LLMResponseService.java
!define ExemplarServiceImpl_PATH COMMON_DIR\service\impl\ExemplarServiceImpl.java
!define SqlValidHelper_PATH COMMON_DIR\jsqlparser\SqlValidHelper.java
!define DataSetResolver_PATH HEADLESS_DIR\chat\parser\llm\HeuristicDataSetResolver.java

participant Actor
Actor -> LLMSqlParser : parse
activate LLMSqlParser
LLMSqlParser -> LLMSqlParser : [[LLMRequestService_PATH#isSkip isSkip]]
LLMSqlParser -> LLMSqlParser : [[LLMSqlParser_PATH#tryParse tryParse]]
LLMSqlParser -> LLMRequestService : [[LLMRequestService_PATH#getDataSetId getDataSetId]]
note right
获取最匹配的数据集
优先能Map到的数量
其次相似度
end note
activate LLMRequestService
LLMRequestService -> DataSetResolver : [[DataSetResolver_PATH#resolve resolve]]
activate DataSetResolver
DataSetResolver -> DataSetResolver : [[DataSetResolver_PATH#selectDataSetBySchemaElementMatchScore selectDataSetBySchemaElementMatchScore]]
activate DataSetResolver
DataSetResolver -> DataSetResolver :[[DataSetResolver_PATH#getDataSetTypeMap getDataSetTypeMap]]
note right:对数据集聚合找到最大相似度
activate DataSetResolver
DataSetResolver --> LLMRequestService
deactivate DataSetResolver
deactivate DataSetResolver
deactivate DataSetResolver
LLMRequestService --> LLMSqlParser
deactivate LLMRequestService
activate LLMSqlParser

LLMSqlParser -> LLMSqlParser : [[LLMSqlParser_PATH#getRecallMaxRetries getRecallMaxRetries]]
LLMSqlParser -> LLMRequestService : [[LLMRequestService_PATH#getLlmReq getLlmReq]]
note right: 构建llmreq之后会用来生成prompt
activate LLMRequestService
LLMRequestService --> LLMSqlParser
LLMSqlParser -> LLMRequestService : [[LLMRequestService_PATH#runText2SQL runText2SQL]]
note left: 默认为3次重试
LLMRequestService -> SqlGenStrategy: [[OnePassSCSqlGenStrategy_PATH#generate generate]]
activate SqlGenStrategy
SqlGenStrategy -> PromptHelper: [[PromptHelper_PATH#getFewShotExemplars getFewShotExemplars]]
note right
1. 获取配置参数,如果之前获取的freeshot不足则再进行召回
2. 如果并发数>1,每个并发随机抽取freeshot
end note
activate PromptHelper
PromptHelper -> ExemplarService: [[PromptHelper_PATH#recallExemplars recallExemplars]]
activate ExemplarService
ExemplarService --> PromptHelper
deactivate ExemplarService
PromptHelper --> SqlGenStrategy
deactivate PromptHelper
SqlGenStrategy -> SqlGenStrategy: [[OnePassSCSqlGenStrategy_PATH#generatePrompt generatePrompt]]
note right
拼接schema、sideinfo(当前日期等信息),freeshot
如果有用户自定义prompt模版则套用用户模版
end note
SqlGenStrategy -> SqlGenStrategy: [[OnePassSCSqlGenStrategy_PATH#getChatLanguageModel getChatLanguageModel]]
note right
获取配置的模型,目前已接入奇智
end note
SqlGenStrategy -> ChatLanguageModel: generate
note right: chat模型
activate ChatLanguageModel
ChatLanguageModel --> SqlGenStrategy
deactivate ChatLanguageModel



SqlGenStrategy --> LLMRequestService
deactivate SqlGenStrategy
LLMRequestService --> LLMSqlParser
deactivate LLMRequestService

LLMSqlParser -> LLMResponseService : [[LLMResponseService_PATH#getDeduplicationSqlResp getDeduplicationSqlResp]]
note right
对并发的sql进行去重,意义不大
其中对sql合法性校验需要注意
end note
activate LLMResponseService
LLMResponseService -> SqlValidHelper: [[SqlValidHelper_PATH#isValidSQL isValidSQL]]
note right: 合法性校验
activate SqlValidHelper
SqlValidHelper --> LLMResponseService
deactivate SqlValidHelper
LLMResponseService --> LLMSqlParser
deactivate LLMResponseService

LLMSqlParser -> LLMSqlParser : [[LLMResponseService_PATH#addParseInfo addParseInfo]]
@enduml
Loading

0 comments on commit f208ea8

Please sign in to comment.