Spring data Elasticsearch searchSimilar 相关推荐搜索问题 时间: 2019-10-15 15:38 分类: Spring,JAVA > Page searchSimilar(T entity, String[] fields, Pageable pageable); 顾名思义,就是查找与传入参数`entity`相关的内容。 但是,我在使用的时候发现两个问题: 1. 经常搜索不出相关内容 2. 分页参数不起作用 于是首先是网上搜索有没有人遇到同样的问题,令我感到奇怪的,竟然没人提过这个问题,难道大家`Spring Data Elasticsearch`用得比较少,是直接基于`Elasticsearch`开发的吗? 既然没有相关文章,那就只能自己看源码了: ```java @Override public Page searchSimilar(T entity, String[] fields, Pageable pageable) { Assert.notNull(entity, "Cannot search similar records for 'null'."); Assert.notNull(pageable, "'pageable' cannot be 'null'"); MoreLikeThisQuery query = new MoreLikeThisQuery(); query.setId(stringIdRepresentation(extractIdFromBean(entity))); query.setPageable(pageable); if (fields != null) { query.addFields(fields); } return elasticsearchOperations.moreLikeThis(query, getEntityClass()); } ``` 首先是找到这个方法的实现,但是发现上面的代码貌似看不出上面问题,还得进入`moreLikeThis`方法: ```java @Override public Page moreLikeThis(MoreLikeThisQuery query, Class clazz) { ElasticsearchPersistentEntity persistentEntity = getPersistentEntityFor(clazz); String indexName = !StringUtils.isEmpty(query.getIndexName()) ? query.getIndexName() : persistentEntity.getIndexName(); String type = !StringUtils.isEmpty(query.getType()) ? query.getType() : persistentEntity.getIndexType(); Assert.notNull(indexName, "No 'indexName' defined for MoreLikeThisQuery"); Assert.notNull(type, "No 'type' defined for MoreLikeThisQuery"); Assert.notNull(query.getId(), "No document id defined for MoreLikeThisQuery"); MoreLikeThisQueryBuilder moreLikeThisQueryBuilder = moreLikeThisQuery( toArray(new MoreLikeThisQueryBuilder.Item(indexName, type, query.getId()))); if (query.getMinTermFreq() != null) { moreLikeThisQueryBuilder.minTermFreq(query.getMinTermFreq()); } if (query.getMaxQueryTerms() != null) { moreLikeThisQueryBuilder.maxQueryTerms(query.getMaxQueryTerms()); } if (!isEmpty(query.getStopWords())) { moreLikeThisQueryBuilder.stopWords(toArray(query.getStopWords())); } if (query.getMinDocFreq() != null) { moreLikeThisQueryBuilder.minDocFreq(query.getMinDocFreq()); } if (query.getMaxDocFreq() != null) { moreLikeThisQueryBuilder.maxDocFreq(query.getMaxDocFreq()); } if (query.getMinWordLen() != null) { moreLikeThisQueryBuilder.minWordLength(query.getMinWordLen()); } if (query.getMaxWordLen() != null) { moreLikeThisQueryBuilder.maxWordLength(query.getMaxWordLen()); } if (query.getBoostTerms() != null) { moreLikeThisQueryBuilder.boostTerms(query.getBoostTerms()); } return queryForPage(new NativeSearchQueryBuilder().withQuery(moreLikeThisQueryBuilder).build(), clazz); } ``` 代码很清晰明了,就是将`MoreLikeThisQuery`参数传给`MoreLikeThisQueryBuilder`,但这里我有个疑惑的问题,在`searchSimilar`方法中,我们看到有如下操作: ```java query.setPageable(pageable); //1 if (fields != null) { query.addFields(fields); //2 } ``` 1. 设置分页参数 2. 设置相关搜索的`fields` 但是在`moreLikeThis`中并未将其传给`MoreLikeThisQueryBuilder`对象,所以分页不起作用就是这个原因导致的,解决办法就是修改源代码中的如下片段: ```java return queryForPage(new NativeSearchQueryBuilder().withQuery(moreLikeThisQueryBuilder).build(), clazz); ``` 改为: ```java return queryForPage(new NativeSearchQueryBuilder().withPageable(pageable).withQuery(moreLikeThisQueryBuilder).build(), clazz); ``` 这样将分页参数设置好后就能生效了。 还有一个疑惑:`fields`貌似也没用用到吧,我们看到构造`MoreLikeThisQueryBuilder`对象时用到的方法是`moreLikeThisQuery`,查看其源码: ```java /** * A more like this query that finds documents that are "like" the provided documents * which is checked against the "_all" field. * @param likeItems the documents to use when generating the 'More Like This' query. */ public static MoreLikeThisQueryBuilder moreLikeThisQuery(Item[] likeItems) { return moreLikeThisQuery(null, null, likeItems); } ``` 看注解就是这个`more like query`搜索的是所有的`field`,看到这里真的懵了:那之前传入的`fileds`参数还有什么意义? 继续查找可以看到还有几个重载的方法: ```java /** * A more like this query that finds documents that are "like" the provided texts or documents * which is checked against the fields the query is constructed with. * * @param fields the field names that will be used when generating the 'More Like This' query. * @param likeTexts the text to use when generating the 'More Like This' query. * @param likeItems the documents to use when generating the 'More Like This' query. */ public static MoreLikeThisQueryBuilder moreLikeThisQuery(String[] fields, String[] likeTexts, Item[] likeItems) { return new MoreLikeThisQueryBuilder(fields, likeTexts, likeItems); } /** * A more like this query that finds documents that are "like" the provided texts or documents * which is checked against the "_all" field. * @param likeTexts the text to use when generating the 'More Like This' query. * @param likeItems the documents to use when generating the 'More Like This' query. */ public static MoreLikeThisQueryBuilder moreLikeThisQuery(String[] likeTexts, Item[] likeItems) { return moreLikeThisQuery(null, likeTexts, likeItems); } /** * A more like this query that finds documents that are "like" the provided texts * which is checked against the "_all" field. * @param likeTexts the text to use when generating the 'More Like This' query. */ public static MoreLikeThisQueryBuilder moreLikeThisQuery(String[] likeTexts) { return moreLikeThisQuery(null, likeTexts, null); } ``` 上面第一个方法是能够传入`fields`的,后面讲解如何自己不修改源码的情况下来修复上面的 BUG。 说到这里,最开始提出的两个问题中`搜索不到结果`并不是上面的`fields`导致的,再次回到`searchSimilar`方法,`MoreLikeThisQuery`对象只设置了`ID`、`Pageable`和`Fields`参数,我们看看`Elasticsearch`官方参数说明吧: ``` fields:要匹配的字段,如果不填的话默认是_all字段 like_text:匹配的文本。 percent_terms_to_match:匹配项(term)的百分比,默认是0.3 min_term_freq:一篇文档中一个词语至少出现次数,小于这个值的词将被忽略,默认是2 max_query_terms:一条查询语句中允许最多查询词语的个数,默认是25 stop_words:设置停止词,匹配时会忽略停止词 min_doc_freq:一个词语最少在多少篇文档中出现,小于这个值的词会将被忽略,默认是无限制 max_doc_freq:一个词语最多在多少篇文档中出现,大于这个值的词会将被忽略,默认是无限制 min_word_len:最小的词语长度,默认是0 max_word_len:最多的词语长度,默认无限制 boost_terms:设置词语权重,默认是1 boost:设置查询权重,默认是1 analyzer:设置使用的分词器,默认是使用该字段指定的分词器 ``` 关键是`min_term_freq`参数,没错,官方默认是2,`searchSimilar`方法并没有修改这个默认参数,当我把它设置成1的时候,效果明显比之前好很多,原因是`一个词语至少出现次数,小于这个值的词将被忽略,默认是2`,达到这个要求的文档真的是太少了,所以才会大多数情况搜索相关推荐都是返回空。 最后,如果我们不修改源代码的情况下,如何修复上面两个问题: ```java public Page searchSimilar(Torrent torrent, String[] fields, Pageable pageable) { MoreLikeThisQueryBuilder moreLikeThisQueryBuilder = QueryBuilders.moreLikeThisQuery(fields, new String[] {torrent.getFileName()}, new Item[] {new Item("dodder", "torrent", torrent.getInfoHash())}); moreLikeThisQueryBuilder.minTermFreq(1); return elasticsearchTemplate.queryForPage(new NativeSearchQueryBuilder() .withPageable(pageable) .withQuery(moreLikeThisQueryBuilder).build(), Torrent.class); } ``` 实际上就是参照官方实现,构造`MoreLikeThisQueryBuilder`我调用的就是`public static MoreLikeThisQueryBuilder moreLikeThisQuery(String[] fields, String[] likeTexts, Item[] likeItems)`这个构造方法传入了`fields`参数,然后设置`min_term_freq`为1,最后查询时设置分页参数即可,上面的实现省略了很多官方实现的参数设置,官方做的是通用方法,比如`indexName`、`type`、`_ID`都是去`Entity`实体类获取的,上面我就直接写成硬代码了,如果你有多个索引,建议还是修改官方源代码吧,通用性更强。 标签: spring-data-elasticsearch