admin

Spring data Elasticsearch searchSimilar 相关推荐搜索问题
Page<T> searchSimilar(T entity, String[] fields, Pa...
扫描右侧二维码阅读全文
15
2019/10

Spring data Elasticsearch searchSimilar 相关推荐搜索问题

Page<T> searchSimilar(T entity, String[] fields, Pageable pageable);
顾名思义,就是查找与传入参数entity相关的内容。

但是,我在使用的时候发现两个问题:

  1. 经常搜索不出相关内容
  2. 分页参数不起作用

于是首先是网上搜索有没有人遇到同样的问题,令我感到奇怪的,竟然没人提过这个问题,难道大家Spring Data Elasticsearch用得比较少,是直接基于Elasticsearch开发的吗?

既然没有相关文章,那就只能自己看源码了:

@Override
public Page<T> searchSimilar(T entity, String[] fields, Pageable pageable) {

    Assert.notNull(entity, "Cannot search similar records for 'null'.");
    Assert.notNull(pageable, "'pageable' cannot be 'null'");

    MoreLikeThisQuery query = new MoreLikeThisQuery();
    query.setId(stringIdRepresentation(extractIdFromBean(entity)));
    query.setPageable(pageable);
    if (fields != null) {
        query.addFields(fields);
    }

    return elasticsearchOperations.moreLikeThis(query, getEntityClass());
}

首先是找到这个方法的实现,但是发现上面的代码貌似看不出上面问题,还得进入moreLikeThis方法:

@Override
public <T> Page<T> moreLikeThis(MoreLikeThisQuery query, Class<T> clazz) {

    ElasticsearchPersistentEntity persistentEntity = getPersistentEntityFor(clazz);
    String indexName = !StringUtils.isEmpty(query.getIndexName()) ? query.getIndexName()
            : persistentEntity.getIndexName();
    String type = !StringUtils.isEmpty(query.getType()) ? query.getType() : persistentEntity.getIndexType();

    Assert.notNull(indexName, "No 'indexName' defined for MoreLikeThisQuery");
    Assert.notNull(type, "No 'type' defined for MoreLikeThisQuery");
    Assert.notNull(query.getId(), "No document id defined for MoreLikeThisQuery");

    MoreLikeThisQueryBuilder moreLikeThisQueryBuilder = moreLikeThisQuery(
            toArray(new MoreLikeThisQueryBuilder.Item(indexName, type, query.getId())));

    if (query.getMinTermFreq() != null) {
        moreLikeThisQueryBuilder.minTermFreq(query.getMinTermFreq());
    }
    if (query.getMaxQueryTerms() != null) {
        moreLikeThisQueryBuilder.maxQueryTerms(query.getMaxQueryTerms());
    }
    if (!isEmpty(query.getStopWords())) {
        moreLikeThisQueryBuilder.stopWords(toArray(query.getStopWords()));
    }
    if (query.getMinDocFreq() != null) {
        moreLikeThisQueryBuilder.minDocFreq(query.getMinDocFreq());
    }
    if (query.getMaxDocFreq() != null) {
        moreLikeThisQueryBuilder.maxDocFreq(query.getMaxDocFreq());
    }
    if (query.getMinWordLen() != null) {
        moreLikeThisQueryBuilder.minWordLength(query.getMinWordLen());
    }
    if (query.getMaxWordLen() != null) {
        moreLikeThisQueryBuilder.maxWordLength(query.getMaxWordLen());
    }
    if (query.getBoostTerms() != null) {
        moreLikeThisQueryBuilder.boostTerms(query.getBoostTerms());
    }

    return queryForPage(new NativeSearchQueryBuilder().withQuery(moreLikeThisQueryBuilder).build(), clazz);
}

代码很清晰明了,就是将MoreLikeThisQuery参数传给MoreLikeThisQueryBuilder,但这里我有个疑惑的问题,在searchSimilar方法中,我们看到有如下操作:

query.setPageable(pageable);       //1
if (fields != null) {
    query.addFields(fields);   //2
}
  1. 设置分页参数
  2. 设置相关搜索的fields
    但是在moreLikeThis中并未将其传给MoreLikeThisQueryBuilder对象,所以分页不起作用就是这个原因导致的,解决办法就是修改源代码中的如下片段:
return queryForPage(new NativeSearchQueryBuilder().withQuery(moreLikeThisQueryBuilder).build(), clazz);

改为:

return queryForPage(new NativeSearchQueryBuilder().withPageable(pageable).withQuery(moreLikeThisQueryBuilder).build(), clazz);

这样将分页参数设置好后就能生效了。

还有一个疑惑:fields貌似也没用用到吧,我们看到构造MoreLikeThisQueryBuilder对象时用到的方法是moreLikeThisQuery,查看其源码:

/**
* A more like this query that finds documents that are "like" the provided documents
* which is checked against the "_all" field.
* @param likeItems the documents to use when generating the 'More Like This' query.
*/
public static MoreLikeThisQueryBuilder moreLikeThisQuery(Item[] likeItems) {
    return moreLikeThisQuery(null, null, likeItems);
}

看注解就是这个more like query搜索的是所有的field,看到这里真的懵了:那之前传入的fileds参数还有什么意义?

继续查找可以看到还有几个重载的方法:

/**
 * A more like this query that finds documents that are "like" the provided texts or documents
 * which is checked against the fields the query is constructed with.
 *
 * @param fields the field names that will be used when generating the 'More Like This' query.
 * @param likeTexts the text to use when generating the 'More Like This' query.
 * @param likeItems the documents to use when generating the 'More Like This' query.
 */
public static MoreLikeThisQueryBuilder moreLikeThisQuery(String[] fields, String[] likeTexts, Item[] likeItems) {
    return new MoreLikeThisQueryBuilder(fields, likeTexts, likeItems);
}

/**
 * A more like this query that finds documents that are "like" the provided texts or documents
 * which is checked against the "_all" field.
 * @param likeTexts the text to use when generating the 'More Like This' query.
 * @param likeItems the documents to use when generating the 'More Like This' query.
 */
public static MoreLikeThisQueryBuilder moreLikeThisQuery(String[] likeTexts, Item[] likeItems) {
    return moreLikeThisQuery(null, likeTexts, likeItems);
}

/**
 * A more like this query that finds documents that are "like" the provided texts
 * which is checked against the "_all" field.
 * @param likeTexts the text to use when generating the 'More Like This' query.
 */
public static MoreLikeThisQueryBuilder moreLikeThisQuery(String[] likeTexts) {
    return moreLikeThisQuery(null, likeTexts, null);
}

上面第一个方法是能够传入fields的,后面讲解如何自己不修改源码的情况下来修复上面的 BUG。

说到这里,最开始提出的两个问题中搜索不到结果并不是上面的fields导致的,再次回到searchSimilar方法,MoreLikeThisQuery对象只设置了IDPageableFields参数,我们看看Elasticsearch官方参数说明吧:

fields:要匹配的字段,如果不填的话默认是_all字段
like_text:匹配的文本。
percent_terms_to_match:匹配项(term)的百分比,默认是0.3
min_term_freq:一篇文档中一个词语至少出现次数,小于这个值的词将被忽略,默认是2
max_query_terms:一条查询语句中允许最多查询词语的个数,默认是25
stop_words:设置停止词,匹配时会忽略停止词
min_doc_freq:一个词语最少在多少篇文档中出现,小于这个值的词会将被忽略,默认是无限制
max_doc_freq:一个词语最多在多少篇文档中出现,大于这个值的词会将被忽略,默认是无限制
min_word_len:最小的词语长度,默认是0
max_word_len:最多的词语长度,默认无限制
boost_terms:设置词语权重,默认是1
boost:设置查询权重,默认是1
analyzer:设置使用的分词器,默认是使用该字段指定的分词器

关键是min_term_freq参数,没错,官方默认是2,searchSimilar方法并没有修改这个默认参数,当我把它设置成1的时候,效果明显比之前好很多,原因是一个词语至少出现次数,小于这个值的词将被忽略,默认是2,达到这个要求的文档真的是太少了,所以才会大多数情况搜索相关推荐都是返回空。

最后,如果我们不修改源代码的情况下,如何修复上面两个问题:

public Page<Torrent> searchSimilar(Torrent torrent, String[] fields, Pageable pageable) {
    MoreLikeThisQueryBuilder moreLikeThisQueryBuilder = QueryBuilders.moreLikeThisQuery(fields,
            new String[] {torrent.getFileName()},
            new Item[] {new Item("dodder", "torrent", torrent.getInfoHash())});
    moreLikeThisQueryBuilder.minTermFreq(1);

    return elasticsearchTemplate.queryForPage(new NativeSearchQueryBuilder()
            .withPageable(pageable)
            .withQuery(moreLikeThisQueryBuilder).build(), Torrent.class);
}

实际上就是参照官方实现,构造MoreLikeThisQueryBuilder我调用的就是public static MoreLikeThisQueryBuilder moreLikeThisQuery(String[] fields, String[] likeTexts, Item[] likeItems)这个构造方法传入了fields参数,然后设置min_term_freq为1,最后查询时设置分页参数即可,上面的实现省略了很多官方实现的参数设置,官方做的是通用方法,比如indexNametype_ID都是去Entity实体类获取的,上面我就直接写成硬代码了,如果你有多个索引,建议还是修改官方源代码吧,通用性更强。

Last modification:October 15th, 2019 at 03:38 pm
If you think my article is useful to you, please feel free to appreciate

Leave a Comment