'elasitcsearch' 태그의 글 목록

elasitcsearch

NGram analyzer 2023.12.14

NGram analyzer

sorka 2023. 12. 14. 20:59

2023. 12. 14. 20:59

NGram

Elasticsearch에서는 빠른 검색을 위해 검색에 사용될 텀들을 미리 분리해 역인덱스(inverted index)에 저장함.

텀이 아닌 단어의 일부만 가지고 검색이 필요할 경우 검색 텀의 일부만 미리 분리해서 저장할 수 있는데, 이렇게 단어의 일부를 나눈 부위를 NGram이라고 함 (unigram - 1글자, bigram - 2글자 등) ("type": "nGram")

ex. "spring" 이라는 단어를 bigram으로 처리할 경우 "sp", "pr", "ri", "in", "ng" 총 5개의 토큰이 추출되며, ngram 토큰 필터 사용 시 2글자로 추출된 텀들이 모두 검색 토큰으로 저장됨 -> "pr" 검색 시 spring이 포함된 도큐먼트들 매치

Edge NGram

텀 앞쪽의 ngram 만 저장하기 위해서는 Edge NGram 토큰필터를 이용 ("type": "edgeNGram")

ex. edgeNGram의 옵션을 "min_gram": 1, "max_gram": 4 으로 설정하고 "spring" 분석 시 "s", "sp", "spr", "spri" 토큰 생성

Shingle

문자가 아닌 단어 단위로 구성된 묶음 ("type": "shingle" )

ex. "spring blooms bright flowers"를 Shingle 토큰 필터를 적용해 2단어씩 분리할 경우 "spring blooms", "bloom bright", "bright flowers" 3개의 shingle 생성

NGram Analyzer 적용

ex.

"컴퓨터프로그래밍_노래방카페_휴대폰어플_커피숍카페" 문자열을 "_" 구분자로 나눈 후 ngram 토큰 필터 적용해 분석하기

PUT ngram_test
{
  "settings": {
    "index": {
      "analysis": {
        "analyzer": {
          "ngram_analyzer": {
            "type": "custom",
            "tokenizer": "underscore_tokenizer",
            "filter": "bigram_filter"
          }
        },
        "filter": {
          "bigram_filter": {
            "type": "ngram",
            "min_gram": 2,
            "max_gram": 2
          }
        }, 
        "tokenizer": {
          "underscore_tokenizer": {
            "type": "pattern",
            "pattern": "_"
          }
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "text": {
        "type": "text",
        "analyzer": "ngram_analyzer"
      }
    }
  }
}

ngram_analyzer: underscore_tokenizer와 bigram_filter로 구성
- underscore_tokenizer: "_"를 구분자로 토큰 추출
- bigram_filter: 토큰들을 2글자씩 추출해 검색 토큰으로 저장

분석 및 결과

GET ngram_test/_analyze
{
  "analyzer": "ngram_analyzer",
  "text": "컴퓨터프로그래밍_노래방카페_휴대폰어플_커피숍카페"
}


{
  "tokens": [
    {
      "token": "컴퓨",
      "start_offset": 0,
      "end_offset": 8,
      "type": "word",
      "position": 0
    },
    {
      "token": "퓨터",
      "start_offset": 0,
      "end_offset": 8,
      "type": "word",
      "position": 0
    },
    {
      "token": "터프",
      "start_offset": 0,
      "end_offset": 8,
      "type": "word",
      "position": 0
    },
    {
      "token": "프로",
      "start_offset": 0,
      "end_offset": 8,
      "type": "word",
      "position": 0
    },
    {
      "token": "로그",
      "start_offset": 0,
      "end_offset": 8,
      "type": "word",
      "position": 0
    },
    {
      "token": "그래",
      "start_offset": 0,
      "end_offset": 8,
      "type": "word",
      "position": 0
    },
    {
      "token": "래밍",
      "start_offset": 0,
      "end_offset": 8,
      "type": "word",
      "position": 0
    },
    {
      "token": "노래",
      "start_offset": 9,
      "end_offset": 14,
      "type": "word",
      "position": 1
    },
    {
      "token": "래방",
      "start_offset": 9,
      "end_offset": 14,
      "type": "word",
      "position": 1
    },
    {
      "token": "방카",
      "start_offset": 9,
      "end_offset": 14,
      "type": "word",
      "position": 1
    },
    {
      "token": "카페",
      "start_offset": 9,
      "end_offset": 14,
      "type": "word",
      "position": 1
    },
    {
      "token": "휴대",
      "start_offset": 15,
      "end_offset": 20,
      "type": "word",
      "position": 2
    },
    {
      "token": "대폰",
      "start_offset": 15,
      "end_offset": 20,
      "type": "word",
      "position": 2
    },
    {
      "token": "폰어",
      "start_offset": 15,
      "end_offset": 20,
      "type": "word",
      "position": 2
    },
    {
      "token": "어플",
      "start_offset": 15,
      "end_offset": 20,
      "type": "word",
      "position": 2
    },
    {
      "token": "커피",
      "start_offset": 21,
      "end_offset": 26,
      "type": "word",
      "position": 3
    },
    {
      "token": "피숍",
      "start_offset": 21,
      "end_offset": 26,
      "type": "word",
      "position": 3
    },
    {
      "token": "숍카",
      "start_offset": 21,
      "end_offset": 26,
      "type": "word",
      "position": 3
    },
    {
      "token": "카페",
      "start_offset": 21,
      "end_offset": 26,
      "type": "word",
      "position": 3
    }
  ]
}

검색 예제 및 결과

GET ngram_test/_search
{
  "query": {
    "match": {
      "text": "노래"
    }
  }
}



{
  ...
  "hits": {
    "total": {
      "value": 1,
      "relation": "eq"
    },
    "max_score": 0.4249156,
    "hits": [
      {
        "_index": "ngram_test",
        "_id": "2m8UaIwBamOI6MTqQUbR",
        "_score": 0.4249156,
        "_source": {
          "text": "컴퓨터프로그래밍_노래방카페_휴대폰어플_커피숍카페"
        }
      }
    ]
  }
}

wildcard 쿼리와의 비교

Elasticsearch는 RDBMS의 LIKE 검색 처럼 사용하는 wildcard 쿼리나 regexp (정규식) 쿼리도 지원을 하지만, 이런 쿼리들은 메모리 소모가 많고 느리기 때문에 Elasticsearch의 장점을 활용하지 못함

wildcard 쿼리는 term level 쿼리이기 때문에 inverted index의 term(token) 목록 중 쿼리에서 질의한 keyword 검색

token기준으로 wildcard 에서 못찾는 document 검색 가능
ngram이 반응 속도 더 빠름
token 의 갯수가 많아지기 때문에 inverted index 사이즈 증가

'Elasticsearch' 카테고리의 다른 글

Elasticsearch 디스크 불균등 이슈 해결하기 (0)	2024.03.10
nested list field size 집계 (1)	2023.12.23
keyword list 필드 string으로 합치기 (1)	2023.12.23
Elasticsearh Enrich processor (1)	2023.10.28
ElasticSearch(oss) vs OpenSearch (0)	2023.10.28

PREV 이전 1 NEXT 다음

SO DATA