Hello. I just can’t figure out how to find words with special characters in Elastic

For example, I have two documents:

1) We are looking for C ++ and C # developers
2) We are looking for C developers

I want to find only a document that has the word C ++.

The code for creating the index, documents and search:

from elasticsearch import Elasticsearch from elasticsearch.helpers import scan ELASTIC_SEARCH_NODES = ['http://localhost:9200'] INDEX = 'my_index' DOC_TYPE = 'material' def create_index(): data = { "settings": { "analysis": { "analyzer": { "my_analyzer": { "type": "custom", "filter": [ "lowercase" ], "tokenizer": "whitespace", } } } } } print es_client.indices.create(index=INDEX, body=data) def create_doc(body): if es_client.exists(INDEX, DOC_TYPE, body['docid']): es_client.delete(INDEX, DOC_TYPE, body['docid']) print es_client.create(index=INDEX, doc_type=DOC_TYPE, body=body, id=body['docid']) def find_doc(value): results_generator = scan(es_client, query={"query": { "match_phrase" : { "text" : value } }}, index=INDEX ) return results_generator if __name__ == '__main__': es_client = Elasticsearch(ELASTIC_SEARCH_NODES, verify_certs=True) # create_index() doc1 = {"docid": 1, 'text': u"We are looking for C developers"} doc2 = {"docid": 2, 'text': u"We are looking for C++ and C# developers"} # create_doc(doc1) # create_doc(doc2) for r in find_doc("C++"): print r 

Search result (if escaped + , the result is the same):

 {u'_score': 0.0, u'_type': u'material', u'_id': u'2', u'_source': {u'text': u'We are looking for C++ and C# developers', u'docid': 2}, u'_index': u'my_index'} {u'_score': 0.0, u'_type': u'material', u'_id': u'1', u'_source': {u'text': u'We are looking for C developers', u'docid': 1}, u'_index': u'my_index'} 

As I understand it, this result is obtained because when splitting into tokens, characters like + and # not indexed , and in fact he is looking for documents that have the C character:

 curl 'http://localhost:9200/my_index/material/_search?pretty=true' -d '{ "query" : { "match_all" : { } }, "script_fields": { "terms" : { "script": "doc[field].values", "params": { "field": "text" } } } }' 

The result of the command:

 { "took" : 3, "timed_out" : false, "_shards" : { "total" : 5, "successful" : 5, "failed" : 0 }, "hits" : { "total" : 2, "max_score" : 1.0, "hits" : [ { "_index" : "my_index", "_type" : "material", "_id" : "2", "_score" : 1.0, "fields" : { "terms" : [ "and", "are", "c", "developers", "for", "looking", "we" ] } }, { "_index" : "my_index", "_type" : "material", "_id" : "1", "_score" : 1.0, "fields" : { "terms" : [ "are", "c", "developers", "for", "looking", "we" ] } }] } } 

How can this problem be solved? And the second question related to the previous one: is it possible to search only for non-alphabetic characters: % or the same + ?

PS I use Elastic 2.3.2 and the library for Python elasticsearch = 2.3.0.

    1 answer 1

    All special characters are indexed, no need to escape. In your case, most likely, the standard analyzer was used during indexation, but not your my_analyzer.

    You need to add mapping.

     data = { "settings": { "analysis": { "analyzer": { "my_analyzer": { "type": "custom", "filter": [ "lowercase" ], "tokenizer": "whitespace", } } } }, "mappings": { "material": { "properties": { "docid": { "type": "integer" }, "text": { "type": "string", "analyzer": "my_analyzer" } } } } } 

    The index will have to be re-recreated and documents added again. When searching, you should also use my_analyzer or reduce the word to lowercase. "C ++" and "c ++" are different tokens.

    You can check for which tokens the string is parsed by the my_analyzer analyzer by the following query:

     curl -XPOST "http://localhost:9200/my_index/_analyze?analyzer=my_analyzer&pretty=true" -d 'We are looking for C++ and C# developers' 

    You can search by any characters ("%", "+").