Hello. I just can’t figure out how to find words with special characters in Elastic
For example, I have two documents:
1) We are looking for C ++ and C # developers
2) We are looking for C developers
I want to find only a document that has the word C ++.
The code for creating the index, documents and search:
from elasticsearch import Elasticsearch from elasticsearch.helpers import scan ELASTIC_SEARCH_NODES = ['http://localhost:9200'] INDEX = 'my_index' DOC_TYPE = 'material' def create_index(): data = { "settings": { "analysis": { "analyzer": { "my_analyzer": { "type": "custom", "filter": [ "lowercase" ], "tokenizer": "whitespace", } } } } } print es_client.indices.create(index=INDEX, body=data) def create_doc(body): if es_client.exists(INDEX, DOC_TYPE, body['docid']): es_client.delete(INDEX, DOC_TYPE, body['docid']) print es_client.create(index=INDEX, doc_type=DOC_TYPE, body=body, id=body['docid']) def find_doc(value): results_generator = scan(es_client, query={"query": { "match_phrase" : { "text" : value } }}, index=INDEX ) return results_generator if __name__ == '__main__': es_client = Elasticsearch(ELASTIC_SEARCH_NODES, verify_certs=True) # create_index() doc1 = {"docid": 1, 'text': u"We are looking for C developers"} doc2 = {"docid": 2, 'text': u"We are looking for C++ and C# developers"} # create_doc(doc1) # create_doc(doc2) for r in find_doc("C++"): print r
Search result (if escaped +
, the result is the same):
{u'_score': 0.0, u'_type': u'material', u'_id': u'2', u'_source': {u'text': u'We are looking for C++ and C# developers', u'docid': 2}, u'_index': u'my_index'} {u'_score': 0.0, u'_type': u'material', u'_id': u'1', u'_source': {u'text': u'We are looking for C developers', u'docid': 1}, u'_index': u'my_index'}
As I understand it, this result is obtained because when splitting into tokens, characters like +
and #
not indexed , and in fact he is looking for documents that have the C
character:
curl 'http://localhost:9200/my_index/material/_search?pretty=true' -d '{ "query" : { "match_all" : { } }, "script_fields": { "terms" : { "script": "doc[field].values", "params": { "field": "text" } } } }'
The result of the command:
{ "took" : 3, "timed_out" : false, "_shards" : { "total" : 5, "successful" : 5, "failed" : 0 }, "hits" : { "total" : 2, "max_score" : 1.0, "hits" : [ { "_index" : "my_index", "_type" : "material", "_id" : "2", "_score" : 1.0, "fields" : { "terms" : [ "and", "are", "c", "developers", "for", "looking", "we" ] } }, { "_index" : "my_index", "_type" : "material", "_id" : "1", "_score" : 1.0, "fields" : { "terms" : [ "are", "c", "developers", "for", "looking", "we" ] } }] } }
How can this problem be solved? And the second question related to the previous one: is it possible to search only for non-alphabetic characters: %
or the same +
?
PS I use Elastic 2.3.2 and the library for Python elasticsearch = 2.3.0.