Search mechanisms for site documents (content)

Question

It is necessary to determine the approach of organizing a search on the site. In this regard, this question. It may seem somewhat common, but nonetheless.

A couple of years ago, they implemented a project on Orchard CMS (.net mvc). Duck here the framework used Lucene.Net for full-text search of site materials. From the developer’s point of view, the indexing connection was quite simple, and looked something like this:

 OnIndexing<EduPart>((ctx, ep) => ctx.DocumentIndex .Add("EduPart_Description", ep.Description).RemoveTags().Analyze() .Add("EduPart_Mission", ep.Mission).RemoveTags().Analyze() );

The index is stored, I do not even remember where, either in the database or in files. But the point is that this search did not have any external calls and dependencies.

Realized projects are quite small, the task of indexing Wikipedia is not worth it. The main framework used is CakePHP , but it doesn’t matter, all popular mvc frameworks implement the same things anyway.

What are the possibilities of search and their applicability:

banal use of LIKE operator. the easiest and least functional option for searching for content. Yes, you can search for titles, for example, but it’s far from full-text here. As a solution is not considered.
Using tools Full text search database engines. The problem (albeit solvable) here is the use of various DBMSs. In principle, I use only MS SQL Server and MySQL in development, both of these cores support full-text indexes, although I honestly admit I have never used it. In general, you will have to write search implementations for each DBMS.
using search engines like elasticSearch / sphinx / solr, etc. In such approaches, these are separate services / daemons that are accessed through the API. Cons - the possibility of transferring projects to hosting sites may be limited. CakePHP itself has a plugin for ElasticSearch , oddly enough, requiring changes in the inheritance hierarchy of application models / entities, although it would seem that some implementation of some behavior that reacts to content add / edit / delete events is required.
The last option (which I originally relied on) was the Lucene PHP port. Previously, such was part of the Zend Framework , and was called Zend.Search . However, the trouble is that this project is not compatible with pkhp 7 and more is not supported. How is the latest version of php written for it 5.3.

It follows from these reflections that the task is to find a solution for full-text search on the site (for all types of documents / content) without using third-party search engines and not depending on the SQL dialect used.

Who can share experience on this issue? What are some other solutions to this problem, or libraries suitable for the conditions.

There was some kind of Pucene , it is not yet clear. - teran

teran teran 18.1k one 14 thirty · Answer 1 · 2018-04-09T13:15:15

In general, I stopped at a full-text search by means of a DBMS. I looked at that full text indexes were allowed on several regular hosting, although he doubted this, since, for example, triggers and views were often forbidden.

From the point of view of implementation on the DBMS side, I went in such a way that an auxiliary table is actually created for the search. There are two fields in the table - the document id and its content. Content is concatenated from the fields of the document itself. In fact, duplication of content in the database is obtained, but since there is not a lot of it, it is not critical.

From the point of view of the application (CakePHP), developed SearchBehavior , which connects to the tables. Like that:

  $this->addBehavior("Search", [ 'className' => 'ContentManager.Search', 'indexFields' => [ 'title', 'description', 'Specs.title', 'Edu.title' ], 'results' => [ 'select' => ['notice'], 'contains' => ['Specs'], 'url' => ['_name' => 'profDetails'], ] ]);

in the config list of concatenated fields is indicated, incl. associations. To display the search results, it is possible to specify selectable fields (in addition to displayField() by default), as well as downloadable associations. To build a link in the results to a specific document, use either named routes (where id is passed as a parameter) or callback method.

Standard pagination had to be configured manually. Parameters are in the $this->params('paging') controller.

Documents are indexed / deleted from the index in the afterSave and afterDelete . A shell has also been developed that allows you to clear the search table, index documents by type, or all at once.
At the level of behavior / tables added (added at the level of behavior, overlapped at the level of the final Table classes, if necessary) method allowIndex($entity) allows you to determine whether to index a specific document. Thus, it is possible to index not all documents of a given type, but only a part of them.

There remains a nuance when one DB table is used for several classes of models. But this situation I have is to be only for the menu items on the site. So for now, only elements of one of them are indexed.

Search mechanisms for site documents (content)

1 answer 1

More articles: