The entrance to the program is a set of English letters. There is a dictionary of words.

How to generate such sentences from words so that each of the sentences consists only of those letters that were submitted for input, taking into account the number of repetitions.

Example

Incoming character set: hellomyfriend

The output of the program:

 Feed Hilly Morn Feed Hilly Norm Feed Horny Mill Freehold Nil My Refilled Hon My Defiler Hymn Lo и т.д. 

I managed to choose words that satisfy the input character set. But quickly make proposals of them fail.

I did by iterating over all the words. For the phrase myfavoritegame script worked for about 5 minutes. The site offers are given instantly.

Can you advise something?

  • I also advise you to look at the search system, such as Sphinx - Artem Y
  • one
    The question you have is “How to make the selection of words from the dictionary so that a given phrase will turn out”, and in the example you search for anagrams. Could you clarify your task by some example? - Ruslan
  • Brute force is not needed. You need for each letter to know all the words that contain it. Cycle through all the letters, take one word at a time, give it all out. If you need some optimality, formulate an optimality criterion. - VladD
  • one
    @Sergiks A sentence is just a collection of words from a dictionary. Regarding the speed of work: it is allowed to search for combinations within 5 seconds. On that site I tried to set random combinations of letters - the answer always came quickly. - Groonya
  • one
    @Sergiks Semi-working version here . It doesn’t work well, but long phrases are still relatively long. PS Thanks for the answer! - Groonya

2 answers 2

Suppose you just need to find all possible combinations of words from a given dictionary that satisfy the condition. Without being distracted by the features of the language.

Important to search:

  1. the presence of letters in the word. We exclude the containing letters outside the given. We exclude, as far as compiling a phrase, already used.
  2. count the letters. At any moment in the compilation of a phrase, it is known which words are suitable for what length, or are definitely not suitable.
  3. duplicate letters.

Not important :

  1. the order of the letters in the dictionary word.
  2. meaning of the word.

For speed, you need to make the search for important features as fast as possible, if necessary, removing unimportant aspects.

To quickly find words with a suitable set of letters, but without counting repetitions, you can make a bit index, as suggested by @Mirdin: 26 letters of the English alphabet = 26 bits. Number the words in the dictionary (or simply count the line number as an index) and make a separate index in two columns: the word id is the bit mask of the existing letters. For a dictionary less than 65 thousand words, such an index will "weigh" 6 bytes per word, less than 400k. Can be kept in RAM for near-instant search. So you can quickly find, for example, the first word of a phrase — simply so that the bits of the “extra” letters are turned off.

It is worth making a copy of the dictionary, where the letters of the word are sorted alphabetically , and the words are sorted alphabetically. Those. again a separate index: сортированные_буквы - id_слова . This index will be heavier than the dictionary itself (the number of words * 2 or 3 bytes). In this index, you can quickly find the right words and discard accurately-inappropriate.

Algorithm like this. We are looking for the first word. I want to find the first word of the greatest possible length. Search in length, from larger to smaller. Eats a valid set of letters and length. Found the first word, updated the permissible set of letters and the length of words - look for the next word.

    1. You make a structure (a table in the database, a hash table, a dictionary, and so on, which is there in php) of two write fields: a hash and the actual word. We hammer in this structure with words.
    2. Hash is approximately done like this: the index of the letter in the alphabet is the power of two, all the letters of the word are added (32 bits for the Russian language should suffice). This is the simplest option, it may be necessary to come up with a more complex function.
    3. Having received the word that needs to be engraved - we calculate its hash and look for it in our structure

    UPD: This is for one word, it will be more difficult with phrases, but the principle is the same

    • A repeated letter will ruin the whole raspberry. The word "free", the "e" index 4, the amount will go twice 2 ^ 4, and this is the same as 2 ^ 5, the letter "f". - Sergiks
    • @Sergiks, yes here you are right, I described the simplest case, in practice you will have to use something more complicated. - Mirdin