Comparison of rows of arbitrary content and output percentage of similarity

Question

#include <stdio.h> #include <string.h> void process(char *istr, char *S2) { char *instr; instr = strstr(istr,S2); if(instr!=NULL) { printf("true\n"); } } void main(void) { char S1[20], S2[20]; char sp[10]=" "; char *istr; printf("Enter S1, S2\n"); gets(S1); gets(S2); istr = strtok(S1,sp); while(istr != NULL) { process(istr,S2); istr = strtok(NULL,sp); } return 0; }

It is necessary to make a comparison of the lines of arbitrary content and the conclusion of the percentage of similarity. I do this, enter 2 lines, break the smaller into words, then pass the word and the line into the function, if the word in the line occurs, I increment it, then I calculate the percentage of matches. So far I want to print true if the word from the first line is found in the second. I am writing this way, because after that I will solve the same problem with the threads. The problem with this code is that it displays true only if the line in which we are looking for a word contains the word itself, for example, line 1 - (abcd) runoff 2 - (a) the result is (true), and if line 1 is (abcd ) drain 2 - (ac) as a result is empty, but should be (true true).

It is necessary to formulate in detail the condition - what is meant by "percent of similarity".
After all, it may be the content of one letter (perhaps several times), the content of two letters, two letters side by side, etc.
The easiest way, as I believe, is to compare words in lines and, depending on the number of duplicate words and their length, calculate the percentage, such is the task that everyone understands in his own way, and the letters in the example led to make it clearer, ideally instead of letters there will be words.
So far I want to print true if the word from the first line is found in the second.
And this code displays true only if the line in which we are looking for a word contains the same word and nothing else, for example line 1 - (aaa bbb ccc ddd) runoff 2 - (aaa) result - (true), and if line 1 - ( aaa bbb ccc ddd) runoff 2 - (aaa ccc) as a result is empty, but must be (true true).
To paint (as I understand it) - 2% for one letter, + 3% if they go in a row, etc.
strstr really looking for a direct entry of a string into a string.
Google "levenshtein distance", and further on the links other similar algorithms.
"to print true if the word from the first line is found in the second" - man strstr

PinkTux PinkTux one · Accepted Answer · 2017-03-07T20:15:37

for example, line 1 - aaa bbb ccc ddd runoff 2 - aaa result is true , and if row 1 is aaa bbb ccc ddd runoff 2 - aaa ccc is empty and should be ( true true ).

To begin with, it is obvious, in this case we need not ( true ), ( true , true ), ..., but simply a counter of matches. Unless, of course, you need to additionally match substrings, positions, the number of their occurrences, etc. This is another story.

The action algorithm here is simple: we divide the string from which we are looking for words into these very words (in the simplest case, this is done using the strtok () function ) and look for the occurrence of each word in another string using strstr () .

 #define DELIMITERS " \t\n\r" size_t count = 0; char *word = strtok( words, DELIMITERS ); while( word ) { if( strstr( string, word ) ) { count++; printf( "string \"%s\" contain word \"%s\"\n", string, word ); } else { printf( "word \"%s\" is not found in string \"%s\"\n", word, string ); } word = strtok( NULL, DELIMITERS ); }

In more complex cases, you first need to decide exactly how "similarity" is considered. Perhaps it makes sense to use agglomerates like Levenshtein distance ( one , two , etc.). Or reinvent your bike. But only after a clear formulation of what exactly should be considered.

Vadim Moroz Vadim Moroz 182 one 13 · Answer 2 · 2017-03-07T21:08:24

 #include <stdio.h> #include <string.h> int process(char *s, char *S2) { char *instr; int col_in = 0; instr = strstr(S2,s); if(instr!=NULL) { return strlen(s); } else return col_in; } void main(void) { char S1[20], S2[20]; char sp[10]=" "; char *istr; int res = 0, col_words=0, tmp; double proc; printf("Enter S1, S2\n"); gets(S1); gets(S2); if(strlen(S1)<strlen(S2)) { istr = strtok(S1,sp); while(istr != NULL) { res += process(istr,S2); printf("%d\n",res); istr = strtok(NULL,sp); col_words++; } proc = (res+col_words-1)*100/strlen(S2); }else{ istr = strtok(S2,sp); while(istr != NULL) { res += process(istr,S1); printf("%d\n",res); istr = strtok(NULL,sp); col_words++; } proc = (res+col_words-1)*100/strlen(S1); } printf("%.1lf",proc); return 0; }

I solved the problem in this way, attach the code, suddenly someone will come in handy

Too much superfluous, reduced by 2 times without loss of meaning and readability.
Post the code with a separate question with a code-review tag (read the tag description).

user239133 · Answer 3 · 2017-03-07T21:20:40

If the lines are really arbitrary content, then they need to be compared with something like the Levenshtein function: https://ru.wikipedia.org/wiki/Levenshteyn_Distance

If you want to work with words, this is done differently. In your algorithm, the word "no" will be detected as included in the string containing the word "Internet".

It is necessary to divide the string into words by whitespace (the result is arrays of words). Then for inflectional languages (Russian) use stemming https://ru.wikipedia.org/wiki/ Stemming for words that are not included in the list of exceptions.

Twenty lines of code will not succeed.

Comparison of rows of arbitrary content and output percentage of similarity

3 answers 3

More articles: