search for multiline text position

Question

there is a large text файл1 . eg:

 a b c d e

There is another text файл2 with several lines. eg:

 b c d

How to use posix -utilit (or at least gnu -utilit, and in the extreme case - to the maximum platform-specific si ) to find the line number in the first file, starting with which these files match? for the given example, it will be 2 (starting from the second line in the first file contains exactly the same lines as in the second).

at the moment I found only a way to find out if the second file is included in the first one

 $ grep -qzP "$(sed ':a;N;$!ba;s/\n/\\n/g' файл2)" файл1 && echo входит

but it does not allow to know the line number from which the match began.

an explanation about the program for sed : it replaces each newline in файле2 with two \n characters (backslash and n ) to make a regular expression for grep . borrowed from here: How can I replace a newline (\ n) using sed? .

the global task is to merge two partially intersecting files: a certain unknown number of lines at the end of one file is repeated at the beginning of another, and you need to get in one file: unique lines from the first, then common lines, then unique lines from the second.
It's easier to write on what you know, and the resources are C / Asm
@KoVadim, the intended place of use is a “microcomputer” (like raspberry pi and weaker) or anything at openwrt .
so resources (including blobs with an interpreter) are critical.
// a "C / ASM", in addition to more time spent on development / debugging, due to the uncertainty of the target architecture (intel / arm / mips) - in some way "overhead".

Accepted Answer · 2016-09-06T15:01:38

Using patch

 if line=$(diff -U0 файл2 /dev/null | patch -f --dry-run файл1 - | sed -rn 's/^Hunk #1 succeeded at ([0-9]+) .*/\1/p; /FAILED/ q1') then echo строка ${line:-1} else echo фрагмент не найден fi

We create a patch that removes all lines from the sample file and try to apply it to the second file in the dry-run mode. If possible, patch reports the offset found, if it is not null. If not, it reports an error.

Using diff.

It was my first option, left it just in case. It seems to work, but I do not know how it will behave with very large files:

 diff -U0 файл2 файл1 | sed -rn '1!{/^-/q1}; 3{s/@@ -0,0 \+(1,([0-9]+)|(1)).*/\2\3/p}; /^@@ -[1-9]/ {G; h; /@@\n@@/ q1 }'

Only lines are numbered from 0.

It may be unnecessarily long to work with a large file, since diff will output the entire файл1 that файл2 not файл2 with файл2 .

If there is no match and returns a status of 1 if the match is partial, the text should be ignored. If the files match from the very beginning, nothing is displayed, status = 0.

Explanations for the program.

diff -U0 issues a patch in a unified format without contextual lines.

The first group catches lines starting with a minus, except for the first line. If such lines are present, then diff did not find any lines from файл2 , the program ends with error code 1.

The second group catches the beginning of the fragment that diff considers as added before the lines from файл2 . From here the number of lines of this fragment is taken. The title of this fragment should look like @@ -0,0 +1,n @@ , where n is the number of its lines. n equal to 1 is omitted.

If файл1 contains all the lines from файл2 , but there are more between them, in this case diff will give more than two terms starting with @@ and not a single minus, the last group of commands tracks this.

@alexanderbarakin, yes, I messed up with copying from the terminal.
@alexanderbarakin, any, now changed, it seems like partial coincidences must be
Please explain why there is an additive |(1) in the construction ([0-9]+)|(1) ?
after all, the first part will also coincide with symbol 1 ([0-9]+) , which means alternative (1) is seemingly redundant.
or even replace this pipeline with /dev/null as the second argument to diff .

Answer 2 · 2016-09-06T18:34:15

came up with an awk option ( скрипт.awk ):

 BEGIN { n=1 # инициализируем номер строки r=1 # умолчальный код возврата - 1 } { # если удалось прочитать следующую строку из файла2 if ((getline l < f2) > 0) { if (l!=$0) { # и она не равна очередной строке из файла1 close(f2) # то закрываем файл2 (очередной getline откроет его заново) n=NR+1 # кандидатом на совпадение считаем следующую строку файла1 } } else { # если файл2 закончился r=0 # код возврата - 0 (удача) print n # номер строки, где началось совпадение nextfile # завершаем чтение текущего файла # а так он единственный - файл1 - то это означает переход # к секции END } } END { exit r # завершаем программу с указанным кодом возврата }

call so:

 $ awk -v f2=файл2 -f скрипт.awk файл1

The name of the second file is passed through the variable f2 .

returns either the line number from файла1 , starting with which файл2 entirely contained in файл1 , or nothing and the return code is 1 (“no match”).

can also be called as a one-liner. in this case, the name of the second file is easier to embed directly into the script code:

 $ awk 'BEGIN{f2="файл2";n=1;r=1}{if((getline l<f2)>0){if(l!=$0){close(f2);n=NR+1}}else{r=0;print n;nextfile}}END{exit r}' файл1

in order to handle such a situation, it is necessary to complicate the algorithm greatly.

Answer 3 · 2016-09-07T08:44:42

To print, where lines common to both files begin at файл1 :

 $ diff \ --old-group-format='' \ --changed-group-format='' \ --new-group-format='' \ --unchanged-group-format='%df'$'\n' файл1 файл2 2

Line Group Formats .

the global task is to merge two partially intersecting files: a certain unknown number of lines at the end of one file is repeated at the beginning of another, and you need to get in one file: unique lines from the first, then common lines, then unique lines from the second.

 #!/bin/bash function cleanup { rm -rf sorted{1,2} } trap cleanup EXIT # sort, remove duplicates sort -u файл1 > sorted1 sort -u файл2 > sorted2 # print unique lines from файл1 comm -23 sorted{1,2} # print common lines comm -12 sorted{1,2} # print unique lines from файл2 comm -13 sorted{1,2}

in this case, sorting is excluded: 1. the order of the lines is important; 2. Many lines can be repeated several times. approximate analogy: partially overlapping log files without timestamps

To merge files containing common lines:

 $ diff \ --old-group-format='%<' \ --new-group-format='%>' \ old new

about the “global task” - thanks, but in this case the sorting is excluded: 1. the order of the rows is important;
approximate analogy: partially overlapping log files without timestamps.
@alexanderbarakin: I don’t see what thanks for it if it doesn’t suit you
I incorrectly interpreted the word “unique” (it means “strings not found in another file” in this case, and not “non-repeating”).
The first of the expressions on the modified файле1 (duplicated the 3rd line containing c ) returns 25 .
this value is quite logical (from the point of view of diff ), but, alas, is wrong (as an answer to a question).
@alexanderbarakin c cannot be duplicated (then file2 is not contained in file1)
But, I understood: the fact of the presence of a complete coincidence is not checked.
Well, it's not scary - right in the question I have a way to check.
// by the way, if you use not '%df' , but (for bash -a) '%df'$'\n' , then the answer will be multi-line and thus you can detect the absence of a full match.

Answer 4 · 2016-09-07T10:02:23

C version for * nix using mmap , for maximum memory savings.

 // avp 2016 find one file (av[2]) into other file (av[1]) #include <stdio.h> #include <stdlib.h> #include <string.h> #include <sys/mman.h> #include <sys/types.h> #include <sys/stat.h> #include <err.h> #include <unistd.h> #include <fcntl.h> #include <sysexits.h> #include <errno.h> struct fmap { char *data; // don't maps empty (filesize == 0) files size_t len; // filesize from fstat int errn; // errno after mapfile() (0 for empty file !!!) }; struct fmap mapfile (const char *fn) { struct fmap fm = {0, 0, EINVAL}; // it's for C++ (it can't {.errn=EINVAL}) if (fn) { int old = errno; errno = 0; int fd = open(fn, O_RDONLY); struct stat st; if (fd > -1) if ((fstat(fd, &st), fm.len = st.st_size) && (fm.data = (char *)mmap(NULL, st.st_size, PROT_READ, MAP_PRIVATE, fd, 0))) { madvise(fm.data, fm.len, MADV_SEQUENTIAL); close(fd); } fm.errn = errno; errno = old; } return fm; } static inline size_t skipnl (struct fmap *p, size_t from) { while (from < p->len && p->data[from++] != '\n'); return from; } int main (int ac, char *av[]) { if (ac != 3) err(EX_USAGE, "search file2 in file1 (by lines) Usage: %s file1 file2", av[0]); struct fmap f1 = mapfile(av[1]), f2 = mapfile(av[2]); if (!f1.data) (errno = f1.errn) ? err(EX_DATAERR, "file %s", av[1]) : errx(EX_DATAERR, "empty file %s", av[1]); if (!f2.data) (errno = f2.errn) ? err(EX_DATAERR, "file %s", av[2]) : errx(EX_DATAERR, "empty file %s", av[2]); size_t i = 0, lineno = 0, // current line number in f1 n2l = 0; // number of '\n' in f2 (for adjust lineno when found f2 in f1) for (i = 0; i < f2.len; i++) if (f2.data[i] == '\n') n2l++; // continue search after successful matching from the beginning of new line int need_skipnl = f2.data[f2.len - 1] != '\n'; i = 0; while (i < f1.len) { if (memcmp(f1.data + i, f2.data, f2.len) == 0) { printf("%ld\n", (long)lineno + 1); lineno += n2l; i += f2.len; if (need_skipnl) i = skipnl(&f1, i), lineno++; } else i = skipnl(&f1, i), lineno++; } return munmap(f1.data, f1.len) + munmap(f2.data, f2.len) ? EX_OSERR : 0; }

It may be worthwhile somehow to design return codes differently. For example, if nothing is found, return 1 (now (ubuntu /usr/include/sysexits.h) returns 0, 64, 65 and 71).

PS
boundary conditions (file sizes are multiples of a page) were not tested.

search for multiline text position

4 answers 4

Using patch

Using diff.

Explanations for the program.

More articles: