Remove from the file the lines that contain unicode symphols bash

Question

I have a large file (about 30 GB), I need to cut out of it all the lines that contain anything except a-zA-Z0-9 and all special characters (! @ # $% ^ & * () ... and etc.).

Specify: all strings containing only unicode, all strings containing at least one unicode character.
Check for UTF-8 iconv -f UTF-8 your_file > /dev/null; echo $?
So a-zA-Z0-9 and special characters are also very unicode characters
@andreymal and this question has already been raised in discussions.
I think with this formulation, the question of solutions can be very much and everything will NOT satisfy the question posed.

Accepted Answer · 2018-08-19T16:30:21

 ~$ grep -Pv "^[a-zA-Z0-9!@\#\$%\^\&\*\\(\\)\\[\\]\\{\\}]*$" big_file

add to the group what is missing

-v ignore lines containing pattern

After some research:

UCS characters U + 0000 to U + 007F (ASCII) are encoded simply bytes 0x00 to 0x7F (ASCII compatibility). ASCII and UTF-8.

grep has the function of reading binary files as text. It turns out you need to exclude values from b \ x00 to b \ x7F

  ~$ grep -Pav '^[b\x00-b\x7F]*$' big_file

Yes, an interesting question came out, in the forest of various encodings for printable characters, you can get lost.

Working version based on the key article [4] . Summing up all the above, you need to find the hexadecimal sequences that satisfy the condition [\x00-\x7D] but print all the characters outside this range.

Of course, if we are talking about something like UNICODE / UTF-8 / ASCII, and not a "raw" binary file.

 ~$ cat test-3 !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]@abcdefghijklmnopqrstuvwxyz{|}| АаБбВвГгДдЕеЁёЖжЗзИиЙйКкЛлМмНнОоПпРрСсТтУуФфХхЦцЧчШшЩщЪъЫыЬьЭэЮюЯя ~$ hexdump -C test-3 00000000 21 22 23 24 25 26 27 28 29 2a 2b 2c 2d 2e 2f 30 |!"#$%&'()*+,-./0| 00000010 31 32 33 34 35 36 37 38 39 3a 3b 3c 3d 3e 3f 40 |123456789:;<=>?@| 00000020 41 42 43 44 45 46 47 48 49 4a 4b 4c 4d 4e 4f 50 |ABCDEFGHIJKLMNOP| 00000030 51 52 53 54 55 56 57 58 59 5a 5b 5c 5d 40 61 62 |QRSTUVWXYZ[\]@ab| 00000040 63 64 65 66 67 68 69 6a 6b 6c 6d 6e 6f 70 71 72 |cdefghijklmnopqr| 00000050 73 74 75 76 77 78 79 7a 7b 7c 7d 7c 0a d0 90 d0 |stuvwxyz{|}|....| 00000060 b0 d0 91 d0 b1 d0 92 d0 b2 d0 93 d0 b3 d0 94 d0 |................| 00000070 b4 d0 95 d0 b5 d0 81 d1 91 d0 96 d0 b6 d0 97 d0 |................| 00000080 b7 d0 98 d0 b8 d0 99 d0 b9 d0 9a d0 ba d0 9b d0 |................| 00000090 bb d0 9c d0 bc d0 9d d0 bd d0 9e d0 be d0 9f d0 |................| 000000a0 bf d0 a0 d1 80 d0 a1 d1 81 d0 a2 d1 82 d0 a3 d1 |................| 000000b0 83 d0 a4 d1 84 d0 a5 d1 85 d0 a6 d1 86 d0 a7 d1 |................| 000000c0 87 d0 a8 d1 88 d0 a9 d1 89 d0 aa d1 8a d0 ab d1 |................| 000000d0 8b d0 ac d1 8c d0 ad d1 8d d0 ae d1 8e d0 af d1 |................| 000000e0 8f 0a |..| 000000e2 ~$ LC_CTYPE=C grep -Pv "[\x00-\x7D]" test-3 АаБбВвГгДдЕеЁёЖжЗзИиЙйКкЛлМмНнОоПпРрСсТтУуФфХхЦцЧчШшЩщЪъЫыЬьЭэЮюЯя

Because The question is not defined until the end, added its own criterion: output any string NOT containing a character with a value from x00 to x7D.

Links

ASCII and UNICODE specification

(I tested them with the wrong encoding, that is, the UTF8 system locale, and in the CP1250 file) - your RE skips them.
It's ironic that in a few hours each of us came to the original idea of the other :) The engine eats up your nickname when posting a comment.

Answer 2 · 2018-08-19T16:28:37

Thank you for the question, unexpectedly difficult with seeming simplicity.

I could not solve it with grep , but a simple perl script does what it needs:

- myfilter.pl -

 while (<STDIN>) { if ($_ !~ m/[^\w\s\/^.@#$%&*(){}\[\],:;?!<>-]/) { print "$_"; } }

Usage: cat big_file.txt |perl myfilter.pl >filtered_file.txt Blank lines are saved!

Alexander Prokoshev

1,967 6 15

2
I thought about Python3, by default all lines are unicode. The template presented by the author can also be unicode. Everything will depend on the format of the saved file. - Hellseher
The "linux" tag is set. In UTF-8, the set described by the author is single-byte and completely coincides with ASCII. So it is unlikely, if only we are talking about a text file (this is not obvious from the question, I agree). - Alexander Prokoshev
one
Characters like вида are skipped! Maybe from the reverse all the same? It is required to leave only the lines consisting of a-zA-Z0-9 (and the special character). It’s not a fact that all these masks will be at the same time, for example, lol8798 (a-z0-9) suits us and we leave this line. But the line yuUBJH ^ 87 ㄀㄀㄀㄀ $ we remove because there is a left character ㄀ - LorDo
The @Hellseher idea is inherently more correct. Understand why it does not work in its current form ... By the way, I checked your line (with the symbol 0x84e3) in my place - it is deleted from me. Wonders ... - Alexander Prokoshev
Corrected, it seems to work (checked on kanji). Changed in response, check. - Alexander Prokoshev

|

Remove from the file the lines that contain unicode symphols bash

2 answers 2

More articles: