I have a large file (about 30 GB), I need to cut out of it all the lines that contain anything except a-zA-Z0-9 and all special characters (! @ # $% ^ & * () ... and etc.).
2 answers
~$ grep -Pv "^[a-zA-Z0-9!@\#\$%\^\&\*\\(\\)\\[\\]\\{\\}]*$" big_file add to the group what is missing
-v ignore lines containing pattern
After some research:
UCS characters U + 0000 to U + 007F (ASCII) are encoded simply bytes 0x00 to 0x7F (ASCII compatibility). ASCII and UTF-8.
grep has the function of reading binary files as text. It turns out you need to exclude values from b \ x00 to b \ x7F
~$ grep -Pav '^[b\x00-b\x7F]*$' big_file Yes, an interesting question came out, in the forest of various encodings for printable characters, you can get lost.
Working version based on the key article [4] . Summing up all the above, you need to find the hexadecimal sequences that satisfy the condition [\x00-\x7D] but print all the characters outside this range.
Of course, if we are talking about something like UNICODE / UTF-8 / ASCII, and not a "raw" binary file.
~$ cat test-3 !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]@abcdefghijklmnopqrstuvwxyz{|}| АаБбВвГгДдЕеЁёЖжЗзИиЙйКкЛлМмНнОоПпРрСсТтУуФфХхЦцЧчШшЩщЪъЫыЬьЭэЮюЯя ~$ hexdump -C test-3 00000000 21 22 23 24 25 26 27 28 29 2a 2b 2c 2d 2e 2f 30 |!"#$%&'()*+,-./0| 00000010 31 32 33 34 35 36 37 38 39 3a 3b 3c 3d 3e 3f 40 |123456789:;<=>?@| 00000020 41 42 43 44 45 46 47 48 49 4a 4b 4c 4d 4e 4f 50 |ABCDEFGHIJKLMNOP| 00000030 51 52 53 54 55 56 57 58 59 5a 5b 5c 5d 40 61 62 |QRSTUVWXYZ[\]@ab| 00000040 63 64 65 66 67 68 69 6a 6b 6c 6d 6e 6f 70 71 72 |cdefghijklmnopqr| 00000050 73 74 75 76 77 78 79 7a 7b 7c 7d 7c 0a d0 90 d0 |stuvwxyz{|}|....| 00000060 b0 d0 91 d0 b1 d0 92 d0 b2 d0 93 d0 b3 d0 94 d0 |................| 00000070 b4 d0 95 d0 b5 d0 81 d1 91 d0 96 d0 b6 d0 97 d0 |................| 00000080 b7 d0 98 d0 b8 d0 99 d0 b9 d0 9a d0 ba d0 9b d0 |................| 00000090 bb d0 9c d0 bc d0 9d d0 bd d0 9e d0 be d0 9f d0 |................| 000000a0 bf d0 a0 d1 80 d0 a1 d1 81 d0 a2 d1 82 d0 a3 d1 |................| 000000b0 83 d0 a4 d1 84 d0 a5 d1 85 d0 a6 d1 86 d0 a7 d1 |................| 000000c0 87 d0 a8 d1 88 d0 a9 d1 89 d0 aa d1 8a d0 ab d1 |................| 000000d0 8b d0 ac d1 8c d0 ad d1 8d d0 ae d1 8e d0 af d1 |................| 000000e0 8f 0a |..| 000000e2 ~$ LC_CTYPE=C grep -Pv "[\x00-\x7D]" test-3 АаБбВвГгДдЕеЁёЖжЗзИиЙйКкЛлМмНнОоПпРрСсТтУуФфХхЦцЧчШшЩщЪъЫыЬьЭэЮюЯя Because The question is not defined until the end, added its own criterion: output any string NOT containing a character with a value from x00 to x7D.
Links
- https://stackoverflow.com/questions/3752913
- https://unix.stackexchange.com/questions/19907
- https://unix.stackexchange.com/questions/19491
- https://stackoverflow.com/questions/31139737
ASCII and UNICODE specification
- Does not work. Here you can somehow transfer the file? - Alexander Prokoshev pm
- @AlexanderProkoshev via GitHub you can share snippet. - Hellseher 4:39 pm
- don't keep Well, well, it can be in words. Take the lines with č, ć, š, etc. (I tested them with the wrong encoding, that is, the UTF8 system locale, and in the CP1250 file) - your RE skips them. - Alexander Prokoshev
- one@AlexanderProkoshev write with a nickname tag. Yes I read in more detail on Unicode. - Hellseher pm
- It's ironic that in a few hours each of us came to the original idea of the other :) The engine eats up your nickname when posting a comment. - Alexander Prokoshev
Thank you for the question, unexpectedly difficult with seeming simplicity.
I could not solve it with grep , but a simple perl script does what it needs:
- myfilter.pl -
while (<STDIN>) { if ($_ !~ m/[^\w\s\/^.@#$%&*(){}\[\],:;?!<>-]/) { print "$_"; } } Usage: cat big_file.txt |perl myfilter.pl >filtered_file.txt Blank lines are saved!
- 2I thought about Python3, by default all lines are unicode. The template presented by the author can also be unicode. Everything will depend on the format of the saved file. - Hellseher
- The "linux" tag is set. In UTF-8, the set described by the author is single-byte and completely coincides with ASCII. So it is unlikely, if only we are talking about a text file (this is not obvious from the question, I agree). - Alexander Prokoshev
- oneCharacters like вида are skipped! Maybe from the reverse all the same? It is required to leave only the lines consisting of a-zA-Z0-9 (and the special character). It’s not a fact that all these masks will be at the same time, for example, lol8798 (a-z0-9) suits us and we leave this line. But the line yuUBJH ^ 87 $ we remove because there is a left character - LorDo
- The @Hellseher idea is inherently more correct. Understand why it does not work in its current form ... By the way, I checked your line (with the symbol 0x84e3) in my place - it is deleted from me. Wonders ... - Alexander Prokoshev
- Corrected, it seems to work (checked on kanji). Changed in response, check. - Alexander Prokoshev
iconv -f UTF-8 your_file > /dev/null; echo $?iconv -f UTF-8 your_file > /dev/null; echo $?- Hellseher