How best to break a line into segments

Question

There is a string and length of the segment

my $string = 'abcdef'; my $cutLength = 3;

You need to get a list of strings ( substr there or (( /(?=(.{3}))/g ( /(?=(.{3}))/g )))

 @array = ('abc', 'bcd', 'cde', 'def');

The IRL in the $string text is the size of a small volume of war and peace, and $cutLength several lengths of $cutLength segments.

How to effectively make such sets of pearls?

PinkTux PinkTux one · Answer 1 · 2016-06-27T19:18:28

An example with a map , of course, looks like a “pearl”. But this is not always justified. Take the text more and check:

 use Modern::Perl; use File::Slurp; use Time::HiRes qw/gettimeofday tv_interval/; use constant CUT_LENGTH => 4; my $string = read_file( 'wim.txt' ); my $string_length = length $string; my @data; say "String length: $string_length bytes"; my $tstart = gettimeofday; for( 0 .. ($string_length - CUT_LENGTH) ) { push @data, substr($string, $_, CUT_LENGTH); } say tv_interval( [$tstart] ); $#data = -1; $tstart = gettimeofday; @data = map { substr($string, $_, CUT_LENGTH) } 0..($string_length - CUT_LENGTH); say tv_interval( [$tstart] );

Conclusion:

 String length: 3418597 bytes 1.13446799504089 2.48430399023438

So, the overhead of the map slows down the process by more than two times. This does not mean that the map is evil :) Just when it comes to efficiency, the easiest way to answer the question in each specific case is to take the profile, at least in the most primitive way.

substr mentioned in the question itself; it is the first thing that comes to mind to use.
but in the case when there is a list of lengths of segments when using substr there are nuances, see how many testSubstr1 and testSubstr2 from my code are testSubstr1 .
Once again I can repeat: if the word "profiling" is not familiar - to the morgue ... Well, this is true, not for you, of course, but for completely oligophrenics and fools.
look at the discussion on stackoverflow.com/questions/25006875/… - where the topstarter was offered a way to work with a different string from substr, I expected some such discussion here.

Community spirit ♦ one · Accepted Answer · 2016-06-28T17:01:29

I did a little research and by the results I’ll say that the fastest approaches to solving the problem are

 my @data = $string =~ /(?=(.{$CUT_LENGTH}))/g;

and

 my @array = split //, $string; my $array_length = scalar @array; my @data; for ( 0 .. ($array_length - $CUT_LENGTH) ) { push @data, join '', @array[$_ .. ($_ + $cut_length - 1)]; }

Working with strings through transcoding into UCS-4le is out of competition in time, but, unfortunately, the speed is leveled by recoding back to utf-8 (for further needs).

Code listing

 #!/usr/bin/env perl use strict; use common::sense; use Time::HiRes qw/gettimeofday tv_interval/; use HTML::Strip; use Encode; use Data::Dumper; use File::Slurp; ## use constant STR_REPEAT => 1; ## #testChopReverse(8, 5, 20, 8); say '---'; testArrayJoin1(8, 5, 20, 8); say '---'; testArrayJoin2(8, 5, 20, 8); say '---'; testRegexp2(8, 5, 20, 8); say '---'; testRegexp1(8, 5, 20, 8); say '---'; testRegexp3(8, 5, 20, 8); say '---'; testSubstr2(8, 5, 20, 8); say '---'; testSubstr1(8, 5, 20, 8); say '---'; testSubstr3(8, 5, 20, 8); say '---'; ## sub testSubstr1 { my @q = @_; my $string = getString( STR_REPEAT ); my $string_length = length $string; foreach my $cut_length (@q) { my @data; my $tstart = gettimeofday; for ( 0 .. ($string_length - $cut_length) ) { push @data, substr($string, $_, $cut_length); } my $time = tv_interval([$tstart]); say Dumper ['substr 1', $cut_length, scalar(@data), $time];#, @data[0..3]]; } } sub testSubstr2 { my @q = @_; foreach my $cut_length (@q) { my $string = getString( STR_REPEAT ); my $string_length = length $string; my @data; my $tstart = gettimeofday; for ( 0 .. ($string_length - $cut_length) ) { push @data, substr($string, $_, $cut_length); } my $time = tv_interval([$tstart]); say Dumper ['substr 2', $cut_length, scalar(@data), $time];#, @data[0..3]]; } } sub testSubstr3 { # http://stackoverflow.com/questions/25006875/performance-issue-with-substr-on-a-very-long-utf-8-string my @q = @_; my $string = getString( STR_REPEAT ); $string = Encode::encode('UTF-32LE', $string); my $string_length = length($string) / 4; foreach my $cut_length (@q) { my @data; my $tstart = gettimeofday; for ( 0 .. ($string_length - $cut_length) ) { # my $t = Encode::decode('UTF-32LE', substr($string, $_ * 4, $cut_length * 4)); # my $t = Encode::encode_utf8(Encode::decode('UTF-32LE', substr($string, $_ * 4, $cut_length * 4))); my $t = substr($string, $_ * 4, $cut_length * 4); push @data, $t; } my $time = tv_interval([$tstart]); say Dumper ['substr 3', $cut_length, scalar(@data), $time];#, @data[0..3]]; } } sub testRegexp1 { my @q = @_; my $string = getString( STR_REPEAT ); study $string; foreach my $cut_length (@q) { my $tstart = gettimeofday; my @data = $string =~ /(?=(.{$cut_length}))/g; my $time = tv_interval([$tstart]); say Dumper ['regexp 1', $cut_length, scalar(@data), $time];#, @data[0..3]]; } } sub testRegexp2 { my @q = @_; foreach my $cut_length (@q) { my $string = getString( STR_REPEAT ); study $string; my $tstart = gettimeofday; my @data = $string =~ /(?=(.{$cut_length}))/g; my $time = tv_interval([$tstart]); say Dumper ['regexp 2', $cut_length, scalar(@data), $time];#, @data[0..3]]; } } sub testRegexp3 { my @q = @_; my $string = getString( STR_REPEAT ); study $string; foreach my $cut_length (@q) { my $tstart = gettimeofday; my @data; while ($string =~ /(?=(.{$cut_length}))/g) { push @data, $1; } my $time = tv_interval([$tstart]); say Dumper ['regexp 3', $cut_length, scalar(@data), $time];#, @data[0..3]]; } } sub testArrayJoin1 { my @q = @_; my $string = getString( STR_REPEAT ); my @array = split //, $string; my $array_length = scalar @array; foreach my $cut_length (@q) { my $tstart = gettimeofday; my @data; for ( 0 .. ($array_length - $cut_length) ) { push @data, join '', @array[$_ .. ($_ + $cut_length - 1)]; } my $time = tv_interval([$tstart]); say Dumper ['array join 1', $cut_length, scalar(@data), $time];#, @data[0..3]]; } } sub testArrayJoin2 { my @q = @_; my $string = getString( STR_REPEAT ); study $string; foreach my $cut_length (@q) { my $tstart = gettimeofday; my @array = split //, $string; my @data; while (@array) { push @data, join '', @array[0 .. ($cut_length - 1)]; shift @array; } my $time = tv_interval([$tstart]); say Dumper ['array join 2', $cut_length, scalar(@data), $time];#, @data[0..3]]; } } sub testChopReverse { my @q = @_; foreach my $cut_length (@q) { my $string = getString( STR_REPEAT ); study $string; my $tstart = gettimeofday; my @data; while ($string) { push @data, unpack("(A3)", $string); $string = reverse $string; chop $string; $string = reverse $string; } my $time = tv_interval([$tstart]); say Dumper ['chop reverse', $cut_length, scalar(@data), $time, @data[0..3]]; } } sub getString { my $x = shift; my $string = read_file( 'Vojna i mir. Tom 1.txt' ); if ($x && $x > 1) { $string = join ' ', map {$string} 1 .. $x; } unless (Encode::is_utf8($string)) { $string = Encode::decode_utf8($string); } my $hs = HTML::Strip->new(striptags=>[qw/script iframe frame style link/]); $string = $hs->parse($string); $hs->eof; undef $hs; $string = lc $string; $string =~ s/[[:punct:]]/ /g; $string =~ s/\n|\r/ /gm; $string =~ s/[[:space:]]/ /gm; $string =~ s/^\s+|\s+$//g; $string =~ s|\s{2,}| |g; return $string; } # 1;

Timings

 $VAR1 = [ 'array join 1', 8, 384610, '0.260628008132935' ]; $VAR1 = [ 'array join 1', 5, 384613, '0.193118049865723' ]; $VAR1 = [ 'array join 1', 20, 384598, '0.478055062683105' ]; $VAR1 = [ 'array join 1', 8, 384610, '0.258578112854004' ]; --- $VAR1 = [ 'array join 2', 8, 384617, '0.340556945098877' ]; $VAR1 = [ 'array join 2', 5, 384617, '0.276476087554932' ]; $VAR1 = [ 'array join 2', 20, 384617, '0.558982025344849' ]; $VAR1 = [ 'array join 2', 8, 384617, '0.339319053970337' ]; --- $VAR1 = [ 'regexp 2', 8, 384610, '0.198590089614868' ]; $VAR1 = [ 'regexp 2', 5, 384613, '0.188468908859253' ]; $VAR1 = [ 'regexp 2', 20, 384598, '0.248025941436768' ]; $VAR1 = [ 'regexp 2', 8, 384610, '0.197826884658813' ]; --- $VAR1 = [ 'regexp 1', 8, 384610, '0.198430934890747' ]; $VAR1 = [ 'regexp 1', 5, 384613, '0.183469981155395' ]; $VAR1 = [ 'regexp 1', 20, 384598, '0.249148044815063' ]; $VAR1 = [ 'regexp 1', 8, 384610, '0.205187035507202' ]; --- $VAR1 = [ 'regexp 3', 8, 384610, '0.297225016693115' ]; $VAR1 = [ 'regexp 3', 5, 384613, '0.280919108428955' ]; $VAR1 = [ 'regexp 3', 20, 384598, '0.32734296295166' ]; $VAR1 = [ 'regexp 3', 8, 384610, '0.297125981842041' ]; --- $VAR1 = [ 'substr 2', 8, 384610, '9.308260052948' ]; $VAR1 = [ 'substr 2', 5, 384613, '9.29751593013001' ]; $VAR1 = [ 'substr 2', 20, 384598, '9.29055691123962' ]; $VAR1 = [ 'substr 2', 8, 384610, '9.31138106384277' ]; --- $VAR1 = [ 'substr 1', 8, 384610, '9.29793611442566' ]; $VAR1 = [ 'substr 1', 5, 384613, '43.1393571024628' ]; $VAR1 = [ 'substr 1', 20, 384598, '40.8735769428101' ]; $VAR1 = [ 'substr 1', 8, 384610, '46.9814130999603' ]; --- $VAR1 = [ 'substr 3', 8, 384610, '0.116764969924927' ]; $VAR1 = [ 'substr 3', 5, 384613, '0.120482912780762' ]; $VAR1 = [ 'substr 3', 20, 384598, '0.125965972229004' ]; $VAR1 = [ 'substr 3', 8, 384610, '0.120614910919189' ];

Thanks for the complicity and ideas of the user @squidepps

In general, it is not clear what have Encode and HTML. Under the nitprofom chase - surprised, and very. - PinkTux

edem edem 4,755 2 12 36 · Answer 3 · 2016-06-27T15:47:56

 my $string = 'abcdef'; my $cutLength = 4; my @arr = map { substr($string, $_, $cutLength) } 0..(length ($string) - $cutLength); print join("\n", @arr);

How best to break a line into segments

3 answers 3

More articles: