There is a string and length of the segment

my $string = 'abcdef'; my $cutLength = 3; 

You need to get a list of strings ( substr there or (( /(?=(.{3}))/g ( /(?=(.{3}))/g )))

 @array = ('abc', 'bcd', 'cde', 'def'); 

The IRL in the $string text is the size of a small volume of war and peace, and $cutLength several lengths of $cutLength segments.

How to effectively make such sets of pearls?

    3 answers 3

    An example with a map , of course, looks like a “pearl”. But this is not always justified. Take the text more and check:

     use Modern::Perl; use File::Slurp; use Time::HiRes qw/gettimeofday tv_interval/; use constant CUT_LENGTH => 4; my $string = read_file( 'wim.txt' ); my $string_length = length $string; my @data; say "String length: $string_length bytes"; my $tstart = gettimeofday; for( 0 .. ($string_length - CUT_LENGTH) ) { push @data, substr($string, $_, CUT_LENGTH); } say tv_interval( [$tstart] ); $#data = -1; $tstart = gettimeofday; @data = map { substr($string, $_, CUT_LENGTH) } 0..($string_length - CUT_LENGTH); say tv_interval( [$tstart] ); 

    Conclusion:

     String length: 3418597 bytes 1.13446799504089 2.48430399023438 

    So, the overhead of the map slows down the process by more than two times. This does not mean that the map is evil :) Just when it comes to efficiency, the easiest way to answer the question in each specific case is to take the profile, at least in the most primitive way.

    • substr mentioned in the question itself; it is the first thing that comes to mind to use. but in the case when there is a list of lengths of segments when using substr there are nuances, see how many testSubstr1 and testSubstr2 from my code are testSubstr1 . besides, substr not the fastest. - nörbörnĂ«n
    • Once again I can repeat: if the word "profiling" is not familiar - to the morgue ... Well, this is true, not for you, of course, but for completely oligophrenics and fools. - PinkTux
    • I expected this type of answer. profiling is good when you know what to profile, i.e. several ways to solve the problem. look at the discussion on stackoverflow.com/questions/25006875/… - where the topstarter was offered a way to work with a different string from substr, I expected some such discussion here. - nörbörnĂ«n

    I did a little research and by the results I’ll say that the fastest approaches to solving the problem are

     my @data = $string =~ /(?=(.{$CUT_LENGTH}))/g; 

    and

     my @array = split //, $string; my $array_length = scalar @array; my @data; for ( 0 .. ($array_length - $CUT_LENGTH) ) { push @data, join '', @array[$_ .. ($_ + $cut_length - 1)]; } 

    Working with strings through transcoding into UCS-4le is out of competition in time, but, unfortunately, the speed is leveled by recoding back to utf-8 (for further needs).

    Code listing

     #!/usr/bin/env perl use strict; use common::sense; use Time::HiRes qw/gettimeofday tv_interval/; use HTML::Strip; use Encode; use Data::Dumper; use File::Slurp; ## use constant STR_REPEAT => 1; ## #testChopReverse(8, 5, 20, 8); say '---'; testArrayJoin1(8, 5, 20, 8); say '---'; testArrayJoin2(8, 5, 20, 8); say '---'; testRegexp2(8, 5, 20, 8); say '---'; testRegexp1(8, 5, 20, 8); say '---'; testRegexp3(8, 5, 20, 8); say '---'; testSubstr2(8, 5, 20, 8); say '---'; testSubstr1(8, 5, 20, 8); say '---'; testSubstr3(8, 5, 20, 8); say '---'; ## sub testSubstr1 { my @q = @_; my $string = getString( STR_REPEAT ); my $string_length = length $string; foreach my $cut_length (@q) { my @data; my $tstart = gettimeofday; for ( 0 .. ($string_length - $cut_length) ) { push @data, substr($string, $_, $cut_length); } my $time = tv_interval([$tstart]); say Dumper ['substr 1', $cut_length, scalar(@data), $time];#, @data[0..3]]; } } sub testSubstr2 { my @q = @_; foreach my $cut_length (@q) { my $string = getString( STR_REPEAT ); my $string_length = length $string; my @data; my $tstart = gettimeofday; for ( 0 .. ($string_length - $cut_length) ) { push @data, substr($string, $_, $cut_length); } my $time = tv_interval([$tstart]); say Dumper ['substr 2', $cut_length, scalar(@data), $time];#, @data[0..3]]; } } sub testSubstr3 { # http://stackoverflow.com/questions/25006875/performance-issue-with-substr-on-a-very-long-utf-8-string my @q = @_; my $string = getString( STR_REPEAT ); $string = Encode::encode('UTF-32LE', $string); my $string_length = length($string) / 4; foreach my $cut_length (@q) { my @data; my $tstart = gettimeofday; for ( 0 .. ($string_length - $cut_length) ) { # my $t = Encode::decode('UTF-32LE', substr($string, $_ * 4, $cut_length * 4)); # my $t = Encode::encode_utf8(Encode::decode('UTF-32LE', substr($string, $_ * 4, $cut_length * 4))); my $t = substr($string, $_ * 4, $cut_length * 4); push @data, $t; } my $time = tv_interval([$tstart]); say Dumper ['substr 3', $cut_length, scalar(@data), $time];#, @data[0..3]]; } } sub testRegexp1 { my @q = @_; my $string = getString( STR_REPEAT ); study $string; foreach my $cut_length (@q) { my $tstart = gettimeofday; my @data = $string =~ /(?=(.{$cut_length}))/g; my $time = tv_interval([$tstart]); say Dumper ['regexp 1', $cut_length, scalar(@data), $time];#, @data[0..3]]; } } sub testRegexp2 { my @q = @_; foreach my $cut_length (@q) { my $string = getString( STR_REPEAT ); study $string; my $tstart = gettimeofday; my @data = $string =~ /(?=(.{$cut_length}))/g; my $time = tv_interval([$tstart]); say Dumper ['regexp 2', $cut_length, scalar(@data), $time];#, @data[0..3]]; } } sub testRegexp3 { my @q = @_; my $string = getString( STR_REPEAT ); study $string; foreach my $cut_length (@q) { my $tstart = gettimeofday; my @data; while ($string =~ /(?=(.{$cut_length}))/g) { push @data, $1; } my $time = tv_interval([$tstart]); say Dumper ['regexp 3', $cut_length, scalar(@data), $time];#, @data[0..3]]; } } sub testArrayJoin1 { my @q = @_; my $string = getString( STR_REPEAT ); my @array = split //, $string; my $array_length = scalar @array; foreach my $cut_length (@q) { my $tstart = gettimeofday; my @data; for ( 0 .. ($array_length - $cut_length) ) { push @data, join '', @array[$_ .. ($_ + $cut_length - 1)]; } my $time = tv_interval([$tstart]); say Dumper ['array join 1', $cut_length, scalar(@data), $time];#, @data[0..3]]; } } sub testArrayJoin2 { my @q = @_; my $string = getString( STR_REPEAT ); study $string; foreach my $cut_length (@q) { my $tstart = gettimeofday; my @array = split //, $string; my @data; while (@array) { push @data, join '', @array[0 .. ($cut_length - 1)]; shift @array; } my $time = tv_interval([$tstart]); say Dumper ['array join 2', $cut_length, scalar(@data), $time];#, @data[0..3]]; } } sub testChopReverse { my @q = @_; foreach my $cut_length (@q) { my $string = getString( STR_REPEAT ); study $string; my $tstart = gettimeofday; my @data; while ($string) { push @data, unpack("(A3)", $string); $string = reverse $string; chop $string; $string = reverse $string; } my $time = tv_interval([$tstart]); say Dumper ['chop reverse', $cut_length, scalar(@data), $time, @data[0..3]]; } } sub getString { my $x = shift; my $string = read_file( 'Vojna i mir. Tom 1.txt' ); if ($x && $x > 1) { $string = join ' ', map {$string} 1 .. $x; } unless (Encode::is_utf8($string)) { $string = Encode::decode_utf8($string); } my $hs = HTML::Strip->new(striptags=>[qw/script iframe frame style link/]); $string = $hs->parse($string); $hs->eof; undef $hs; $string = lc $string; $string =~ s/[[:punct:]]/ /g; $string =~ s/\n|\r/ /gm; $string =~ s/[[:space:]]/ /gm; $string =~ s/^\s+|\s+$//g; $string =~ s|\s{2,}| |g; return $string; } # 1; 

    Timings

     $VAR1 = [ 'array join 1', 8, 384610, '0.260628008132935' ]; $VAR1 = [ 'array join 1', 5, 384613, '0.193118049865723' ]; $VAR1 = [ 'array join 1', 20, 384598, '0.478055062683105' ]; $VAR1 = [ 'array join 1', 8, 384610, '0.258578112854004' ]; --- $VAR1 = [ 'array join 2', 8, 384617, '0.340556945098877' ]; $VAR1 = [ 'array join 2', 5, 384617, '0.276476087554932' ]; $VAR1 = [ 'array join 2', 20, 384617, '0.558982025344849' ]; $VAR1 = [ 'array join 2', 8, 384617, '0.339319053970337' ]; --- $VAR1 = [ 'regexp 2', 8, 384610, '0.198590089614868' ]; $VAR1 = [ 'regexp 2', 5, 384613, '0.188468908859253' ]; $VAR1 = [ 'regexp 2', 20, 384598, '0.248025941436768' ]; $VAR1 = [ 'regexp 2', 8, 384610, '0.197826884658813' ]; --- $VAR1 = [ 'regexp 1', 8, 384610, '0.198430934890747' ]; $VAR1 = [ 'regexp 1', 5, 384613, '0.183469981155395' ]; $VAR1 = [ 'regexp 1', 20, 384598, '0.249148044815063' ]; $VAR1 = [ 'regexp 1', 8, 384610, '0.205187035507202' ]; --- $VAR1 = [ 'regexp 3', 8, 384610, '0.297225016693115' ]; $VAR1 = [ 'regexp 3', 5, 384613, '0.280919108428955' ]; $VAR1 = [ 'regexp 3', 20, 384598, '0.32734296295166' ]; $VAR1 = [ 'regexp 3', 8, 384610, '0.297125981842041' ]; --- $VAR1 = [ 'substr 2', 8, 384610, '9.308260052948' ]; $VAR1 = [ 'substr 2', 5, 384613, '9.29751593013001' ]; $VAR1 = [ 'substr 2', 20, 384598, '9.29055691123962' ]; $VAR1 = [ 'substr 2', 8, 384610, '9.31138106384277' ]; --- $VAR1 = [ 'substr 1', 8, 384610, '9.29793611442566' ]; $VAR1 = [ 'substr 1', 5, 384613, '43.1393571024628' ]; $VAR1 = [ 'substr 1', 20, 384598, '40.8735769428101' ]; $VAR1 = [ 'substr 1', 8, 384610, '46.9814130999603' ]; --- $VAR1 = [ 'substr 3', 8, 384610, '0.116764969924927' ]; $VAR1 = [ 'substr 3', 5, 384613, '0.120482912780762' ]; $VAR1 = [ 'substr 3', 20, 384598, '0.125965972229004' ]; $VAR1 = [ 'substr 3', 8, 384610, '0.120614910919189' ]; 

    Thanks for the complicity and ideas of the user @squidepps

    • In general, it is not clear what have Encode and HTML. Under the nitprofom chase - surprised, and very. - PinkTux
     my $string = 'abcdef'; my $cutLength = 4; my @arr = map { substr($string, $_, $cutLength) } 0..(length ($string) - $cutLength); print join("\n", @arr);