Hello!

I have two files with thousands of proteins: first file: protein ID + amino acid sequence; second file: protein ID + nucleotide sequence. And there is my third file - with the positions of the domains of these proteins, which are associated with my proteins and nucleotide files. I connected these three files with this code:

File (1) acid.txt contains:

Enst003

File (2) nucleotides.txt contains:

ENST00000274849 | Q9ULW3 ATGGAGGCAGAGGAATCGGAGAAGGCCGCAACGGAGCAAGAGCCGCTGGAAGGGACAGAA CAGACACTAGATGCGGAGGAGGAGCAGGAGGAATCCGAAGAAGCGGCCTGTGGCAGCAAG AAACGGGTAGTGCCAGGTATTGTGTACCTGGGCCATATCCCGCCGCGCTTCCGGCCCCTG CACGTCCGCAACCTTCTCAGCGCCTATGGCGAGGTCGGACGCGTCTTCTTTCAGGCTGAG GACCGGTTCGTGAGACGCAAGAAGAAGGCAGCAGCAGCTGCCGGAGGAAAAAAGCGGTCC TACACCAAGGACTACACCGAGGGATGGGTGGAGTTCCGTGACAAGCGCATAGCCAAGCGC GTGGCGGCCAGTCTACACAACACGCCTATGGGTGCCCGCAGGCGCAGCCCCTTCCGTTAT GATCTTTGGAACCTCAAGTACTTGCACCGTTTCACCTGGTCCCACTGA

File (3) domain.txt contains:

Q9ULW3; 46 142 

NOTE. These numbers indicate the position of the domain of proteins (that is, my domain is from 46 to 142 number)

 use strict; use Bio::SeqIO; #################################################### #MODULE 1: read protein file, and save it in a hash# #################################################### my %hash1; my $sequence = "acid.txt"; my $multifasta = Bio::SeqIO ->new (-file => "<$sequence",-format=> "fasta"); while (my $seq= $multifasta->next_seq()) { my $na= $seq->display_id; #Saves the ID in $na my $ss = $seq->seq; $hash1{$na} = $ss; } ############################################################# #MODULE 2: read nucleotide file, and save it in another hash# ############################################################# my %hash2; my $genes = "nucleotides.txt"; my $multifasta = Bio::SeqIO ->new (-file => "<$genes",-format=> "fasta"); while (my $seq= $multifasta->next_seq()) { my $na= $seq->display_id; #Saves the ID in $na my $des=$seq->description; my $ss = $seq->seq; $hash2{$na} = $ss; } ##################### #MODULE 3: my $name;# ##################### my $name; # Read from standard input chomp $name; ############################################################################## #MODULE 4: DOMAIN ANNOTATION + RELATED AMINO ACIDS AND NUCLEOTIDES IN COLUMNS# ############################################################################## foreach my $name (keys %hash1) { my $ac = (split(/\s*\|/, $name))[1]; #print "$ac\n" ; #################################################### #MODULE 4.1: DOMAIN ANNOTATION: POSITION OF DOMAINS# #################################################### open(FILE, "<" ,"domain.txt"); my @array = (<FILE>); my @lines = grep (/$ac/, @array); print for @lines; close (FILE); ############################################################ #MODULE 4.2: RELATED AMINO ACIDS AND NUCLEOTIDES IN COLUMNS# ############################################################ my @array1 = split(//, $hash1{$name}, $hash2{$name}); #CUT SEQUENCE my @array2 = unpack("a3" x (length($hash1{$name})),$hash2{$name}); #CUT NUCLEOTIDE SEQUENCE my $number = "$#array1+1"; foreach (my $count = 0; $count <= $number; $count++) { print "$count\t@array1[$count]\t@array2[$count]\n"; } } 

And here is my FILE, which I received after running the script:

 Q9ULW3; 46 142 0 M ATG 1 E GAG 2 A GCA 3 E GAG 4 E GAA 5 S TCG 6 E GAG 7 K AAG 8 A GCC 9 A GCA 10 T ACG 11 E GAG 12 Q CAA 13 E GAG 14 P CCG 15 L CTG 16 E GAA 17 G GGG 18 T ACA 19 E GAA 20 Q CAG 21 T ACA 22 L CTA 23 D GAT 24 A GCG 25 E GAG 26 E GAG 27 E GAG 28 Q CAG 29 E GAG 30 E GAA 31 S TCC 32 E GAA 33 E GAA 34 A GCG 35 A GCC 36 C TGT 37 G GGC 38 S AGC 39 K AAG 40 K AAA 41 R CGG 42 V GTA 43 V GTG 44 P CCA 45 G GGT 46 I ATT 47 V GTG 48 Y TAC 49 L CTG 50 G GGC 51 H CAT 52 I ATC 53 P CCG 54 P CCG 55 R CGC 56 F TTC 57 R CGG 58 P CCC 59 L CTG 60 H CAC 61 V GTC 62 R CGC 63 N AAC 64 L CTT 65 L CTC 66 S AGC 67 A GCC 68 Y TAT 69 G GGC 70 E GAG 71 V GTC 72 G GGA 73 R CGC 74 V GTC 75 F TTC 76 F TTT 77 Q CAG 78 A GCT 79 E GAG 80 D GAC 81 R CGG 82 F TTC 83 V GTG 84 R AGA 85 R CGC 86 K AAG 87 K AAG 88 K AAG 89 A GCA 90 A GCA 91 A GCA 92 A GCT 93 A GCC 94 G GGA 95 G GGA 96 K AAA 97 K AAG 98 R CGG 99 S TCC 100 Y TAC 101 T ACC 102 K AAG 103 D GAC 104 Y TAC 105 T ACC 106 E GAG 107 G GGA 108 W TGG 109 V GTG 110 E GAG 111 F TTC 112 R CGT 113 D GAC 114 K AAG 115 R CGC 116 I ATA 117 A GCC 118 K AAG 119 R CGC 120 V GTG 121 A GCG 122 A GCC 123 S AGT 124 L CTA 125 H CAC 126 N AAC 127 T ACG 128 P CCT 129 M ATG 130 G GGT 131 A GCC 132 R CGC 133 R AGG 134 R CGC 135 S AGC 136 P CCC 137 F TTC 138 R CGT 139 Y TAT 140 D GAT 141 L CTT 142 W TGG 143 N AAC 144 L CTC 145 K AAG 146 Y TAC 147 L TTG 148 H CAC 149 R CGT 150 F TTC 151 T ACC 152 W TGG 153 S TCC 154 H CAC 155 L CTC 

Now I would like to add a new fourth column that will contain “YES” or “NOT” - it depends on if the codons are in the range from 46 to 142 (that is, in the domain) - YES, which are not in the range from 46 142 each (outside domains) - NOT. I would like to receive the ouptut file:

  Q9ULW3; 46 142 0 M ATG NOT 1 E GAG NOT 2 A GCA NOT 3 E GAG NOT 4 E GAA NOT 5 S TCG NOT 6 E GAG NOT 7 K AAG NOT 8 A GCC NOT 9 A GCA NOT 10 T ACG NOT 11 E GAG NOT 12 Q CAA NOT 13 E GAG NOT 14 P CCG NOT 15 L CTG NOT 16 E GAA NOT 17 G GGG NOT 18 T ACA NOT 19 E GAA NOT 20 Q CAG NOT 21 T ACA NOT 22 L CTA NOT 23 D GAT NOT 24 A GCG NOT 25 E GAG NOT 26 E GAG NOT 27 E GAG NOT 28 Q CAG NOT 29 E GAG NOT 30 E GAA NOT 31 S TCC NOT 32 E GAA NOT 33 E GAA NOT 34 A GCG NOT 35 A GCC NOT 36 C TGT NOT 37 G GGC NOT 38 S AGC NOT 39 K AAG NOT 40 K AAA NOT 41 R CGG NOT 42 V GTA NOT 43 V GTG NOT 44 P CCA NOT 45 G GGT NOT 46 I ATT YES 47 V GTG YES 48 Y TAC YES 49 L CTG YES 50 G GGC YES 51 H CAT YES 52 I ATC YES 53 P CCG YES 54 P CCG YES 55 R CGC YES 56 F TTC YES 57 R CGG YES 58 P CCC YES 59 L CTG YES 60 H CAC YES 61 V GTC YES 62 R CGC YES 63 N AAC YES 64 L CTT YES 65 L CTC YES 66 S AGC YES 67 A GCC YES 68 Y TAT YES 69 G GGC YES 70 E GAG YES 71 V GTC YES 72 G GGA YES 73 R CGC YES 74 V GTC YES 75 F TTC YES 76 F TTT YES 77 Q CAG YES 78 A GCT YES 79 E GAG YES 80 D GAC YES 81 R CGG YES 82 F TTC YES 83 V GTG YES 84 R AGA YES 85 R CGC YES 86 K AAG YES 87 K AAG YES 88 K AAG YES 89 A GCA YES 90 A GCA YES 91 A GCA YES 92 A GCT YES 93 A GCC YES 94 G GGA YES 95 G GGA YES 96 K AAA YES 97 K AAG YES 98 R CGG YES 99 S TCC YES 100 Y TAC YES 101 T ACC YES 102 K AAG YES 103 D GAC YES 104 Y TAC YES 105 T ACC YES 106 E GAG YES 107 G GGA YES 108 W TGG YES 109 V GTG YES 110 E GAG YES 111 F TTC YES 112 R CGT YES 113 D GAC YES 114 K AAG YES 115 R CGC YES 116 I ATA YES 117 A GCC YES 118 K AAG YES 119 R CGC YES 120 V GTG YES 121 A GCG YES 122 A GCC YES 123 S AGT YES 124 L CTA YES 125 H CAC YES 126 N AAC YES 127 T ACG YES 128 P CCT YES 129 M ATG YES 130 G GGT YES 131 A GCC YES 132 R CGC YES 133 R AGG YES 134 R CGC YES 135 S AGC YES 136 P CCC YES 137 F TTC YES 138 R CGT YES 139 Y TAT YES 140 D GAT YES 141 L CTT YES 142 W TGG YES 143 N AAC NOT 144 L CTC NOT 145 K AAG NOT 146 Y TAC NOT 147 L TTG NOT 148 H CAC NOT 149 R CGT NOT 150 F TTC NOT 151 T ACC NOT 152 W TGG NOT 153 S TCC NOT 154 H CAC NOT 155 L CTC NOT 

This is an example for one protein, I need to do this for a large number of proteins. Do you have any ideas?

Thank!

  • well, just add print to print in print print "$count\t@array1[$count]\t@array2[$count]\t".(($count>=46 && $count<=142) ? "YES":"NOT" )."\n"; - Mike
  • Thanks for your reply! But, if I need to do this for a large number of proteins, then I would not like to enter all these positions in the domain with numbers, but I would like the code to do this automatically ... I don’t know how to convert before that - Daria
  • it means they should be obtained from domain.txt when you read it and remember it in variables. the truth is not entirely clear what will happen if there are several lines matching the condition of grep. - Mike
  • And how best to do it? Using '@' or '$'? - Daria
  • Ummm which means better. @ values ​​are arrays, $ are scalars. None of them can be better or worse, they have a completely different purpose. and it depends on the answer to my previous question " it is not entirely clear what will happen if there are several lines matching the grep condition in this file ". - Mike

0