Hello!
I have two files with thousands of proteins: first file: protein ID + amino acid sequence; second file: protein ID + nucleotide sequence. And there is my third file - with the positions of the domains of these proteins, which are associated with my proteins and nucleotide files. I connected these three files with this code:
File (1) acid.txt contains:
Enst003
File (2) nucleotides.txt contains:
ENST00000274849 | Q9ULW3 ATGGAGGCAGAGGAATCGGAGAAGGCCGCAACGGAGCAAGAGCCGCTGGAAGGGACAGAA CAGACACTAGATGCGGAGGAGGAGCAGGAGGAATCCGAAGAAGCGGCCTGTGGCAGCAAG AAACGGGTAGTGCCAGGTATTGTGTACCTGGGCCATATCCCGCCGCGCTTCCGGCCCCTG CACGTCCGCAACCTTCTCAGCGCCTATGGCGAGGTCGGACGCGTCTTCTTTCAGGCTGAG GACCGGTTCGTGAGACGCAAGAAGAAGGCAGCAGCAGCTGCCGGAGGAAAAAAGCGGTCC TACACCAAGGACTACACCGAGGGATGGGTGGAGTTCCGTGACAAGCGCATAGCCAAGCGC GTGGCGGCCAGTCTACACAACACGCCTATGGGTGCCCGCAGGCGCAGCCCCTTCCGTTAT GATCTTTGGAACCTCAAGTACTTGCACCGTTTCACCTGGTCCCACTGA
File (3) domain.txt contains:
Q9ULW3; 46 142 NOTE. These numbers indicate the position of the domain of proteins (that is, my domain is from 46 to 142 number)
use strict; use Bio::SeqIO; #################################################### #MODULE 1: read protein file, and save it in a hash# #################################################### my %hash1; my $sequence = "acid.txt"; my $multifasta = Bio::SeqIO ->new (-file => "<$sequence",-format=> "fasta"); while (my $seq= $multifasta->next_seq()) { my $na= $seq->display_id; #Saves the ID in $na my $ss = $seq->seq; $hash1{$na} = $ss; } ############################################################# #MODULE 2: read nucleotide file, and save it in another hash# ############################################################# my %hash2; my $genes = "nucleotides.txt"; my $multifasta = Bio::SeqIO ->new (-file => "<$genes",-format=> "fasta"); while (my $seq= $multifasta->next_seq()) { my $na= $seq->display_id; #Saves the ID in $na my $des=$seq->description; my $ss = $seq->seq; $hash2{$na} = $ss; } ##################### #MODULE 3: my $name;# ##################### my $name; # Read from standard input chomp $name; ############################################################################## #MODULE 4: DOMAIN ANNOTATION + RELATED AMINO ACIDS AND NUCLEOTIDES IN COLUMNS# ############################################################################## foreach my $name (keys %hash1) { my $ac = (split(/\s*\|/, $name))[1]; #print "$ac\n" ; #################################################### #MODULE 4.1: DOMAIN ANNOTATION: POSITION OF DOMAINS# #################################################### open(FILE, "<" ,"domain.txt"); my @array = (<FILE>); my @lines = grep (/$ac/, @array); print for @lines; close (FILE); ############################################################ #MODULE 4.2: RELATED AMINO ACIDS AND NUCLEOTIDES IN COLUMNS# ############################################################ my @array1 = split(//, $hash1{$name}, $hash2{$name}); #CUT SEQUENCE my @array2 = unpack("a3" x (length($hash1{$name})),$hash2{$name}); #CUT NUCLEOTIDE SEQUENCE my $number = "$#array1+1"; foreach (my $count = 0; $count <= $number; $count++) { print "$count\t@array1[$count]\t@array2[$count]\n"; } } And here is my FILE, which I received after running the script:
Q9ULW3; 46 142 0 M ATG 1 E GAG 2 A GCA 3 E GAG 4 E GAA 5 S TCG 6 E GAG 7 K AAG 8 A GCC 9 A GCA 10 T ACG 11 E GAG 12 Q CAA 13 E GAG 14 P CCG 15 L CTG 16 E GAA 17 G GGG 18 T ACA 19 E GAA 20 Q CAG 21 T ACA 22 L CTA 23 D GAT 24 A GCG 25 E GAG 26 E GAG 27 E GAG 28 Q CAG 29 E GAG 30 E GAA 31 S TCC 32 E GAA 33 E GAA 34 A GCG 35 A GCC 36 C TGT 37 G GGC 38 S AGC 39 K AAG 40 K AAA 41 R CGG 42 V GTA 43 V GTG 44 P CCA 45 G GGT 46 I ATT 47 V GTG 48 Y TAC 49 L CTG 50 G GGC 51 H CAT 52 I ATC 53 P CCG 54 P CCG 55 R CGC 56 F TTC 57 R CGG 58 P CCC 59 L CTG 60 H CAC 61 V GTC 62 R CGC 63 N AAC 64 L CTT 65 L CTC 66 S AGC 67 A GCC 68 Y TAT 69 G GGC 70 E GAG 71 V GTC 72 G GGA 73 R CGC 74 V GTC 75 F TTC 76 F TTT 77 Q CAG 78 A GCT 79 E GAG 80 D GAC 81 R CGG 82 F TTC 83 V GTG 84 R AGA 85 R CGC 86 K AAG 87 K AAG 88 K AAG 89 A GCA 90 A GCA 91 A GCA 92 A GCT 93 A GCC 94 G GGA 95 G GGA 96 K AAA 97 K AAG 98 R CGG 99 S TCC 100 Y TAC 101 T ACC 102 K AAG 103 D GAC 104 Y TAC 105 T ACC 106 E GAG 107 G GGA 108 W TGG 109 V GTG 110 E GAG 111 F TTC 112 R CGT 113 D GAC 114 K AAG 115 R CGC 116 I ATA 117 A GCC 118 K AAG 119 R CGC 120 V GTG 121 A GCG 122 A GCC 123 S AGT 124 L CTA 125 H CAC 126 N AAC 127 T ACG 128 P CCT 129 M ATG 130 G GGT 131 A GCC 132 R CGC 133 R AGG 134 R CGC 135 S AGC 136 P CCC 137 F TTC 138 R CGT 139 Y TAT 140 D GAT 141 L CTT 142 W TGG 143 N AAC 144 L CTC 145 K AAG 146 Y TAC 147 L TTG 148 H CAC 149 R CGT 150 F TTC 151 T ACC 152 W TGG 153 S TCC 154 H CAC 155 L CTC Now I would like to add a new fourth column that will contain “YES” or “NOT” - it depends on if the codons are in the range from 46 to 142 (that is, in the domain) - YES, which are not in the range from 46 142 each (outside domains) - NOT. I would like to receive the ouptut file:
Q9ULW3; 46 142 0 M ATG NOT 1 E GAG NOT 2 A GCA NOT 3 E GAG NOT 4 E GAA NOT 5 S TCG NOT 6 E GAG NOT 7 K AAG NOT 8 A GCC NOT 9 A GCA NOT 10 T ACG NOT 11 E GAG NOT 12 Q CAA NOT 13 E GAG NOT 14 P CCG NOT 15 L CTG NOT 16 E GAA NOT 17 G GGG NOT 18 T ACA NOT 19 E GAA NOT 20 Q CAG NOT 21 T ACA NOT 22 L CTA NOT 23 D GAT NOT 24 A GCG NOT 25 E GAG NOT 26 E GAG NOT 27 E GAG NOT 28 Q CAG NOT 29 E GAG NOT 30 E GAA NOT 31 S TCC NOT 32 E GAA NOT 33 E GAA NOT 34 A GCG NOT 35 A GCC NOT 36 C TGT NOT 37 G GGC NOT 38 S AGC NOT 39 K AAG NOT 40 K AAA NOT 41 R CGG NOT 42 V GTA NOT 43 V GTG NOT 44 P CCA NOT 45 G GGT NOT 46 I ATT YES 47 V GTG YES 48 Y TAC YES 49 L CTG YES 50 G GGC YES 51 H CAT YES 52 I ATC YES 53 P CCG YES 54 P CCG YES 55 R CGC YES 56 F TTC YES 57 R CGG YES 58 P CCC YES 59 L CTG YES 60 H CAC YES 61 V GTC YES 62 R CGC YES 63 N AAC YES 64 L CTT YES 65 L CTC YES 66 S AGC YES 67 A GCC YES 68 Y TAT YES 69 G GGC YES 70 E GAG YES 71 V GTC YES 72 G GGA YES 73 R CGC YES 74 V GTC YES 75 F TTC YES 76 F TTT YES 77 Q CAG YES 78 A GCT YES 79 E GAG YES 80 D GAC YES 81 R CGG YES 82 F TTC YES 83 V GTG YES 84 R AGA YES 85 R CGC YES 86 K AAG YES 87 K AAG YES 88 K AAG YES 89 A GCA YES 90 A GCA YES 91 A GCA YES 92 A GCT YES 93 A GCC YES 94 G GGA YES 95 G GGA YES 96 K AAA YES 97 K AAG YES 98 R CGG YES 99 S TCC YES 100 Y TAC YES 101 T ACC YES 102 K AAG YES 103 D GAC YES 104 Y TAC YES 105 T ACC YES 106 E GAG YES 107 G GGA YES 108 W TGG YES 109 V GTG YES 110 E GAG YES 111 F TTC YES 112 R CGT YES 113 D GAC YES 114 K AAG YES 115 R CGC YES 116 I ATA YES 117 A GCC YES 118 K AAG YES 119 R CGC YES 120 V GTG YES 121 A GCG YES 122 A GCC YES 123 S AGT YES 124 L CTA YES 125 H CAC YES 126 N AAC YES 127 T ACG YES 128 P CCT YES 129 M ATG YES 130 G GGT YES 131 A GCC YES 132 R CGC YES 133 R AGG YES 134 R CGC YES 135 S AGC YES 136 P CCC YES 137 F TTC YES 138 R CGT YES 139 Y TAT YES 140 D GAT YES 141 L CTT YES 142 W TGG YES 143 N AAC NOT 144 L CTC NOT 145 K AAG NOT 146 Y TAC NOT 147 L TTG NOT 148 H CAC NOT 149 R CGT NOT 150 F TTC NOT 151 T ACC NOT 152 W TGG NOT 153 S TCC NOT 154 H CAC NOT 155 L CTC NOT This is an example for one protein, I need to do this for a large number of proteins. Do you have any ideas?
Thank!
print "$count\t@array1[$count]\t@array2[$count]\t".(($count>=46 && $count<=142) ? "YES":"NOT" )."\n";- Mike