Show Menu
Cheatography

Awk one-liners for FASTA manipulation version 1.0 Cheat Sheet by

Introd­uction

This cheatsheet contains 10 useful AWK one-liners for manipu­lation of FASTA files. It is created as part of a series to help graduate students and biologists in learning some simple progra­mming scripts. Each oneliner is usually accomp­anied by additional comments which start with a hash ("#"). Runnable codes is available on http:/­/co­de.r­un­nab­le.c­om­/VZ­sPv­rVQ­5Jk­yE_­ru/­awk­-on­e-l­ine­rs-­for­-fa­sta­-ma­nip­ula­tio­n-f­or-­she­ll-­bas­h-a­nd-­bio­inf­orm­atics
Author: Melissa M.L. Wong; Date created: 1 July 2015; Date last modified:6 July 2015; Email: meliss­awo­ngu­km@­gma­il.com

FASTA format is a text-based format for repres­enting either nucleotide sequences or peptide sequences, in which nucleo­tides or amino acids are repres­ented using single­-letter codes. A fasta sequence must start with an arrow (">"), followed by its name and a newline character ("\n­"), and lastly its sequence which can span multiple lines.

1. To find sequences with matching name

awk 'BEGIN{RS=">";FS="\n"}NR>1{if ($1~/name/) print ">"$0}' file.fa

2. To extract sequences using a list

awk 'BEGIN{RS=">";FS="\n"}NR==FNR{a[$1]++}NR>FNR{if ($1 in a && $0!="") printf ">%s",$0}' list file.fa
#The names in the list must start with ">" and each name is separated by a newline ("\n")

3. To join multiple lines into single line

awk 'BEGIN{RS=">";FS="\n"}NR>1{seq="";for (i=2;i<=NF;i++) seq=seq""$i; print ">"$1"\n"seq}' file.fa
#Single line sequence is desirable when a sequence is long and spans many lines. Furthermore, single line sequence is much easier to be manipulated using AWK oneliners as showed in the next few examples.

4. To print specified sequence region

#To print the sequence starting from position 1 until 2213
awk 'BEGIN{RS=">";FS="\n"}NR>1{seq="";for (i=2;i<=NF;i++) seq=seq""$i; print ">"$1"\n"substr(seq,1,2213)}' file.fa
#To print sequence starting from position 399 until 704
awk 'BEGIN{RS=">";FS="\n"}NR>1{seq="";for (i=2;i<=NF;i++) seq=seq""$i; print ">"$1"\n"substr(seq,399,704-399+1)}' file.fa
#To print sequence with matching name from position 399 until 704
awk 'BEGIN{RS=">";FS="\n"}NR>1{seq="";for (i=2;i<=NF;i++) seq=seq""$i; if ($1~/name/) print ">"$1"\n"substr(seq,399,704-399+1)}' file.fa
#Useful to print sequence region when given start position and stop position or length

5. To reformat into 100 characters per line

awk 'BEGIN{RS=">";FS="\n"}NR>1{seq="";for (i=2;i<=NF;i++) seq=seq""$i;a[$1]=seq;b[$1]=length(seq)}END{for (i in a) {k=sprintf("%d", (b[i]/100)+1); printf ">%s\n",i;for (j=1;j<=int(k);j++) printf "%s\n", substr(a[i],1+(j-1)*100,100)}}' fasta.txt

6. To substitute nucleotide sequences

#To substitute small letter with capital letter
awk 'BEGIN{RS=">";FS="\n"}NR>1{printf ">%s\n",$1;for (i=2;i<=NF;i++) {gsub(/c/,"C",$i);gsub(/a/,"A",$i);gsub(/g/,"G",$i);gsub(/t/,"T",$i); printf "%s\n",$i}}' file.fa

7. To convert DNA to RNA

awk 'BEGIN{RS=">";FS="\n"}NR>1{printf ">%s\n",$1;for (i=2;i<=NF;i++) {gsub(/T/,"U",$i); printf "%s\n",$i}}' file.fa

8. To summarize sequence content

awk 'BEGIN{RS=">";FS="\n";print "name\tA\tC\tG\tT\tN\tlength\tGC%"}NR>1{sumA=0;sumT=0;sumC=0;sumG=0;sumN=0;seq="";for (i=2;i<=NF;i++) seq=seq""$i; k=length(seq); for (i=1;i<=k;i++) {if (substr(seq,i,1)=="T") sumT+=1; else if (substr(seq,i,1)=="A") sumA+=1; else if (substr(seq,i,1)=="G") sumG+=1; else if (substr(seq,i,1)=="C") sumC+=1; else if (substr(seq,i,1)=="N") sumN+=1}; print $1"\t"sumA"\t"sumC"\t"sumG"\t"sumT"\t"sumN"\t"k"\t"(sumC+sumG)/k*100}' file.fa
#Calculate number of each nucleotide, total length and GC content

9. To reverse complement nucleotide sequences

awk 'BEGIN{RS=">";FS="\n";a["T"]="A";a["A"]="T";a["C"]="G";a["G"]="C";a["N"]="N"}NR>1{for (i=2;i<=NF;i++) seq=seq""$i;for(i=length(seq);i!=0;i--) {k=substr(seq,i,1);x=x a[k]}; printf ">%s\n%s",$1,x}' file.fa
#This will produce a single line sequence

10. To convert FASTQ to FASTA format

awk 'NR%4==1{print ">"substr($0,2)}NR%4==2{print $0}' file.fq
#print first and second line of every four lines. Replace the first character of the first line with ">".
 

Comments

Hi Melissa, I need to apply your cheat #2 but with a loop for several files containing different lists. So I have sequences.fa, and several files OMA*.fas to recover several fasta alignments. How could I do this?

I tried something like the code below but it didn't work. Maybe the for is no the solution but what would you recommend?

fo file in *.fas; do awk 'NR==1{printf $0"\t";next}{printf /^>/ ? "\n"$0"\t" : $0}' sequences.fa | awk -F"\t" 'BEGIN{while((getline k < "$file")>0)i[k]=1}{gsub("^>","",$0); if(i[$1]){print ">"$1"\n"$2}}' > $(basename $file).nuc; done

Hi drelo, thanks for the 5 star rating. I would modify script #2 to assign OMA*.fas as a variable to awk and redirect awk output to a variable filename.

for i in *.fas; do name=$(basename $i .fas) && awk -v name="$name" 'BEGIN{RS=">";FS="\n"}NR==FNR{a[$1]++}NR>FNR{if ($1 in a && $0!="") print ">"$0 > name".nuc"}' $i sequence.fa; done

Add a Comment

Your Comment

Please enter your name.

    Please enter your email address

      Please enter your Comment.

          More Cheat Sheets by melissamlwong