ITS2 RefSeq Integration Workflow-1

植物ITS2データベースの構築

Duboisらにより提唱された参照用データベース作成フローを一部改変。

1. 植物ITS2領域のDNAデータを取得

NCBIウェブサイトの検索窓から「植物」、「ITS2」などのキーワードでDNAデータを抽出。

((viridiplantae[Organism] AND its2) AND 100:10000000[Sequence Length]) NOT (uncultured OR environmental sample OR incertae sedis OR unverified)

取得した配列数をカウントする。2024年3月6日時点では249,660配列が得られた。

grep -c '>' sequence.fasta　#249660

2. Accession number と紐づいている Taxonomic identifier を得る

最初に1.で得たsequence.fastaからAccession numberを抜き出し、AccessionNumbersというファイルに保存する。

grep ">" sequence.fasta | cut -d ">" -f 2 | cut -d " " -f 1 > AccessionNumbers

次に、アクセッション番号と配列だけが紐づいたテーブルAccessionNumbers_seqs_linking_tableを作成する。一つの配列が複数行に渡り改行で格納されている場合は、後のGlobal_table作成時に結合できない可能性があるため、最初に配列を整形してからAccessionNumbers_seqs_linking_tableを作成する。

awk '/^>/ {if (seq) print seq; seq=""; print; next} {seq=seq $0} END {if (seq) print seq}' sequence.fasta > sequence_temp.fasta
mv sequence_temp.fasta sequence.fasta

paste \
  <(cat AccessionNumbers) \
  <(sed '/^>/d' sequence.fasta) > AccessionNumbers_seqs_linking_table

NCBIからTaxonomic identifierを得る。解凍後は11.5Gになった（ディスク容量に注意）。

wget ftp://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid/nucl_gb.accession2taxid.gz

gzip -d nucl_gb.accession2taxid.gz

AccessionNumbersをクエリとしてnucl_gb.accession2taxidから対応するレコードを抜き出して保存する。nucl_gb.accession2taxidにはaccession,accession.version,taxid,gi の４つのデータが格納されている。なお、使用するPCのリソースよって時間がかかるケースがある。

fgrep -w -f AccessionNumbers nucl_gb.accession2taxid > AccessionNumbers_taxids_linking_table

出来上がったAccessionNumbers_taxids_linking_tableファイルのレコード数と、最初に得たsequence.fasta の配列数が一致するか確認する。

wc -l AccessionNumbers_taxids_linking_table 
# 249660 =>OK

必要な列（accession.version, taxid）のみ残し、他は削除する

awk 'BEGIN {FS=OFS="\t"} {print $2,$3}' AccessionNumbers_taxids_linking_table > AccessionNumbers_taxids_linking_table_final

不要な中間ファイルを削除

rm AccessionNumbers
rm AccessionNumbers_taxids_linking_table

3. taxid と対応している生物種名のリストを得る

最初に重複の無いTaxidを得る。生物種の数は86281。

awk -F '\t' '{print $2}' AccessionNumbers_taxids_linking_table_final | sort | uniq > Taxids_uniq

wc -l Taxids_uniq
# 86281

taxidsに対応する生物分類を得る

NCBIからレファレンスファイルをFTPウェブサイトから得る

mkdir taxdump

wget https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/new_taxdump/new_taxdump.tar.gz
mv new_taxdump.tar.gz taxdump/

tar -xvzf taxdump/new_taxdump.tar.gz -C taxdump

rankedlineage.dmpファイルにはタブ文字が含まれているため削除する

sed -i "s/\t//g" taxdump/rankedlineage.dmp

rankedlineage.dmpの並び替えをする

sort -t "|" -k 1b,1 taxdump/rankedlineage.dmp > taxdump/rankedlineage_sorted

タクソノミー情報（rankedlineage_sorted）をTaxids（Taxids_uniq）に結合させる

join -t "|" -1 1 -2 1 -a 1 Taxids_uniq taxdump/rankedlineage_sorted > Taxids_taxonomic_lineages_linking_table

wc -l Taxids_taxonomic_lineages_linking_table

#86281

結合できなかったタクソノミー情報が無いか２列目の空のレコード数を確認する（taxidIdが変更されているものがあるため、マッピングできないレコードがある）。

awk -F '|' '{print $2}' Taxids_taxonomic_lineages_linking_table | grep -c '^$'     

#4

空のタクソノミー情報が4あるため、空のtaxids を抽出⇒照会用URLを作成し照会⇒xmlファイルに結果を保存。

awk -F '|' '$2=="" {print $0}' Taxids_taxonomic_lineages_linking_table > Taxids_not_found

url="https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=taxonomy&rettype=xml&id="
url+=$(paste -s -d "," Taxids_not_found)

curl $url > Taxids_not_found.xml

xmlファイルから必要な部分を取りだし、空のタクソノミー情報のレコードと入れ替える。xmlファイルの処理のため、pythonコード（irw_1.pynb)の最初のセクションを使用。入力ファイルは①Taxids_not_found、②Taxids_not_found.xml、③Taxids_taxonomic_lineages_linking_table、④AccessionNumbers_taxids_linking_table_finalが必要となる。入力ファイルをirw_1.ipynbと同じフォルダに保存し、同コードを実行する。出力ファイルはUpdated_Taxids_taxonomic_lineages　および Updated_AccessionNumbers_taxids_linking_table

入れ替えによりUpdated_Taxids_taxonomic_lineagesに重複が生じている可能性があるため、重複を削除する。

awk -F'|' '!seen[$1]++' Updated_Taxids_taxonomic_lineages > Updated_Taxids_taxonomic_lineages_tmp

mv Updated_Taxids_taxonomic_lineages_tmp Updated_Taxids_taxonomic_lineages

結合できなかったタクソノミー情報がないか２列目の空のレコードを確認する。

awk -F '|' '{print $2}' Updated_Taxids_taxonomic_lineages| grep -c '^$'     

#0

Qiime2のフォーマットに合う様にUpdated_Taxids_taxonomic_lineages_linking_tableを変える

paste \
  <(awk -F "|" '$2!="" {print $1}' Updated_Taxids_taxonomic_lineages) \
  <(awk -F "|" 'BEGIN {OFS=""} $2!="" {print "k__",$9,"; p__",$8,"; c__",$7,"; o__",$6,"; f__",$5,"; g__",$4,"; s__",$2}' Updated_Taxids_taxonomic_lineages) > Updated_Taxids_taxonomic_lineages_reformatted

RESCRIPtで取得した配列データの構造に合わせるために、種名のアノテーションランクから属名を削除する（例：Acer yangbiense　⇒　yangbiense）

paste -d "" \
  <(awk -F 's__' '{OFS=""} {print $1,"s__"}' Updated_Taxids_taxonomic_lineages_reformatted) \
  <(awk -F 's__' '{print $2}' Updated_Taxids_taxonomic_lineages_reformatted | cut -d " " -f 2-) > Updated_Taxids_taxonomic_lineages_final

元となるファイル①：Updated_AccessionNumbers_taxids_linking_table_final (irw_1.ipynbで生成）

元となるファイル②：Updated_Taxids_taxonomic_lineages_final (前述のUnixコマンドで生成）

元となるファイルをpythonコード(irw_1.ipynb)と同じフォルダに保存し、irw_1.ipynbの２セクション目を実行しAccessionNumbers_taxids_Taxonomic_lineages_linking_table を得る。

4. アクセッション番号、タクソノミーID、分類情報、DNA配列が集約しているglobal table を作成する

Global_tableを生成する

join -t $'\t' -1 1 -2 1 -a 1 \
      <(sort -t $'\t' -k 1b,1 AccessionNumbers_taxids_Taxonomic_lineages_linking_table)\
      <(sort -t $'\t' -k 1b,1 AccessionNumbers_seqs_linking_table) > Global_table

Global_table のレコード例：

AB000322.1 85472 kViridiplantae; pStreptophyta; cMagnoliopsida; oCaryophyllales; fPolygonaceae; gFagopyrum; s__callianthum TCGAAACCTGCCCCAAAGC…

すべてのレコードに空白値がないかを確認する

awk -F '\t' '{print $1}' Global_table | grep -c "^$"       # 0 => Ok
awk -F '\t' '{print $2}' Global_table | grep -c "^$"       # 0 => Ok
awk -F '\t' '{print $3}' Global_table | grep -c "^$"       # 0 => Ok
awk -F '\t' '{print $4}' Global_table | grep -c "^$"       # 0 => Ok

使用するファイルを残して使用しない中間ファイルを削除する

find . -maxdepth 1 ! -name 'Global_table' ! -name 'sequence.fasta' ! -name 'Taxids_not_found.xml' ! -name 'taxdump' ! -name 'nucl_gb.accession2taxid' ! -name '.' -exec rm -rf {} +

5. Qiime2フォーマットのfastaファイルとタクソノミー情報ファイルを作成し、Qiime2にインポートする

Qiime2フォーマットのfastaファイル（メタ情報にバージョン付きアクセッション番号、配列情報）を作成。配列数をカウントすると249660であった。

awk -F '\t' 'BEGIN {OFS=""} {print ">",$1,"\n",$4}' Global_table | sed 's/-//g' > Fasta_file
grep -c '>' Fasta_file 
#249660

タクソノミー情報ファイル（バージョン付きアクセッション番号にタクソノミー情報）を作成

awk 'BEGIN {FS=OFS="\t"} {print $1,$3}' Global_table > Taxonomic_lineages

Qiime２にインポート。もし、Fasta_fileの配列が小文字の場合は大文字に変えておく。

qiime tools import \
  --type 'FeatureData[Sequence]' \
  --input-path Fasta_file \
  --output-path Fasta_file.qza

6. 低クオリティ配列の除去

RESCRIPtプラグインを使い、ホモポリマーが12以上連続する部分を含むシーケンスを除外。（使用するマシンのコア数に応じてjobsの数を変える）。除外後のシーケンス配列を出力し、次の対応するタクソノミー情報の削除の時に利用する。

qiime rescript cull-seqs \
    --i-sequences Fasta_file.qza \
    --p-homopolymer-length 12 \
    --p-n-jobs 8 \
    --o-clean-sequences Fasta_file_tmp.qza

mv Fasta_file_tmp.qza Fasta_file.qza

qiime tools export \
  --input-path Fasta_file.qza \
  --output-path .

grep -c '>' dna-sequences.fasta
# 234688

削除後の配列数が234688となり14972配列除去された

シーケンスがRESCRIPtプラグインで除去された場合には、対応するタクソノミー情報を削除する必要がある。削除後、タクソノミー情報をQiime2で使用するqzaファイルに格納する

fgrep -v -f <(
  cat <(grep ">" dna-sequences.fasta | cut -d ">" -f 2) \
      <(cut -d $'\t' -f 1 Taxonomic_lineages) | sort | uniq -u
) Taxonomic_lineages > Taxonomic_lineages_tmp

mv Taxonomic_lineages_tmp Taxonomic_lineages

qiime tools import \
  --type 'FeatureData[Taxonomy]' \
  --input-format HeaderlessTSVTaxonomyFormat \
  --input-path Taxonomic_lineages \
  --output-path Taxonomic_lineages.qza

7. 同様の塩基配列データが存在した場合は、重複を削除する（オプション）

　DNA配列が全く同じのレコードを削除する。処理後の配列数は182470となり、52218配列削除された。ナイーブベイズ分類器の過学習の回避、結果の信頼性の確保、コンピューターリソースの節約の観点から重複を削除した配列を使用する。一方で、データの重複を保持することで、万が一の際の復旧や結果の信頼性を検証する場合に活用するなど、重複を保持することが目的によって必要に場合もあるtため、実験目的に応じて実行するかどうかを決める必要がある。なお、初期設定では、同じ配列であっても生物名が異なる場合は同一とみなされず、生物種単位で同一の配列が削除される。

qiime rescript dereplicate \
    --i-sequences Fasta_file.qza \
    --i-taxa Taxonomic_lineages.qza \
    --p-mode 'uniq' \
    --p-threads 8 \
    --o-dereplicated-sequences Fasta_file_tmp.qza \
    --o-dereplicated-taxa Taxonomic_lineages_tmp.qza

mv Fasta_file_tmp.qza Fasta_file.qza
mv Taxonomic_lineages_tmp.qza Taxonomic_lineages.qza

8. 菌類と思われる配列を植物ITSデータベースから削除する

最初に現在の配列データと分類データを抽出しておく。

qiime tools export \
--input-path Fasta_file.qza \
--output-path . && mv dna-sequences.fasta Exported_fasta_file.fasta

qiime tools export \
--input-path Taxonomic_lineages.qza \
--output-path . && awk 'NR>1' taxonomy.tsv > Exported_taxonomic_lineages.tsv

菌類と疑われる配列を特定するために、RefSeqsとUNITEデータベースの真菌ITS配列に対してブラスト検索する。最初はRefSeqから。

RefSeqからのデータをダウンロード

for i in $(curl -l ftp://ftp.ncbi.nlm.nih.gov/refseq/release/fungi/ | grep "genomic.fna");do wget https://ftp.ncbi.nlm.nih.gov/refseq/release/fungi/$i;done

ファイルを１つに連結。解凍と結合時に一時的にディスク容量が必要になるので注意する。今回は40GBほど必要になった。

gzip -d *.gz

cat *fna > fungi_genomic_refseqs.fna

taxidsの取得

grep ">" fungi_genomic_refseqs.fna | cut -d ">" -f 2 | cut -d " " -f 1 > AccessionNumbers

fgrep -w -f AccessionNumbers nucl_gb.accession2taxid > AccessionNumbers_taxids_linking_table

分類情報を得られなかったtaxidsの取得。

awk -F '\t' '{print $2}' AccessionNumbers_taxids_linking_table > AccessionNumbers_found_in_accession2taxid

cat AccessionNumbers AccessionNumbers_found_in_accession2taxid | sort | uniq -u > AccessionNumbers_not_found

今回は、分類情報が得られなかったtaxidは無かったので、fungi_genomic_refseqs.fna と　AccessionNumbers_taxids_linking_table　でblastデータベースを作成する。もし、分類情報が得られない配列がある場合には、植物の場合同様にxmlファイルから、足りない情報を得る。pythonコードを利用するか、数が少なければ手動で追加しても差し支えない。

BLASTデータベースの作成

makeblastdb \
-in fungi_genomic_refseqs.fna \
-parse_seqids \
-blastdb_version 5 \
-taxid_map AccessionNumbers_taxids_linking_table \
-title "fungi_genomic_refseqs" \
-dbtype nucl

Blast検索を植物ITS2配列に対して行う

blastn \
-db fungi_genomic_refseqs.fna \
-query Exported_fasta_file.fasta \
-num_threads 8 \
-max_target_seqs 1 \
-outfmt "6 qacc sacc evalue bitscore length pident ssciname scomname staxid" \
-out blastn_outfile_fungi_genomic_refseqs

配列の一致度が90％以上で且つ、アラインメントされた部分がクエリ配列の95％以上だった場合にその配列をデータベースから削除する

まずBlast検索の結果ファイルの様式を調整する。

sort -buk 1,1 blastn_outfile_fungi_genomic_refseqs | sed "s/100.000/100/g" > blastn_outfile_fungi_genomic_refseqs_uniq

awk -F '\t' '{print $1}' blastn_outfile_fungi_genomic_refseqs_uniq > AccessionNumbers_in_blastn_outfile

paste \
  <(cat AccessionNumbers_in_blastn_outfile) \
  <(fgrep -w -f AccessionNumbers_in_blastn_outfile Global_table | awk -F '\t' '{print length($4)*0.95}' | cut -d ',' -f 1) > AccessionNumbers_seqs_length_linking_table

join -t $'\t' -1 1 -2 1 -a 1 \
      <(sort -t $'\t' -k 1b,1 blastn_outfile_fungi_genomic_refseqs_uniq)\
      <(sort -t $'\t' -k 1b,1 AccessionNumbers_seqs_length_linking_table) > blastn_outfile_fungi_genomic_refseqs_uniq_withlengthdata

該当した配列のアクセッション番号を取得する。41配列該当した。

awk -F '\t' '$6>=90' blastn_outfile_fungi_genomic_refseqs_uniq_withlengthdata | awk -F '\t' '$5>=$NF' | awk -F '\t' '{print $1}' > Sequences_to_remove_fungi_genomic_refseqs

wc -l Sequences_to_remove_fungi_genomic_refseqs
#41

次にUNITEから菌類ITSのデータを得る。UNITEはウェブサイトからダウンロードする。今回はVersion10.0（2024.04)を使用。ダウンロード後、様式を調整

tar -xzvf sh_general_release_04.04.2024.tgz

paste \
<(grep ">" sh_general_release_dynamic_04.04.2024.fasta | cut -d "|" -f 2) \
<(sed '/^>/d' sh_general_release_dynamic_04.04.2024.fasta) > AccessionNumbers_seqs_linking_table

awk -F '\t' 'BEGIN {OFS=""} {print ">",$1,"\n",$2}' AccessionNumbers_seqs_linking_table > UNITE_fungi_seqs.fasta

UNITEデータベースを作成

makeblastdb \
  -in UNITE_fungi_seqs.fasta \
  -parse_seqids \
  -blastdb_version 5 \
  -title "UNITE_fungi_seqs" \
  -dbtype nucl

Blast検索を行う

blastn \
  -db UNITE_fungi_seqs.fasta \
  -query Exported_fasta_file.fasta \
  -num_threads 8 \
  -max_target_seqs 1 \
  -outfmt "6 qacc sacc evalue bitscore length pident ssciname scomname staxid" \
  -out blastn_outfile_UNITE_fungi

様式を調整する（先ほどのRefseqと同様の操作）。

sort -buk 1,1 blastn_outfile_UNITE_fungi | sed "s/100.000/100/g" > blastn_outfile_UNITE_fungi_uniq

awk -F '\t' '{print $1}' blastn_outfile_UNITE_fungi_uniq > AccessionNumbers_in_blastn_outfile

paste \
  <(cat AccessionNumbers_in_blastn_outfile) \
  <(fgrep -w -f AccessionNumbers_in_blastn_outfile Global_table | awk -F '\t' '{print length($4)*0.95}' | cut -d ',' -f 1) > AccessionNumbers_seqs_length_linking_table

join -t $'\t' -1 1 -2 1 -a 1 \
      <(sort -t $'\t' -k 1b,1 blastn_outfile_UNITE_fungi_uniq)\
      <(sort -t $'\t' -k 1b,1 AccessionNumbers_seqs_length_linking_table) > blastn_outfile_UNITE_fungi_uniq_withlengthdata

該当した配列のアクセッション番号を取得する。17配列ピックアップされた。

awk -F '\t' '$6>=90' blastn_outfile_UNITE_fungi_uniq_withlengthdata | awk -F '\t' '$5>=$NF' | awk -F '\t' '{print $1}' > Sequences_to_remove_UNITE_seqs

wc -l Sequences_to_remove_UNITE_seqs
#17

RefseqとUNITEデータベースで得られた菌類と思われる配列のアクセッション番号を結合し、重複を削除する。42配列得られた。

cat Sequences_to_remove_fungi_genomic_refseqs Sequences_to_remove_UNITE_seqs | sort | uniq > Sequences_to_remove

wc -l Sequences_to_remove
#42

菌類と疑われる配列を作成中の配列データと、生物分類データから削除する。

grep -n -A 1 -f Sequences_to_remove Exported_fasta_file.fasta | \
sed -n 's/^\([0-9]\{1,\}\).*/\1d/p' | \
sed -f - Exported_fasta_file.fasta > Fasta_file_without_fungi

grep -v -f Sequences_to_remove Exported_taxonomic_lineages.tsv > Taxonomic_lineages_without_fungi

grep -c '>' Fasta_file_without_fungi
#182428

9. 誤って生物分類されていると思われる配列の削除

誤って生物名がアノテーションされている配列を除くために、交差検証を行う。自身の配列をクエリ配列としてデータベースに問い合わせ、familyランクで一致する配列が自身の１配列しか存在しない場合、誤ってアノテーションされていると判断する。

植物ITSでのBlastデータベースを作る。Blastデータベースに必要なファイルを生成する。

grep ">" Fasta_file_without_fungi | cut -d ">" -f 2 > AccessionNumbers

fgrep -f AccessionNumbers Global_table | awk 'BEGIN {FS=OFS="\t"} {print $1,$2}' > AccessionNumbers_taxids_linking_table

植物ITSの仮Blastデータベースを作る。

makeblastdb \
  -in Fasta_file_without_fungi \
  -parse_seqids \
  -blastdb_version 5 \
  -taxid_map AccessionNumbers_taxids_linking_table \
  -title "Fasta_file_without_fungi" \
  -dbtype nucl

自身をクエリ配列としてBlast検索を行う。

blastn \
  -db ./Fasta_file_without_fungi \
  -query Fasta_file_without_fungi \
  -num_threads 8 \
  -max_target_seqs 5 \
  -outfmt "6 qacc sacc evalue bitscore length pident ssciname scomname staxid" \
  -out blastn_outfile_leakedCV

予想していた分類結果とそうならなかった分類結果をBlast検索の結果から得る。

マッチング順位上位５の結果を抽出する。最初はアクセッション番号とtaxidIDを得る。

awk -F '\t' '{print $1}' blastn_outfile_leakedCV | sort | uniq > AccessionNumbers_in_blastn_outfile

awk 'BEGIN {FS=OFS="\t"} {print $1,$9}' blastn_outfile_leakedCV > AccessionNumbers_PredictedTaxids_linking_table

次にアクセッション番号１つと、このアクセッション番号に対応する５つのtaxidsIDを得て、さらにアクセッション番号と対応するtaxidsIDを１対１対応にする。データ処理のための行番号もつける。

awk 'seen[$1]++{ $1="" }1' OFS='\t' AccessionNumbers_PredictedTaxids_linking_table \
  | fgrep -w -A 4 -f AccessionNumbers_in_blastn_outfile \
  | sed '/--/d' > AccessionNumbers_PredictedTaxids_linking_table_top5

paste \
  <(awk -F '\t' '{print $1}' AccessionNumbers_PredictedTaxids_linking_table_top5 | awk 'BEGIN {FS=OFS="\t"} NF {p = $0} {print p}') \
  <(awk -F '\t' 'BEGIN {FS=OFS="\t"} {print $2}' AccessionNumbers_PredictedTaxids_linking_table_top5) > AccessionNumbers_PredictedTaxids_linking_table_top5_tmp && mv AccessionNumbers_PredictedTaxids_linking_table_top5_tmp AccessionNumbers_PredictedTaxids_linking_table_top5

awk 'BEGIN {FS=OFS="\t"} {print NR,$0}' AccessionNumbers_PredictedTaxids_linking_table_top5 > AccessionNumbers_PredictedTaxids_linking_table_top5_tmp && mv AccessionNumbers_PredictedTaxids_linking_table_top5_tmp AccessionNumbers_PredictedTaxids_linking_table_top5

familyレベルの該当する生物分類を追加する。

Global_tableからアクセッション番号とfamilyの組み合わせファイル、およびtaxidIDとfamilyの組み合わせファイルを得る。

paste \
  <(awk 'BEGIN {FS=OFS="\t"} {print $1}' Global_table) \
  <(awk 'BEGIN {FS=OFS="\t"} {print $3}' Global_table | awk -F '; ' '{print $5}') > AccessionNumbers_taxonomic_lineages_linking_table

paste \
  <(awk 'BEGIN {FS=OFS="\t"} {print $2}' Global_table) \
  <(awk 'BEGIN {FS=OFS="\t"} {print $3}' Global_table | awk -F '; ' '{print $5}') | sort -buk 1,1 > Taxids_taxonomic_lineages_linking_table

次にBlast検索の結果ファイル（アクセッション番号にtaxidIDが記載）に上記で得られたTaxids_taxonomic_lineages_linking_tableのfamilyを追加する。キーになるのはtaxidIDとなる。

LC_ALL=C join -t $'\t' -1 3 -2 1 -a 1 \
  <(LC_ALL=C sort -t $'\t' -k 3 AccessionNumbers_PredictedTaxids_linking_table_top5) \
  <(LC_ALL=C sort -t $'\t' -k 1 Taxids_taxonomic_lineages_linking_table) > AccessionNumbers_PredictedTaxids_linking_table_top5_tmp && mv AccessionNumbers_PredictedTaxids_linking_table_top5_tmp AccessionNumbers_PredictedTaxids_linking_table_top5

データの様式を調整する。それぞれのアクセッション番号にヒットした５つのfamilyが並ぶレコード形式にする。

sort -n -k 2 AccessionNumbers_PredictedTaxids_linking_table_top5 | awk 'BEGIN {FS=OFS="\t"} {print $3,$4}' > AccessionNumbers_PredictedTaxids_linking_table_top5_tmp && mv AccessionNumbers_PredictedTaxids_linking_table_top5_tmp AccessionNumbers_PredictedTaxids_linking_table_top5

awk 'BEGIN {FS=OFS="\t"} $1 != prev { printf "%s%s", ors, $1; prev=$1; ors=ORS } { printf " %s", $2 } END { print "" }' AccessionNumbers_PredictedTaxids_linking_table_top5 | sed "s/ /\t/g" > AccessionNumbers_PredictedTaxids_linking_table_top5_tmp && mv AccessionNumbers_PredictedTaxids_linking_table_top5_tmp AccessionNumbers_PredictedTaxids_linking_table_top5

さらにBlast検索の結果ではない、Global_tableから得られたfamilyの分類情報を追加する。結果的に、１つのアクセッション番号にBlast検索の結果である５つのfamilyと元々分類されていたfamilyの6つのfamilyが追加されたファイルになる。

join -t $'\t' -1 1 -2 1 -a 1 \
  <(sort -t $'\t' -k 1 AccessionNumbers_PredictedTaxids_linking_table_top5) \
  <(sort -t $'\t' -k 1 AccessionNumbers_taxonomic_lineages_linking_table) -o 1.1,2.2,1.2,1.3,1.4,1.5,1.6 > AccessionNumbers_PredictedTaxids_linking_table_top5_tmp && mv AccessionNumbers_PredictedTaxids_linking_table_top5_tmp AccessionNumbers_PredictedTaxids_linking_table_top5

次に、familyの数をそれぞれのレコードでカウントする。アクセッション番号とカウント数がセットになったファイルが生成される。

awk 'BEGIN {FS=OFS="\t"} { i=$1; $1=""; print i, gsub($2,"")-1 }' AccessionNumbers_PredictedTaxids_linking_table_top5 > Predicted_taxonomy_count

自分自身しかヒットせず１度だけのみ表示された配列を削除する。

awk 'BEGIN {FS=OFS="\t"} $2==1 {print $1}' Predicted_taxonomy_count > Sequences_to_remove

wc -l Sequences_to_remove
#1201

grep -n -A 1 -f Sequences_to_remove Fasta_file_without_fungi | \
sed -n 's/^\([0-9]\{1,\}\).*/\1d/p' | \
sed -f - Fasta_file_without_fungi > NCBI_ITS2_Viridiplantae_fasta_file_2024_03_06.fasta

grep -c '>' NCBI_ITS2_Viridiplantae_fasta_file_2024_03_06.fasta
#181227

grep -v -f Sequences_to_remove Taxonomic_lineages_without_fungi > NCBI_ITS2_Viridiplantae_taxonomic_lineages_2024_03_06.tsv

不要なファイルを削除する。最終的に必要になるファイルは以下の２つのファイルである。

NCBI_ITS2_Viridiplantae_taxonomic_lineages_2024_03_06.tsv, NCBI_ITS2_Viridiplantae_fasta_file_2024_03_06.fasta

加えて任意であるが、元々のデータソースとして

①NCBIから取得した植物ITS２配列 sequence.fasta 、②タクソノミー情報 nucl_gb.accession2taxid、③分類情報が得られなかった際に取得したxmlファイル Taxids_not_found.xml、④菌類データベース(連結後）fungi_refseqs.fna、⑤菌類データベース（連結後）UNITE_fungi_seqs.fasta

重要な中間ファイルとして Global_table、また９のセクションで行ったblast検索のファイル blastn_outfile_leakedCV

これらを保存しておき、他のファイルを削除するコマンドは以下である。

find . -maxdepth 1 ! -name 'NCBI_ITS2_Viridiplantae_taxonomic_lineages_2024_03_06.tsv' \
                 ! -name 'NCBI_ITS2_Viridiplantae_fasta_file_2024_03_06.fasta' \
                 ! -name 'sequence.fasta' \
                 ! -name 'nucl_gb.accession2taxid' \
                 ! -name 'Taxids_not_found.xml' \
                 ! -name 'fungi_refseqs.fna' \
                 ! -name 'UNITE_fungi_seqs.fasta' \
                 ! -name 'Global_table' \
                 ! -name 'blastn_outfile_leakedCV' \
                 ! -name '.' \
                 -exec rm -rf {} +

Qiime2にデータを格納し、qzaファイルとして出力する。

qiime tools import \
  --type 'FeatureData[Sequence]' \
  --input-path NCBI_ITS2_Viridiplantae_fasta_file_2024_03_06.fasta \
  --output-path NCBI_ITS2_Viridiplantae_fasta_file_2024_03_06.qza

qiime tools import \
  --type 'FeatureData[Taxonomy]' \
  --input-format HeaderlessTSVTaxonomyFormat \
  --input-path NCBI_ITS2_Viridiplantae_taxonomic_lineages_2024_03_06.tsv \
  --output-path NCBI_ITS2_Viridiplantae_taxonomic_lineages_2024_03_06.qza

10. 分類器の作成

Qiime2で利用するskilearnベースのナイーブベイズ分類器を作成する。

qiime feature-classifier fit-classifier-naive-bayes \
  --i-reference-reads NCBI_ITS2_Viridiplantae_fasta_file_2024_03_06.qza \
  --i-reference-taxonomy NCBI_ITS2_Viridiplantae_taxonomic_lineages_2024_03_06.qza \
  --o-classifier NCBI_ITS2_Viridiplantae_classifier_2024_03_06.qza

11.今回作成した植物ITS2データベースを用いてBlast用DBを作成（オプション）

先ほど、ファイルを一括削除したのでAccessionNumbers_taxids_linking_tableを復活させる。

grep ">" NCBI_ITS2_Viridiplantae_fasta_file_2024_03_06.fasta | cut -d ">" -f 2 > AccessionNumbers

fgrep -w -f AccessionNumbers nucl_gb.accession2taxid > AccessionNumbers_taxids_linking_table

awk 'BEGIN {FS=OFS="\t"} {print $2,$3}' AccessionNumbers_taxids_linking_table > AccessionNumbers_taxids_linking_table_final

Blast検索の結果に、生物種を表示させるデータベースをダウンロードしておく

mkdir taxdump
wget ftp://ftp.ncbi.nlm.nih.gov/blast/db/taxdb.tar.gz
#指定したフォルダに解凍する。（こちらは弊社のパス例）
tar -xzf taxdb.tar.gz -C /home/bioinsight/DB_new/blast/taxdump

データベース生成時に、taxdumpフォルダを参照するように環境設定しておく。

export BLASTDB=/home/bioinsight/DB_new/blast/taxdump

この設定をシェルセッションが終了しても維持するために、~/.bashrcに追加。

echo 'export BLASTDB=/home/bioinsight/DB_new/blast/taxdump' >> ~/.bashrc
source ~/.bashrc

blastデータベースを生成する

makeblastdb \
  -in NCBI_ITS2_Viridiplantae_fasta_file_2024_03_06.fasta \
  -parse_seqids \
  -blastdb_version 5 \
  -taxid_map AccessionNumbers_taxids_linking_table_final \
  -title "NCBI_ITS2_Viridiplantae_blastdb_2024_03_06" \
  -dbtype nucl