Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

assembly -> chado analysis mapping: we need to download FTP stuff... #28

Closed
bradfordcondon opened this issue Nov 29, 2018 · 4 comments
Closed
Assignees

Comments

@bradfordcondon
Copy link
Contributor

child of #15

assembly has core field information for analysis (algorith namely) in the FTP file. so, we need to

  • write an fTP class
  • fetch the files, search for the algorithm info.
@bradfordcondon
Copy link
Contributor Author

heres where the FTP info is parsed into:

case 'FtpSites':
$list['files'] = $this->processFinalChildren($child, ['type']);
case 'default':

@bradfordcondon bradfordcondon self-assigned this Nov 29, 2018
@bradfordcondon
Copy link
Contributor Author

<FtpSites>   <FtpPath type="Assembly_rpt">ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/002/285/GCF_000002285.3_CanFam3.1/GCF_000002285.3_CanFam3.1_assembly_report.txt</FtpPath>   <FtpPath type="GenBank">ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/002/285/GCA_000002285.2_CanFam3.1</FtpPath>   <FtpPath type="RefSeq">ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/002/285/GCF_000002285.3_CanFam3.1</FtpPath>   <FtpPath type="Stats_rpt">ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/002/285/GCF_000002285.3_CanFam3.1/GCF_000002285.3_CanFam3.1_assembly_stats.txt</FtpPath> </FtpSites> 

lets consider each of these:

Assembly_rpt
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/002/285/GCF_000002285.3_CanFam3.1/GCF_000002285.3_CanFam3.1_assembly_report.txt

probably what we want.

refseq:
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/002/285/GCA_000002285.2_CanFam3.1

directory. handy to show to users, we wont be querying it.

stats report
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/002/285/GCF_000002285.3_CanFam3.1/GCF_000002285.3_CanFam3.1_assembly_stats.txt

greatly expanded stats report beyond whats included in the CDATA.

so the assembly_rpt is the clear winner.

@bradfordcondon
Copy link
Contributor Author

heres what we fetch from assembly method in our sample dataset.

 ["# Assembly method:"]=>
  array(1) {
    [0]=>
    string(68) "FALCON v. 0.5.0; Arcs v. 1.0.1; Links v. 1.8.5; BioNano Solve v. 3.1"
  }
}
.array(1) {
  ["# Assembly method:"]=>
  array(1) {
    [0]=>
    string(15) "FALCON v. 0.4.0"
  }
}
.array(1) {
  ["# Assembly method:"]=>
  array(1) {
    [0]=>
    string(21) "Arachne v. April 2010"
  }
}
.array(1) {
  ["# Assembly method:"]=>
  array(1) {
    [0]=>
    string(18) "SOAPdenovo v. 1.14"
  }
}
.array(1) {
  ["# Assembly method:"]=>
  array(1) {
    [0]=>
    string(28) "Celera Assembler (CA) v. 5.3"
  }
}
.array(1) {
  ["# Assembly method:"]=>
  array(1) {
    [0]=>
    string(14) "MaSuRCA v. 2.3"
  }
}
.array(1) {
  ["# Assembly method:"]=>
  array(1) {
    [0]=>
    string(17) "SOAPdenovo v. 1.1"
  }
}
.array(1) {
  ["# Assembly method:"]=>
  array(1) {
    [0]=>
    string(18) "SOAPdenovo v. 1.05"
  }

in the case of single versions, we could totally parse on v. For multiples, we could parse on ;, then parse on v, but im not sure its worth it since hte versions get separated from the programs.

@bradfordcondon
Copy link
Contributor Author

handled. program and programversion are the same and the full string in assembly method.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant