Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create UniProt Mapping file. #691

Open
wants to merge 4 commits into
base: staging
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions app/jobs/generate_tsvs.rb
Original file line number Diff line number Diff line change
Expand Up @@ -32,8 +32,8 @@ def ensure_downloads_directory_exists
end

def public_file_path(filename)
desination_filename = [filename_prefix, filename].join('-')
File.join(downloads_dir_path, desination_filename)
destination_filename = [filename_prefix, filename].join('-')
File.join(downloads_dir_path, destination_filename)
end

def downloads_dir_path
Expand Down
54 changes: 54 additions & 0 deletions app/jobs/generate_uniprot_mapping_tsv.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
class GenerateUniprotMappingTsv < GenerateTsvs
def perform
ensure_downloads_directory_exists
tsvs_to_generate.each do |e|
begin
tmp_file = tmp_file(e.file_name)
tmp_file.puts(e.headers.join("\t"))

e.objects.find_each do |object|
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You probably don't need the indirection here of calling e.objects and can delete def self.objects in the presenter file. This can probably just be Gene.find_each


row = e.row_from_object(object)
if row[1].is_a?(Array)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might change this to be e.rows_from_object() and assume you always get back an Array of rows and push the logic of handling multiple (or no) uniprot ids down a level. (See other comment)

overview_col = e.formatted_overview_col(object)
row[1].each do |r|
new_row = [row[0], r, overview_col].join("\t")
tmp_file.puts(new_row)
end
elsif row[1] == "N/A"
next
else
tmp_file.puts(row.join("\t"))
end
end

tmp_file.close
public_path = public_file_path(e.file_name)
FileUtils.cp(tmp_file.path, public_path)
File.chmod(0644, public_path)
ensure
tmp_file.unlink
end
end
end

def tsvs_to_generate
[UniprotMappingTsvPresenter]
end

def public_file_path(filename)
File.join(downloads_dir_path, filename)
end

def downloads_dir_path
TsvRelease.downloads_path
end

def release_path
''
end

def filename_prefix
''
end
end
34 changes: 34 additions & 0 deletions app/presenters/uniprot_mapping_tsv_presenter.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
class UniprotMappingTsvPresenter
def self.objects
Gene.joins(variants: :evidence_items).where("variants.evidence_items.status!='rejected'").distinct
end

def self.headers
[
'civic_name',
'uniprot_name',
'gene_overview'
]
end

def self.row_from_object(gene)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rename to rows_from_object() and do something along these lines (haven't tested it, just off the top of my head):

swissprot_names = Array(Scrapers::MyGeneInfo.get_swissprot_name(gene))
formatted_overview = formatted_overview_col(gene)
swisprot_names.map do |swissprot_name|
  if name == 'N/A'
    nil
  else 
    [gene.name, swissprot_name, formatted_overview]
  end
end.compact

That way you have a list of rows for your TSV, compact will remove the nils and the code that actually writes the TSV can just be a simple iteration over genes, calling this, and then writing a row for each item this returns.

[
gene.name,
Scrapers::MyGeneInfo.get_swissprot_name(gene),
formatted_overview_col(gene)
]
end

def self.file_name
"UniprotMapping.tsv"
end

def self.formatted_overview_col(gene)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd get the counts in the following ways:

eid_count = EvidenceItem.joins(variant: [:gene]).where("evidence_items.status != 'rejected'").where(variant: {gene: gene}).distinct.count
variant_count = gene.variants.joins(:evidence_items).where("evidence_items.status != 'rejected'").distinct.count
assertion_count = gene.assertions.where("status != 'rejected'").distinct.count

You could also invert the logic and do something like this:

Assertion.joins(:gene).where("status != 'rejected'").where(gene: g).distinct.count

depending on what's more clear to you.


eid_count = 0
gene.variants.each do |vs|
eid_count += vs.evidence_items.size
end
"#{gene.assertions.size} clinical assertion(s) and #{eid_count} evidence item(s) for #{gene.variants.size} variant(s) curated from #{gene.sources.size} published source(s)"
end
end
6 changes: 6 additions & 0 deletions config/scheduled_tasks.yml
Original file line number Diff line number Diff line change
Expand Up @@ -52,6 +52,12 @@ GenerateMonthlyTsvs:
queue: default
class: GenerateMonthlyTsvs

GenerateUniprotMappingTsv:
cron: '0 0 1 * *'
description: Update the UniprotMapping tsv
queue: default
class: GenerateUniprotMappingTsv

GenerateNightlyTsvs:
cron: '0 0 * * *'
description: Update the nightly TSV dumps
Expand Down
19 changes: 19 additions & 0 deletions lib/scrapers/my_gene_info.rb
Original file line number Diff line number Diff line change
Expand Up @@ -36,5 +36,24 @@ def self.extract_entrez_id(data)
def self.extract_official_name(data)
data['hits'].first['name']
end

private
def self.url_for_uniprot(gene_symbol)
"http://mygene.info/v2/query/?q=symbol:#{gene_symbol}&species=human&entrezonly=1&limit=1&fields=uniprot"
end

def self.get_swissprot_name(gene)
resp = Util.make_get_request(url_for_uniprot(gene.name))
data = JSON.parse(resp)
extract_swissprot_name(data)
end

def self.extract_swissprot_name(data)
if data['hits'].first != nil and data['hits'].first['uniprot'] != nil
data['hits'].first['uniprot']['Swiss-Prot']
else
'N/A'
end
end
end
end
Loading