Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Release the vocabulary/gene map #10

Open
Egiob opened this issue Jul 2, 2024 · 3 comments
Open

Release the vocabulary/gene map #10

Egiob opened this issue Jul 2, 2024 · 3 comments

Comments

@Egiob
Copy link

Egiob commented Jul 2, 2024

Hello,
I understand that Nicheformer operates on a vocabulary of 20,310 genes. But I can't find in this repo the map that would allow to convert let's say an ensembl ID, or a gene name, to an id (i.e. a token) in your vocabulary.

Could you provide this gene map please? Or indicate how you constructed it?

Thank you so much.

@yehuicheng2002
Copy link

@Egiob Hello, have you solved this problem now?

@dimalvovs
Copy link

could it be that that the mapping is obtained like this (so that the token 10723 is ENSG00000000003)?

h5ad = sc.read_h5ad("nicheformer/data/model_means/model.h5ad")
h5ad.X
  (0, 10723)	1.0
  (0, 12184)	4.0
  (0, 5297)	1.0
  (0, 17537)	1.0
  (0, 6145)	1.0
  (0, 13799)	1.0
  (0, 3204)	1.0
  (0, 19265)	1.0

h5ad.X.shape
(1, 20310)

h5ad.var
Empty DataFrame
Columns: []
Index: [ENSG00000000003, ENSG00000000005, ENSG00000000419, ENSG00000000457, ENSG00000000460, ENSG00000000938, ENSG00000000971, ENSG00000001036, ENSG00000001084, ENSG00000001167, ENSG00000001460, ENSG00000001461, ENSG00000001497, ENSG00000001561, ENSG00000001617, ENSG00000001626, ENSG00000001629, ENSG00000001630, ENSG00000001631, ENSG00000002016, ENSG00000002330, ENSG00000002549, ENSG00000002586, ENSG00000002587, ENSG00000002726, ENSG00000002745, ENSG00000002746, ENSG00000002822, ENSG00000002834, ENSG00000002919, ENSG00000002933, ENSG00000003056, ENSG00000003096, ENSG00000003137, ENSG00000003147, ENSG00000003249, ENSG00000003393, ENSG00000003400, ENSG00000003402, ENSG00000003436, ENSG00000003509, ENSG00000003756, ENSG00000003987, ENSG00000003989, ENSG00000004059, ENSG00000004139, ENSG00000004142, ENSG00000004399, ENSG00000004455, ENSG00000004468, ENSG00000004478, ENSG00000004487, ENSG00000004534, ENSG00000004660, ENSG00000004700, ENSG00000004766, ENSG00000004776, ENSG00000004777, ENSG00000004779, ENSG00000004799, ENSG00000004809, ENSG00000004838, ENSG00000004846, ENSG00000004848, ENSG00000004864, ENSG00000004866, ENSG00000004897, ENSG00000004939, ENSG00000004948, ENSG00000004961, ENSG00000004975, ENSG00000005001, ENSG00000005007, ENSG00000005020, ENSG00000005022, ENSG00000005059, ENSG00000005073, ENSG00000005075, ENSG00000005100, ENSG00000005102, ENSG00000005108, ENSG00000005156, ENSG00000005175, ENSG00000005187, ENSG00000005189, ENSG00000005194, ENSG00000005206, ENSG00000005238, ENSG00000005243, ENSG00000005249, ENSG00000005302, ENSG00000005339, ENSG00000005379, ENSG00000005381, ENSG00000005421, ENSG00000005436, ENSG00000005448, ENSG00000005469, ENSG00000005471, ENSG00000005483, ...]

[20310 rows x 0 columns]

@dimalvovs
Copy link

Oh based on the ipnbs it looks even simpler and we can just the the gene ordering from the model.h5ad:

#Loading model with right gene ordering
model = sc.read_h5ad(
    f"{BASE_PATH}/model.h5ad"
)
...
#Concatenation
#Next we concatenate the model and the dissociated object to ensure they are in the same order. This ensures we have the same gene #ordering in the object.

adata = ad.concat([model, dissociated], join='outer', axis=0)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants