Merge pull request #1 from roldanx/issue-1

Issue 1
opencb · Nov 25, 2018 · 603de5d · 603de5d
2 parents 28e9370 + c9741de
commit 603de5d
Show file tree

Hide file tree

Showing 5 changed files with 292 additions and 78 deletions.
diff --git a/oskar-spark/src/main/java/org/opencb/oskar/spark/commons/PythonUtils.java b/oskar-spark/src/main/java/org/opencb/oskar/spark/commons/PythonUtils.java
@@ -0,0 +1,34 @@
+package org.opencb.oskar.spark.commons;
+
+
+import com.fasterxml.jackson.annotation.JsonInclude;
+import com.fasterxml.jackson.databind.MapperFeature;
+import com.fasterxml.jackson.databind.ObjectMapper;
+import com.fasterxml.jackson.databind.SerializationFeature;
+
+import java.io.IOException;
+
+public class PythonUtils {
+
+    private final ObjectMapper objectMapper;
+
+    public PythonUtils() {
+        objectMapper = new ObjectMapper()
+                .configure(MapperFeature.REQUIRE_SETTERS_FOR_GETTERS, true)
+                .configure(SerializationFeature.WRITE_NULL_MAP_VALUES, false)
+                .setSerializationInclusion(JsonInclude.Include.NON_DEFAULT);
+    }
+
+    public String toJsonString(Object obj) throws IOException {
+        return objectMapper.writeValueAsString(obj);
+    }
+
+    public Object toJavaObject(String str, String className) throws ClassNotFoundException, IOException {
+        return objectMapper.readValue(str, Class.forName(className));
+    }
+
+    public <T> T toJavaObject(String str, Class<T> clazz) throws IOException {
+        return objectMapper.readValue(str, clazz);
+    }
+
+}
diff --git a/oskar-spark/src/main/python/notebook/variant_filtering.ipynb b/oskar-spark/src/main/python/notebook/variant_filtering.ipynb
@@ -15,16 +15,9 @@
     "First of all we need to import the modules that contain the Oskar API as well as the Spark ones. Also we need to create a new instance of the Oskar object, from which depends the whole functionality, and load our data as \"df\"."
    ]
   },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "**Optional: Create a Temporary View case you plan to acces the data folder from SQL.*"
-   ]
-  },
   {
    "cell_type": "code",
-   "execution_count": 1,
+   "execution_count": 2,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -39,12 +32,19 @@
     "df.createOrReplaceTempView(\"platinum\")"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "**Optional: Create a Temporary View case you plan to acces the data folder from SQL.*"
+   ]
+  },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "## Region filter\n",
-    "First of all we will execute a variant filtering based on a restricting zone, which in this example we chose as the 22th chromosome and the nucleotides between 17.000.000 and 17.500.000 position"
+    "First of all we will execute a simple filtering based on a restricted zone for those people who are not familiarizes with spark sintax. In this example we chose the 22th chromosome and the nucleotides between 17.000.000 and 17.500.000 position. This would be two of the statements we could use:"
    ]
   },
   {
@@ -117,8 +117,8 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## Gene filter\n",
-    "On the second place we would like to execute a filtering which ties up the variants attached to a concrete gene. \"NBEAP3\" was chosen as the target. Here we start preciating the functionality of Oskar API: \"genes\" field is automatically located and easily accesed."
+    "## Gene\n",
+    "We may want to execute a filtering which ties up the variants attached to a concrete gene. \"NBEAP3\" was chosen as the target. Here we start preciating the functionality of Oskar API: \"genes\" UDF will automatically locate and acces the field with that same name. We just need to specify the column where we store the gene info: \"annotation\". After that it just lacks to define the filter."
    ]
   },
   {
@@ -154,13 +154,13 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## Genotype filter\n",
-    "In this case we would like to show the way we could get the genotypes corresponding to any sample with our \"sampleDataField\" function."
+    "## Genotype\n",
+    "In case we want to get the genotypes corresponding to any sample we could use the \"sample_data_field\" function. We will also need to specify the field where that info is contained (\"studies\") just as \"genes\" function, the sample and the format."
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 20,
+   "execution_count": 30,
    "metadata": {},
    "outputs": [
     {
@@ -188,12 +188,12 @@
    ],
    "source": [
     "# Spark\n",
-    "df.select(sample_data_field(col(\"studies\"), 'NA12877', 'GT').alias(\"NA12877\")).show(10)"
+    "df.select(sample_data_field(\"studies\", 'NA12877', 'GT').alias(\"NA12877\")).show(10)"
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 21,
+   "execution_count": 31,
    "metadata": {},
    "outputs": [
     {
@@ -228,13 +228,13 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## Population Frequency filter\n",
-    "Finally we want to filter our dataframe by using PF field, so we just type \"population_frequency\" and specify the field were we can acces it, the ID of the concrete study we would like to check, and the target population. "
+    "## Population Frequency\n",
+    "In order to get the population frequency data we need to use the \"population_frequency\" UDF. Apart from specifying the field were we contain that information, we will need to specify the ID of the concrete study we would like to check and the target population."
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 3,
+   "execution_count": 28,
    "metadata": {},
    "outputs": [
     {
@@ -247,13 +247,13 @@
       "|22:16054454:C:T| 0.07566695660352707|\n",
       "|22:16065809:T:C| 0.14594951272010803|\n",
       "|22:16077310:T:A|  0.2338419109582901|\n",
-      "|22:16080499:A:G|                 0.0|\n",
-      "|22:16084621:T:C|                 0.0|\n",
       "|22:16091610:G:T|0.003129890421405...|\n",
-      "|22:16096040:G:A|                 0.0|\n",
       "|22:16099957:C:T|  0.6782668232917786|\n",
-      "|22:16100462:A:G|                 0.0|\n",
       "|22:16105660:G:A| 0.12387744337320328|\n",
+      "|22:16114913:A:T| 0.11458797007799149|\n",
+      "|22:16127471:A:-|0.004762233234941959|\n",
+      "|22:16134019:G:T| 0.03758351877331734|\n",
+      "|22:16138943:C:G|  0.2861359417438507|\n",
       "+---------------+--------------------+\n",
       "only showing top 10 rows\n",
       "\n"
@@ -269,40 +269,184 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Otherwise, if we are interested in getting all the Population Frequencies available, we can use this other method named \"population_frequency_as_map\" which will return the whole set as a dictionary format. Then we could apply \"explode\" and convert that dictionary into a new dataframe."
+    "Otherwise, if we are interested in getting all the Population Frequencies available without caring the population, we can use this other method named \"population_frequency_as_map\" which will return the whole set as a dictionary format. Then we could apply \"explode\" and convert that dictionary into a new dataframe. In this example we have also filtered by variant to get a better-looking output:"
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 43,
+   "execution_count": 29,
    "metadata": {},
    "outputs": [
     {
      "name": "stdout",
      "output_type": "stream",
      "text": [
-      "+--------------------+-----------------------+\n",
-      "|               study|populationFrequenciesDF|\n",
-      "+--------------------+-----------------------+\n",
-      "|  GNOMAD_GENOMES:OTH|     0.6896551847457886|\n",
-      "|  GNOMAD_GENOMES:ALL|     0.6782668232917786|\n",
-      "|  GNOMAD_GENOMES:AFR|     0.6747621297836304|\n",
-      "|  GNOMAD_GENOMES:NFE|     0.6699579954147339|\n",
-      "| GNOMAD_GENOMES:MALE|      0.682662844657898|\n",
-      "|  GNOMAD_GENOMES:FIN|     0.6726332306861877|\n",
-      "|  GNOMAD_GENOMES:EAS|     0.9523809552192688|\n",
-      "|  GNOMAD_GENOMES:ASJ|     0.5721649527549744|\n",
-      "|GNOMAD_GENOMES:FE...|     0.6727412939071655|\n",
-      "|  GNOMAD_GENOMES:AMR|     0.6950549483299255|\n",
-      "+--------------------+-----------------------+\n",
+      "+---------------+--------------------+-----------------------+\n",
+      "|             id|               study|populationFrequenciesDF|\n",
+      "+---------------+--------------------+-----------------------+\n",
+      "|22:16099957:C:T|  GNOMAD_GENOMES:OTH|     0.6896551847457886|\n",
+      "|22:16099957:C:T|  GNOMAD_GENOMES:ALL|     0.6782668232917786|\n",
+      "|22:16099957:C:T|  GNOMAD_GENOMES:AFR|     0.6747621297836304|\n",
+      "|22:16099957:C:T|  GNOMAD_GENOMES:NFE|     0.6699579954147339|\n",
+      "|22:16099957:C:T| GNOMAD_GENOMES:MALE|      0.682662844657898|\n",
+      "|22:16099957:C:T|  GNOMAD_GENOMES:FIN|     0.6726332306861877|\n",
+      "|22:16099957:C:T|  GNOMAD_GENOMES:EAS|     0.9523809552192688|\n",
+      "|22:16099957:C:T|  GNOMAD_GENOMES:ASJ|     0.5721649527549744|\n",
+      "|22:16099957:C:T|GNOMAD_GENOMES:FE...|     0.6727412939071655|\n",
+      "|22:16099957:C:T|  GNOMAD_GENOMES:AMR|     0.6950549483299255|\n",
+      "+---------------+--------------------+-----------------------+\n",
       "\n"
      ]
     }
    ],
    "source": [
     "PF = df.select(df.id, population_frequency_as_map(\"annotation\").alias(\"populationFrequencies\"))\\\n",
     "    .filter(df.id == \"22:16099957:C:T\")\n",
-    "PF.select(explode(PF.populationFrequencies).alias(\"study\", \"populationFrequenciesDF\")).show()"
+    "PF.select(\"id\", explode(PF.populationFrequencies).alias(\"study\", \"populationFrequenciesDF\")).show()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Consequence type\n",
+    "Theres two ways we could acces consequence type field. The first one is the \"consequence_types\" UDF."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 27,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "+---------------+--------------------+--------------------+\n",
+      "|             id|               genes|                  CT|\n",
+      "+---------------+--------------------+--------------------+\n",
+      "|22:16054454:C:T|                  []|[intergenic_variant]|\n",
+      "|22:16065809:T:C|       [LA16c-4G1.3]|[regulatory_regio...|\n",
+      "|22:16077310:T:A|       [LA16c-4G1.4]|[2KB_upstream_var...|\n",
+      "|22:16080499:A:G|[LA16c-4G1.4, LA1...|[upstream_gene_va...|\n",
+      "|22:16084621:T:C|       [LA16c-4G1.5]|[TF_binding_site_...|\n",
+      "|22:16091610:G:T|                  []|[intergenic_variant]|\n",
+      "|22:16096040:G:A|            [NBEAP3]|[downstream_gene_...|\n",
+      "|22:16099957:C:T|            [NBEAP3]|[2KB_downstream_v...|\n",
+      "|22:16100462:A:G|            [NBEAP3]|[2KB_downstream_v...|\n",
+      "|22:16105660:G:A|            [NBEAP3]|[non_coding_trans...|\n",
+      "+---------------+--------------------+--------------------+\n",
+      "only showing top 10 rows\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "df.select(df.id, genes(\"annotation\").alias(\"genes\"), consequence_types(\"annotation\").alias(\"CT\")).show(10)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The \"consequence_types_by_gene\" UDF, unlike \"consequence_types\", will return the field that contains the consequence types that directly match the specified gene."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 17,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "+---------------+--------------------+--------------------+\n",
+      "|             id|               genes|                  CT|\n",
+      "+---------------+--------------------+--------------------+\n",
+      "|22:16054454:C:T|                  []|                  []|\n",
+      "|22:16065809:T:C|       [LA16c-4G1.3]|                  []|\n",
+      "|22:16077310:T:A|       [LA16c-4G1.4]|                  []|\n",
+      "|22:16080499:A:G|[LA16c-4G1.4, LA1...|                  []|\n",
+      "|22:16084621:T:C|       [LA16c-4G1.5]|                  []|\n",
+      "|22:16091610:G:T|                  []|                  []|\n",
+      "|22:16096040:G:A|            [NBEAP3]|[downstream_gene_...|\n",
+      "|22:16099957:C:T|            [NBEAP3]|[2KB_downstream_v...|\n",
+      "|22:16100462:A:G|            [NBEAP3]|[2KB_downstream_v...|\n",
+      "|22:16105660:G:A|            [NBEAP3]|[non_coding_trans...|\n",
+      "+---------------+--------------------+--------------------+\n",
+      "only showing top 10 rows\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "df.select(df.id, genes(\"annotation\").alias(\"genes\"), consequence_types_by_gene(\"annotation\", \"NBEAP3\").alias(\"CT\")).show(10, False)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Then we may want to combine that output with a filter like this:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 24,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "+---------------+-----------------------------------------------+\n",
+      "|id             |CT                                             |\n",
+      "+---------------+-----------------------------------------------+\n",
+      "|22:16105660:G:A|[non_coding_transcript_variant, intron_variant]|\n",
+      "|22:16112391:G:A|[non_coding_transcript_variant, intron_variant]|\n",
+      "|22:16114913:A:T|[non_coding_transcript_variant, intron_variant]|\n",
+      "+---------------+-----------------------------------------------+\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "df.select(df.id, consequence_types_by_gene(\"annotation\", \"NBEAP3\").alias(\"CT\")).filter(array_contains(\"CT\", \"intron_variant\")).show(10, False)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Protein substitution score\n",
+    "Finally this is how we could use \"protein_substitution\" function. We need to know that this UDF gets the MAX and the MIN values between all the posible scores for every variant. A possible execution would look like this:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 41,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "+---------------+--------------+------------+\n",
+      "|             id|      polyphen|        sift|\n",
+      "+---------------+--------------+------------+\n",
+      "|22:25024098:G:A|[0.124, 0.581]| [0.0, 0.12]|\n",
+      "|22:28378810:T:G|[0.005, 0.005]|[0.37, 0.37]|\n",
+      "|22:29885644:C:A|[0.001, 0.001]|  [1.0, 1.0]|\n",
+      "|22:32831711:C:G|[0.027, 0.351]|[0.54, 0.78]|\n",
+      "+---------------+--------------+------------+\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "df.select(\"id\", protein_substitution(\"annotation\", \"polyphen\").alias(\"polyphen\"), protein_substitution(\"annotation\", \"sift\")\\\n",
+    "    .alias(\"sift\")).where(\"polyphen[0]>=0\").show()"
    ]
   }
  ],