diff --git a/.gitignore b/.gitignore index 613bf65..0049c5e 100644 --- a/.gitignore +++ b/.gitignore @@ -115,9 +115,4 @@ out/ atlassian-ide-plugin.xml # Documentation -docs/examples/find_files -docs/examples/variant -docs/examples/tasks -docs/examples/cli -docs/examples/plugin_system docs/_build/ diff --git a/docs/examples/cli/README.md b/docs/examples/cli/README.md new file mode 100644 index 0000000..0cca1b8 --- /dev/null +++ b/docs/examples/cli/README.md @@ -0,0 +1,4 @@ +# Command-line examples + +- [Introduction](introduction_cli.ipynb) - Description of each command that offers *OpenVariant* with `openvar`. +- [Main](main_cli.ipynb) - Bunch of examples that can be run in a shell. diff --git a/docs/examples/cli/introduction_cli.ipynb b/docs/examples/cli/introduction_cli.ipynb new file mode 100644 index 0000000..9df6a01 --- /dev/null +++ b/docs/examples/cli/introduction_cli.ipynb @@ -0,0 +1,270 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "source": [ + "# Introduction" + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%% md\n" + } + } + }, + { + "cell_type": "markdown", + "source": [ + "**OpenVariant** offers a command-line interface (CLI) with the main tasks that also can be applied in Python scripts.\n", + "\n", + "On the following command you can check the different options that `openvar` can run." + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%% md\n" + } + } + }, + { + "cell_type": "code", + "execution_count": 1, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Usage: openvar [OPTIONS] COMMAND [ARGS]...\n", + "\n", + " 'openvar' is the command-line interface of OpenVariant. Parsing and data\n", + " transformation of multiple input formats.\n", + "\n", + "Options:\n", + " --version Show the version and exit.\n", + " -h, --help Show this message and exit.\n", + "\n", + "Commands:\n", + " cat Concatenate parsed files to standard output.\n", + " count Number of rows that matches a specified criterion.\n", + " groupby Group the parsed result for each different value of the specified\n", + " key.\n", + " plugin Actions to execute for a plugin: create.\n" + ] + } + ], + "source": [ + "%%bash\n", + "openvar --help" + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%%\n" + } + } + }, + { + "cell_type": "markdown", + "source": [ + "### ***Cat*** command" + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%% md\n" + } + } + }, + { + "cell_type": "code", + "execution_count": 2, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Usage: openvar cat [OPTIONS] [INPUT_PATH]\n", + "\n", + " Print the parsed files on the stdout/\"output\".\n", + "\n", + "Options:\n", + " -w, --where TEXT Filter expression. eg: CHROMOSOME == 4\n", + " -a, --annotations PATH Annotation path. eg: /path/annotation_vcf.yaml\n", + " --header Show the result header.\n", + " -o, --output TEXT File to write the output.\n", + " -h, --help Show this message and exit.\n" + ] + } + ], + "source": [ + "%%bash\n", + "openvar cat --help" + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%%\n" + } + } + }, + { + "cell_type": "markdown", + "source": [ + "### ***Count*** command" + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%% md\n" + } + } + }, + { + "cell_type": "code", + "execution_count": 3, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Usage: openvar count [OPTIONS] [INPUT_PATH]\n", + "\n", + " Print on the stdout/\"output\" the number of rows that meets the criteria.\n", + "\n", + "Options:\n", + " -w, --where TEXT Filter expression. eg: CHROMOSOME == 4\n", + " -g, --group_by TEXT Key to group rows. eg: COUNTRY\n", + " -a, --annotations PATH Annotation path. eg: /path/annotation_vcf.yaml\n", + " -c, --cores INTEGER Maximum processes to run in parallel.\n", + " -q, --quite Don't show the progress.\n", + " -o, --output TEXT File to write the output.\n", + " -h, --help Show this message and exit.\n" + ] + } + ], + "source": [ + "%%bash\n", + "openvar count --help" + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%%\n" + } + } + }, + { + "cell_type": "markdown", + "source": [ + "### ***Group by*** command" + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%% md\n" + } + } + }, + { + "cell_type": "code", + "execution_count": 4, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Usage: openvar groupby [OPTIONS] [INPUT_PATH]\n", + "\n", + " Print on the stdout/\"output\" the parsed files group by a specified field.\n", + "\n", + "Options:\n", + " --header Show the result header.\n", + " --show Show group by each row.\n", + " -w, --where TEXT Filter expression. eg: CHROMOSOME == 4\n", + " -g, --group_by TEXT Key to group rows. eg: COUNTRY\n", + " -s, --script TEXT Filter expression. eg: gzip >\n", + " \\${GROUP_KEY}.parsed.tsv.gz\n", + " -a, --annotations PATH Annotation path. eg: /path/annotation_vcf.yaml\n", + " -c, --cores INTEGER Maximum processes to run in parallel.\n", + " -q, --quite Don't show the progress.\n", + " -o, --output TEXT File to write the output.\n", + " -h, --help Show this message and exit.\n" + ] + } + ], + "source": [ + "%%bash\n", + "openvar groupby --help" + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%%\n" + } + } + }, + { + "cell_type": "markdown", + "source": [ + "### ***Plugin*** command" + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%% md\n" + } + } + }, + { + "cell_type": "code", + "execution_count": 5, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Usage: openvar plugin [OPTIONS] {create}\n", + "\n", + " Actions to apply on the plugin system.\n", + "\n", + "Options:\n", + " -n, --name TEXT Name of the plugin.\n", + " -d, --directory TEXT Directory to reach the plugin.\n", + " -h, --help Show this message and exit.\n" + ] + } + ], + "source": [ + "%%bash\n", + "openvar plugin --help" + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%%\n" + } + } + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 2 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython2", + "version": "2.7.6" + } + }, + "nbformat": 4, + "nbformat_minor": 0 +} \ No newline at end of file diff --git a/docs/examples/cli/main_cli.ipynb b/docs/examples/cli/main_cli.ipynb new file mode 100644 index 0000000..f599f3d --- /dev/null +++ b/docs/examples/cli/main_cli.ipynb @@ -0,0 +1,401 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "source": [ + "# Main tasks" + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%% md\n" + } + } + }, + { + "cell_type": "markdown", + "source": [ + "We can run the main tasks that **OpenVariant** offers as a Python package, also, with a shell command. In [Introduction section](./introduction_cli.ipynb) you can check which are the different commands and their options.\n", + "\n", + "In the following examples we will see a little snip on these tasks.\n", + "\n", + "--------" + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%% md\n" + } + } + }, + { + "cell_type": "markdown", + "source": [ + "### ***Cat*** command\n", + "\n", + "A simple case of the command:" + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%% md\n" + } + } + }, + { + "cell_type": "code", + "execution_count": 1, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "ACAP3\t1p36.33\tMESO\n", + "ACTRT2\t1p36.32\tMESO\n", + "AGRN\t1p36.33\tMESO\n", + "ANKRD65\t1p36.33\tMESO\n", + "ATAD3A\t1p36.33\tMESO\n", + "ATAD3B\t1p36.33\tMESO\n", + "ATAD3C\t1p36.33\tMESO\n", + "AURKAIP1\t1p36.33\tMESO\n", + "B3GALT6\t1p36.33\tMESO\n", + "ACAP3\t1p36.33\tACC\n", + "ACTRT2\t1p36.32\tACC\n", + "AGRN\t1p36.33\tACC\n", + "ANKRD65\t1p36.33\tACC\n", + "ATAD3A\t1p36.33\tACC\n", + "ATAD3B\t1p36.33\tACC\n", + "ATAD3C\t1p36.33\tACC\n", + "AURKAIP1\t1p36.33\tACC\n", + "B3GALT6\t1p36.33\tACC\n" + ] + } + ], + "source": [ + "%%bash\n", + "openvar cat ../datasets/sample2" + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%%\n" + } + } + }, + { + "cell_type": "markdown", + "source": [ + "Using it with some flags:" + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%% md\n" + } + } + }, + { + "cell_type": "code", + "execution_count": 2, + "outputs": [], + "source": [ + "%%bash\n", + "openvar cat ../datasets/sample2 --output ./output_cat_cmd.tsv --header" + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%%\n" + } + } + }, + { + "cell_type": "markdown", + "source": [ + "### ***Count*** command\n", + "\n", + "A simple case of the command:" + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%% md\n" + } + } + }, + { + "cell_type": "code", + "execution_count": 3, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "TOTAL\t18\n" + ] + } + ], + "source": [ + "%%bash\n", + "openvar count ../datasets/sample2 -q" + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%%\n" + } + } + }, + { + "cell_type": "markdown", + "source": [ + "Using it with some flags:" + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%% md\n" + } + } + }, + { + "cell_type": "code", + "execution_count": 4, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "ACAP3\t2\n", + "ACTRT2\t2\n", + "AGRN\t2\n", + "ANKRD65\t2\n", + "ATAD3A\t2\n", + "ATAD3B\t2\n", + "ATAD3C\t2\n", + "AURKAIP1\t2\n", + "B3GALT6\t2\n", + "TOTAL\t18\n" + ] + } + ], + "source": [ + "%%bash\n", + "openvar count ../datasets/sample2 --group_by \"SYMBOL\" --cores 4 -q" + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%%\n" + } + } + }, + { + "cell_type": "code", + "execution_count": 5, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "ACAP3\t1\n", + "ACTRT2\t1\n", + "AGRN\t1\n", + "ANKRD65\t1\n", + "ATAD3A\t1\n", + "ATAD3B\t1\n", + "ATAD3C\t1\n", + "AURKAIP1\t1\n", + "B3GALT6\t1\n", + "TOTAL\t9\n" + ] + } + ], + "source": [ + "%%bash\n", + "openvar count ../datasets/sample2 --group_by \"SYMBOL\" --where \"CANCER == 'ACC'\" --cores 4 -q" + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%%\n" + } + } + }, + { + "cell_type": "markdown", + "source": [ + "### ***Group by*** command\n", + "\n", + "A simple case of the command:" + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%% md\n" + } + } + }, + { + "cell_type": "code", + "execution_count": 6, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "SYMBOL\tCYTOBAND\tCANCER\n", + "ACC\tACAP3\t1p36.33\tACC\n", + "ACC\tACTRT2\t1p36.32\tACC\n", + "ACC\tAGRN\t1p36.33\tACC\n", + "ACC\tANKRD65\t1p36.33\tACC\n", + "ACC\tATAD3A\t1p36.33\tACC\n", + "ACC\tATAD3B\t1p36.33\tACC\n", + "ACC\tATAD3C\t1p36.33\tACC\n", + "ACC\tAURKAIP1\t1p36.33\tACC\n", + "ACC\tB3GALT6\t1p36.33\tACC\n", + "MESO\tSYMBOL\tCYTOBAND\tCANCER\n", + "MESO\tACAP3\t1p36.33\tMESO\n", + "MESO\tACTRT2\t1p36.32\tMESO\n", + "MESO\tAGRN\t1p36.33\tMESO\n", + "MESO\tANKRD65\t1p36.33\tMESO\n", + "MESO\tATAD3A\t1p36.33\tMESO\n", + "MESO\tATAD3B\t1p36.33\tMESO\n", + "MESO\tATAD3C\t1p36.33\tMESO\n", + "MESO\tAURKAIP1\t1p36.33\tMESO\n", + "MESO\tB3GALT6\t1p36.33\tMESO\n" + ] + } + ], + "source": [ + "%%bash\n", + "openvar groupby ../datasets/sample2 -g \"CANCER\" --show --header -q" + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%%\n" + } + } + }, + { + "cell_type": "markdown", + "source": [ + "Using it with some flags:" + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%% md\n" + } + } + }, + { + "cell_type": "code", + "execution_count": 7, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "MESO\tATAD3C\t1p36.33\tMESO\n", + "ACC\tATAD3C\t1p36.33\tACC\n" + ] + } + ], + "source": [ + "%%bash\n", + "openvar groupby ../datasets/sample2 -g \"CANCER\" --where \"SYMBOL == 'ATAD3C'\" --cores 4 --show -q" + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%%\n" + } + } + }, + { + "cell_type": "code", + "execution_count": 8, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "ACC\tATAD3A\t1p36.33\tACC\n", + "ACC\tATAD3B\t1p36.33\tACC\n", + "ACC\tATAD3C\t1p36.33\tACC\n", + "ACC\tB3GALT6\t1p36.33\tACC\n", + "MESO\tATAD3A\t1p36.33\tMESO\n", + "MESO\tATAD3B\t1p36.33\tMESO\n", + "MESO\tATAD3C\t1p36.33\tMESO\n", + "MESO\tB3GALT6\t1p36.33\tMESO\n" + ] + } + ], + "source": [ + "%%bash\n", + "openvar groupby ../datasets/sample2 -g \"CANCER\" --script \"grep -E -i '^(B3|ATA)'\" --show --cores 4 -q" + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%%\n" + } + } + }, + { + "cell_type": "markdown", + "source": [ + "\n", + "### ***Plugin*** command\n", + "\n", + "_Plugin_ is slightly different than the others commands.\n", + "\n", + "On this one we have only one single action: `create`." + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%% md\n" + } + } + }, + { + "cell_type": "code", + "execution_count": null, + "outputs": [], + "source": [ + "%%bash\n", + "openvar plugin create --name add_date --directory ../plugin_system" + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%%\n" + } + } + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 2 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython2", + "version": "2.7.6" + } + }, + "nbformat": 4, + "nbformat_minor": 0 +} \ No newline at end of file diff --git a/docs/examples/cli/output_cat_cmd.tsv b/docs/examples/cli/output_cat_cmd.tsv new file mode 100644 index 0000000..00a719b --- /dev/null +++ b/docs/examples/cli/output_cat_cmd.tsv @@ -0,0 +1,19 @@ +SYMBOL CYTOBAND CANCER +ACAP3 1p36.33 MESO +ACTRT2 1p36.32 MESO +AGRN 1p36.33 MESO +ANKRD65 1p36.33 MESO +ATAD3A 1p36.33 MESO +ATAD3B 1p36.33 MESO +ATAD3C 1p36.33 MESO +AURKAIP1 1p36.33 MESO +B3GALT6 1p36.33 MESO +ACAP3 1p36.33 ACC +ACTRT2 1p36.32 ACC +AGRN 1p36.33 ACC +ANKRD65 1p36.33 ACC +ATAD3A 1p36.33 ACC +ATAD3B 1p36.33 ACC +ATAD3C 1p36.33 ACC +AURKAIP1 1p36.33 ACC +B3GALT6 1p36.33 ACC diff --git a/docs/examples/find_files/README.md b/docs/examples/find_files/README.md new file mode 100644 index 0000000..88d652f --- /dev/null +++ b/docs/examples/find_files/README.md @@ -0,0 +1,5 @@ +# Find files examples + +- [With directory path](find_files_with_directory_path.ipynb) - Search for all possible _input_ file and _annotation_ file + in a directory. +- [With file path](find_files_with_file_path.ipynb) - Search for _input_ file and _annotation_ file given a file path. diff --git a/docs/examples/find_files/find_files_with_directory_path.ipynb b/docs/examples/find_files/find_files_with_directory_path.ipynb new file mode 100644 index 0000000..a4729b6 --- /dev/null +++ b/docs/examples/find_files/find_files_with_directory_path.ipynb @@ -0,0 +1,174 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "pycharm": { + "name": "#%% md\n" + } + }, + "source": [ + "# With directory path" + ] + }, + { + "cell_type": "markdown", + "source": [ + "A Simple example on how find files task works for a directory path." + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%% md\n" + } + } + }, + { + "cell_type": "code", + "execution_count": 1, + "outputs": [], + "source": [ + "from os import getcwd\n", + "from os.path import dirname\n", + "from openvariant import findfiles\n", + "\n", + "dataset_folder = f'{dirname(getcwd())}/datasets/sample1'" + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%%\n" + } + } + }, + { + "cell_type": "markdown", + "source": [ + "It will get any type of file that matches with any _annotation_ file that is on the same folder or in a child folder.\n", + "\n", + "`find_files` function parameters:\n", + "\n", + "- `base_path` - Base path of _input_ folder.\n", + "- `annotation_path` - Path of _annotation_ file.\n", + "\n", + "As we see, the output has two types of pattern `*.vcf.gz` and `*.maf.gz`.\n" + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%% md\n" + } + } + }, + { + "cell_type": "code", + "execution_count": 2, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "File path: /home/dmartinez/openvariant/examples/datasets/sample1/5a3a743.wxs.maf.gz\n", + "Annotation object: \n", + "-------------------------------------\n", + "File path: /home/dmartinez/openvariant/examples/datasets/sample1/22f5b2f.wxs.maf.gz\n", + "Annotation object: \n", + "-------------------------------------\n", + "File path: /home/dmartinez/openvariant/examples/datasets/sample1/345c90e.raw_somatic_mutation.vcf.gz\n", + "Annotation object: \n", + "-------------------------------------\n", + "File path: /home/dmartinez/openvariant/examples/datasets/sample1/de46011.raw_somatic_mutation.vcf.gz\n", + "Annotation object: \n", + "-------------------------------------\n", + "File path: /home/dmartinez/openvariant/examples/datasets/sample1/sample1_1/3a70e22.raw_somatic_mutation.vcf.gz\n", + "Annotation object: \n", + "-------------------------------------\n", + "File path: /home/dmartinez/openvariant/examples/datasets/sample1/sample1_1/4c0b87e.raw_somatic_mutation.vcf.gz\n", + "Annotation object: \n", + "-------------------------------------\n" + ] + } + ], + "source": [ + "for file_path, annotation in findfiles(base_path=dataset_folder):\n", + " print(f'File path: {file_path}')\n", + " print(f'Annotation object: {annotation}')\n", + " print(\"-------------------------------------\")" + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%%\n" + } + } + }, + { + "cell_type": "markdown", + "source": [ + "In the following example, we will get files with a fixed _Annotation_ file.\n", + "\n", + "All the files that we will be able to detect will follow the pattern described on annotations." + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%% md\n" + } + } + }, + { + "cell_type": "code", + "execution_count": 3, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "File path: /home/dmartinez/openvariant/examples/datasets/sample1/5a3a743.wxs.maf.gz\n", + "Annotation object: \n", + "-------------------------------------\n", + "File path: /home/dmartinez/openvariant/examples/datasets/sample1/22f5b2f.wxs.maf.gz\n", + "Annotation object: \n", + "-------------------------------------\n" + ] + } + ], + "source": [ + "annotation_file = f'{dirname(getcwd())}/datasets/sample1/annotation_maf.yaml'\n", + "\n", + "for file_path, annotation in findfiles(base_path=dataset_folder, annotation_path=annotation_file):\n", + " print(f'File path: {file_path}')\n", + " print(f'Annotation object: {annotation}')\n", + " print(\"-------------------------------------\")" + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%%\n" + } + } + } + ], + "metadata": { + "celltoolbar": "Tags", + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.8.12" + } + }, + "nbformat": 4, + "nbformat_minor": 1 +} \ No newline at end of file diff --git a/docs/examples/find_files/find_files_with_file_path.ipynb b/docs/examples/find_files/find_files_with_file_path.ipynb new file mode 100644 index 0000000..29e7278 --- /dev/null +++ b/docs/examples/find_files/find_files_with_file_path.ipynb @@ -0,0 +1,197 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "source": [ + "# With file path" + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%% md\n" + } + } + }, + { + "cell_type": "markdown", + "source": [ + "A simple example on how find files task works for a file path." + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%% md\n" + } + } + }, + { + "cell_type": "code", + "execution_count": 3, + "outputs": [], + "source": [ + "from os import getcwd\n", + "from os.path import dirname\n", + "from openvariant import findfiles\n", + "\n", + "dataset_file = f'{dirname(getcwd())}/datasets/sample1/5a3a743.wxs.maf.gz'" + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%%\n" + } + } + }, + { + "cell_type": "markdown", + "source": [ + "As it occurs on the [directory path examples](find_files_with_directory_path.ipynb), it will get the file and its corresponding _Annotation_ object.\n", + "\n", + "`find_files` function parameters:\n", + "\n", + "- `base_path` - Base path of _input_ file.\n", + "- `annotation_path` - Path of _annotation_ file.\n", + "\n", + "It will only get one file as an _output_, the file that is passed as a parameter. In case that there's any _annotation_ file that matches with the file's pattern, it won't generate any output." + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%% md\n" + } + } + }, + { + "cell_type": "code", + "execution_count": 4, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "File path: /home/dmartinez/openvariant/examples/datasets/sample1/5a3a743.wxs.maf.gz\n", + "Annotation object: \n", + "-------------------------------------\n" + ] + } + ], + "source": [ + "for file_path, annotation in findfiles(base_path=dataset_file):\n", + " print(f'File path: {file_path}')\n", + " print(f'Annotation object: {annotation}')\n", + " print(\"-------------------------------------\")" + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%%\n" + } + } + }, + { + "cell_type": "markdown", + "source": [ + "It is possible to pass the _annotation_ file as a parameter of the function.\n", + "\n", + "It will generate the same output as it did before, and if the _annotation_ file doesn't match with the _input_ file it will not generate any output." + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%% md\n" + } + } + }, + { + "cell_type": "code", + "execution_count": 5, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "File path: /home/dmartinez/openvariant/examples/datasets/sample1/5a3a743.wxs.maf.gz\n", + "Annotation object: \n", + "-------------------------------------\n" + ] + } + ], + "source": [ + "annotation_file = f'{dirname(getcwd())}/datasets/sample1/annotation_maf.yaml'\n", + "\n", + "for file_path, annotation in findfiles(dataset_file, annotation_file):\n", + " print(f'File path: {file_path}')\n", + " print(f'Annotation object: {annotation}')\n", + " print(\"-------------------------------------\")" + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%%\n" + } + } + }, + { + "cell_type": "markdown", + "source": [ + "As it occurs on the [directory path examples](find_files_with_directory_path.ipynb), it will get the file and its corresponding _Annotation_ object.\n", + "\n", + "It will only get one file as an output, the file that is passed as a parameter. In case that there's any _annotation_ file that matches with the file's pattern, it won't generate any output." + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%% md\n" + } + } + }, + { + "cell_type": "code", + "execution_count": 6, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "File path: /home/dmartinez/openvariant/examples/datasets/sample1/5a3a743.wxs.maf.gz\n", + "Annotation object: \n", + "-------------------------------------\n" + ] + } + ], + "source": [ + "for file_path, annotation in findfiles(dataset_file):\n", + " print(f'File path: {file_path}')\n", + " print(f'Annotation object: {annotation}')\n", + " print(\"-------------------------------------\")" + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%%\n" + } + } + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 2 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython2", + "version": "2.7.6" + } + }, + "nbformat": 4, + "nbformat_minor": 0 +} \ No newline at end of file diff --git a/docs/examples/plugin_system/README.md b/docs/examples/plugin_system/README.md new file mode 100644 index 0000000..0ea62c3 --- /dev/null +++ b/docs/examples/plugin_system/README.md @@ -0,0 +1,3 @@ +# Plugin system examples + +- [Plugin system](plugin_system.ipynb) - A simple example that two plugins. diff --git a/docs/examples/plugin_system/add_date/__init__.py b/docs/examples/plugin_system/add_date/__init__.py new file mode 100644 index 0000000..344acee --- /dev/null +++ b/docs/examples/plugin_system/add_date/__init__.py @@ -0,0 +1,2 @@ +import .add_date from Add_datePlugin +import .add_date from Add_dateContext diff --git a/docs/examples/plugin_system/add_date/add_date.py b/docs/examples/plugin_system/add_date/add_date.py new file mode 100644 index 0000000..163aa10 --- /dev/null +++ b/docs/examples/plugin_system/add_date/add_date.py @@ -0,0 +1,53 @@ +from datetime import date + +from openvariant.plugins.context import Context +from openvariant.plugins.plugin import Plugin + + +class Add_dateContext(Context): + """ + Add_datePlugin class generated by OpenVariant + + Attributes + ------- + row : dict + Main method to execute data transformation in each row. + field_name : str + Name of the corresponding column that was described on the annotation schema. + file_path : str + Path of the Input file that is being parsed. + """ + + def __init__(self, row: dict, field_name: str, file_path: str) -> None: + super().__init__(row, field_name, file_path) + + +class Add_datePlugin(Plugin): + """ + Add_datePlugin class generated by OpenVariant + + Methods + ------- + run(context: Add_dateContext) + Main method to execute data transformation in each row. + """ + + def run(self, context: Add_dateContext) -> dict: + """ + Data transformation of a single row + + Parameters + ------- + context : Add_dateContext + Representation of the row to be parsed. + + Returns + ------- + float or int or str + The value of the field transformed. + """ + + # This is an example code, modify if as you wish + context.row[context.field_name] = str(date.today()) + + return context.row[context.field_name] diff --git a/docs/examples/plugin_system/get_length/__init__.py b/docs/examples/plugin_system/get_length/__init__.py new file mode 100644 index 0000000..930bcb9 --- /dev/null +++ b/docs/examples/plugin_system/get_length/__init__.py @@ -0,0 +1,2 @@ +import .get_length from Get_lengthPlugin +import .get_length from Get_lengthContext diff --git a/docs/examples/plugin_system/get_length/get_length.py b/docs/examples/plugin_system/get_length/get_length.py new file mode 100644 index 0000000..e0a74f2 --- /dev/null +++ b/docs/examples/plugin_system/get_length/get_length.py @@ -0,0 +1,51 @@ +from openvariant.plugins.context import Context +from openvariant.plugins.plugin import Plugin + + +class Get_lengthContext(Context): + """ + Get_lengthPlugin class generated by OpenVariant + + Attributes + ------- + row : dict + Main method to execute data transformation in each row. + field_name : str + Name of the corresponding column that was described on the annotation schema. + file_path : str + Path of the Input file that is being parsed. + """ + + def __init__(self, row: dict, field_name: str, file_path: str) -> None: + super().__init__(row, field_name, file_path) + + +class Get_lengthPlugin(Plugin): + """ + Get_lengthPlugin class generated by OpenVariant + + Methods + ------- + run(context: Get_lengthContext) + Main method to execute data transformation in each row. + """ + + def run(self, context: Get_lengthContext) -> dict: + """ + Data transformation of a single row + + Parameters + ------- + context : Get_lengthContext + Representation of the row to be parsed. + + Returns + ------- + float or int or str + The value of the field transformed. + """ + + # This is an example code, modify if as you wish + context.row[context.field_name] = str(int(context.row['END']) - int(context.row['START'])) + + return context.row[context.field_name] diff --git a/docs/examples/plugin_system/plugin_system.ipynb b/docs/examples/plugin_system/plugin_system.ipynb new file mode 100644 index 0000000..8469a7c --- /dev/null +++ b/docs/examples/plugin_system/plugin_system.ipynb @@ -0,0 +1,74 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "source": [ + "# Plugin system example" + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%% md\n" + } + } + }, + { + "cell_type": "code", + "execution_count": 1, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "CHROM\tDATE\tSTART\tEND\tLENGTH\n", + "1\t2022-07-29\t2488138\t14192672\t11704534\n", + "1\t2022-07-29\t15296019\t16248794\t952775\n", + "1\t2022-07-29\t16258674\t45800123\t29541449\n", + "1\t2022-07-29\t45805909\t46714227\t908318\n", + "1\t2022-07-29\t46715732\t51434461\t4718729\n", + "1\t2022-07-29\t51436105\t51441361\t5256\n", + "1\t2022-07-29\t53811942\t59249761\t5437819\n", + "1\t2022-07-29\t59250761\t75413622\t16162861\n", + "1\t2022-07-29\t78414455\t147282446\t68867991\n", + "1\t2022-07-29\t150547161\t155880500\t5333339\n", + "1\t2022-07-29\t156785626\t245977996\t89192370\n", + "2\t2022-07-29\t4717089\t28539877\t23822788\n", + "2\t2022-07-29\t29416439\t36353114\t6936675\n", + "2\t2022-07-29\t38467258\t99204005\t60736747\n" + ] + } + ], + "source": [ + "%%bash\n", + "openvar cat ../datasets/sample3 --header" + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%%\n" + } + } + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 2 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython2", + "version": "2.7.6" + } + }, + "nbformat": 4, + "nbformat_minor": 0 +} \ No newline at end of file diff --git a/docs/examples/tasks/README.md b/docs/examples/tasks/README.md new file mode 100644 index 0000000..c3d045d --- /dev/null +++ b/docs/examples/tasks/README.md @@ -0,0 +1,5 @@ +# Task examples + +- [Cat](cat.ipynb) - Bunch of examples for _cat_ task. +- [Count](count.ipynb) - Bunch of examples for _count_ task. +- [Group by](group_by.ipynb) - Bunch of examples for _group by_ task. diff --git a/docs/examples/tasks/cat.ipynb b/docs/examples/tasks/cat.ipynb new file mode 100644 index 0000000..75b9e14 --- /dev/null +++ b/docs/examples/tasks/cat.ipynb @@ -0,0 +1,170 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "source": [ + "# _Cat_" + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%% md\n" + } + } + }, + { + "cell_type": "markdown", + "source": [ + "A simple example where we can find how **cat** task works. This task is able with command-line." + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%% md\n" + } + } + }, + { + "cell_type": "code", + "execution_count": 1, + "outputs": [], + "source": [ + "from os import getcwd\n", + "from os.path import dirname\n", + "from openvariant import cat\n", + "\n", + "dataset_folder = f'{dirname(getcwd())}/datasets/sample2'" + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%%\n" + } + } + }, + { + "cell_type": "markdown", + "source": [ + "`cat` task allows us to show on the standard output the parsed result. It has the following parameters:\n", + "\n", + "- `base_path` - Input path to explore and parse.\n", + "- `annotation_path` - Path of the _annotation_ path.\n", + "- `where` - Filter expression.\n", + "- `header_show` - Show header on the result.\n", + "- `output` - File path to save the output result." + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%% md\n" + } + } + }, + { + "cell_type": "code", + "execution_count": 2, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "SYMBOL\tCYTOBAND\tCANCER\n", + "ACAP3\t1p36.33\tMESO\n", + "ACTRT2\t1p36.32\tMESO\n", + "AGRN\t1p36.33\tMESO\n", + "ANKRD65\t1p36.33\tMESO\n", + "ATAD3A\t1p36.33\tMESO\n", + "ATAD3B\t1p36.33\tMESO\n", + "ATAD3C\t1p36.33\tMESO\n", + "AURKAIP1\t1p36.33\tMESO\n", + "B3GALT6\t1p36.33\tMESO\n", + "ACAP3\t1p36.33\tACC\n", + "ACTRT2\t1p36.32\tACC\n", + "AGRN\t1p36.33\tACC\n", + "ANKRD65\t1p36.33\tACC\n", + "ATAD3A\t1p36.33\tACC\n", + "ATAD3B\t1p36.33\tACC\n", + "ATAD3C\t1p36.33\tACC\n", + "AURKAIP1\t1p36.33\tACC\n", + "B3GALT6\t1p36.33\tACC\n" + ] + } + ], + "source": [ + "cat(base_path=dataset_folder)" + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%%\n" + } + } + }, + { + "cell_type": "markdown", + "source": [ + "One of the parameters on `cat` task is `where`. You will be able to apply a conditional filter. The possible operations can be:\n", + "\n", + "+ `==` - Equal.\n", + "+ `!=` - Not equal.\n", + "+ `<=` - Less or equal than.\n", + "+ `<` - Less than.\n", + "+ `>=` - More or equal than.\n", + "+ `>` - More than.\n", + "\n", + "One example of this parameter is the following one:" + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%% md\n" + } + } + }, + { + "cell_type": "code", + "execution_count": 3, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "SYMBOL\tCYTOBAND\tCANCER\n", + "ATAD3C\t1p36.33\tMESO\n", + "ATAD3C\t1p36.33\tACC\n" + ] + } + ], + "source": [ + "cat(base_path=dataset_folder, where=\"SYMBOL == 'ATAD3C'\")" + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%%\n" + } + } + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 2 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython2", + "version": "2.7.6" + } + }, + "nbformat": 4, + "nbformat_minor": 0 +} \ No newline at end of file diff --git a/docs/examples/tasks/count.ipynb b/docs/examples/tasks/count.ipynb new file mode 100644 index 0000000..15957ef --- /dev/null +++ b/docs/examples/tasks/count.ipynb @@ -0,0 +1,158 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "source": [ + "# _Count_" + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%% md\n" + } + } + }, + { + "cell_type": "markdown", + "source": [ + "A simple example where we can find how **count** task works. This task is able with command-line." + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%% md\n" + } + } + }, + { + "cell_type": "code", + "execution_count": 1, + "outputs": [], + "source": [ + "from os import getcwd\n", + "from os.path import dirname\n", + "from openvariant import count\n", + "\n", + "dataset_folder = f'{dirname(getcwd())}/datasets/sample2'\n", + "annotation_path = f'{dirname(getcwd())}/datasets/sample2/annotation.yaml'" + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%%\n" + } + } + }, + { + "cell_type": "markdown", + "source": [ + "`count` task allows us to count the number of rows that result has. It has the following parameters:\n", + "\n", + "- `base_path` - Input path to explore and parse.\n", + "- `annotation_path` - Path of the annotation path.\n", + "- `group_by` - Key to group rows.\n", + "- `where` - Filter expression.\n", + "- `cores` - Maximum processes to run in parallel.\n", + "- `quite` - Do not show the progress meanwhile the parsing is running.\n", + "\n", + "On the following example we can see a general case of `count` task:" + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%% md\n" + } + } + }, + { + "cell_type": "code", + "execution_count": 2, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Total: 18\n" + ] + } + ], + "source": [ + "result = count(base_path=dataset_folder, annotation_path=annotation_path, quite=True)\n", + "print(f\"Total: {result[0]}\")" + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%%\n" + } + } + }, + { + "cell_type": "markdown", + "source": [ + "One of the parameter on `count` task is `where`. You will be able to apply a conditional filter. The possible operations can be:\n", + "\n", + "+ `==` - Equal.\n", + "+ `!=` - Not equal.\n", + "+ `<=` - Less or equal than.\n", + "+ `<` - Less than.\n", + "+ `>=` - More or equal than.\n", + "+ `>` - More than.\n", + "\n", + "Also, `gropu_by` to group rows on different values of this key. An example of these parameters is the following one:" + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%% md\n" + } + } + }, + { + "cell_type": "code", + "execution_count": 3, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Total: 2\n", + "Groups and count: {'MESO': 1, 'ACC': 1}\n" + ] + } + ], + "source": [ + "result = count(base_path=dataset_folder, annotation_path=annotation_path, where=\"SYMBOL == 'ATAD3C'\", group_by=\"CANCER\", quite=True)\n", + "print(f\"Total: {result[0]}\")\n", + "print(f\"Groups and count: {result[1]}\")" + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%%\n" + } + } + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 2 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython2", + "version": "2.7.6" + } + }, + "nbformat": 4, + "nbformat_minor": 0 +} \ No newline at end of file diff --git a/docs/examples/tasks/group_by.ipynb b/docs/examples/tasks/group_by.ipynb new file mode 100644 index 0000000..3a5c467 --- /dev/null +++ b/docs/examples/tasks/group_by.ipynb @@ -0,0 +1,239 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "source": [ + "# _Group by_" + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%% md\n" + } + } + }, + { + "cell_type": "markdown", + "source": [ + "A simple example where we can find how **group by** task works. This task is able with command-line.\n" + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%% md\n" + } + } + }, + { + "cell_type": "code", + "execution_count": 1, + "outputs": [], + "source": [ + "from os.path import dirname\n", + "from os import getcwd\n", + "from openvariant import group_by\n", + "\n", + "dataset_folder = f'{dirname(getcwd())}/datasets/sample2'\n", + "annotation_path = f'{dirname(getcwd())}/datasets/sample2/annotation.yaml'" + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%%\n" + } + } + }, + { + "cell_type": "markdown", + "source": [ + "`group_by` task allows us to group the rows depending on the value of an output field.\n", + "\n", + "- `base_path` - Input path to explore and parse.\n", + "- `annotation_path` - Path of the annotation path.\n", + "- `script` - Command-line to execute with the result of the parsing.\n", + "- `key_by` - Key to group rows.\n", + "- `where` - Filter expression.\n", + "- `cores` - Maximum processes to run in parallel.\n", + "- `quite` - Do not show the progress meanwhile the parsing is running.\n", + "- `header` - Show header on the result.\n", + "\n", + "On the following example we can see a general case for `group by` task:" + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%% md\n" + } + } + }, + { + "cell_type": "code", + "execution_count": 2, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Group: MESO\n", + "ACAP3\t1p36.33\tMESO\n", + "ACTRT2\t1p36.32\tMESO\n", + "AGRN\t1p36.33\tMESO\n", + "ANKRD65\t1p36.33\tMESO\n", + "ATAD3A\t1p36.33\tMESO\n", + "ATAD3B\t1p36.33\tMESO\n", + "ATAD3C\t1p36.33\tMESO\n", + "AURKAIP1\t1p36.33\tMESO\n", + "B3GALT6\t1p36.33\tMESO\n", + "\n", + "\n", + "Group: ACC\n", + "ACAP3\t1p36.33\tACC\n", + "ACTRT2\t1p36.32\tACC\n", + "AGRN\t1p36.33\tACC\n", + "ANKRD65\t1p36.33\tACC\n", + "ATAD3A\t1p36.33\tACC\n", + "ATAD3B\t1p36.33\tACC\n", + "ATAD3C\t1p36.33\tACC\n", + "AURKAIP1\t1p36.33\tACC\n", + "B3GALT6\t1p36.33\tACC\n", + "\n", + "\n" + ] + } + ], + "source": [ + "for group, values, script_used in group_by(base_path=dataset_folder, annotation_path=annotation_path, script=None, key_by=\"CANCER\", quite=True):\n", + " print(f'Group: {group}')\n", + " for row in values:\n", + " print(row)\n", + " print(\"\\n\")" + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%%\n" + } + } + }, + { + "cell_type": "markdown", + "source": [ + "One of the parameters on `count` task is `where`. You will be able to apply a conditional filter. The possible operations can be:\n", + "\n", + "+ `==` - Equal.\n", + "+ `!=` - Not equal.\n", + "+ `<=` - Less or equal than.\n", + "+ `<` - Less than.\n", + "+ `>=` - More or equal than.\n", + "+ `>` - More than.\n", + "\n", + "One example of this parameter is the following one:" + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%% md\n" + } + } + }, + { + "cell_type": "code", + "execution_count": 3, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Group: MESO\n", + "ATAD3C\t1p36.33\tMESO\n", + "\n", + "\n", + "Group: ACC\n", + "ATAD3C\t1p36.33\tACC\n", + "\n", + "\n" + ] + } + ], + "source": [ + "for group, values, script_used in group_by(base_path=dataset_folder, annotation_path=annotation_path, script=None,where=\"SYMBOL == 'ATAD3C'\", key_by=\"CANCER\", quite=True):\n", + " print(f'Group: {group}')\n", + " for row in values:\n", + " print(row)\n", + " print(\"\\n\")" + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%%\n" + } + } + }, + { + "cell_type": "markdown", + "source": [ + "Also, on `group by` task, there is `script` parameter which will allow to the user to execute a command shell on the parsed result. In the following example we can see how many characters there are in each group of the parsed output:" + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%% md\n" + } + } + }, + { + "cell_type": "code", + "execution_count": 4, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Group: MESO\n", + "181\n", + "\n", + "\n", + "Group: ACC\n", + "172\n", + "\n", + "\n" + ] + } + ], + "source": [ + "for group, values, script_used in group_by(base_path=dataset_folder, annotation_path=annotation_path, script=\"wc -m\", key_by=\"CANCER\", quite=True):\n", + " print(f'Group: {group}')\n", + " for row in values:\n", + " print(row)\n", + " print(\"\\n\")" + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%%\n" + } + } + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 2 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython2", + "version": "2.7.6" + } + }, + "nbformat": 4, + "nbformat_minor": 0 +} \ No newline at end of file diff --git a/docs/examples/variant/README.md b/docs/examples/variant/README.md new file mode 100644 index 0000000..4a8bf58 --- /dev/null +++ b/docs/examples/variant/README.md @@ -0,0 +1,4 @@ +# Variant examples + +- [Read](read.ipynb) - Scan and read a parsed data. +- [Save](save.ipynb) - Save parsed data in specific location. \ No newline at end of file diff --git a/docs/examples/variant/read.ipynb b/docs/examples/variant/read.ipynb new file mode 100644 index 0000000..ecf3e9e --- /dev/null +++ b/docs/examples/variant/read.ipynb @@ -0,0 +1,363 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "source": [ + "# Read" + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%% md\n" + } + } + }, + { + "cell_type": "markdown", + "source": [ + "A simple example on how **Variant** can read and how can be treated." + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%% md\n" + } + } + }, + { + "cell_type": "code", + "execution_count": 1, + "outputs": [], + "source": [ + "from os import getcwd\n", + "from os.path import dirname\n", + "from openvariant import Annotation, Variant\n", + "\n", + "dataset_file = f'{dirname(getcwd())}/datasets/sample1/22f5b2f.wxs.maf.gz'\n", + "annotation_file = f'{dirname(getcwd())}/datasets/sample1/annotation_maf.yaml'" + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%%\n" + } + } + }, + { + "cell_type": "markdown", + "source": [ + "`Annotation` object generated from _annotation_ file. Parameters:\n", + "\n", + "- `annotation_path` - Path of _annotation_ file.\n", + "\n", + "`Variant` object to iterate through the parsed file. Parameters:\n", + "\n", + "- `path` - Path of _input_ file.\n", + "- `annotation` - Annotation object which _input_ will be parsed.\n", + "\n", + "One of the main functions of _Variant_ is `read`.It will generate an iterator to scan the parsed file.\n", + "\n", + "`read` function parameters:\n", + "\n", + "- `where` - Filter expression.\n", + "- `group_key` - Key to group rows.\n", + "\n", + "\n", + "In this example, it will get the 10 first lines of parsed files through an _annotation_ file." + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%% md\n" + } + } + }, + { + "cell_type": "code", + "execution_count": 2, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Line 0: {'POSITION': '16963', 'DATASET': '22f5b2f', 'SAMPLE': 'SAMPLE1', 'STRAND_REF': 'POS', 'PLATFORM': 'WGS'}\n", + "Line 1: {'POSITION': '17691', 'DATASET': '22f5b2f', 'SAMPLE': 'SAMPLE1', 'STRAND_REF': 'POS', 'PLATFORM': 'WGS'}\n", + "Line 2: {'POSITION': '98933', 'DATASET': '22f5b2f', 'SAMPLE': 'SAMPLE1', 'STRAND_REF': 'POS', 'PLATFORM': 'WGS'}\n", + "Line 3: {'POSITION': '139058', 'DATASET': '22f5b2f', 'SAMPLE': 'SAMPLE1', 'STRAND_REF': 'POS', 'PLATFORM': 'WGS'}\n", + "Line 4: {'POSITION': '186112', 'DATASET': '22f5b2f', 'SAMPLE': 'SAMPLE1', 'STRAND_REF': 'POS', 'PLATFORM': 'WGS'}\n", + "Line 5: {'POSITION': '187146', 'DATASET': '22f5b2f', 'SAMPLE': 'SAMPLE1', 'STRAND_REF': 'POS', 'PLATFORM': 'WGS'}\n", + "Line 6: {'POSITION': '187153', 'DATASET': '22f5b2f', 'SAMPLE': 'SAMPLE1', 'STRAND_REF': 'POS', 'PLATFORM': 'WGS'}\n", + "Line 7: {'POSITION': '187264', 'DATASET': '22f5b2f', 'SAMPLE': 'SAMPLE1', 'STRAND_REF': 'POS', 'PLATFORM': 'WGS'}\n", + "Line 8: {'POSITION': '187323', 'DATASET': '22f5b2f', 'SAMPLE': 'SAMPLE1', 'STRAND_REF': 'POS', 'PLATFORM': 'WGS'}\n", + "Line 9: {'POSITION': '187363', 'DATASET': '22f5b2f', 'SAMPLE': 'SAMPLE1', 'STRAND_REF': 'POS', 'PLATFORM': 'WGS'}\n" + ] + } + ], + "source": [ + "annotation = Annotation(annotation_path=annotation_file)\n", + "result = Variant(path=dataset_file, annotation=annotation)\n", + "\n", + "for n_line, line in enumerate(result.read()):\n", + " print(f'Line {n_line}: {line}')\n", + " if n_line == 9:\n", + " break" + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%%\n" + } + } + }, + { + "cell_type": "markdown", + "source": [ + "As we can see in the output each line is a `dict` where the `key` is the field of the parsed result and the `value` is the value in that cell.\n", + "\n", + "**Variant** has different attributes than we can explore:" + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%% md\n" + } + } + }, + { + "cell_type": "code", + "execution_count": 3, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Headers: ['POSITION', 'DATASET', 'SAMPLE', 'STRAND_REF', 'PLATFORM']\n", + "Input file: /home/dmartinez/openvariant/examples/datasets/sample1/22f5b2f.wxs.maf.gz\n" + ] + } + ], + "source": [ + "print('Headers: ', result.header)\n", + "print('Input file: ', result.path)" + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%%\n" + } + } + }, + { + "cell_type": "markdown", + "source": [ + "Also, we can check the _Annotation_ which input file was parsed.\n", + "\n", + "+ _Annotation_ file path - `path`\n", + "+ Format - `format`\n", + "+ Annotations - `annotations`\n", + "+ Columns - `columns`\n", + "+ Delimiter - `delimiter`\n", + "+ Excludes - `excludes`\n", + "+ Patterns - `patterns`\n", + "+ Structure - `structure`" + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%% md\n" + } + } + }, + { + "cell_type": "code", + "execution_count": 4, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "{'PLATFORM': ('STATIC', 'WGS'), 'POSITION': ('INTERNAL', ['Position', 'Start', 'Start_Position', 'Pos', 'Chromosome_Start', 'POS'], , nan), 'DATASET': ('FILENAME', , re.compile('(.*)')), 'SAMPLE': ('DIRNAME', , re.compile('(.*)')), 'STRAND': ('INTERNAL', ['Strand', 'Chromosome_Strand', ''], , nan), 'STRAND_REF': ('MAPPING', ['STRAND'], {'+': 'POS', '-': 'NEG'})}\n" + ] + } + ], + "source": [ + "print(result.annotation.annotations)" + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%%\n" + } + } + }, + { + "cell_type": "markdown", + "source": [ + "One of the parameter to `read` function is `where`. You will be able to apply a conditional filter. The possible operations can be:\n", + "\n", + "+ `==` - Equal.\n", + "+ `!=` - Not equal.\n", + "+ `<=` - Less or equal than.\n", + "+ `<` - Less than.\n", + "+ `>=` - More or equal than.\n", + "+ `>` - More than.\n", + "\n", + "One example of this parameter is the following one:" + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%% md\n" + } + } + }, + { + "cell_type": "code", + "execution_count": 5, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "{'POSITION': '186112', 'DATASET': '22f5b2f', 'SAMPLE': 'SAMPLE1', 'STRAND_REF': 'POS', 'PLATFORM': 'WGS'}\n" + ] + } + ], + "source": [ + "annotation = Annotation(annotation_path=annotation_file)\n", + "result = Variant(path=dataset_file, annotation=annotation)\n", + "\n", + "for n_line, line in enumerate(result.read(where=\"POSITION == 186112\")):\n", + " print(f'{line}')" + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%%\n" + } + } + }, + { + "cell_type": "markdown", + "source": [ + "Also, `read` allows `group_key` as a parameter which it will group rows depending on its value." + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%% md\n" + } + } + }, + { + "cell_type": "markdown", + "source": [ + "**Variant** can be combined with `findfiles` as it shows the following example. It will print the 3 first lines of each input file." + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%% md\n" + } + } + }, + { + "cell_type": "code", + "execution_count": 6, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "File: 5a3a743.wxs.maf.gz \n", + "\n", + "Line 0: {'POSITION': '65872', 'DATASET': '5a3a743', 'SAMPLE': 'SAMPLE1', 'STRAND_REF': 'POS', 'PLATFORM': 'WGS'}\n", + "Line 1: {'POSITION': '131628', 'DATASET': '5a3a743', 'SAMPLE': 'SAMPLE1', 'STRAND_REF': 'POS', 'PLATFORM': 'WGS'}\n", + "Line 2: {'POSITION': '183697', 'DATASET': '5a3a743', 'SAMPLE': 'SAMPLE1', 'STRAND_REF': 'POS', 'PLATFORM': 'WGS'}\n", + "\n", + "\n", + "File: 22f5b2f.wxs.maf.gz \n", + "\n", + "Line 0: {'POSITION': '16963', 'DATASET': '22f5b2f', 'SAMPLE': 'SAMPLE1', 'STRAND_REF': 'POS', 'PLATFORM': 'WGS'}\n", + "Line 1: {'POSITION': '17691', 'DATASET': '22f5b2f', 'SAMPLE': 'SAMPLE1', 'STRAND_REF': 'POS', 'PLATFORM': 'WGS'}\n", + "Line 2: {'POSITION': '98933', 'DATASET': '22f5b2f', 'SAMPLE': 'SAMPLE1', 'STRAND_REF': 'POS', 'PLATFORM': 'WGS'}\n", + "\n", + "\n", + "File: 345c90e.raw_somatic_mutation.vcf.gz \n", + "\n", + "Line 0: {'POSITION': '10267', 'DATASET': '345c90e', 'PLATFORM': 'WGS', 'INFO': 'WGS:T_C'}\n", + "Line 1: {'POSITION': '10273', 'DATASET': '345c90e', 'PLATFORM': 'WGS', 'INFO': 'WGS:T_C'}\n", + "Line 2: {'POSITION': '10321', 'DATASET': '345c90e', 'PLATFORM': 'WGS', 'INFO': 'WGS:C_T'}\n", + "\n", + "\n", + "File: de46011.raw_somatic_mutation.vcf.gz \n", + "\n", + "Line 0: {'POSITION': '10105', 'DATASET': 'de46011', 'PLATFORM': 'WGS', 'INFO': 'WGS:A_C'}\n", + "Line 1: {'POSITION': '10381', 'DATASET': 'de46011', 'PLATFORM': 'WGS', 'INFO': 'WGS:T_C'}\n", + "Line 2: {'POSITION': '10438', 'DATASET': 'de46011', 'PLATFORM': 'WGS', 'INFO': 'WGS:A_T'}\n", + "\n", + "\n", + "File: 3a70e22.raw_somatic_mutation.vcf.gz \n", + "\n", + "Line 0: {'POSITION': '10033', 'DATASET': '3a70e22', 'PLATFORM': 'WGS', 'INFO': 'WGS:A_C'}\n", + "Line 1: {'POSITION': '10075', 'DATASET': '3a70e22', 'PLATFORM': 'WGS', 'INFO': 'WGS:A_C'}\n", + "Line 2: {'POSITION': '10087', 'DATASET': '3a70e22', 'PLATFORM': 'WGS', 'INFO': 'WGS:A_C'}\n", + "\n", + "\n", + "File: 4c0b87e.raw_somatic_mutation.vcf.gz \n", + "\n", + "Line 0: {'POSITION': '10105', 'DATASET': '4c0b87e', 'PLATFORM': 'WGS', 'INFO': 'WGS:A_C'}\n", + "Line 1: {'POSITION': '10241', 'DATASET': '4c0b87e', 'PLATFORM': 'WGS', 'INFO': 'WGS:T_C'}\n", + "Line 2: {'POSITION': '10267', 'DATASET': '4c0b87e', 'PLATFORM': 'WGS', 'INFO': 'WGS:T_C'}\n", + "\n", + "\n" + ] + } + ], + "source": [ + "from os.path import basename\n", + "from openvariant import findfiles\n", + "\n", + "dataset_folder = f'{dirname(getcwd())}/datasets/sample1'\n", + "\n", + "for file_path, annotation in findfiles(base_path=dataset_folder):\n", + " result = Variant(path=file_path, annotation=annotation)\n", + "\n", + " n_line = 1\n", + " print('File: ', basename(file_path), '\\n')\n", + " for n_line, line in enumerate(result.read()):\n", + " print(f'Line {n_line}: {line}')\n", + " if n_line == 2:\n", + " print(\"\\n\")\n", + " break" + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%%\n" + } + } + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 2 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython2", + "version": "2.7.6" + } + }, + "nbformat": 4, + "nbformat_minor": 0 +} \ No newline at end of file diff --git a/docs/examples/variant/save.ipynb b/docs/examples/variant/save.ipynb new file mode 100644 index 0000000..25cad7b --- /dev/null +++ b/docs/examples/variant/save.ipynb @@ -0,0 +1,169 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "source": [ + "# Save" + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%% md\n" + } + } + }, + { + "cell_type": "markdown", + "source": [ + "A simple example on how **Variant** can save the output.\n" + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%% md\n" + } + } + }, + { + "cell_type": "code", + "execution_count": 1, + "outputs": [], + "source": [ + "from os import getcwd\n", + "from os.path import dirname\n", + "from openvariant import Annotation, Variant\n", + "\n", + "dataset_file = f'{dirname(getcwd())}/datasets/sample1/22f5b2f.wxs.maf.gz'\n", + "annotation_file = f'{dirname(getcwd())}/datasets/sample1/annotation_maf.yaml'" + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%%\n" + } + } + }, + { + "cell_type": "markdown", + "source": [ + "`Annotation` object generated from _annotation_ file. Parameters:\n", + "\n", + "- `annotation_path` - Path of _annotation_ file.\n", + "\n", + "`Variant` object to iterate through the parsed file. Parameters:\n", + "\n", + "- `path` - Path of _input_ file.\n", + "- `annotation` - Annotation object which _input_ will be parsed.\n", + "\n", + "One of the main functions of _Variant_ is `read`. It will generate an iterator to scan the parsed file.\n", + "\n", + "`save` function parameters:\n", + "\n", + "- `file_path` - Path where file will be saved.\n", + "- `mode` - Mode to write the _output_.\n", + " - `a` - The cursor starts at the end of the file.\n", + " - `w` - The cursor starts at the begging of the file.\n", + "- `display_header` - It will write the headers on the _output_ file.\n", + "\n", + "\n", + "In this example, it will save the parsed input in an _output_ file." + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%% md\n" + } + } + }, + { + "cell_type": "code", + "execution_count": 2, + "outputs": [], + "source": [ + "annotation = Annotation(annotation_file)\n", + "result = Variant(dataset_file, annotation)\n", + "\n", + "output_file = f'{dirname(getcwd())}/datasets/sample1/output.tsv'\n", + "result.save(file_path=output_file, display_header=True)" + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%%\n" + } + } + }, + { + "cell_type": "markdown", + "source": [ + "It is also possible to combine `save` function with `findfiles` which will find any file and then save the parsed output appending it in a single file." + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%% md\n" + } + } + }, + { + "cell_type": "code", + "execution_count": 3, + "outputs": [], + "source": [ + "from openvariant import findfiles\n", + "\n", + "output_file_append = f'{dirname(getcwd())}/datasets/sample1/output_append.tsv'\n", + "\n", + "annotation_file = f'{dirname(getcwd())}/datasets/sample1/annotation_maf.yaml'\n", + "annotation = Annotation(annotation_file)\n", + "\n", + "dataset_folder = f'{dirname(getcwd())}/datasets/sample1'\n", + "\n", + "for file_path, _ in findfiles(dataset_folder):\n", + " result = Variant(file_path, annotation)\n", + "\n", + " try:\n", + " result.save(output_file_append, mode=\"a\")\n", + " except NameError:\n", + " pass" + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%%\n" + } + } + }, + { + "cell_type": "markdown", + "source": [], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%% md\n" + } + } + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 2 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython2", + "version": "2.7.6" + } + }, + "nbformat": 4, + "nbformat_minor": 0 +} \ No newline at end of file