diff --git a/recipes/3p_integrations/data-prep-kit/Data-prep-kit-diagram.png b/recipes/3p_integrations/data-prep-kit/Data-prep-kit-diagram.png new file mode 100644 index 000000000..09288fb40 Binary files /dev/null and b/recipes/3p_integrations/data-prep-kit/Data-prep-kit-diagram.png differ diff --git a/recipes/3p_integrations/data-prep-kit/Readme.md b/recipes/3p_integrations/data-prep-kit/Readme.md new file mode 100644 index 000000000..e69de29bb diff --git a/recipes/3p_integrations/data-prep-kit/end_2_end_code_data_prep_finetuning.ipynb b/recipes/3p_integrations/data-prep-kit/end_2_end_code_data_prep_finetuning.ipynb new file mode 100644 index 000000000..fc123fff6 --- /dev/null +++ b/recipes/3p_integrations/data-prep-kit/end_2_end_code_data_prep_finetuning.ipynb @@ -0,0 +1,891 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "NbF_Zw3KBazf" + }, + "source": [ + "# **Demo on building data prep pipeline for model fine tuning**" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "uxyvT77U3O_w" + }, + "source": [ + "\n", + " \"Open\n", + "" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "_-NOkuTxiP7r", + "outputId": "043f32fc-c476-433e-86b6-d7e9abd4d285" + }, + "source": [ + "This demo notebook shows how to use [data-prep-kit](https://github.com/IBM/data-prep-kit) to build a data preparation pipeline that can be used for fine tuning or extended pre-training. We will discuss the various data preparation steps to process raw data (code repositories), tokenise it that can then be fine tuned using any popular code models. We will also discuss a novel recipe for semantic ordering of files in a repository which has shown to enhance model training. Please see our [paper](https://arxiv.org/abs/2407.13739) here for more details. For this demo, we will use the [codeparrot/github-code](https://huggingface.co/datasets/codeparrot/github-code) dataset hosted on Hugging Face datasets.\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "7BrICcwo3O_x" + }, + "source": [ + "## Setup\n", + "\n", + "Install data-prep-toolkit and datasets library. This notebook requires atleast 4 cpus.\n", + "To run on google colab, it is recommended to change the runtime to TPUs to get the required number of cpus.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "STVDCN1l3O_y" + }, + "outputs": [], + "source": [ + "%%capture logpip --no-stderr\n", + "!pip install data-prep-toolkit-transforms-ray==0.2.1.dev1\n", + "!pip install datasets\n", + "!pip install pandas" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "8VhIsZViaU2i" + }, + "source": [ + "We use parallel processing capability using Ray, so that beyond the demo, a user can also use this for actual production runs on larger datasets, with minor code changes. Please read [here](https://github.com/IBM/data-prep-kit?tab=readme-ov-file#-about-) on various features of data-prep-kit that includes flexibility of compute to run from laptop to cluster. There are three parameters, that the user can change, as per usecase:\n", + "\n", + "`runtime_num_worker`: number of parallel workers to be used\n", + "\n", + "`num_cpus`: number of cpus to be used per worker\n", + "\n", + "`run_locally: True` start a ray cluster for parallel computation\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "J_UbnF9wbj95" + }, + "outputs": [], + "source": [ + "from data_processing_ray.runtime.ray import RayTransformLauncher\n", + "from data_processing.utils import ParamsUtils\n", + "import sys\n", + "import json\n", + "import pandas as pd\n", + "#Default parameters for computation\n", + "worker_options = {\"num_cpus\": 0.8}\n", + "common_config_params = {\n", + " \"run_locally\": True,\n", + " \"runtime_worker_options\": ParamsUtils.convert_to_ast(worker_options),\n", + " \"runtime_num_workers\": 2,\n", + " }\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "18EtjZAO3O_0" + }, + "source": [ + "\n", + "\n", + "We will do all the processing in `sample_data` folder. This concludes our setup section." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "JQ2duPlp3O_1" + }, + "outputs": [], + "source": [ + "!rm -rf sample_data\n", + "!mkdir -p sample_data\n", + "!mkdir -p sample_data/hf_2_parquet" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ULW-gYFT3O_1" + }, + "source": [ + "## Data Preparation Steps\n", + "\n", + "We now discuss the various data preparation steps to transform the raw data to a tokenised format post cleaning and transforming the data. We use the [parquet data format](https://parquet.apache.org/) for all our operations. This helps to efficiently scale the data for actual production runs, beyond the demo.\n", + "\n", + "1. HuggingFace2Parquet: Read the dataset from HF and convert into parquet format.\n", + "2. Exact Deduplication: Remove exact duplicates.\n", + "3. Fuzzy Deduplication: Remove near duplicates.\n", + "4. Programming Lang Selection: Select the programming languages to be used for the analysis.\n", + "5. Code Quality Annotations: Annotate whether a given code file is of high quality or not using various rules.\n", + "6. Filtering: Filter dataset to retain only programming language of interest.\n", + "7. Semantic Ordering: Organise code files by their semantic dependencies. \n", + "8. Tokenization: Tokenise the data for model fine tuning.\n", + "\n", + "The data processing pipeline is organised such that the output of the previous transform is used as input to the next one. Refer to the papers [here](https://arxiv.org/pdf/2405.04324) and [here](https://arxiv.org/abs/2407.13739) for complete details for each of the above steps." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "xliMSdQEEwYx" + }, + "source": [ + "## 1. Huggingface datasets to Parquet\n", + "\n", + "This is the first component of this pipeline. It ingests a dataset `codeparrot/github-code` from huggingface and converts it into\n", + "parquet files for consumption by the next steps in this data processing pipeline.\n", + "\n", + "For this demo we are trying to process a few records. The following fields can be updated in case you want to use more data.\n", + "_total_files_ = 10
\n", + "_rows_per_file_ = 10\n", + "\n", + "The output of this stage of the pipeline would be written to `sample_data/hf_2_parquet`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "wit7ic1GauWN" + }, + "outputs": [], + "source": [ + "import os\n", + "import pyarrow as pa\n", + "import pyarrow.parquet as pq\n", + "\n", + "from datasets import load_dataset\n", + "\n", + "import uuid\n", + "from data_processing.utils import TransformUtils\n", + "from collections import defaultdict\n", + "\n", + "DATASET_NAME='codeparrot/github-code'\n", + "\n", + "ds = load_dataset(DATASET_NAME,\n", + " streaming=True,\n", + " split=\"train\",\n", + " trust_remote_code=True)\n", + "\n", + "def row_mapper(row):\n", + " return {\n", + " 'ext': TransformUtils.get_file_extension(row['path'])[1],\n", + " 'document_id': str(uuid.uuid4())\n", + " }\n", + "\n", + "parquet_data_output = \"sample_data/hf_2_parquet\"\n", + "\n", + "## Converts a subset of a Hugging Face dataset to a Parquet file, optionally mapping and renaming columns.\n", + "def hf_dataset_to_parquet(ds, skip, nrows, file_name, mapper=None, renamed_columns=[]):\n", + " dst_ = ds.skip(skip).take(nrows)\n", + "\n", + " data_dict = defaultdict(list)\n", + "\n", + " dst = dst_.map(mapper)\n", + "\n", + " for data in dst:\n", + " for k, v in data.items():\n", + " data_dict[k].append(v)\n", + "\n", + " for old, new in renamed_columns:\n", + " data_dict[new] = data_dict[old]\n", + " del data_dict[old]\n", + "\n", + " table = pa.Table.from_pydict(data_dict)\n", + " pq.write_table(table, file_name)\n", + "\n", + "\n", + "## Create parquet files\n", + "\n", + "total_files = 20\n", + "rows_per_file = 20\n", + "for num in range(total_files):\n", + " file_name = os.path.join(\n", + " f\"{parquet_data_output}\",\n", + " f\"data_{num}.parquet\"\n", + " )\n", + " print (f\"Writing {file_name}\")\n", + " hf_dataset_to_parquet(ds,\n", + " 1 * rows_per_file,\n", + " rows_per_file,\n", + " file_name=file_name,\n", + " mapper=row_mapper,\n", + " renamed_columns=[(\"code\", \"contents\"),\n", + " (\"path\", \"title\")])" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "2160M1Da3O_2" + }, + "outputs": [], + "source": [ + "#Function to read parquet files in a directory as pandas dataframe\n", + "from pathlib import Path\n", + "def read_parquet_bulk(dir_path):\n", + " data_dir = Path(dir_path)\n", + " # Get the list of all Parquet files in the directory\n", + " parquet_files = list(data_dir.glob('*.parquet'))\n", + " # Check if the directory contains any Parquet files\n", + " if not parquet_files:\n", + " raise ValueError(f\"No Parquet files found in directory: {dir_path}\")\n", + " # Concatenate all Parquet files into a single DataFrame\n", + " full_df = pd.concat(\n", + " pd.read_parquet(parquet_file)\n", + " for parquet_file in parquet_files\n", + " ).reset_index(drop=True)\n", + "\n", + " return full_df\n", + "\n", + "\n", + "input_df=read_parquet_bulk(parquet_data_output)\n", + "\n", + "print(\"No of rows, No of columns\",input_df.shape)\n", + "print(\"Sample data \\n \")\n", + "input_df.head(1)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "0ncsFW6E3O_2" + }, + "source": [ + "## 2. Exact deduplication\n", + "\n", + "This step will find exact duplicates in the 'content' column and remove them. This is done by computing SHA256 hash on the code files and remove records having identical hashes.\n", + "\n", + "The transform specific params for exact deduplication are:
\n", + " _ededup_hash_cpu_ - Number of cpus per worker
\n", + " _ededup_num_hashes_ - Number of workers used to store hashes
\n", + " _ededup_doc_column_ - Name of column which has to be checked for deduplication
\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "bRUfjHExbd1g" + }, + "outputs": [], + "source": [ + "import os\n", + "import sys\n", + "from ededup_transform_ray import EdedupRayTransformConfiguration\n", + "\n", + "input_folder = parquet_data_output # Output of previous stage is used as input.\n", + "output_folder = \"sample_data/ededup_out\"\n", + "\n", + "local_conf = {\n", + " \"input_folder\": input_folder,\n", + " \"output_folder\": output_folder,\n", + "}\n", + "\n", + "ededup_params = {\n", + " # ededup parameters\n", + " \"ededup_hash_cpu\": 0.5,\n", + " \"ededup_num_hashes\": 2,\n", + " \"ededup_doc_column\": \"contents\",\n", + " \"data_local_config\": ParamsUtils.convert_to_ast(local_conf)\n", + "}\n", + "\n", + "params = common_config_params | ededup_params\n", + "sys.argv = ParamsUtils.dict_to_req(d=params)\n", + "ededup_launcher = RayTransformLauncher(EdedupRayTransformConfiguration())\n", + "ededup_launcher.launch()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "fKk2l_Wt3O_3" + }, + "outputs": [], + "source": [ + "import json\n", + "import pprint\n", + "def read_metadata(path):\n", + " with open(path, 'r') as file:\n", + " metadata = json.load(file)\n", + " pprint.pp(metadata)\n", + "read_metadata(f\"{output_folder}/metadata.json\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "6Nt_kmX33O_3" + }, + "source": [ + "## 3. Fuzzy Deduplication\n", + "\n", + "This step will find near duplicates and remove them. The code is broken into two code cells, one for adding document ids to the parquet file and then running fuzzy dedup. Document id addition is a prerequisite for fuzzy dedup.\n", + "\n", + "We first add the document ids as an additional column to the parquet files.
\n", + "_doc_column_ - specifies name of the column containing the document (required for ID generation)
\n", + "_hash_column_ - specifies name of the column created to hold the string document id, if None, id is not generated
\n", + "_int_id_column_ - specifies name of the column created to hold the integer document id, if None, id is not generated
\n", + "At least one of hash_column or int_id_column must be specified.\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "H4cYttNlbgf0" + }, + "outputs": [], + "source": [ + "input_folder = \"sample_data/ededup_out\"\n", + "output_folder = \"sample_data/docid_out\"\n", + "\n", + "\n", + "from doc_id_transform_ray import DocIDRayTransformConfiguration\n", + "local_conf = {\n", + " \"input_folder\": input_folder,\n", + " \"output_folder\": output_folder,\n", + "}\n", + "\n", + "doc_id_params = {\n", + " # doc id configuration\n", + " \"doc_id_doc_column\": \"contents\",\n", + " \"doc_id_hash_column\": \"hash_column\",\n", + " \"doc_id_int_column\": \"int_id_column\",\n", + " \"data_local_config\": ParamsUtils.convert_to_ast(local_conf)\n", + "}\n", + "\n", + "params = doc_id_params | common_config_params\n", + "sys.argv = ParamsUtils.dict_to_req(d=params)\n", + "launcher = RayTransformLauncher(DocIDRayTransformConfiguration())\n", + "launcher.launch()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "BFhoBCWW3O_4" + }, + "outputs": [], + "source": [ + "input_df=read_parquet_bulk(output_folder)\n", + "input_df.head(1)\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "emTz5QoA3O_4" + }, + "outputs": [], + "source": [ + "read_metadata(f\"{output_folder}/metadata.json\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "YCBkr7vn3O_4" + }, + "source": [ + "Post adding the document ids, the next step is to run fuzzy deduplication. We apply a two-step method for this: (1) compute MinHashes of all the documents and then utilize Locally Sensitive Hashing (LSH) to group documents based on their MinHash fingerprints, (2) measure Jaccard similarity between each pair of documents\n", + "in the same bucket and annotate documents except one as duplicates based on a similarity\n", + "threshold. \n", + "\n", + "Some important transform specific params are:
\n", + "_fdedup_doc_column_ - Column to be used for deduplication
\n", + "_fdedup_threshold_ - specifies the Jaccard similarity threshold (default is 0.7)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "b11MMQEheO6q" + }, + "outputs": [], + "source": [ + "input_folder = \"sample_data/docid_out\"\n", + "output_folder = \"sample_data/fdedup_out\"\n", + "\n", + "import os\n", + "import sys\n", + "\n", + "from data_processing.utils import ParamsUtils\n", + "from fdedup_transform_ray import FdedupRayTransformConfiguration\n", + "\n", + "local_conf = {\n", + " \"input_folder\": input_folder,\n", + " \"output_folder\": output_folder,\n", + "}\n", + "worker_options = {\"num_cpus\": 0.8}\n", + "code_location = {\"github\": \"github\", \"commit_hash\": \"12345\", \"path\": \"path\"}\n", + "fdedup_params = {\n", + " # columns used\n", + " \"fdedup_doc_column\": \"contents\",\n", + " \"fdedup_id_column\": \"int_id_column\",\n", + " \"fdedup_cluster_column\": \"hash_column\",\n", + " \"data_local_config\": ParamsUtils.convert_to_ast(local_conf)\n", + "}\n", + "\n", + "params = common_config_params| fdedup_params\n", + "\n", + "# Pass commandline params\n", + "sys.argv = ParamsUtils.dict_to_req(d=params)\n", + "\n", + "# launch\n", + "fdedup_launcher = RayTransformLauncher(FdedupRayTransformConfiguration())\n", + "fdedup_launcher.launch()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "jb0LIS4s3O_5" + }, + "outputs": [], + "source": [ + "read_metadata(f\"{output_folder}/metadata.json\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "9TtLkWhx3O_5" + }, + "source": [ + "## 4. Programming Language Selection\n", + "\n", + "This module helps retain the code files for language of interest which can be specified using selected_languages_file. Post this step, a new column is added, that contains the programming language name. One can use the code in the Filtering step to do analytics on how many files are found for which languages and thereby selectively filter.\n", + "\n", + "The important parameters used by this transform are:
\n", + "_lang_allowed_langs_file_key_ - A file with a list of allowed languages.
\n", + "_lang_lang_column_key_ - The name of column which has programming language.
\n", + "_lang_output_column_key_ - The name of annotation column.
\n", + "\n", + "For this demo, we will use this [file](https://github.com/IBM/data-prep-kit/blob/dev/transforms/code/proglang_select/python/test-data/languages/allowed-code-languages.txt) to specify languages of interest and the module will add a new column called \"language_of_interest\" which can have two values 0/1. 1 is added for all rows that have code files belonging to programming language specified in the list." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "QGaG8NWUAbAu" + }, + "outputs": [], + "source": [ + "input_folder = \"sample_data/fdedup_out\"\n", + "output_folder = \"sample_data/ps_out\"\n", + "\n", + "# download allowed-code-languages.txt\n", + "\n", + "# Create a file with language of interest\n", + "! echo \"C\" >> allowed-code-languages.txt\n", + "\n", + "selected_languages_file = \"./allowed-code-languages.txt\"\n", + "\n", + "from proglang_select_transform_ray import ProgLangSelectRayConfiguration\n", + "from proglang_select_transform import (\n", + " lang_allowed_langs_file_key,\n", + " lang_lang_column_key,\n", + " lang_output_column_key,\n", + ")\n", + "\n", + "# create parameters\n", + "language_column_name = \"language\"\n", + "annotated_column_name = \"language_of_interest\"\n", + "\n", + "local_conf = {\n", + " \"input_folder\": input_folder,\n", + " \"output_folder\": output_folder,\n", + "}\n", + "\n", + "langselect_config = {\n", + " lang_allowed_langs_file_key: selected_languages_file,\n", + " lang_lang_column_key: language_column_name,\n", + " lang_output_column_key: annotated_column_name,\n", + " \"data_local_config\": ParamsUtils.convert_to_ast(local_conf)\n", + "}\n", + "\n", + "params = common_config_params| langselect_config\n", + "\n", + "sys.argv = ParamsUtils.dict_to_req(d=params)\n", + "\n", + "# create launcher\n", + "launcher = RayTransformLauncher(ProgLangSelectRayConfiguration())\n", + "launcher.launch()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "T89oTl4r3O_5" + }, + "outputs": [], + "source": [ + "read_metadata(f\"{output_folder}/metadata.json\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "oBK11Bju3O_5" + }, + "outputs": [], + "source": [ + "read_parquet_bulk(output_folder).head(10)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "fXUaoXk83O_6" + }, + "source": [ + "## 5. Code Quality\n", + "\n", + "We experiment with various code quality metrics but finally retain the four code quality metrics used by (Li et al., 2023) to balance the tradeoff between code quality versus data volume.\n", + "\n", + "Quality metrics\n", + "\n", + "'line_mean': Average of the total line lengths.\n", + "'line_max': Maximum line length present .\n", + "'total_num_lines': Total number of lines present\n", + "'avg_longest_lines': Average of the first n longest lines, where n can be any number you choose.\n", + "'alphanum_frac': Calculates average of alpha numeric with respect to total data\n", + "'char_token_ratio': Computes character/token ratio of the file with tokenizer\n", + "'autogenerated': Check if file is autogenerated by looking for keywords in the first few lines of the file.\n", + "'config_or_test': Check if file is a configuration file or a unit test\n", + "'has_no_keywords': Check if a python file has none of the keywords - for funcion, class, for loop, while loop.\n", + "'has_few_assignments': Check if file uses symbol '=' less than 'minimum' times\n", + "'is_xml': Check if input data is xml content\n", + "'is_html': Check if input data is HTML files based on displayed text VS code ratio" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "TPpftqPl3O_6" + }, + "outputs": [], + "source": [ + "input_folder = \"sample_data/ps_out\"\n", + "output_folder = \"sample_data/cq_out\"\n", + "\n", + "from code_quality_transform_ray import CodeQualityRayTransformConfiguration\n", + "\n", + "local_conf = {\n", + " \"input_folder\": input_folder,\n", + " \"output_folder\": output_folder,\n", + "}\n", + "# ??\n", + "\n", + "\n", + "language_column_name = \"language\"\n", + "params = {\n", + " \"cq_contents_column_name\": \"contents\",\n", + " \"cq_language_column_name\": language_column_name,\n", + " \"data_local_config\": ParamsUtils.convert_to_ast(local_conf)\n", + "}\n", + "\n", + "params = common_config_params| params\n", + "sys.argv = ParamsUtils.dict_to_req(d=params)\n", + "\n", + "# create launcher\n", + "launcher = RayTransformLauncher(CodeQualityRayTransformConfiguration())\n", + "# launch\n", + "launcher.launch()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "kGvCtJvF3O_6" + }, + "outputs": [], + "source": [ + "read_metadata(f\"{output_folder}/metadata.json\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "V4KLjEr13O_6" + }, + "outputs": [], + "source": [ + "read_parquet_bulk(output_folder).head(10)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "oXu_i9jLAo9H" + }, + "source": [ + "## 6. Filtering\n", + "\n", + "This step can be used to filter the code files based on our chosen conditions. In this demo example, we have only used one annotation of adding programming language names for each code file. To demonstrate the utility, we will use this module to retain only code files of interest." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "OAl7B58oAyZQ" + }, + "outputs": [], + "source": [ + "input_folder = \"sample_data/cq_out\"\n", + "output_folder = \"sample_data/filter_out\"\n", + "\n", + "\n", + "from filter_transform import (\n", + " filter_columns_to_drop_cli_param,\n", + " filter_criteria_cli_param,\n", + " filter_logical_operator_cli_param,\n", + ")\n", + "from filter_transform_ray import FilterRayTransformConfiguration\n", + "\n", + "local_conf = {\n", + " \"input_folder\": input_folder,\n", + " \"output_folder\": output_folder,\n", + "}\n", + "\n", + "# This is just an example criteria to filter\n", + "filter_criteria = [\n", + " \"language_of_interest = 1\",\n", + "]\n", + "filter_logical_operator = \"AND\"\n", + "filter_columns_to_drop = [\"language_of_interest\", \"hash_column\"]\n", + "\n", + "filter_params = {\n", + " filter_criteria_cli_param: filter_criteria,\n", + " filter_columns_to_drop_cli_param: filter_columns_to_drop,\n", + " filter_logical_operator_cli_param: filter_logical_operator,\n", + " \"data_local_config\": ParamsUtils.convert_to_ast(local_conf)\n", + "}\n", + "\n", + "\n", + "sys.argv = ParamsUtils.dict_to_req(common_config_params| filter_params)\n", + "launcher = RayTransformLauncher(FilterRayTransformConfiguration())\n", + "launcher.launch()\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "6xkoyqt23O_7" + }, + "outputs": [], + "source": [ + "read_metadata(f\"{output_folder}/metadata.json\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "yzoYG-_g3O_7" + }, + "source": [ + "## 7. Semantic Ordering of Code Files\n", + "\n", + "In this step, we order the code files such that we pack files from the same repository together, arranging them to prioritize semantic dependencies. We identify these dependencies by analyzing file imports and create a directed acyclic graph, where each file is a node and edges represent API imports between files. After breaking any cycles in the graph, we perform a topological sort to establish an ordering of files based on their semantic dependencies. We then organize the files in a repository by placing documentation and build files first, followed by the ordered set of files with semantic dependencies, and finally the remaining non-connected files. These non-connected files are arranged according to their folder structure, using a depth-first search to traverse the repository. Finally, we determine the dominant programming language of a repository based on file extensions and presence of build files, to organise repo-ordered files by programming languages.\n", + "\n", + "\n", + "This transform has following parameters:
\n", + " _repo_lvl_sorting_enabled_ - If True, the repo level output is sorted using _repo_lvl_sorting_algo_
\n", + " _repo_lvl_sorting_algo_ - Select the sorting algorithm to be used for repo level sorting. Use SORT_SEMANTIC_NORMALISED to organise by semantic dependencies or SORT_BY_PATH to arrange files based on folder structure in a repository.
\n", + " _repo_lvl_store_backend_dir_ - Directory to use for local store. Needed only when repo_lvl_store_type=local
\n", + " _repo_lvl_output_by_langs_ - If True, it organises output into folders of programming language.
\n", + " _repo_lvl_combine_rows_ - If True, it combines the contents of repo into a single row.
\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "EuR-2QsX3O_7" + }, + "outputs": [], + "source": [ + "input_folder = \"sample_data/filter_out\"\n", + "output_folder = \"sample_data/rlo_out\"\n", + "\n", + "import tempfile\n", + "from repo_level_order_transform import RepoLevelOrderRayTransformConfiguration\n", + "with tempfile.TemporaryDirectory() as tmpdirname:\n", + "\n", + " # create parameters\n", + " local_conf = {\n", + " \"input_folder\": input_folder,\n", + " \"output_folder\": output_folder,\n", + " }\n", + "\n", + " worker_options = {\"num_cpus\": 0.8}\n", + " code_location = {\"github\": \"github\", \"commit_hash\": \"12345\", \"path\": \"path\"}\n", + "\n", + " repo_level_params = {\n", + " \"repo_lvl_sorting_algo\": \"SORT_SEMANTIC_NORMALISED\",\n", + " \"repo_lvl_store_type\": \"local\",\n", + " \"repo_lvl_store_backend_dir\": tmpdirname,\n", + " \"repo_lvl_output_by_langs\": True,\n", + " \"repo_lvl_combine_rows\": True,\n", + " \"repo_lvl_sorting_enabled\": True,\n", + " \"data_local_config\": ParamsUtils.convert_to_ast(local_conf)\n", + " }\n", + "\n", + "\n", + " sys.argv = ParamsUtils.dict_to_req(d= common_config_params| repo_level_params)\n", + " launcher = RayTransformLauncher(RepoLevelOrderRayTransformConfiguration())\n", + " launcher.launch()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "UT9Sz7lm3O_8" + }, + "outputs": [], + "source": [ + "read_metadata(f\"{output_folder}/metadata.json\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "byK75Kb1A3E7" + }, + "source": [ + "## 8. Tokenization\n", + "\n", + "Next, we tokenize the data to be used for fine tuning.\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "kBYg93WMBBq6" + }, + "outputs": [], + "source": [ + "input_folder = \"sample_data/rlo_out\"\n", + "output_folder = \"sample_data/tokenize_out\"\n", + "\n", + "from tokenization_transform_ray import TokenizationRayConfiguration\n", + "\n", + "local_conf = {\n", + " \"input_folder\": input_folder,\n", + " \"output_folder\": output_folder,\n", + "}\n", + "\n", + "tf_params= {\n", + " \"data_local_config\": ParamsUtils.convert_to_ast(local_conf)\n", + "}\n", + "sys.argv = ParamsUtils.dict_to_req(d=common_config_params| tf_params)\n", + "# create launcher\n", + "launcher = RayTransformLauncher(TokenizationRayConfiguration())\n", + "# Launch the ray actor(s) to process the input\n", + "launcher.launch()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "2dP6gkY63PAD" + }, + "outputs": [], + "source": [ + "read_metadata(f\"{output_folder}/metadata.json\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "b7xsQA693PAD" + }, + "outputs": [], + "source": [ + "read_parquet_bulk(f\"{output_folder}/C\").head(5)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "xFUrzzjeBFfJ" + }, + "source": [ + "**The data is now ready for extended pretraining or fine tuning using any open source code models.**" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "FD81Ojjt3PAE" + }, + "outputs": [], + "source": [] + } + ], + "metadata": { + "accelerator": "TPU", + "colab": { + "gpuType": "V28", + "provenance": [] + }, + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.2" + } + }, + "nbformat": 4, + "nbformat_minor": 0 +} \ No newline at end of file