Merge branch 'main' into code_cleanup

cleanlab · Nov 20, 2023 · 37b0bd8 · 37b0bd8
2 parents c228a12 + 9de8f33
commit 37b0bd8
Show file tree

Hide file tree

Showing 9 changed files with 170 additions and 61 deletions.
diff --git a/DEVELOPMENT.md b/DEVELOPMENT.md
@@ -110,6 +110,25 @@ You can install the git hook scripts with:
 pre-commit install
 ```
 
+## How to build `cleanvision` docs locally?
+
+1. Install the required packages to build the docs:
+```shell
+pip install -r docs/requirements.txt
+```
+2. Install [pandoc](https://pandoc.org/installing.html)
+
+3. Build the docs using `sphinx-build`
+```shell
+sphinx-build docs/source cleanvision-docs
+```
+
+**Note for faster build**: Executing the Jupyter Notebooks (i.e., the .ipynb files) that make up some portion of the docs, such as the tutorials, takes a long time. If you want to skip rendering these, set the environment variable `SKIP_NOTEBOOKS=1`. You can either set this using `export SKIP_NOTEBOOKS=1`
+
+4. To view the docs open the file `cleanvision-docs/index.html` file in a browser.
+
+
+
 ### EditorConfig
 
 This repo uses [EditorConfig](https://editorconfig.org/) to keep code style

diff --git a/README.md b/README.md
@@ -66,6 +66,7 @@ imagelab.report(issue_types=issue_types)
 - [Additional example notebooks](https://github.com/cleanlab/cleanvision-examples)
 - [Documentation](https://cleanvision.readthedocs.io/)
 - [Blog Post](https://cleanlab.ai/blog/cleanvision/)
+- [FAQ](https://cleanvision.readthedocs.io/en/latest/faq.html)
 
 ## *Clean* your data for better Computer *Vision*
 

diff --git a/...ce/cleanvision/dataset/folder_dataset.rst → ...ce/cleanvision/dataset/fsspec_dataset.rst b/...ce/cleanvision/dataset/folder_dataset.rst → ...ce/cleanvision/dataset/fsspec_dataset.rst
@@ -1,7 +1,7 @@
-Folder Dataset
+Fsspec Dataset
 ==============
 
-.. automodule:: cleanvision.dataset.folder_dataset
+.. automodule:: cleanvision.dataset.fsspec_dataset
    :autosummary:
    :members:
    :undoc-members:

diff --git a/docs/source/cleanvision/dataset/index.rst b/docs/source/cleanvision/dataset/index.rst
@@ -10,7 +10,7 @@ Dataset
 
 .. toctree::
     base_dataset
-    folder_dataset
+    fsspec_dataset
     hf_dataset
     torch_dataset
     utils
diff --git a/docs/source/faq.rst b/docs/source/faq.rst
@@ -0,0 +1,68 @@
+Frequently Asked Questions
+==========================
+
+Answers to frequently asked questions about the `cleanvision <https://github.com/cleanlab/cleanvision/>`_ open-source package.
+
+1. **What kind of machine learning tasks can I use CleanVision for?**
+
+CleanVision is independent of any machine learning tasks as it directly works on images and does not require and labels or metadata to detect issues in the dataset. The issues detected by CleanVision are helpful for all kinds of machine learning tasks.
+
+2. **Can I check for specific issues in my dataset?**
+
+
+Yes, you can specify issues like ``light`` or ``blurry`` in the issue_types argument when calling ``Imagelab.find_issues``
+
+.. code-block:: python3
+
+    imagelab.find_issues(issue_types={"light": {}, "blurry": {}})
+
+
+3. **What dataset formats does CleanVision support?**
+
+
+Apart from plain image files stored locally or in the cloud, CleanVision also works with HuggingFace and Torchvision datasets. You can use the dataset objects as is with the ``image_key`` argument.
+
+.. code-block:: python3
+
+    imagelab = Imagelab(hf_dataset=dataset, image_key="image")
+
+For more detailed usage instructions and examples, check the :ref:`tutorials`.
+
+Commonly encountered errors
+---------------------------
+
+- **RuntimeError: An attempt has been made to start a new process before the current process has finished its bootstrapping phase.**
+
+.. code-block:: console
+
+    This probably means that you are not using fork to start your
+    child processes and you have forgotten to use the proper idiom
+    in the main module:
+
+        if __name__ == '__main__':
+            freeze_support()
+            ...
+
+    The "freeze_support()" line can be omitted if the program
+    is not going to be frozen to produce an executable.
+
+    To fix this issue, refer to the "Safe importing of main module"
+    section in https://docs.python.org/3/library/multiprocessing.html
+
+
+The above issue is caused by multiprocessing module working differently for macOS and Windows platforms. A detailed discussion of the issue can be found `here <https://github.com/cleanlab/cleanlab/issues/159>`_.
+A fix around this issue is to run CleanVision in the main namespace like this
+
+.. code-block:: python3
+
+    if __name__ == "__main__":
+
+        imagelab = Imagelab(data_path)
+        imagelab.find_issues()
+        imagelab.report()
+
+OR use ``n_jobs=1`` to disable parallel processing:
+
+.. code-block:: python3
+
+    imagelab.find_issues(n_jobs=1)
diff --git a/docs/source/index.rst b/docs/source/index.rst
@@ -4,47 +4,50 @@
 
 Documentation
 =======================================
+
 CleanVision automatically detects various issues in image datasets, such as images that are: (near) duplicates, blurry,
 over/under-exposed, etc. This data-centric AI package is designed as a quick first step for any computer vision project
 to find problems in your dataset, which you may want to address before applying machine learning.
 
 
 Installation
-============
-
-To install the latest stable version (recommended):
+------------
 
-.. code-block:: console
+.. tabs::
 
-   $ pip install cleanvision
+   .. tab:: pip
 
+      .. code-block:: bash
 
-To install the bleeding-edge developer version:
+         pip install cleanvision
 
-.. code-block:: console
+      To install the package with all optional dependencies:
 
-   $ pip install git+https://github.com/cleanlab/cleanvision.git
+      .. code-block:: bash
 
-To install with HuggingFace optional dependencies
+         pip install "cleanvision[all]"
 
-.. code-block:: console
+   .. tab:: source
 
-   $ pip install "cleanvision[huggingface]"
+      .. code-block:: bash
 
-To install with Torchvision optional dependencies
+         pip install git+https://github.com/cleanlab/cleanvision.git
 
-.. code-block:: console
+      To install the package with all optional dependencies:
 
-   $ pip install "cleanvision[pytorch]"
+      .. code-block:: bash
 
+         pip install "git+https://github.com/cleanlab/cleanvision.git#egg=cleanvision[all]"
 
 
 
 
-Quickstart
-===========
+How to Use CleanVision
+----------------------
 
-1. Using CleanVision to audit your image data is as simple as running the code below:
+Basic Usage
+^^^^^^^^^^^
+Here's how to quickly audit your image data:
 
 
 .. code-block:: python3
@@ -60,8 +63,9 @@ Quickstart
     # Produce a neat report of the issues found in your dataset
     imagelab.report()
 
-2. CleanVision diagnoses many types of issues, but you can also check for only specific issues:
-
+Targeted Issue Detection
+^^^^^^^^^^^^^^^^^^^^^^^^
+You can also focus on specific issues:
 
 .. code-block:: python3
 
@@ -72,8 +76,9 @@ Quickstart
     # Produce a report with only the specified issue_types
     imagelab.report(issue_types.keys())
 
-3. Run CleanVision on a Hugging Face dataset
-
+Integration with Hugging Face Dataset
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+Easily use CleanVision with a Hugging Face dataset:
 
 .. code-block:: python3
 
@@ -90,7 +95,9 @@ Quickstart
 
     imagelab.report()
 
-4. Run CleanVision on a Torchvision dataset
+Integration with Torchvision Dataset
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+CleanVision works smoothly with Torchvision datasets too:
 
 
 .. code-block:: python3
@@ -111,29 +118,32 @@ Quickstart
     imagelab.report()
 
 
-More on how to get started with CleanVision:
-- `Example Python script <https://github.com/cleanlab/cleanvision/blob/main/docs/source/tutorials/run.py>`_
-- `Example Notebooks <https://github.com/cleanlab/cleanvision-examples>`_
-- `How To Contribute <https://github.com/cleanlab/cleanvision/blob/main/CONTRIBUTING.md>`_
+Additional Resources
+--------------------
+- Get started with our `Example Notebook <https://cleanvision.readthedocs.io/en/latest/tutorials/tutorial.html>`_
+- Explore more `Example Notebooks <https://github.com/cleanlab/cleanvision-examples>`_
+- Learn how to contribute in the `Contribution Guide <https://github.com/cleanlab/cleanvision/blob/main/CONTRIBUTING.md>`_
 
 
 .. toctree::
    :hidden:
-   :maxdepth: 1
-   :caption: Getting Started
 
    Quickstart <self>
-.. _api-reference:
 
+
+.. _tutorials:
 .. toctree::
    :hidden:
    :maxdepth: 3
    :caption: Tutorials
+   :name: _tutorials
 
-   tutorials/tutorial.ipynb
+   How to Use CleanVision <tutorials/tutorial.ipynb>
    tutorials/torchvision_dataset.ipynb
    tutorials/huggingface_dataset.ipynb
+   Frequently Asked Questions <faq>
 
+.. _api-reference:
 .. toctree::
    :hidden:
    :maxdepth: 3
@@ -153,3 +163,4 @@ More on how to get started with CleanVision:
    GitHub <https://github.com/cleanlab/cleanvision.git>
    PyPI <https://pypi.org/project/cleanvision/>
    Cleanlab Studio <https://cleanlab.ai/studio/?utm_source=cleanvision&utm_medium=docs&utm_campaign=clostostudio>
+
diff --git a/docs/source/tutorials/tutorial.ipynb b/docs/source/tutorials/tutorial.ipynb
@@ -5,7 +5,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "# Overview"
+    "# How to Use CleanVision"
    ]
   },
   {
@@ -30,13 +30,13 @@
    ]
   },
   {
-   "attachments": {},
    "cell_type": "markdown",
    "metadata": {
-    "nbsphinx": "hidden",
-    "tags": []
+    "nbsphinx": "hidden"
    },
    "source": [
+    "Use `pip install cleanvision` to install a stable release of the package.\n",
+    "\n",
     "**After you install these packages, you may need to restart your notebook runtime before running the rest of this notebook.**"
    ]
   },
@@ -72,38 +72,26 @@
     "This notebook uses an example dataset, that you can download using these commands."
    ]
   },
-  {
-   "attachments": {},
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "wget - nc 'https://cleanlab-public.s3.amazonaws.com/CleanVision/image_files.zip'\n",
-    "\n",
-    "unzip -q image_files.zip"
-   ]
-  },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {
-    "nbsphinx": "hidden",
-    "tags": []
+    "nbsphinx": "hidden"
    },
    "outputs": [],
    "source": [
-    "!wget - nc 'https://cleanlab-public.s3.amazonaws.com/CleanVision/image_files.zip'"
+    "!wget - nc 'https://cleanlab-public.s3.amazonaws.com/CleanVision/image_files.zip'\n",
+    "!unzip -q image_files.zip"
    ]
   },
   {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {
-    "nbsphinx": "hidden",
-    "tags": []
-   },
-   "outputs": [],
+   "cell_type": "markdown",
+   "metadata": {},
    "source": [
-    "!unzip -q image_files.zip"
+    "```shell\n",
+    "wget - nc 'https://cleanlab-public.s3.amazonaws.com/CleanVision/image_files.zip'\n",
+    "unzip -q image_files.zip\n",
+    "```"
    ]
   },
   {
@@ -804,7 +792,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "**Beyond the collection of image files demonstrated here, you can alternatively run CleanVision on: [Hugging Face datasets](https://github.com/cleanlab/cleanvision/blob/main/docs/source/tutorials/huggingface_dataset.ipynb) and [torchvision datasets](https://github.com/cleanlab/cleanvision/blob/main/docs/source/tutorials/torchvision_dataset.ipynb).**"
+    "Beyond the collection of image files demonstrated here, you can alternatively run CleanVision on: [Hugging Face datasets](https://github.com/cleanlab/cleanvision/blob/main/docs/source/tutorials/huggingface_dataset.ipynb), [torchvision datasets](https://github.com/cleanlab/cleanvision/blob/main/docs/source/tutorials/torchvision_dataset.ipynb), as well as [files in cloud storage buckets like S3, GCS, or Azure](https://github.com/cleanlab/cleanvision-examples/blob/main/cloud_dataset.ipynb)."
    ]
   }
  ],

diff --git a/src/cleanvision/utils/viz_manager.py b/src/cleanvision/utils/viz_manager.py
@@ -113,4 +113,4 @@ def plot_image_grid(
             set_image_on_axes(images[i], axes[i], titles[i])
     else:
         set_image_on_axes(images[0], axes, titles[0])
-    plt.show()  # type: ignore
+    plt.show()
diff --git a/tests/test_viz_manager.py b/tests/test_viz_manager.py
@@ -1,7 +1,7 @@
 import pytest
 from PIL import Image
 
-from cleanvision.utils.viz_manager import VizManager
+from cleanvision.utils.viz_manager import VizManager, truncate_titles
 
 
 class TestVizManager:
@@ -30,3 +30,25 @@ def test_individual_images(self, images, title_info):
     )
     def test_image_sets(self, image_sets, title_info_sets):
         VizManager.image_sets(image_sets, title_info_sets, 4, (2, 2))
+
+
+def test_truncate_titles():
+    assert truncate_titles(
+        2,
+        [
+            "/home/usr/proj/dev/product/dataset/images/image_0001.img",
+            "/home/usr/proj/dev/product/dataset/images/image_0002.img",
+        ],
+    ) == ["...es/image_0001.img", "...es/image_0002.img"]
+
+    assert truncate_titles(2, ["image.jpeg", "image2.jpeg"]) == [
+        "image.jpeg",
+        "image2.jpeg",
+    ]
+    assert truncate_titles(
+        2,
+        [
+            "/pictures/mount/image_0001.img",
+            "/home/usr/proj/dev/product/dataset/images/image_0002.img",
+        ],
+    ) == ["/pictures/mount/i...", "/home/usr/proj/de..."]