Skip to content

Commit

Permalink
Update README.md
Browse files Browse the repository at this point in the history
New API explained. Simple showcase.
  • Loading branch information
SaelKimberly authored Jan 16, 2024
1 parent f36d3b2 commit fa42064
Showing 1 changed file with 59 additions and 32 deletions.
91 changes: 59 additions & 32 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,44 +2,71 @@

Shortcut from Read XLS\[X|B\]

Reading both XLSX and XLSB files, fast and memory-safe, with Python, into Polars or Pandas.
Reading both XLSX and XLSB files, fast and memory-safe, with Python, into PyArrow.

## Description

This module provides one function: `xl_scan` for reading both .xlsx and .xlsb files.
This module provides one function: `read` for reading both .xlsx and .xlsb files.

```python
import polars as pl
import pandas as pd

from rxls.reader import read

polars_df = pl.from_arrow(read(some_file, header=True))
pandas_df = read(some_file, header=True).to_pandas()
```

## Parameters:

- Positional:
- **xl_file**: path to the file, or BytesIO with file.
- **file**: path to the file, or BytesIO with file.
- **sheet**: index or name of sheet for reading (default: `0`)
- **header**:
- **`bool`**: `True` when header is presented in the sheet and have one row height
- **`int`**: `0` when no header is presented in the sheet, `N` for header of `N` rows height.
- **`list[str]`**: when no header is presented, but we are know, what it should be.
- Keyword-only
- **mode**: `pd` or `pl` (default: `pl`, polars). What do we need: Pandas dataframe or Polars DataFrame
- **head**: `int` or `list[str]` (default: `0`). Do you want to read `head` rows as multiline header, or want to override column names with your own list.
- **skip_rows**: `int` (default: `0`). Skip some rows in top of the file
- **drop_rows**: `int` (default: `0`). Drop some rows after header (or after skipped rows, if `head == 0` or `head` manually provided)
- **take_rows**: `int` (default: `-1`, means infinity). Max amount of rows to read
- **drop_cels**: `str` or `None`. Pattern for *the first cell of sheet* - cell, that matches this, will be first non-empty cell, and it's row also will be first. All cells before will be dropped, and all rows before will be skipped.
- **with_tqdm**: bool (default: `True`). Show progress of reading (only current rows)
- **book_name**: `str` or `None` (default: `None`). Override Excel workbook name in tqdm output.
- **index_col**: `str` or `None` (default: `None`). Column, that defines, whether row must exist, or not. Allows for smart skpping rows, that not contains useful data.
- **inferring**: `no`, `basic`, `strict` or `extended` (default: `basic`)
- `no`. All columns will not be converted. Returns Utf-8 DataFrame.
- `basic` (default). Convert data in column, based on most used cell type. Data can disappear from colums with mixed types.
- `strict`. Convert data in column, based on most used cell type. If some data missed after converting, reverse changes and return Utf-8 column.
- `extended`. Same as `basic`, but additionally tests remaining `utf-8` columns with regular expressions and convert them, on success, to the according type.
- **frounding**: `int` or `None` (default: None). Round floating-point cells to the given precision.
- **keep_rows**: `bool` (default: False). Keep empty rows in resulting dataframe. Ignored, when use along with `index_col`

## Dependencies:

`python>=3.8`

- pyarrow>=13.0.0
- polars>=0.19.3
- pandas>=2.0.3
- numpy>=1.24.4
- numba>=0.58.0
- recordclass>=0.20
- tqdm>=4.66.1
- typing-extensions>=4.8.0
- **dtypes**: Specify datatypes for columns.
- **`Sequence[pyarrow.DataType]`**: when we know dtype for each non-empty column.
- **`dict[str, pyarrow.DataType]`**: when we will override dtype for some columns.
- **`pyarrow.DataType`**: when all columns must be of this dtype.
- **`None`**: for original dtypes of columns.
- **skip_cols**: Skip some columns by their `0`-based indices (A == 0). Performs on `reading` step.
- **skip_rows**: Skip some rows on the top of the file. Performs on `reading` step.
- **skip_rows_after_header**: Skip some rows after `header`. Performs on `prepare` step.
- **take_rows**: Stop reading after this row (`0`-based). Performs on `reading` step.
- **take_rows_non_empty**: Leave empty rows in resulting table. Performs on `reading` step.
- **lookup_head**: Regular expression for smart-search of the first row of `header`, or `column` index, where first non-empty cell is the top-level cell of `header`.
- **lookup_size**: Count of rows to perform lookup when searching for `header`. Note: RXLS will raise an exception, if `lookup_head` with this `lookup_size` is failed.
- **row_filters**: Regular expression(s) for columns, which content determines empty and non-empty rows.
- **row_filters_strategy**: Boolean operator(s) for situations with two or more columns in `row_filters`.
- **float_precision**. All numeric values in MS Excel are `floating-point` under the hood, so, when rounding whole `float64` column to this precision gives equal result to just truncate decimals, this column will be converted to `int64`.
- **datetime_formats**: One or more formats, which may appears in columns with `conflicts`.
- **conflict_resolve**: When column contains two or more datatypes, this is a `conflict`. When conflict cannot be resolved, whole column will be convert to `utf-8`. Conflicts may be resolved as:
- **`no`**: All parts in columns with `conflicts` will be convert to `utf8`.
- **`temporal`**: Try to convert non-temporal parts of column with some temporal parts to temporal (`float64` -> `timestamp` and `utf8` -> `timestamp` (using default formats (`ISO 8601`), or as specified in `datetime_formats`)
- **`numeric`**: Try to convert non-numeric parts of column with some numeric parts to numeric (`utf8` -> `float64`).
- **`all`**: Use both strategies to resolve conflicts. When some parts of column is temporal, try to convert all other parts to temporal (also enable two-step string converting: `utf8` -> `float64` -> `timestamp`)
- **utf8_type_infers**: `(WIP)` When resulting column is `utf-8` and all non-null cells of it passes regular expression of `numeric` values, convert it to `float64` (and, maybe, to `int64` after).
- **null_values**: Advanced list of values, that should be skipped on `reading` step (or `callable` predicate for them).
- **row_callback**: Any callable, which may be called without arguments on each row event. Useful for progress tracking.

## Dependencies

### Required:

- **pyarrow**>=`14.0.2`
- **numpy**>=`1.24.4`
- **recordclass**>=`0.21.1`
- **typing_extensions**>=`4.9.0`

### Optional:

- **numba**>=`0.58.1` (increase import time, but reading speed also increases x3/x4 and up)
- **tbb**>=`2021.11.0` (only for numba - additional performance gain)
- **polars**>=`0.20.4` (if needs to parse timestamps with milliseconds/microseconds/nanoseconds or AM/PM with timezone)
- **pandas**>=`2.0.3` (for pyarrow `to_pandas` functionality)
- **tqdm**>=`4.66.1` (fast progress tracking)

0 comments on commit fa42064

Please sign in to comment.