Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add configuration to include specific special characters while indexing #779

Merged
merged 5 commits into from
Jan 27, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@

## Unreleased

* Added the "Include Characters" option
* Added the Pagefind Playground
* Reduced filesizes for the Pagefind WebAssembly

Expand Down
23 changes: 20 additions & 3 deletions docs/content/docs/config-options.md
Original file line number Diff line number Diff line change
Expand Up @@ -71,6 +71,23 @@ Note that currently Pagefind only supports lists of options via configuration fi
|---------------------------|------------------------------|---------------------|
| `--exclude-selectors <S>` | `PAGEFIND_EXCLUDE_SELECTORS` | `exclude_selectors` |

### Include characters
Prevents Pagefind from stripping the provided characters when indexing content.
Allows users to search for words including these characters.

See [Indexing special characters](/docs/indexing/#indexing-special-characters) for more documentation.

Care is needed if setting this argument via the CLI, as special characters may be interpreted by your shell.
Configure this via a [configuration file](/docs/config-sources/#config-files) if you encounter issues.

```yml
include_characters: "<>$"
```

| CLI Flag | ENV Variable | Config Key |
|----------------------------|-------------------------------|---------------------|
| `--include-characters <S>` | `PAGEFIND_INCLUDE_CHARACTERS` | `include_characters` |

### Glob
Configures the glob used by Pagefind to discover HTML files. Defaults to `**/*.{html}`.
See [Wax patterns documentation](https://github.com/olson-sean-k/wax#patterns) for more details.
Expand All @@ -79,7 +96,7 @@ See [Wax patterns documentation](https://github.com/olson-sean-k/wax#patterns) f
|-----------------|-----------------|------------|
| `--glob <GLOB>` | `PAGEFIND_GLOB` | `glob` |

### Force Language
### Force language
Ignores any detected languages and creates a single index for the entire site as the provided language. Expects an ISO 639-1 code, such as `en` or `pt`.

See [Multilingual search](/docs/multilingual/) for more details.
Expand All @@ -88,14 +105,14 @@ See [Multilingual search](/docs/multilingual/) for more details.
|---------------------------|---------------------------|------------------|
| `--force-language <LANG>` | `PAGEFIND_FORCE_LANGUAGE` | `force_language` |

### Keep Index URL
### Keep index URL
Keeps `index.html` at the end of search result paths. By default, a file at `animals/cat/index.html` will be given the URL `/animals/cat/`. Setting this option to `true` will result in the URL `/animals/cat/index.html`.

| CLI Flag | ENV Variable | Config Key |
|--------------------|------------------|------------------|
| `--keep-index-url` | `KEEP_INDEX_URL` | `keep_index_url` |

### Write Playground
### Write playground
Writes the Pagefind playground files to `/playground` within your bundle directory. For most sites, this will make the Pagefind playground available at `/pagefind/playground/`.

This defaults to false, so playground files are not written to your live site. Playground files are always available when running Pagefind with `--serve`.
Expand Down
22 changes: 21 additions & 1 deletion docs/content/docs/indexing.md
Original file line number Diff line number Diff line change
Expand Up @@ -92,5 +92,25 @@ Attributes of HTML elements can be added to the search index with the `data-page
```
{{< /diffcode >}}

This attribute takes a comma-separated list of other attributes to include inline with the indexed content.
This attribute takes a comma-separated list of other attributes to include inline with the indexed content.
The above example will be indexed as: `Condimentum Nullam. Image Title. Image Alt. Nullam id dolor id nibh ultricies.`

## Indexing special characters

By default, Pagefind strips most punctuation out of the page when indexing content. Punctuation is also removed from the search term when searching.

For some sites, such as documentation for programming languages, searching for punctuation can be important. In these cases,
the default behavior can be changed using the [Include Characters](/docs/config-options/#include-characters) option when running Pagefind.

For example, given the following HTML:

```html
<p>The &lt;head&gt; tag</p>
```

Pagefind's default indexing would index `the`, `head`, and `tag`,
and a user typing in a search term of `<head>` will have their search adapted to `head`.
While this will still match the correct page, it won't distinguish between this result and a result talking about the head of a git repository.

With the [Include Characters](/docs/config-options/#include-characters) option set to `<>`, Pagefind will instead index `the`, `<head>`, `head`, and `tag`.
A search for `head` will still locate this page, while a search for `<head>` won't be rewritten and will specifically match this page.
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
name: Character Tests > Pagefind matches custom characters
steps:
- ref: ./background.toolproof.yml
- step: I have a "public/page_a/index.html" file with the content {html}
html: >-
<!DOCTYPE html><html lang="en"><head></head><body><h1>Talking about @money</h1></body></html>
- step: I have a "public/page_b/index.html" file with the content {html}
html: >-
<!DOCTYPE html><html lang="en"><head></head><body><h1>Configure a^b^c^d</h1></body></html>
- macro: I run Pagefind with '--include-characters "@^"'
- step: stdout should contain "Running Pagefind"
- step: The file "public/pagefind/pagefind.js" should not be empty
- step: I serve the directory "public"
- step: In my browser, I load "/"
- step: In my browser, I evaluate {js}
js: |-
let pagefind = await import("/pagefind/pagefind.js");
let search = await pagefind.search("@");
let pages = await Promise.all(search.results.map(r => r.data()));

toolproof.assert_eq(pages.length, 1);
toolproof.assert_eq(pages[0].url, "/page_a/");
- step: In my browser, I evaluate {js}
js: |-
let pagefind = await import("/pagefind/pagefind.js");
let search = await pagefind.search("money");
let pages = await Promise.all(search.results.map(r => r.data()));

toolproof.assert_eq(pages.length, 1);
toolproof.assert_eq(pages[0].url, "/page_a/");
- step: In my browser, I evaluate {js}
js: |-
let pagefind = await import("/pagefind/pagefind.js");
let search = await pagefind.search("a^b^c^d");
let pages = await Promise.all(search.results.map(r => r.data()));

toolproof.assert_eq(pages.length, 1);
toolproof.assert_eq(pages[0].url, "/page_b/");
- step: In my browser, the console should be empty
Loading
Loading