Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug when highlighted perfectly adjacent spans #10

Open
idc9 opened this issue Nov 11, 2024 · 1 comment
Open

Bug when highlighted perfectly adjacent spans #10

idc9 opened this issue Nov 11, 2024 · 1 comment

Comments

@idc9
Copy link

idc9 commented Nov 11, 2024

When I highlight perfectly adjacent spans it does some weird stuff with the text. Here I highlighted "Yesterday" and " , at 3pm Emily" and it added an extra "Yesterday" in between

Screenshot 2024-11-11 at 9 44 01 AM
@ballesterosbr
Copy link

Hi @idc9,

I’m experiencing the same issue as you.

Additionally, I’ve noticed another related problem: when the text spans multiple paragraphs, if you try to label the last word of a paragraph (e.g., finalword\nNew Paragraph), the labeling process will automatically include the first word of the next paragraph, resulting in something like: finalWord Paragraph.

I’ve found a workaround for this issue. While it might not be the most elegant solution—since I don’t fully understand the application’s code—it has been working for me so far. Here are the behaviors I’ve tested:

  • Labeling a word or words with a trailing space.
  • Labeling a word with a leading space.
  • Labeling a word or words with or without symbols at the end (as defined by the regex in the code).
  • Labeling a word at the beginning and at the end of a paragraph, both with and without spaces.
  • Labeling individual symbols.

To test this, I split your text into three paragraphs:

Yesterday, at 3 PM, Emily Johnson and Michael Smith met at the Central Park in New York to discuss the merger between TechCorp and Global Solutions.

The deal, worth approximately 500 million dollars, is expected to significantly impact the tech industry.

Later, at 6 PM, they joined a conference call with the CEO of TechCorp, David Brown, who was in London for a technology summit. During the call, they discussed the market trends in Asia and Europe and planned for the next quarterly meeting, which is scheduled for January 15th, 2024, in Paris.

Here are what I've tested:

image

The code I modified is located at:

src/streamlit_annotation_tools/frontend/src/helpers/labelerHelpers.ts

Specifically, I made changes to the adjustSelectionBounds function by adding a set of characters to validate. Please note that the highlightHelpers.ts file has not been modified.

export const adjustSelectionBounds = (
  textContent: string,
  startIndex: number,
  endIndex: number
): { start: number; end: number } => {
  let startAdjustment = 0
  let endAdjustment = 0

  const reStartIndex = /^[!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~]/g
  while (
    !(
      textContent.charAt(startIndex) === " " ||
      textContent.charAt(startIndex) === "\n" ||
      textContent.charAt(startIndex).match(reStartIndex) ||
      textContent.charAt(startIndex + startAdjustment - 1) === " " ||
      textContent.charAt(startIndex + startAdjustment - 1) === "\n" ||
      textContent.charAt(startIndex + startAdjustment - 1).match(reStartIndex)
    ) &&
    textContent.charAt(startIndex + startAdjustment - 1) !== ""
  ) {
    startAdjustment -= 1
  }

  const reEndIndex = /[!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~]$/g
  while (
    !(
      textContent.charAt(endIndex-1) === " " ||
      textContent.charAt(endIndex-1) === "\n" ||
      textContent.charAt(endIndex-1).match(reEndIndex) ||
      textContent.charAt(endIndex + endAdjustment) === " " ||
      textContent.charAt(endIndex + endAdjustment) === "\n" ||
      textContent.charAt(endIndex + endAdjustment).match(reEndIndex)
    ) &&
    textContent.charAt(endIndex + endAdjustment) !== ""
  ) {
    endAdjustment += 1
  }

  return {
    start: startIndex + startAdjustment,
    end: endIndex + endAdjustment,
  }
}

To work with this modification, you’ll need to install the library by following these steps:

  1. Clone the GitHub repository.
  2. Make the modification to the labelerHelpers.ts file.
  3. Navigate to the frontend directory
    cd streamlit-annotation-tools/src/streamlit_annotation_tools/frontend
  4. Install the necessary dependencies:
    npm install
  5. Build the frontend:
    npm run build
  6. Install the library (you can also install it beforehand):
    cd streamlit-annotation-tools
    pip install -e .
    
  7. Run the application:
    streamlit run your_application.py

I’m also attaching the JSON output generated from the labels in the previous image for reference:

{
    "labels": [
        {
            "start": 0,
            "end": 9,
            "label": "Yesterday"
        },
        {
            "start": 9,
            "end": 15,
            "label": ", at 3"
        },
        {
            "start": 18,
            "end": 20,
            "label": ", "
        },
        {
            "start": 26,
            "end": 45,
            "label": "Johnson and Michael"
        },
        {
            "start": 52,
            "end": 56,
            "label": "met "
        },
        {
            "start": 58,
            "end": 62,
            "label": " the"
        },
        {
            "start": 138,
            "end": 148,
            "label": "Solutions."
        },
        {
            "start": 148,
            "end": 153,
            "label": "\n\nThe"
        },
        {
            "start": 246,
            "end": 256,
            "label": "industry. "
        },
        {
            "start": 258,
            "end": 263,
            "label": "Later"
        },
        {
            "start": 268,
            "end": 272,
            "label": "PM, "
        },
        {
            "start": 361,
            "end": 377,
            "label": "for a technology"
        },
        {
            "start": 401,
            "end": 402,
            "label": ","
        },
        {
            "start": 535,
            "end": 540,
            "label": " 2024"
        },
        {
            "start": 545,
            "end": 551,
            "label": "Paris."
        }
    ]
}

I hope this explanation and the workaround I’ve shared are clear and helpful. Please let me know if I’ve missed anything or made any mistakes, and feel free to reach out if further clarification is needed.

Best regards,

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants