Skip to content

Commit

Permalink
Refactor SuggestedResource to leverage new Fingerprint model
Browse files Browse the repository at this point in the history
Why these changes are being introduced:

We [recently decided](#145) to make a separate Fingerprint
model, associated with Term, as multiple detectors are likely
to use fringerprinting (implemented in #138). We have also begun
to split the ActiveRecord components of detectors into separate
models (implemented for Detector::Journal in #162).

Relevant ticket(s):

* [TCO-111](https://mitlibraries.atlassian.net/browse/TCO-111)
* [TCO-122](https://mitlibraries.atlassian.net/browse/TCO-122)

How this addresses that need:

* Splits the ActiveRecord components of Detector::SuggestedResource
into a separate SuggestedResource model.
* Associates SuggestedResource with Fingerprint, via Term, such
that a suggested resource can have multiple terms and fingerprints.
* Removes the suggested resource dashboard (see side effects).

Side effects of this change:

* Terms that are associated with a suggested resource should not
be destroyed. Rails does not allow the `:dependent` option on
`belongs_to` associations, so this commit instead adds a
`before_destroy` hook with a custom method that aborts the callback
and logs the attempt in Sentry.
* Because administrate does not handle has_many relationships well,
we will need to build a custom dashboard to manage suggested resources.
This is ticketed as [TCO-145](https://mitlibraries.atlassian.net/browse/TCO-145).
Until that UI is ready, we will use the Rails console to make any
requested changes to suggested resources.
  • Loading branch information
jazairi committed Feb 10, 2025
1 parent c94bf23 commit 936fba0
Show file tree
Hide file tree
Showing 20 changed files with 214 additions and 325 deletions.
75 changes: 0 additions & 75 deletions app/dashboards/detector/suggested_resource_dashboard.rb

This file was deleted.

89 changes: 3 additions & 86 deletions app/models/detector/suggested_resource.rb
Original file line number Diff line number Diff line change
@@ -1,93 +1,10 @@
# frozen_string_literal: true

# == Schema Information
#
# Table name: detector_suggested_resources
#
# id :integer not null, primary key
# title :string
# url :string
# phrase :string
# fingerprint :string
# created_at :datetime not null
# updated_at :datetime not null
#

require 'stringex/core_ext'

class Detector
# Detector::SuggestedResource stores custom hints that we want to send to the
# user in response to specific strings. For example, a search for "web of
# science" should be met with our custom login link to Web of Science via MIT.
class SuggestedResource < ApplicationRecord
before_save :update_fingerprint

def self.table_name_prefix
'detector_'
end

# This exists for the before_save lifecycle hook to call the calculate_fingerprint method, to ensure that these
# records always have a correctly-calculated fingerprint. It has no arguments and returns nothing.
def update_fingerprint
self.fingerprint = Detector::SuggestedResource.calculate_fingerprint(phrase)
end

# This implements the OpenRefine fingerprinting algorithm. See
# https://openrefine.org/docs/technical-reference/clustering-in-depth#fingerprint
#
# @param old_phrase [String] A text string which needs to have its fingerprint calculated. This could either be the
# "phrase" field on the SuggestedResource record, or an incoming search term received from a contributing system.
#
# @return [String] A string of all words in the input, downcased, normalized, and alphabetized.
def self.calculate_fingerprint(old_phrase)
modified_phrase = old_phrase
modified_phrase = modified_phrase.strip
modified_phrase = modified_phrase.downcase

# This removes all punctuation and symbol characters from the string.
modified_phrase = modified_phrase.gsub(/\p{P}|\p{S}/, '')

# Normalize to ASCII (e.g. gödel and godel are liable to be intended to
# find the same thing)
modified_phrase = modified_phrase.to_ascii

# Coercion to ASCII can introduce new symbols, so we remove those now.
modified_phrase = modified_phrase.gsub(/\p{P}|\p{S}/, '')

# Tokenize
tokens = modified_phrase.split

# Remove duplicates and sort
tokens = tokens.uniq
tokens = tokens.sort

# Rejoin tokens
tokens.join(' ')
end

# This replaces all current Detector::SuggestedResource records with a new set from an imported CSV.
#
# @note This method is called by the suggested_resource:reload rake task.
#
# @param input [CSV::Table] An imported CSV file containing all Suggested Resource records. The CSV file must have
# at least three headers, named "Title", "URL", and "Phrase". Please note: these values
# are case sensitive.
def self.bulk_replace(input)
raise ArgumentError.new, 'Tabular CSV is required' unless input.instance_of?(CSV::Table)

# Need to check what columns exist in input
required_headers = %w[Title URL Phrase]
missing_headers = required_headers - input.headers
raise ArgumentError.new, "Some CSV columns missing: #{missing_headers}" unless missing_headers.empty?

Detector::SuggestedResource.delete_all

input.each do |line|
record = Detector::SuggestedResource.new({ title: line['Title'], url: line['URL'], phrase: line['Phrase'] })
record.save
end
end

# Detector::SuggestedResource handles detections for SuggestedResource records.
class SuggestedResource
# Identify any SuggestedResource record whose pre-calculated fingerprint matches the fingerprint of the incoming
# phrase.
#
Expand All @@ -98,7 +15,7 @@ def self.bulk_replace(input)
#
# @return [Detector::SuggestedResource] The record whose fingerprint matches that of the search term.
def self.full_term_match(phrase)
SuggestedResource.where(fingerprint: calculate_fingerprint(phrase))
::SuggestedResource.joins(:fingerprints).where('fingerprints.value = ?', Fingerprint.calculate(phrase))
end

# Look up any matching Detector::SuggestedResource records, building on the full_term_match method. If a match is
Expand Down
33 changes: 33 additions & 0 deletions app/models/suggested_resource.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
# frozen_string_literal: true

# SuggestedResource stores custom hints that we want to send to the
# user in response to specific strings. For example, a search for "web of
# science" should be met with our custom login link to Web of Science via MIT.
class SuggestedResource < ApplicationRecord
has_many :terms
has_many :fingerprints, through: :terms, dependent: :nullify

# This replaces all current SuggestedResource records with a new set from an imported CSV.
#
# @note This method is called by the suggested_resource:reload rake task.
#
# @param input [CSV::Table] An imported CSV file containing all Suggested Resource records. The CSV file must have
# at least three headers, named "Title", "URL", and "Phrase". Please note: these values
# are case sensitive.
def self.bulk_replace(input)
raise ArgumentError.new, 'Tabular CSV is required' unless input.instance_of?(CSV::Table)

# Need to check what columns exist in input
required_headers = %w[Title URL Phrase]
missing_headers = required_headers - input.headers
raise ArgumentError.new, "Some CSV columns missing: #{missing_headers}" unless missing_headers.empty?

SuggestedResource.destroy_all

input.each do |line|
record = SuggestedResource.new({ title: line['Title'], url: line['URL'] })
record.save
record.terms.find_or_create_by(phrase: line['Phrase'])
end
end
end
12 changes: 12 additions & 0 deletions app/models/term.rb
Original file line number Diff line number Diff line change
Expand Up @@ -20,8 +20,10 @@ class Term < ApplicationRecord
has_many :categorizations, dependent: :destroy
has_many :confirmations, dependent: :destroy
belongs_to :fingerprint, optional: true
belongs_to :suggested_resource, optional: true

before_save :register_fingerprint
before_destroy :check_suggested_resource
after_destroy :check_fingerprint_count

scope :categorized, -> { where.associated(:categorizations).distinct }
Expand Down Expand Up @@ -104,6 +106,16 @@ def check_fingerprint_count
fingerprint.destroy if fingerprint&.terms&.count&.zero?
end

# This is called before_destroy to avoid orphaning SuggestedResource records. Deleting terms should be an unlikely
# event, so this should come up rarely. If it does, it warrants the extra care to delete the record manually in the
# Rails console.
def check_suggested_resource
if suggested_resource
Sentry.capture_message('Cannot delete term with associated suggested resource')
throw :abort
end
end

# This method looks up all current detections for the given term, and assembles their confidence scores in a format
# usable by the calculate_categorizations method. It exists to transform data like:
# [{3=>0.91}, {1=>0.1}] and [{3=>0.95}]
Expand Down
3 changes: 0 additions & 3 deletions app/views/layouts/_site_nav.html.erb
Original file line number Diff line number Diff line change
Expand Up @@ -19,9 +19,6 @@
<% if can? :view, :playground %>
<%= link_to('Playground', '/playground', class: 'nav-item') %>
<% end %>
<% if can? :manage, :detector__suggested_resource %>
<%= link_to('Suggested Resources', admin_detector_suggested_resources_path, class: 'nav-item') %>
<% end %>
<% if can? :view, Categorization %>
<%= link_to('Categorizations', admin_categorizations_path, class: 'nav-item') %>
<% end %>
Expand Down
5 changes: 0 additions & 5 deletions config/routes.rb
Original file line number Diff line number Diff line change
Expand Up @@ -5,11 +5,6 @@
end

namespace :admin do
# Lookup-style detector records
namespace :detector do
resources :suggested_resources
end

# Knowledge graph models
resources :detectors
resources :detector_categories
Expand Down
10 changes: 10 additions & 0 deletions db/migrate/20250107204800_create_suggested_resources.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
class CreateSuggestedResources < ActiveRecord::Migration[7.1]
def change
create_table :suggested_resources do |t|
t.string :title
t.string :url

t.timestamps
end
end
end
18 changes: 18 additions & 0 deletions db/migrate/20250107204813_drop_detector_suggested_resources.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
class DropDetectorSuggestedResources < ActiveRecord::Migration[7.1]
def up
drop_table :detector_suggested_resources
end

def down
create_table :detector_suggested_resources do |t|
t.string :title
t.string :url
t.string :phrase
t.string :fingerprint

t.timestamps
end
add_index :detector_suggested_resources, :phrase, unique: true
add_index :detector_suggested_resources, :fingerprint, unique: true
end
end
11 changes: 11 additions & 0 deletions db/migrate/20250107213433_add_suggested_resource_to_terms.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
class AddSuggestedResourceToTerms < ActiveRecord::Migration[7.1]
def up
add_reference :terms, :suggested_resource
add_foreign_key :terms, :suggested_resources, on_delete: :nullify
end

def down
remove_foreign_key :terms, :suggested_resource, on_delete: :nullify
remove_reference :terms, :suggested_resources
end
end
21 changes: 10 additions & 11 deletions db/schema.rb

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

6 changes: 3 additions & 3 deletions lib/tasks/suggested_resources.rake
Original file line number Diff line number Diff line change
Expand Up @@ -19,10 +19,10 @@ namespace :suggested_resources do
file = url.open.read.gsub("\xEF\xBB\xBF", '').force_encoding('UTF-8').encode
data = CSV.parse(file, headers: true)

Rails.logger.info("Record count before we reload: #{Detector::SuggestedResource.count}")
Rails.logger.info("Record count before we reload: #{SuggestedResource.count}")

Detector::SuggestedResource.bulk_replace(data)
SuggestedResource.bulk_replace(data)

Rails.logger.info("Record count after we reload: #{Detector::SuggestedResource.count}")
Rails.logger.info("Record count after we reload: #{SuggestedResource.count}")
end
end
16 changes: 13 additions & 3 deletions test/fixtures/fingerprints.yml
Original file line number Diff line number Diff line change
Expand Up @@ -31,9 +31,6 @@ isbn_9781319145446:
journal_nature_medicine:
value: 'medicine nature'

suggested_resource_jstor:
value: 'jstor'

multiple_detections:
value: '103389fpubh202000014 32154200 a air and doi environmental frontiers health impacts in of pmid pollution public review'

Expand All @@ -45,3 +42,16 @@ barcode:

not_a_barcode:
value: '39080678901234 extra some text with'

jstor:
value: 'jstor'

web_of_science:
value: 'of science web'

web_of_knowledge:
value: 'knowledge of web'

nobel_laureate:
value: 'bawendi moungi'

Loading

0 comments on commit 936fba0

Please sign in to comment.