On a recent project, had to deal with searching of tens of thousands of product descriptions, with a need to find substring matches quickly. The select: statement in Smalltalk works like a SQL table scan – okay for small collections, but becomes seconds+ response time with larger lists.
An effective solution to this is an nGram Dictionary. Strings of words can be broken up into sets of tri-grams, quad-grams, quint-grams, and so on.
My approach to this is a Dictionary indexed by nGram length, each element containing dictionaries of nGram strings of collections of the string objects to be searched. Thus, indexing results as such:
3 -> ana -> ('banana') ban -> ('banana', 'band') 4 -> bana -> ('banana') band -> ('band') anan -> ('banana') 5 -> banan -> ('banana') ...
To instantiate, just NGramDictionary new on: someCollectionofStrings from: 4 to: 10, for example. To search, just foo search: ‘someString’. If someString can be larger than the NGramDictionary size; search: will handle longer string searches by doing a select: scan.
How large to make the nGram Dictionary is a matter of how large and non-unique the string collection may be. I recommend starting at the low end at 3 or 4 – searches of 1 or 2 character strings will return way too many results with most string collections to be useful.
There’s a few other special cases I kept this in this example code, to meet my particular needs; for example, this is not a strict nGram splicing – spaces, numbers and punctuation marks are excluded and become “word” boundaries. But, the results are super-fast, enabling, for example, type-ahead select lists on a web-based form.
Object subclass: #NGramDictionary
instanceVariableNames: 'mySmallest myLargest myCollectionofStrings myNGrams'
classVariableNames: ''
poolDictionaries: ''
category: ''
at: aString
"returns strictly those strings that have an indexed nGram. If the string is larger than
te largest nGram specified in this dictionary, return an empty collection"
|nGram|
nGram := self nGrams at: aString size ifAbsent: [nil].
^nGram notNil
ifTrue: [nGram at: aString ifAbsent: [OrderedCollection new]]
ifFalse: [Bag new]
nGramed: aCollectionofStrings n: length
"private. Return a dictionary of all nGrams or order length for the given collection of strings, excluding nGrams with punctuation or numbers"
|ngram e words |
ngram := Dictionary new.
aCollectionofStrings
do: [:each |
e := each asLowercase.
1 to: e size - length + 1 do: [:i |
words := (e copyFrom: i to: i + length -1) subStrings: '., :;- ()[]!!''&%/#+0123456789'.
(words size > 0 and: [words first size = length]) "split it, save if still the same length"
ifTrue:
[(ngram at: words first ifAbsentPut: [OrderedCollection new]) add: each]
]
].
^ngram
nGrams
"Private. lazily initialized"
^ myNGrams ifNil:
[myNGrams := Dictionary new.
mySmallest to: myLargest do: [ :i |
myNGrams at: i put: (self nGramed: myCollectionofStrings n: i) ].
myNGrams yourself ]
on: aCollectionofStrings from: start to: finish
"initialization method"
myCollectionofStrings := aCollectionofStrings.
mySmallest := start.
myLargest := finish
search: aString
"returns a collection of strings that contain aString. If there's not an nGram available, do a select: , which is slower of course'"
|nGram searchstring|
searchstring := aString asLowercase.
searchstring size <= myLargest
ifTrue:
[nGram := self nGrams at: searchstring size.
^nGram at: searchstring ifAbsent: [OrderedCollection new]]
ifFalse:
[nGram := self nGrams at: myLargest.
^(nGram at: (searchstring copyFrom: 1 to: myLargest) ifAbsent: [OrderedCollection new])
select: [: each | each asLowercase includesSubString: searchstring]
]