# StringMetrics Module

This module contains a set of functions for performing fuzzy string matches. It can be useful when you have 2 columns containing text that is close but not 100% the same. However, since the functions in this module only perform fuzzy matches, there is no guarantee that there will be 100% accuracy in the matches. However, for small groups of string where each string is very different than the other (such as a small group of fairly dissimilar names), these functions can be highly accurate. Finally, some of the functions in this Module will take a long time to calculate for large numbers of cells, as the number of calculations for some functions will grow exponentially, but for small sets of data (such as 100 strings to compare), these functions perform fairly quickly.

### =DAMERAU(string1, string2)

This function takes two strings of any length and calculates the Damerau-Levenshtein Distance between them. Damerau-Levenshtein Distance differs from Levenshtein Distance in that it includes an additional operation, called Transpositions, which occurs when two adjacent characters are swapped. Thus, Damerau-Levenshtein Distance calculates the number of Insertions, Deletions, Substitutions, and Transpositons needed to convert string1 into string2. As a result, this function is good when it is likely that spelling errors have occured between two string where the error is simply a transposition of 2 adjacent characters.

#### Arguments

String: string1 ---> is the first string

String: string2 ---> is the second string that will be compared to the first string

#### Returns

Integer: Returns an integer of the Damerau-Levenshtein Distance between two string

#### Examples

- =DAMERAU("Cat", "Bat") -> 1; Since all that is needed is 1 change (changing the "B" in Bat to "C")
- =DAMERAU("Cat", "Ca") -> 1; Since only one Insertion needs to occur (adding a "t" at the end)
- =DAMERAU("Cat", "Cta") -> 1; Since the "t" and "a" can be transposed as they are adjacent to each other. Notice how LEVENSHTEIN("Cat","Cta")=2 but DAMERAU("Cat","Cta")=1

### =DAM_STR(range1, rangeArray)

This function takes two ranges and calculates the string that is the result of the lowest Damerau-Levenshtein Distance. The first range is a single cell which will be compared to all other cells in the second range and whichever value produces the lowest Damerau-Levenshtein Distance will be returned.

#### Arguments

Range: range1 ---> contains the string we want to find the closest string in the rangeArray to

Range: rangeArray ---> is a range of all strings that will be compared to the string in range1

#### Returns

String: Returns the string that is closest from the rangeArray

###### Warning!

This function will require exponential numbers of calculations for large amounts of strings. In cases where the number of strings are very large (a couple thousand strings for example), a better solution would be to use an external program other than Excel. Also this function will perform well in the case of comparing two lists with the same content but with spelling errors, but in cases where transpositions are unlikely, thus LEV_STR should be used as this function will be slower.#### Examples

- =DAM_STR("Cat", A1:A3) -> "Cta"; Where A1:A3 contains ["Bath", "Hello", "Cta"]; LEV_STR will actually return "Bath" in this case since it comes first in the range and since "Bath" and "Cta" will actually both have a LEV=2, but while "Bath" with have DAM=2, for "Cta" only one operation is required (a single Transposition instead of a Substitution and a Deletion) and thus for "Cta" DAM=1

### =DAM_STR_OPT(range1, rangeArray, numberOfLeftCharactersBound, plusOrMinusLengthBound)

This function is the same as DAM_STR except that it adds two more arguments which can be used to optimize the speed of searches when the number of strings to search is very large. Since the number of calculations will increase exponentially to find the best fit string, this function can exclude a lot of strings that are unlikely to have the lowest Damerau�Levenshtein Distance. The additional two parameters are a parameter that first checks a certain number of characters at the left of the strings and if the strings don't have the same characters on the left, then that string is excluded. The second of the two parameters sets the maximum length difference between the two strings, and if the length of string2 is not within the bounds of string1 length +/- the length bound, then this string is excluded. Setting high values for these parameters will essentially convert this function into a slightly slower version of DAM_STR.

#### Arguments

Range: range1 ---> contains the string we want to find the closest string in the rangeArray to

Range: rangeArray ---> is a range of all strings that will be compared to the string in range1

Long: numberOfLeftCharactersBound ---> is the number of left characters that will be checked first on both strings before calculating their Levenshtein Distance

Variant: plusOrMinusLengthBound ---> is the number plus or minus the length of the first string that will be checked compared to the second string before calculating their Levenshtein Distance

#### Returns

String: Returns the string that is closest from the rangeArray

###### Warning!

This function will work for a much larger number of strings than LEV_STR, however care must be taken before using it. This function will perform well in cases where the group of strings are likely to have a large number of differnces between each individual string and where it is likely that the leftmost charaters of the string will be the same. An example might comparing two sets of company names for the companies in a stock index, as they are likely to be fairly different but likely will have the same leftmost characters between the two lists.#### Examples

- =DAM_STR_OPT("Cat", A1:A3, 1, 2) -> "Car"; Where A1:A3 contains ["Car", "C Programming Langauge", "Dog"]; The calculation won't be performed on "Dog" since "Dog" doesn't start with the character "C", and "C Programming Langauge" won't be calculated either since its length is greating than LEN("Cat") +/- 2 (its length is not between 0-5 characters long).

### =HAMMING(string1, string2)

This function takes two strings of the same length and calculates the Hamming Distance between them. Hamming Distance measures how close two strings are by checking how many Substitutions are needed to turn one string into the other. Lower numbers mean the strings are closer than high numbers.

#### Arguments

String: string1 ---> is the first string

String: string2 ---> is the second string that will be compared to the first string

#### Returns

Integer: Returns an integer of the Hamming Distance between two string

#### Examples

- =HAMMING("Cat", "Bat") -> 1; Since all that is needed is 1 change (changing the "B" in Bat to "C")
- =HAMMING("Cat", "Bag") -> 2; 2 changes are needed, changing the "B" to "C" and the "g" to "t"
- =HAMMING("Cat", "Dog") -> 3; Every single character needs to be substituted in this case

### =LEVENSHTEIN(string1, string2)

This function takes two strings of any length and calculates the Levenshtein Distance between them. Levenshtein Distance measures how close two strings are by checking how many Insertions, Deletions, or Substitutions are needed to turn one string into the other. Lower numbers mean the strings are closer than high numbers. Unlike Hamming Distance, Levenshtein Distance works for strings of any length and includes 2 more operations. However, calculation time will be slower than Hamming Distance for same length strings, so if you know the two strings are the same length, its preferred to use Hamming Distance.

#### Arguments

String: string1 ---> is the first string

String: string2 ---> is the second string that will be compared to the first string

#### Returns

Long: Returns an integer of the Levenshtein Distance between two string

#### Examples

- =LEVENSHTEIN("Cat", "Bat") -> 1; Since all that is needed is 1 change (changing the "B" in Bat to "C")
- =LEVENSHTEIN("Cat", "Ca") -> 1; Since only one Insertion needs to occur (adding a "t" at the end)
- =LEVENSHTEIN("Cat", "Cta") -> 2; Since the "t" in "Cta" needs to be substituted into an "a", and the final character "a" needs to be substituted into a "t"

### =LEV_STR(range1, rangeArray)

This function takes two ranges and calculates the string that is the result of the lowest Levenshtein Distance. The first range is a single cell which will be compared to all other cells in the second range and whichever value produces the lowest Levenshtein Distance will be returned.

#### Arguments

Range: range1 ---> contains the string we want to find the closest string in the rangeArray to

Range: rangeArray ---> is a range of all strings that will be compared to the string in range1

#### Returns

String: Returns the string that is closest from the rangeArray

###### Warning!

This function will require exponential numbers of calculations for large amounts of strings. In cases where the number of strings are very large (a couple thousand strings for example), a better solution would be to use an external program other than Excel.#### Examples

- Where A1:A3 contains ["Bat", "Hello", "Dog"] =LEV_STR("Cat", A1:A3) -> "Bat"; Since "Bat" will have the lowest Levenshtein Distance of all 3 strings when compared to the string "Cat"

### =LEV_STR_OPT(range1, rangeArray, numberOfLeftCharactersBound, plusOrMinusLengthBound)

This function is the same as LEV_STR except that it adds two more arguments which can be used to optimize the speed of searches when the number of strings to search is very large. Since the number of calculations will increase exponentially to find the best fit string, this function can exclude a lot of strings that are unlikely to have the lowest Levenshtein Distance. The additional two parameters are a parameter that first checks a certain number of characters at the left of the strings and if the strings don't have the same characters on the left, then that string is excluded. The second of the two parameters sets the maximum length difference between the two strings, and if the length of string2 is not within the bounds of string1 length +/- the length bound, then this string is excluded. Setting high values for these parameters will essentially convert this function into a slightly slower version of LEV_STR.

#### Arguments

Range: range1 ---> contains the string we want to find the closest string in the rangeArray to

Range: rangeArray ---> is a range of all strings that will be compared to the string in range1

Integer: numberOfLeftCharactersBound ---> is the number of left characters that will be checked first on both strings before calculating their Levenshtein Distance

Integer: plusOrMinusLengthBound ---> is the number plus or minus the length of the first string that will be checked compared to the second string before calculating their Levenshtein Distance

#### Returns

String: Returns the string that is closest from the rangeArray

###### Warning!

This function will work for a much larger number of strings than LEV_STR, however care must be taken before using it. This function will perform well in cases where the group of strings are likely to have a large number of differnces between each individual string and where it is likely that the leftmost charaters of the string will be the same. An example might comparing two sets of company names for the companies in a stock index, as they are likely to be fairly different but likely will have the same leftmost characters between the two lists.#### Examples

- =LEV_STR_OPT("Cat", A1:A3, 1, 2) -> "Car"; Where A1:A3 contains ["Car", "C Programming Langauge", "Dog"]; The calculation won't be performed on "Dog" since "Dog" doesn't start with the character "C", and "C Programming Langauge" won't be calculated either since its length is greating than LEN("Cat") +/- 2 (its length is not between 0-5 characters long).

### =PARTIAL_LOOKUP(range1, rangeArray)

This function takes two ranges and calculates the string that is the closest match. It works similar to a VLOOKUP, expect that it works with partial matches as well as exact matches.

#### Arguments

Range: range1 ---> contains the string we want to find the closest string in the rangeArray to

Range: rangeArray ---> is a range of all strings that will be compared to the string in range1

#### Returns

String: Returns the string that is closest from the rangeArray

###### Note

This function is an alias for DAM_STR, and for a more in-depth explaination of the underlying logic used in the function to calculate the partial lookup, see the DAM_STR function.###### Warning!

This function will require exponential numbers of calculations for large amounts of strings. In cases where the number of strings are very large (more than 1000 strings), a better solution would be to use an external program other than Excel. Also this function will perform well in the case of comparing two lists with the same content but with spelling errors, but in cases where transpositions are unlikely, thus LEV_STR should be used as this function will be slower.#### Examples

- =PARTIAL_LOOKUP("Cta", A1:A3) -> "Cat"; Where A1:A3 contains ["Bath", "Hello", "Cat"];