On this page
Fellegi-Sunter Search Module
A new very general-purpose Search Module based on Fellegi and Sunter's algorithm as described in their 1969 paper1. Unlike other Mirth Match Search Modules, it does not define the traits that it compares. Instead, it relies on a Matching Configuration which is created independently for each Entity Domain that uses this Search Module.
Matching Configuration
A Matching Configuration specifies the set of Traits whose values are to be compared to determine the similarity score between a candidate entity and a search entity, and scripts to specify domain-dependent matching operations. An Entity Domain that uses the Fellegi-Sunter Search Module must have at least one Matching Configuration, of which one is currently active.
For each trait to be compared, a Matching Configuration stores:
- The method to be used to compare values of the trait.
- The agreement and non-agreement rates for the trait. These are normally computed by the Closed-Form U, U Estimator or Expectation Maximization calculation, not manually entered.
- An option to use "Trinomial EM" in calculating the agreement and non-agreement rates for this trait.
In addition, a Matching Configuration stores scripts to define the traits used by the Search Module (all the traits, not just the compared traits), to compute the values of derived traits based on source trait values, to generate candidate queries (select which Entities to examine more closely for matches), and to adjust the computed score for each pair of Entities compared. See the next section for details.
Scripting
Matching Configuration scripts are written in JavaScript.
Utility Methods
Every script is passed an instance of MatchingModuleScriptingUtil as the "Util" variable. This class contains utility methods useful in writing matching scripts.
| Method | Description | ||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
SingleTraitMatchExpressionGroup newSingleTraitMatchExpressionGroup(String id, Map<String, String> values) | Returns a new SingleTraitMatchExceptionGroup from the specified trait identifier and trait values, which can be added to the queries in a Generate Candidate Queries script. | ||||||||||||||||
| List<String> getNicknamesForName(String name) | Returns all known nicknames for the specified first name. | ||||||||||||||||
| List<String> getNamesForNickname(String nickname) | Returns all known names for the specified nickname. | ||||||||||||||||
| List<String> newStringList() | Returns an empty List of Strings. | ||||||||||||||||
| String joinList(List<String>, String) | Returns a String containing the text of the supplied Strings separated using the supplied separator String. | ||||||||||||||||
| List<String> stringAsList(String) | Returns a List of Strings containing only the specified String. | ||||||||||||||||
| Map<String,String> stringAsMap(String key, String value) | Returns a Map with a single entry composed of the specified key and value. | ||||||||||||||||
| boolean isBlank(String) | Returns true if the specified String is empty or contains only whitespace, and false otherwise. | ||||||||||||||||
| boolean isNotEmptyOrNull(String) | Returns true if the specified String is not null, not empty, and contains at least one non-whitespace character, and false otherwise. | ||||||||||||||||
| String getMetaphone(String) | Returns the Metaphone for the specified String value, which is ordinarily a last name. | ||||||||||||||||
| String encode(String[] array) | Returns a String representation of the specified array of Strings that can be decoded back to the same array of Strings by the decode method. This is the encoding method expected by the container support for multivalued blocking traits. | ||||||||||||||||
| String encode(List<String> list) | Returns a String representation of the specified List of Strings that can be decoded back to an array of the same Strings by the decode method. This is the encoding method expected by the container support for multivalued blocking traits. | ||||||||||||||||
| String[] decode(String encoded) | Given a String encoded by the encode method, returns an array of the same Strings as passed to the encode method. | ||||||||||||||||
| Logger getLogger() | Returns the Logger for MatchingModuleScripting. Useful for logging information from scripts to the Mirth Match log file. | ||||||||||||||||
| void debug(String msg) | Logs the specified message to the Mirth Match log file at the FINEST level. | ||||||||||||||||
| String normalizeDOB(String raw) | Performs the same normalization logic on a raw date as that performed by the MHSip date of birth normalization. Attempts to convert date values to the form YYYYMMDD. | ||||||||||||||||
| String normalizeGender(String raw) | Performs the same normalization logic on a raw gender as that performed by the MHSip Search Module. Attempts to convert gender values to M, F or U. | ||||||||||||||||
| NormalName normalizeName(String last, String first, String middle) | Converts the specified raw last name(s), first name and middle name to a standard form. The NormalName returned contains:
| ||||||||||||||||
| String removeDiacritics(String src) | Returns a copy of the source string, with all diacritics (accents, etc.) removed from all characters. Since these are often omitted, removing them should be part of normalizing the values for matching and blocking traits. | ||||||||||||||||
| void setScore(FloatScore score, double value) | Sets the value of score to value. | ||||||||||||||||
| void adjustScore(FloatScore score, double delta) | Adds value to the value of score. | ||||||||||||||||
| float getComponentScore(String id, boolean matching) | Get the matching module plugin component score with the specified identifier.
| ||||||||||||||||
| boolean same(Object left, Object right) | Tests if the two supplied objects are the same object (whether equal or not, including null) or equal. |
After Scoring Script
This script is executed after the search module has computed a similarity score for the search and candidate Entitys. It provides an opportunity to adjust the similarity score based on factors the search module couldn't or didn't take into account.
Parameters
| Name | Type | Description |
|---|---|---|
| entity | ScriptableEntity | The candidate Entity that the search Entity is being compared to (read-only). |
| configuration | MatchingConfiguration | The active Matching Configuration for the domain in which the search is occurring. |
| score | Float | The Score value computed by the search algorithm so far. |
| options | ScriptableEntity | The search Entity - the Entity being searched for (read-only). |
| Util | MatchingModuleScriptingUtil | Scripting utility methods. |
Derive Trait Values Script
This script is executed for each of the candidate Entity and the search Entity before comparing the two. Its job is to fill in the values of derived traits (traits not supplied by the caller, but instead calculated) from source traits (traits that are supplied by the caller).
Parameters
| Name | Type | Description |
|---|---|---|
| configuration | MatchingConfiguration | The active Matching Configuration for the domain in which the search is occurring. |
| entity | ScriptableEntity | The Entity for which derived trait values should be computed. The source trait values are read-only are cannot be modified. The derived trait values are read-write and may be modified. |
| Util | MatchingModuleScriptingUtil | Scripting utility methods. |
Example
Generate Candidate Queries Script
This script is called after the derived trait values are calculated for the search Entity, to determine which blocking traits and trait values should be used to find candidate Entities to compare to the search Entity. By default, at entry to the script, the queries parameter will contain queries for all blocking traits associated with the Entity Domain for which the search Entity has a (typically derived) trait value.
Parameters
| Name | Type | Description |
|---|---|---|
| configuration | MatchingConfiguration | The active Matching Configuration for the domain in which the search is occurring. |
| entity | ScriptableEntity | The search Entity - the Entity being searched for (read-only). |
| queries | List<TaggedEISExpressionGroup> | The (modifiable) list of candidate queries. At the start of the script, the list will contain the default queries created by the container based on the "blocking" and "multivalued" values for the traits associated with the Entity Domain. The script may add or remove queries in the list. |
| options | MMSearchContext | The search options for which the candidate queries are being generated. Overlaps entity. |
| Util | MatchingModuleScriptingUtil | Scripting utility methods. |
Example
Get Traits Script
This script is called each time the Entity Domain is started, including at server startup and when a Matching Configuration is made active for the Entity Domain. It returns the minimum set of traits that should be associated with the Entity Domain.
Parameters
| Name | Type | Description |
|---|---|---|
| configuration | MatchingConfiguration | The active Matching Configuration for the domain in which the search is occurring. |
| Util | MatchingModuleScriptingUtil | Scripting utility methods. |
| traits | TraitList | Represents the list of traits this matching configuration uses. Provides two methods for adding traits to the list: add(String identifier, String label, String description) adds a simple trait with the default value for all options. add(String identifier, String label, String description, Map options) adds a trait with the value in options for all values it defines, and the default value for all others. The option values described below as "per domain" will only be recorded the first time the trait is associated with the entity domain. This allows the user to manually change the values of these options through the UI. The option values not described below as "per domain" will be modified the first time the trait is associated with the entity domain, and any time the domain is started (such as at server startup). |
Options
| Name | Type | Default | Per Domain | Meaning |
|---|---|---|---|---|
| datatype | String | "ST" | false | The alias of the data type for this trait. |
| blocking | boolean | false | true | True if this trait is a blocking trait for this entity domain, false if it is not. |
| codeset | String | null | false | The alias of the codeset from which values of this trait are to be taken. |
| parent | String | null | false | The identifier of the parent trait from which the value of this trait is derived. |
| mask | String | null | false | The edit mask for the value of this trait. |
| required | boolean | false | true | True if a value for this trait must be provided when registering a new Entity, false if not. |
| maxLength | Integer | null | false | The maximum allowed length for a value of this trait. |
| minLength | Integer | null | false | The minimum allowed length for a value of this trait. |
| multivalued | boolean | false | true | True if the value of this trait is an encoded list of values. This only affects traits which are also flagged as blocking traits. |
| status | String | Active | false | The alias of the status for this trait. |
Example
Analysis Workbench
The new Analysis Workbench provides facilities for defining and performing analysis operations on the trait values of the Entitys in an Entity Domain or Identifier Domain. An Analysis performs one or more calculations according to a selected Matching Configuration. The selected Matching Configuration can be different than the active Matching Configuration for the Entity Domain, allowing analysis calculations to be performed based on different trait value derivation logic, for example. Analytical calculations are defined by plugins which can be added, removed and upgraded without restarting Mirth Match.
Blocking
The Blocking calculation examines the values of all traits associated with the Entity Domain that are marked as blocking traits. For each blocking trait it computes and reports:
- Coverage: the percentage of Entitys that have a value for the blocking trait. A good blocking trait will have a high coverage.
- Number of distinct values: The number of different values seen for the blocking trait. A good blocking trait will have a large number of distinct values.
- Number of values that are too common: The number of values seen for the blocking trait that appear too many times, with a configurable threshold for "too many." A good blocking trait will have no values that are "too common."
- Average number of candidates: The average number of Entitys with a particular value of the blocking trait. A good blocking trait will have a fairly small average number of candidates. If the number is too large, performance will suffer because the search process must compare against too many candidate matches. If the number is too small, however, potential matching Entitys may be missed.
- Maximum number of candidates: The maximum number of Entitys with a particular value of the blocking trait, over all values. A good blocking trait will have a maximum number of candidates not too much larger than the average number of candidates.
Closed-Form U
The Closed-Form U calculation examines the values of all Matching Configuration traits (traits that are compared by the search module). For each such trait, it computes and reports the agreement and non-agreement rates using the Closed-Form method. The calculation also updates the Matching Configuration traits with the calculated Agreement and non-Agreement values.
Expectation Maximization
The Expectation Maximization calculation examines the values of all Matching Configuration traits (traits that are compared by the search module) for all possible pairs of Entities and calculates the agreement and non-agreement rates and score weights using the Expectation Maximization method. The calculation also updates the Matching Configuration traits with the calculated Agreement and non-Agreement values and score weights. This is the primary method used for determining the Agreement, non-Agreement and score weight values that drive the Fellegi-Sunter Search Module. The generated report includes:
- The lowest possible score value that can be returned by the Fellegi-Sunter search module.
- The highest possible score value that can be returned by the Fellegi-Sunter search module.
- The recommended "auto-link" score.
- The overall agreement rate.
- For each Matching Configuration trait, the calculated agreement and non-agreement rates.
- For each matching vector (combination of trait value matches and mismatches), the assigned score, the number of Entity pairs with the matching vector, and the cumulative total of pair counts (the estimated net population if the score is used as the auto-link score).
Note that because this calculation operates on all possible pairs of Entities, its execution time is proportional to the square of the number of Entities. On other words, doubling the number of Entities to be processed causes the execution time to quadruple.
Frequency
The Frequency calculation examines the values of all traits in all selected Entities. For each trait, it calculates and reports:
- Null Frequency: the percentage of Entities with no value for the trait.
- Number of Distinct Values: the number of different values seen for the trait.
- Average Value Frequency: the average over all values of the trait of the number of Entities with that trait value.
- Entropy and Maximum Entropy: the average amount of information provided by a particular value of this trait, in bits.
- The value, occurrence count and frequency for a configurable number of the least and most frequent values of the trait.
Outliers
The Outliers calculation examines the value of a single date trait in all selected Entities. It computes and reports:
- The number of Entities with no value for the trait.
- The number of Entities with a value before a configurable threshold.
- The number of Entities with a value after a configurable threshold.
In addition, it computes and reports the observed and expected frequency of the day of the month. Note that the the expected frequency is based only on the number of days in the month, not real-world demographics.
Pair Frequency
The Pair Frequency calculation examines the values of two selected traits, and computes and reports on the combinations of values observed. It calculates and reports:
- The number of distinct value pairs observed.
- The average mutual information of the trait pair, in bits.
- A configurable number of the most and least frequently observed value pairs, their occurrence count and frequency.
U Estimator
The U Estimator calculation examines the values of all Matching Configuration traits (traits that are compared by the search module) for a "random" subset of pairs of Entitys (with a configurable size). For each such trait, it computes and reports the agreement rate, standard deviation and confidence interval. The calculation also updates the Matching Configuration traits with the calculated Agreement values.
References
- Fellegi, Ivan; Sunter, Alan (December 1969). "A Theory for Record Linkage". Journal of the American Statistical Association 64 (328): pp. 1183–1210. doi:10.2307/2286061. JSTOR 2286061.


