Proximity Measure of Information Object Features for Solving the Problem of Their Identification in Information Systems

Imagine you are the manager of a massive, chaotic detective agency. Your job is to track down "suspects" (physical objects like cars, people, or ships) based on reports coming in from different sources.

Here is the problem: You have two detectives, Detective Bob and Detective Alice. They are both looking at the same suspect, a red car.

Bob is using a high-tech laser scanner. He reports: "The car is at mile marker 100.2."
Alice is using a cheap, old map. She reports: "The car is at mile marker 100.5."

In your current system, the computer thinks these are two different cars because the numbers don't match perfectly. So, you end up with two files for one car. This clutters your database, wastes memory, and confuses your team.

This paper proposes a new, smarter way to decide if two reports are actually about the same object, even when the data isn't perfect.

The Old Way: The "Rigid Ruler"

Previously, systems used a "Rigid Ruler" approach.

If the numbers were exactly the same, it was a match.
If they were even slightly different, it was a mismatch.
The Flaw: This ignores reality. No measurement is perfect. A ruler might be slightly bent, or a human might misread a scale. The old system couldn't handle "close enough."

The New Way: The "Fuzzy Detective"

The author, V.V. Yuzefovych, suggests we stop looking for exact matches and start looking for probability. Think of it like this:

1. The Quantitative Clues (The Numbers)

Imagine you are trying to guess the temperature.

Detective Bob says: "It's 20°C." He is very confident (his thermometer is precise).
Detective Alice says: "It's 22°C." She is less confident (her thermometer is shaky).

Instead of saying "20 is not 22," the new method asks: "What is the chance that the real temperature is somewhere in the middle?"

It draws a "cloud of uncertainty" around Bob's number (a small, tight cloud because he's precise).
It draws a "foggy cloud" around Alice's number (a big, wide cloud because she's shaky).
If these clouds overlap, the system says, "Hey, there's a good chance they are talking about the same temperature!"
The Magic: The system calculates the probability that the overlap is real. The more the clouds overlap, the higher the chance it's the same object.

2. The Qualitative Clues (The Descriptions)

Now, imagine the clues aren't numbers, but descriptions.

Bob says: "The car is Red."
Alice says: "The car is Dark Red."

Old systems would say, "Red is not Dark Red. Mismatch!"
The new system uses Fuzzy Logic (think of it as a dimmer switch rather than an on/off light).

It treats "Red" and "Dark Red" not as separate boxes, but as shades on a spectrum.
It asks: "How much does 'Dark Red' look like 'Red'?"
If Alice is unsure about her description (she says, "I think it's Red, but I'm not 100% sure"), the system accounts for that doubt. It widens the "fog" around her description, making it easier to match with Bob's.

The Final Verdict: The "All-or-Nothing" Rule

Once the system checks every clue (location, color, speed, type), it needs to make a final decision.

The paper suggests using a Multiplicative Rule (like a chain).

Imagine a chain where every link represents a clue.
If one link is broken (e.g., the car is definitely Red in one report and definitely Blue in another), the whole chain breaks.
Even if the location and speed match perfectly, if the color is totally wrong, the system concludes: "These are two different cars."

This prevents the system from accidentally merging two totally different objects just because they happened to be in the same neighborhood.

Why This Matters

This new method is like upgrading your detective agency from a rigid robot to a wise human investigator.

No More Clutter: It stops the system from creating duplicate files for the same object.
Handles Mistakes: It understands that humans and machines make errors, and it doesn't panic when data isn't perfect.
Better Decisions: By knowing which objects are truly the same, the system can give you a clearer picture of the world, leading to better decisions.

In short: The paper teaches computers how to say, "These two reports aren't exactly the same, but given the mistakes we know happen, they are almost certainly talking about the same thing."

1. Problem Statement

Information systems often aggregate data regarding physical objects (POs) from multiple independent sources (internal or external). A critical challenge arises when data describing the same physical object arrives as distinct Information Objects (IOs) due to:

Measurement Errors: Quantitative features (e.g., coordinates) are determined with varying precision and errors.
Subjectivity/Uncertainty: Qualitative features (e.g., object type, status) are determined via human reasoning or ordinal scales, introducing non-statistical uncertainty.
Data Duplication: Without a robust identification mechanism, the system stores duplicate records, leading to inflated data volumes and erroneous assessments of environmental saturation.

Existing proximity measures (e.g., Euclidean distance, Hamming, Jaccard) often fail because they:

Require normalization of features measured in different units.
Assume a "complete match" is necessary for qualitative features, ignoring the possibility of close but non-identical values due to error margins.
Do not inherently account for the uncertainty (error distribution) of the data sources.

2. Methodology

The author proposes a new quantitative-qualitative proximity measure that explicitly models the probability of two feature values originating from the same physical reality, accounting for source-specific errors. The methodology is divided into three components:

A. Quantitative Features (Probabilistic Approach)

Instead of using linear distance, the author treats measurement errors as probability distributions (assumed Normal/Gaussian via the Central Limit Theorem).

Mechanism: For two measurements $x_i$ and $x_j$ with known Root Mean Square Errors ( $\sigma_i, \sigma_j$ ), the method calculates the probability that the true value lies within the intersection of their error intervals (defined by the "three-sigma" rule).
Calculation:
1. Define the intersection interval $[c, d]$ where $c = \max(x_i - 3\sigma_i, x_j - 3\sigma_j)$ and $d = \min(x_i + 3\sigma_i, x_j + 3\sigma_j)$ .
2. Calculate the joint probability $P$ that the true value falls in this interval for both sources.
3. Proximity Measure ( $\rho'$ ): The joint probability itself.
4. Distance Measure ( $\rho$ ): $1 - \rho'$ .
Refinement: To address the issue where high-precision sources yielding identical values should be "closer" than low-precision sources, a correction coefficient ( $P_\xi$ ) is introduced, inversely dependent on the error magnitude.

B. Qualitative Features (Fuzzy Set Approach)

Qualitative features (nominal or ordinal) are formalized as Fuzzy Sets to handle non-statistical uncertainty and human reasoning errors.

Ordinal Scales: Values are mapped to triangular (or Gaussian) membership functions. The "width" of the triangle is determined by a coefficient $k$ representing the perceived error margin.
Nominal Scales: Modeled with a specific membership function allowing for a small degree of error ( $\Delta$ ) even if values differ, acknowledging potential misclassification.
Certainty Levels: The method incorporates linguistic certainty levels (e.g., "Certain," "Probable," "Doubtful") by scaling the membership function values.
Calculation:
1. Construct fuzzy sets for values from Source A and Source B.
2. Compute the intersection of these sets (using the minimum operator).
3. Proximity Measure ( $\rho'$ ): The height of the maximum point of the intersection (possibility degree).
4. Distance Measure ( $\rho$ ): $1 - \rho'$ .

C. Aggregation (Global IO Proximity)

To determine the overall similarity between two IOs based on a set of features, the paper proposes two aggregation strategies:

Additive: Summing weighted distances (standard approach).
Multiplicative (Proposed for Identification): Multiplying the proximity indicators of individual features.
- Rationale: In identification tasks, a single significant mismatch (e.g., different object type or large coordinate deviation) should invalidate the hypothesis that two IOs are the same. The multiplicative approach ensures that if any feature has zero similarity, the total similarity becomes zero.

3. Key Contributions

Unified Error-Aware Measure: The paper introduces a framework that simultaneously handles quantitative (probabilistic) and qualitative (fuzzy) features without requiring prior normalization or transformation of data units.
Theoretical Validation: The proposed measures are verified against the standard axioms of distance metrics:
- Non-negativity, Symmetry, and Identity: Satisfied by both probabilistic and fuzzy definitions.
- Triangle Inequality: The fuzzy measure satisfies this strictly. The probabilistic measure does not always satisfy it due to non-linear probability growth, but the author argues this is acceptable for identification tasks where physical meaning (probability) outweighs strict metric geometry.
Handling Source Precision: The method dynamically adjusts the "closeness" of two objects based on the known precision ( $\sigma$ ) of the data sources, rather than treating all data points equally.
Multiplicative Convolution: A specific aggregation method is proposed to prevent "averaging out" critical mismatches during object identification.

4. Results and Simulation

The author conducted modeling experiments using planar coordinates (quantitative) and object types (qualitative/nominal) from two sources with varying precision.

Scenario 1 (Lower Precision): Sources with RMSE of 20m and 30m.
Scenario 2 (Higher Precision): Sources with RMSE of 10m and 15m.
Findings:
- The proximity measure increased non-linearly as the linear distance between objects decreased.
- Precision Sensitivity: For objects in very close proximity, the proximity score was higher when derived from high-precision sources (reflecting higher confidence). Conversely, for distant objects, the score dropped more sharply with high-precision sources (as the probability of such a large error decreases).
- Qualitative Impact: A mismatch in object type (qualitative feature) drastically reduced the total similarity score, even if the spatial coordinates were close, demonstrating the effectiveness of the multiplicative approach.

5. Significance and Conclusion

Automation: The proposed measure enables higher levels of automation in data fusion systems by reducing the need for manual intervention to resolve duplicates.
Data Integrity: By accurately identifying and merging IOs, the system eliminates data duplication, leading to more accurate assessments of environmental object saturation and reducing the risk of erroneous decision-making.
Flexibility: The approach is adaptable to various domains (surveillance, environmental monitoring) and does not require the rigid preprocessing steps (normalization) demanded by traditional distance metrics.
Limitation: The method requires a priori knowledge of measurement errors (RMSE) and fuzzy set parameters (error coefficients), which must be specified before implementation.

In summary, Yuzefovych presents a robust, theoretically grounded mathematical framework for identifying information objects that respects the inherent uncertainty and varying precision of multi-source data, offering a superior alternative to traditional binary or linear distance measures.