Within our Data Quality pillar, we offer a tool specifically designed to deduplicate datasets containing millions of records. Duplicate data is known as one of the key reasons data engineers and analysts lose a lot of time in preparing their reports. Duplicate data is also known as noise during the analysis phase, for example, whilst tracking down a specific trail of information. 

Whether data is “duplicate” depends on its definition. It can be an absolute definition (i.e.: companies with the same VAT number), or it can be a more dynamic requirement set (i.e.: one company that is shown via its different employees).

Solution

Our Data Deduplication solution (D³) is unlike the default deduplication tools. Where they focus on an absolute definition of identical data, D³ is able to virtually map all information and create connections between relevant data points. This way, we create an entire network of data elements. This is not unlike the map of a metro network. All stops are interlinked and accessible via multiple routes. You don’t have to look through all the stops to find your shortest route, just follow the path that is most convenient. This ensures a high performance whilst processing large datasets compared to traditional tools that have to scan record by record..

Because of this differentiating network-approach to data, the size of the dataset does not affect its performance. This is something that differentiates it from the classic SQL tools.

Our D³ solution is not an A.I. model. Therefore, there is no need to (re)train specific models, nor to provide sufficient training data. Configuration is done during the network setup, based on the initial parameters. This configuration can be modified when there are additional datasets available, or when a dataset has structurally changed.

Offerings

D³ can be offered via 2 means :

  • Hosted by Vectr
    • API Endpoint availability towards the customer
    • all infra managed by Vectr
  • On Prem 
    • Customer provides infrastructure 
    • Vectr drops stand-alone solution on this infra

Implementation possibilities

  • Soon to be available as an add-on to marketing tools (CRM environments like SalesForce, Dynamics 365, …)
  • It is possible for external visualization tools like e.g. Linkurious to connect to the D³ engine
  • Because of the API endpoint availability, it can be plugged into custom built applications

Use-cases

Singer is well suited to be used in the digitalization of manual document flows. It can be run individually or as part of a set of digitalization tools. Individual uses can be the scanning and interpretation of :

  • logistical documents (CMR, order sheets, …)
  • quality documents (automotive, health industry, …)
  • banking documents 
  • public service docs (travel documents, border control, …)

An example where it is part of a chain of tools can be where the first tool scans the document, then sends it to Singer for analysis. Singer sends the output to a validation tool. Depending on a positive or negative validation, the information a trigger with this information is sent to a CRM or ERP system.

Reference projects

  • Clean up of large datasets (leads, prize winners, legacy data,..)
  • Reduction in required storage
  • Risk management : identify fraudulent companies based on duplicate detection towards rogue sister installments (for example: 15+ companies on the same address, with the same board of directors)