Preprocessed Record Similarity API is a high-speed fuzzy matching and deduplication API built for real-world, messy data. It helps you identify near-duplicate records and reconcile entities even when values don’t match exactly—typos, casing differences, missing punctuation, spacing issues, abbreviations, and minor word-order changes.
Instead of building and tuning your own fuzzy matching pipeline, you send your strings (or records) to the API and get back similarity-scored matches you can trust. Typical outputs include matched pairs (e.g., “Apple” ↔ “apple inc.”), similarity scores, and structured results that are easy to plug into data cleaning workflows, CRMs, ETL jobs, and analytics pipelines.
Common use cases:
Deduplicate lists: find duplicates inside a dataset (all-to-all matching) and return likely duplicate pairs.
Reconcile against a master list: match an incoming list to a canonical set (list-to-master).
CRM and customer data hygiene: clean leads/accounts/companies where duplicates break reporting and outreach.
Entity resolution & record linkage: connect references to the same real-world entity across sources.
Why teams use it:
Works on messy text out of the box (no manual rules for every edge case)
Similarity scores for ranking and thresholds (you choose how strict to be)
Built for scale and automation (designed to run in pipelines, not just one-off scripts)
Dedupe is an all-to-all fuzzy matching endpoint for finding duplicates within a single list of strings. Instead of comparing only two inputs per API call, you send a dataset and it returns similar pairs and/or deduplicated groups across the entire set.
Why you’d use it
Massive speedup: typically ~300× to 1,000× faster than “regular” approaches people try first (pairwise comparisons, looping fuzzy scorers, etc.) once you go beyond tiny lists.
Optional cleanup built-in: you can enable common text cleanup (lowercasing, punctuation removal, token sorting). This saves hours (or days) of development + ongoing maintenance.
Company suffixes handled automatically: common endings like “Inc”, “LLC”, “Ltd”, etc. are stripped so you match the real name.
Benchmarks: similarity-api/blog/speed-benchmarks (1M records in ~7 minutes; faster than common Python fuzzy matching libraries).
Hard limits on Zyla
Max 1,000 strings per request (enforced).
Need bigger / unlimited?
Parameters (POST request)
data (required)
A string containing a JSON array of strings.
Example value for data:
["Acme Inc","ACME LLC","Globex GmbH"]
Higher = stricter matching (fewer pairs). Typical: 0.80–0.90 for company dedupe.
Removes punctuation differences (e.g., “A.C.M.E.” vs “ACME”).
Makes matching case-insensitive.
use_token_sort (optional, true/false, default false)
Helps when word order changes (e.g., “Bank of America” vs “America Bank of”).
output_format (optional, default string_pairs)
This exndpoint can return data in multiple formats. Please select one of the following:
string_pairs:
[string_A, string_B, similarity]index_pairs:
string_pairs, but returns positions in your input list instead of the strings.[index_A, index_B, similarity]deduped_strings:
deduped_indices:
deduped_strings, but returns the indices of the kept items.membership_map:
[0,0,0,3,3] means rows 0/1/2 are one group (rep=0) and rows 3/4 are another (rep=3).row_annotations:
Returns one object per input row with an explanation of what it belongs to (rep row + similarity).
Use when: you want a human-readable, per-row result for debugging or UI display.
top_k (optional, integer or "all", default "all")
all = find all matches above threshold.
Or an integer (e.g., 50) to limit matches per row (faster, fewer results).
Sample request in python
import requests, json
API_KEY = "YOUR_ZYLA_KEY"
URL = "API_URL/dedupe"
data_list = ["Microsoft","Micsrosoft","Apple Inc","Apple","Google LLC","9oogle"]
params = {
"data": json.dumps(data_list),
"similarity_threshold": "0.75",
"remove_punctuation": "true",
"to_lowercase": "true",
"use_token_sort": "false",
"output_format": "string_pairs",
"top_k": "all"
}
headers = {"Authorization": f"Bearer {API_KEY}"}
r = requests.post(URL, headers=headers, params=params, timeout=60)
print(r.status_code)
print(r.json())
Dedupe - Endpoint Features
| Object | Description |
|---|---|
data |
[Required] JSON array of strings to deduplicate (max 1000). Example: ["a","b","c"] |
similarity_threshold |
Optional Similarity cutoff from 0 to 1. Higher values are stricter (fewer matches). Default is 0.75. |
remove_punctuation |
Optional If true, punctuation is removed before matching. Default is true. |
to_lowercase |
Optional If true, strings are lowercased before matching. Default is true. |
use_token_sort |
Optional If true, tokens in each string are sorted before matching. Useful when word order varies. Default is false. |
output_format |
Optional Default: string_pairs Allowed values (and what each means): index_pairs List of matches as [i, j, score] where i and j are indices in the input list. string_pairs List of matches as [string_i, string_j, score] using original strings. deduped_strings List of strings with duplicates removed (one representative per group). deduped_indices List of indices representing the deduplicated set (one representative per group). membership_map Array of length N where entry i is the representative index for the group of data[i]. row_annotations Array of objects (one per input row) with fields: index, original_string, rep_index, rep_string, similarity_to_rep. |
top_k |
Optional Limits how many neighbors are returned per input string. Use all for full dedupe, or a positive integer for top matches per row. |
{"status":"success","response_data":[["Apple","appl!e",1.0]]}
curl --location --request POST 'https://zylalabs.com/api/11916/preprocessed+record+similarity+api/22653/dedupe?data=["Apple", "appl!e"]' --header 'Authorization: Bearer YOUR_API_KEY'
| Header | Description |
|---|---|
Authorization
|
[Required] Should be Bearer access_key. See "Your API Access Key" above when you are subscribed. |
No long-term commitment. Upgrade, downgrade, or cancel anytime.
The Dedupe endpoint returns a JSON object containing matched pairs of strings, similarity scores, and optional deduplicated results. The output can be formatted as string pairs, index pairs, or deduplicated strings, depending on the specified configuration.
Key fields in the response data include "status" (indicating success or error) and "response_data," which contains the results formatted according to the user's request, such as matched pairs or deduplicated strings.
Users can customize requests by adjusting parameters in the "config" object, such as "similarity_threshold" for match strictness, "remove_punctuation" for preprocessing, and "output_format" to choose the desired result structure.
The response data is organized as an array of results, where each entry corresponds to a match or deduplicated string. Depending on the output format, entries may include original strings, indices, and similarity scores, facilitating easy integration into workflows.
Typical use cases include deduplicating customer lists, reconciling records against a master list, cleaning CRM data, and performing entity resolution across different data sources to ensure data integrity and accuracy.
Data accuracy is maintained through advanced fuzzy matching algorithms that account for common data issues like typos and casing differences. The API is designed to handle messy data effectively, ensuring reliable matching results.
Accepted parameter values include "similarity_threshold" (0 to 1), "remove_punctuation" (boolean), "to_lowercase" (boolean), "use_token_sort" (boolean), and "top_k" (integer or "all"). These parameters allow users to tailor the matching process to their specific needs.
If the Dedupe endpoint returns partial or empty results, users should check the input data for quality issues, such as excessive duplicates or very low similarity thresholds. Adjusting the "similarity_threshold" or reviewing the input list can help improve results.
To obtain your API key, you first need to sign in to your account and subscribe to the API you want to use. Once subscribed, go to your Profile, open the Subscription section, and select the specific API. Your API key will be available there and can be used to authenticate your requests.
You can’t switch APIs during the free trial. If you subscribe to a different API, your trial will end and the new subscription will start as a paid plan.
If you don’t cancel before the 7th day, your free trial will end automatically and your subscription will switch to a paid plan under the same plan you originally subscribed to, meaning you will be charged and gain access to the API calls included in that plan.
The free trial ends when you reach 50 API requests or after 7 days, whichever comes first.
No, the free trial is available only once, so we recommend using it on the API that interests you the most. Most of our APIs offer a free trial, but some may not include this option.
Yes, we offer a 7-day free trial that allows you to make up to 50 API calls at no cost, so you can test our APIs without any commitment.
Zyla API Hub is like a big store for APIs, where you can find thousands of them all in one place. We also offer dedicated support and real-time monitoring of all APIs. Once you sign up, you can pick and choose which APIs you want to use. Just remember, each API needs its own subscription. But if you subscribe to multiple ones, you'll use the same key for all of them, making things easier for you.
Service Level:
100%
Response Time:
3,110ms
Service Level:
100%
Response Time:
3,110ms
Service Level:
100%
Response Time:
3,110ms
Service Level:
100%
Response Time:
3,110ms
Service Level:
100%
Response Time:
1,695ms
Service Level:
100%
Response Time:
1,937ms
Service Level:
100%
Response Time:
3,110ms
Service Level:
100%
Response Time:
3,110ms
Service Level:
100%
Response Time:
393ms
Service Level:
83%
Response Time:
195ms
Service Level:
100%
Response Time:
3,091ms
Service Level:
100%
Response Time:
1,052ms
Service Level:
100%
Response Time:
1,154ms
Service Level:
100%
Response Time:
495ms
Service Level:
100%
Response Time:
632ms
Service Level:
100%
Response Time:
8,724ms
Service Level:
100%
Response Time:
320ms
Service Level:
100%
Response Time:
3,749ms
Service Level:
100%
Response Time:
1,394ms
Service Level:
100%
Response Time:
3,767ms