预处理记录相似性 API API ID: 11916

预处理记录相似性 API 利用预处理数据快速查找相似记录并最小化资源使用

通过 MCP 从您的 AI 代理使用此 API

支持 OpenClaw、Claude Code/Desktop、Cursor、Windsurf、Cline 以及任何兼容 MCP 的 AI 客户端。

文档和设置

通过封装此 MCP 创建技能： https://mcp.zylalabs.com/mcp?apikey=YOUR_ZYLA_API_KEY

长期描述（平衡，适合市场）

预处理记录相似度API是一个为现实世界的杂乱数据构建的高速模糊匹配和去重API 它帮助你识别近重复记录并调解实体即使值不完全匹配—拼写错误、大写差异、缺少标点、间距问题、缩写和轻微的词序变化

你无需构建和调整自己的模糊匹配管道，只需将你的字符串（或记录）发送到API并返回可以信任的相似度评分匹配典型输出包括匹配对（例如：“苹果” ↔ “苹果公司”）、相似度分数和结构化结果，便于插入数据清理工作流、客户关系管理、ETL作业和分析管道

常见用例：

去重列表：在数据集中查找重复项（全对全匹配）并返回可能的重复对
与主列表调解：将输入列表与标准集匹配（列表对主表）
客户关系管理和客户数据卫生：清理导致报告和外联出现问题的潜在重复的线索/账户/公司
实体解析与记录链接：在多个来源中连接到同一现实世界实体的引用

团队使用它的原因：

开箱即用处理杂乱文本（不需要为每个边缘案例手动规则）
用于排名和阈值的相似度分数（你可以选择多严格）
为规模和自动化而构建（设计用于在管道中运行，而不仅仅是一次性脚本）

API 文档

端点

Dedupe Endpoint ID: 22653

Dedupe is an all-to-all fuzzy matching endpoint for finding duplicates within a single list of strings. Instead of comparing only two inputs per API call, you send a dataset and it returns similar pairs and/or deduplicated groups across the entire set.

Why you’d use it

Massive speedup: typically ~300× to 1,000× faster than “regular” approaches people try first (pairwise comparisons, looping fuzzy scorers, etc.) once you go beyond tiny lists.
Optional cleanup built-in: you can enable common text cleanup (lowercasing, punctuation removal, token sorting). This saves hours (or days) of development + ongoing maintenance.
Company suffixes handled automatically: common endings like “Inc”, “LLC”, “Ltd”, etc. are stripped so you match the real name.

Benchmarks: similarity-api/blog/speed-benchmarks (1M records in ~7 minutes; faster than common Python fuzzy matching libraries).

Hard limits on Zyla

Max 1,000 strings per request (enforced).

Need bigger / unlimited?

Use the full version at similarity-api/docs

Parameters (POST request)

data (required)

A string containing a JSON array of strings.

Example value for data:
["Acme Inc","ACME LLC","Globex GmbH"]

similarity_threshold (optional, 0.0 to 1.0, default 0.75)

Higher = stricter matching (fewer pairs). Typical: 0.80–0.90 for company dedupe.

remove_punctuation (optional, true/false, default true)

Removes punctuation differences (e.g., “A.C.M.E.” vs “ACME”).

to_lowercase (optional, true/false, default true)

Makes matching case-insensitive.

use_token_sort (optional, true/false, default false)

Helps when word order changes (e.g., “Bank of America” vs “America Bank of”).
output_format (optional, default string_pairs)

This exndpoint can return data in multiple formats. Please select one of the following:
- string_pairs:
  - Returns the duplicate matches as text, so you can read them immediately.
    Each row is: [string_A, string_B, similarity]
    Use when: you want to see which names matched which names.
- index_pairs:
  - Same as string_pairs, but returns positions in your input list instead of the strings.
    Each row is: [index_A, index_B, similarity]
    Use when: you want to join results back to your source rows safely (databases, spreadsheets, CRM exports).
- deduped_strings:
  - Returns a cleaned list with duplicates removed (keeps one representative from each duplicate group).
    Use when: you want a final list to export/use, without worrying about mapping back.
- deduped_indices:
  - Same idea as deduped_strings, but returns the indices of the kept items.
    Use when: you want to keep the original rows (by index) and drop the duplicates.
- membership_map:
  - Returns a list the same length as your input where each position tells you the representative index for that item.
    Example: [0,0,0,3,3] means rows 0/1/2 are one group (rep=0) and rows 3/4 are another (rep=3).
    Use when: you want clustering / group IDs per row.
- row_annotations:
  - Returns one object per input row with an explanation of what it belongs to (rep row + similarity).
    Use when: you want a human-readable, per-row result for debugging or UI display.
top_k (optional, integer or "all", default "all")

all = find all matches above threshold.

Or an integer (e.g., 50) to limit matches per row (faster, fewer results).

Sample request in python

import requests, json

API_KEY = "YOUR_ZYLA_KEY"
URL = "API_URL/dedupe"

data_list = ["Microsoft","Micsrosoft","Apple Inc","Apple","Google LLC","9oogle"]

params = {
"data": json.dumps(data_list),
"similarity_threshold": "0.75",
"remove_punctuation": "true",
"to_lowercase": "true",
"use_token_sort": "false",
"output_format": "string_pairs",
"top_k": "all"
}

headers = {"Authorization": f"Bearer {API_KEY}"}
r = requests.post(URL, headers=headers, params=params, timeout=60)
print(r.status_code)
print(r.json())

                                                                            
POST https://pr137-testing.zylalabs.com/api/11916/preprocessed+record+similarity+api/22653/dedupe

Dedupe - 端点功能

对象	描述
`data`	[必需] JSON array of strings to deduplicate (max 1000). Example: ["a","b","c"]
`similarity_threshold`	可选 Similarity cutoff from 0 to 1. Higher values are stricter (fewer matches). Default is 0.75.
`remove_punctuation`	可选 If true, punctuation is removed before matching. Default is true.
`to_lowercase`	可选 If true, strings are lowercased before matching. Default is true.
`use_token_sort`	可选 If true, tokens in each string are sorted before matching. Useful when word order varies. Default is false.
`output_format`	可选 Default: string_pairs Allowed values (and what each means): index_pairs List of matches as [i, j, score] where i and j are indices in the input list. string_pairs List of matches as [string_i, string_j, score] using original strings. deduped_strings List of strings with duplicates removed (one representative per group). deduped_indices List of indices representing the deduplicated set (one representative per group). membership_map Array of length N where entry i is the representative index for the group of data[i]. row_annotations Array of objects (one per input row) with fields: index, original_string, rep_index, rep_string, similarity_to_rep.
`top_k`	可选 Limits how many neighbors are returned per input string. Use all for full dedupe, or a positive integer for top matches per row.

测试端点

API 示例响应

       
                                                                                                        
                                                                                                                                                                                                                                                                                                                                        {"status":"success","response_data":[["Apple","appl!e",1.0]]}

Dedupe - 代码片段


curl --location --request POST 'https://zylalabs.com/api/11916/preprocessed+record+similarity+api/22653/dedupe?data=["Apple", "appl!e"]' --header 'Authorization: Bearer YOUR_API_KEY'

API 访问密钥和身份验证

注册后，每个开发者都会被分配一个个人 API 访问密钥，这是一个唯一的字母和数字组合，用于访问我们的 API 端点。要使用预处理记录相似性 API 进行身份验证，只需在 Authorization 标头中包含您的 bearer token。

标头

标头	描述
`授权`	[必需] 应为 `Bearer access_key`. 订阅后，请查看上方的"您的 API 访问密钥"。

问题

简单透明的定价

无长期承诺。随时升级、降级或取消。

💫Basic

$24.99/月

50 请求 / 月
然后 $0.6497400 如果超过限制，每次请求
速率限制: 60 reqs 每分钟
专业客户支持
实时 API 监控
包含无限数据传输

$24.99 / 月

无承诺。随时取消

Popular

⚡Pro

$49.99/月

100 请求 / 月
然后 $0.6497400 如果超过限制，每次请求
速率限制: 60 reqs 每分钟
专业客户支持
实时 API 监控
包含无限数据传输

$49.99 / 月

无承诺。随时取消

🔥Pro Plus

$99.99/月

200 请求 / 月
然后 $0.6497400 如果超过限制，每次请求
速率限制: 120 reqs 每分钟
专业客户支持
实时 API 监控
包含无限数据传输

$99.99 / 月

无承诺。随时取消

🚀 企业版

起价
$ 10,000/年

自定义数量
自定义速率限制
专业客户支持
实时 API 监控

预约通话

客户喜爱的功能

✔︎ 仅支付成功请求
✔︎ 7 天免费试用
✔︎ 多语言支持
✔︎ 一个 API 密钥，所有 API。
✔︎ 直观的仪表板

✔︎ 全面的错误处理
✔︎ 开发者友好的文档
✔︎ Postman 集成
✔︎ 安全的 HTTPS 连接
✔︎ 可靠的正常运行时间

预处理记录相似性 API FAQs

Dedupe端点返回什么类型的数据

去重端点返回一个 JSON 对象，包含匹配的字符串对、相似度分数和可选的去重结果。输出可以根据指定的配置格式化为字符串对、索引对或去重字符串

响应数据中的关键字段是什么

响应数据中的关键字段包括“状态”（指示成功或错误）和“响应数据”，其中包含根据用户请求格式化的结果，如匹配对或去重字符串

用户如何自定义他们的数据请求

用户可以通过调整“config”对象中的参数来自定义请求，例如“similarity_threshold”用于匹配的严格性，“remove_punctuation”用于预处理，以及“output_format”以选择所需的结果结构

响应数据是如何组织的

响应数据组织为一个结果数组，每个条目对应于一个匹配或去重字符串根据输出格式条目可能包括原始字符串索引和相似性分数方便轻松集成到工作流程中

此数据的典型使用案例是什么

典型的用例包括去重客户名单将记录与主名单对账清理CRM数据以及在不同数据源之间进行实体解析以确保数据的完整性和准确性

如何保持数据准确性

数据准确性通过先进的模糊匹配算法得以保持，这些算法考虑了常见的数据问题，如拼写错误和大小写差异。该API旨在有效处理混乱数据，确保可靠的匹配结果

Dedupe接口接受的参数值是什么

接受的参数值包括“similarity_threshold”（0到1）、“remove_punctuation”（布尔值）、“to_lowercase”（布尔值）、“use_token_sort”（布尔值）和“top_k”（整数或“all”）。这些参数允许用户根据其特定需求调整匹配过程

如何处理部分或空结果

如果去重端点返回部分或空结果，用户应检查输入数据是否存在质量问题，例如过多的重复项或非常低的相似性阈值。调整“相似性阈值”或审查输入列表可以帮助改善结果

一般常见问题

什么是 Zyla API Hub？

Zyla API Hub 就像一个大型 API 商店，您可以在一个地方找到数千个 API。我们还为所有 API 提供专门支持和实时监控。注册后，您可以选择要使用的 API。请记住，每个 API 都需要自己的订阅。但如果您订阅多个 API，您将为所有这些 API 使用相同的密钥，使事情变得更简单。

允许使用哪些货币和支付方式？

价格以 USD（美元）、EUR（欧元）、CAD（加元）、AUD（澳元）和 GBP（英镑）列出。我们接受所有主要的借记卡和信用卡。我们的支付系统使用最新的安全技术，由 Stripe 提供支持，Stripe 是世界上最可靠的支付公司之一。如果您在使用卡片付款时遇到任何问题，请通过 [email protected]

此外，如果您已经以这些货币中的任何一种（USD、EUR、CAD、AUD、GBP）拥有有效订阅，该货币将保留用于后续订阅。只要您没有任何有效订阅，您可以随时更改货币。

如果我在定价页面上看到本地货币，为什么不能用它付款？

定价页面上显示的本地货币基于您 IP 地址的国家/地区，仅供参考。实际价格以 USD（美元）为单位。当您付款时，即使您在我们的网站上看到以本地货币显示的等值金额，您的卡片对账单上也会以美元显示费用。这意味着您不能直接使用本地货币付款。

我的付款被拒绝，我该怎么办？

有时，银行可能会因其欺诈保护设置而拒绝收费。我们建议您首先联系您的银行，检查他们是否阻止了我们的收费。此外，您可以访问账单门户并更改关联的卡片以进行付款。如果这些方法不起作用并且您需要进一步帮助，请通过 [email protected]

我的 API 订阅将如何收费？

价格由月度或年度订阅决定，具体取决于所选计划。

我的 API 调用将如何从我的计划中扣除？

API 调用根据成功请求从您的计划中扣除。每个计划都包含您每月可以进行的特定数量的调用。只有成功的调用（由状态 200 响应指示）才会计入您的总数。这确保失败或不完整的请求不会影响您的月度配额。

您的计费周期如何工作？

Zyla API Hub 采用月度订阅系统。您的计费周期将从您购买付费计划的那一天开始，并在下个月的同一日期续订。因此，如果您想避免未来的费用，请提前取消订阅。

如何升级我当前的 API 订阅计划？

要升级您当前的订阅计划，只需转到 API 的定价页面并选择您要升级到的计划。升级将立即生效，让您立即享受新计划的功能。请注意，您之前计划中的任何剩余调用都不会转移到新计划，因此在升级时请注意这一点。您将被收取新计划的全部金额。

如何查看本月我可以进行的剩余 API 调用次数？

要检查您本月剩余多少 API 调用，请参考响应标头中的 "X-Zyla-API-Calls-Monthly-Remaining" 字段。例如，如果您的计划允许每月 1,000 个请求，而您已使用 100 个，则响应标头中的此字段将显示 900 个剩余调用。

如何找出我的订阅计划允许的最大 API 请求数？

要查看您的计划允许的最大 API 请求数，请检查 "X-Zyla-RateLimit-Limit" 响应标头。例如，如果您的计划包括每月 1,000 个请求，此标头将显示 1,000。

如何知道我的速率限制何时重置？

"X-Zyla-RateLimit-Reset" 标头显示您的速率限制重置之前的秒数。这告诉您何时您的请求计数将重新开始。例如，如果它显示 3,600，则意味着还有 3,600 秒直到限制重置。

我可以随时取消吗？

是的，您可以随时通过访问您的账户并在账单页面上选择取消选项来取消您的计划。请注意，升级、降级和取消会立即生效。此外，取消后，您将不再有权访问该服务，即使您的配额中还有剩余调用。

7 天免费试用如何工作？

为了让您有机会在没有任何承诺的情况下体验我们的 API，我们提供 7 天免费试用，允许您免费进行最多 50 次 API 调用。此试用只能使用一次，因此我们建议将其应用于您最感兴趣的 API。虽然我们的大多数 API 都提供免费试用，但有些可能不提供。试用在 7 天后或您进行了 50 次请求后结束，以先发生者为准。如果您在试用期间达到 50 次请求限制，您需要"开始您的付费计划"以继续发出请求。您可以在个人资料中的订阅 -> 选择您订阅的 API -> 定价标签下找到"开始您的付费计划"按钮。或者，如果您在第 7 天之前不取消订阅，您的免费试用将结束，您的计划将自动计费，授予您访问计划中指定的所有 API 调用的权限。请记住这一点以避免不必要的费用。