Doccano Auto-Labeling Configuration
Table of Contents
Open-source annotation tool Doccano. Project: https://github.com/doccano/doccano
Official documentation: https://doccano.github.io/doccano/
Supports JSONL file import/export and REST API-based auto-labeling.
Auto-labeling API reference:
https://blog.csdn.net/weixin_44826203/article/details/125719480
Issues encountered:
Unable to correctly configure the auto-labeling API
The root cause is a frontend bug in the current version of Doccano. See https://github.com/doccano/doccano/issues/2281
Workaround: access the Django admin interface at http://x.x.x.x:8000/admin/ and configure it manually.
Model attrs:{"url": "http://x.x.x.x:5739", "body": {"text": "{{ text }}"}, "method": "POST", "params": {}, "headers": {}}
Template:[
{% for entity in input %}
{
"start_offset": {{ entity.start_offset }},
"end_offset": {{ entity.end_offset}},
"label": "{{ entity.label }}"
}{% if not loop.last %},{% endif %}
{% endfor %}
]
Label mapping:{"label1":"match label","label2":"match label2"}
# label1: your configured label_span name
# match label: entity class name returned by the interface
After correct configuration, the API backend can receive data and process it normally. However, the Doccano frontend still fails to auto-label. The root cause is unclear — either relevant parameters were not configured correctly (difficult to diagnose due to the poor quality of the Doccano frontend), or Doccano is not receiving the returned data.
Debugging approach:
- On the Doccano machine, capture traffic to the API endpoint to verify whether data is being received.
- Check Doccano-related logs.
- Read the source code (at this point, switching to another tool or annotating manually may be more practical).
Solution: install an older version of Doccano.
docker pull doccano/doccano:1.8.3
docker container create --name doccano_183 \
-e "ADMIN_USERNAME=admin" \
-e "ADMIN_EMAIL=admin@example.com" \
-e "ADMIN_PASSWORD=password" \
-v doccano-db:/data \
-p 8002:8000 doccano/doccano:1.8.3
docker container start doccano_183
# List available tags
curl -s https://registry.hub.docker.com/v2/repositories/doccano/doccano/tags | jq '.results[].name'
Auto-Labeling
Named entity recognition interface:
from flask import Flask, request, jsonify
import regex, re
app = Flask(__name__)
def load_common_words(file_path):
with open(file_path, 'r', encoding='utf-8') as file:
words = [line.strip() for line in file if line.strip()]
return words
words=load_common_words('common_words.txt')
# Define regex patterns
patterns_mc = [
('Phone', r'(?<=\+86[-\s]?)1[3-9]\d{9}|(?<=\+852[-\s]?)(?:4|5|6|7|8|9)\d{7}|(?<=\+886[-\s]?)09\d{8}|(?<=\+853[-\s]?)6\d{7}'),
('TG', r'@[a-zA-Z][a-zA-Z0-9_]{4,31}'),
('Mail', r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}'),
('QQ', r'(?<=QQ[ ]?|qq[ ]?|Qq[ ]?|qQ[ ]?)[1-9][0-9]{6,10}'),
('ID', r'(\d{6}(?:\d{8}|\d{6})\d{3}(?:\d|X))'),
('Landline_Number', r'0\d{2,3}-\d{7,8}'),
('Common_words', r'\b(' + '|'.join(re.escape(word) for word in words) + r')\b')
]
patterns = [
('Phone', r'1[3-9]\d{9}|(?:4|5|6|7|8|9)\d{7}|09\d{8}|6\d{7}'), # (optional country code) mainland China / Hong Kong / Taiwan / Macau phone numbers
('QQ', r'[1-9][0-9]{6,10}'), # QQ number, constrained to 7-11 digits
('WX', r'[a-zA-Z][-_a-zA-Z0-9]{5,19}') # WeChat ID pattern
]
def extract_labels(text):
results = []
grapheme_clusters = list(regex.finditer(r'\X', text))
matched_positions = [False] * len(grapheme_clusters) # track which grapheme clusters have been matched
all_matches_mc = []
all_matches = []
# Priority matching pass
for label, pattern in patterns_mc:
for match in regex.finditer(pattern, text):
match_text=match.group
start, end = match.start(), match.end()
all_matches_mc.append((label, start, end))
all_matches_mc.sort(key=lambda x: x[2] - x[1], reverse=True)
for label, start, end in all_matches_mc:
# Find the grapheme cluster range for the match
start_cluster = next(i for i, m in enumerate(grapheme_clusters) if m.start() == start)
end_cluster = next(i for i, m in enumerate(grapheme_clusters) if m.end() == end)
# Check if any grapheme cluster in the range has already been matched
if not any(matched_positions[start_cluster:end_cluster]):
results.append({
"label": label,
"start_offset": start_cluster,
"end_offset": end_cluster+1
})
# Mark the matched range
for i in range(start_cluster, end_cluster):
matched_positions[i] = True
# Secondary matching pass
for label, pattern in patterns:
for match in regex.finditer(pattern, text):
start, end = match.start(), match.end()
all_matches.append((label, start, end))
# Sort by match length, longest first
all_matches.sort(key=lambda x: x[2] - x[1], reverse=True)
for label, start, end in all_matches:
# Find the grapheme cluster range for the match
start_cluster = next(i for i, m in enumerate(grapheme_clusters) if m.start() == start)
end_cluster = next(i for i, m in enumerate(grapheme_clusters) if m.end() == end)
# Check if any grapheme cluster in the range has already been matched
if not any(matched_positions[start_cluster:end_cluster]):
results.append({
"label": label,
"start_offset": start_cluster,
"end_offset": end_cluster+1
})
# Mark the matched range
for i in range(start_cluster, end_cluster):
matched_positions[i] = True
return results
@app.route('/', methods=['POST'])
def get_result():
text = request.json['text']
print(text)
results = extract_labels(text)
return jsonify(results)
if __name__ == '__main__':
# Be careful not to conflict with existing ports
# host=0.0.0.0 means the service is accessible from any machine on the network
# When accessing from another machine, use the actual IP address
app.run(host='0.0.0.0', port=5739)
Test:
curl -X POST http://x.x.x.x:5739 -H "Content-Type: application/json" -d '{"text":"这是一个测试文本,包含中国大陆手机号:13912345678,香港手机号:51234567,澳门手机号:61234567,台湾手机号:0912345678"}'
Now that we have the Doccano annotation platform and an auto-labeling interface, the next step is to connect them.
Log into the annotation system with the admin account. Click Settings in the lower-left corner, then select Auto Labeling. In the dialog that appears, choose Custom REST Request.
Click Next and enter the address of the auto-labeling service (your IP + port).
Leave Params and Headers empty. In Body, fill in:
Key: text
Value: {{ text }}
Note: in value, there are two spaces between text and the surrounding brackets.
After filling this in, you can test the interface by entering a sample sentence and clicking Test. If a valid result is returned, the interface is working correctly. Otherwise, trace back through the previous steps.
Click Next and add the following template at the indicated location:
[
{% for entity in input %}
{
"start_offset": {{ entity.start_offset }},
"end_offset": {{ entity.end_offset}},
"label": "{{ entity.label }}"
}{% if not loop.last %},{% endif %}
{% endfor %}
]
The final step is to establish label mappings from the interface to the annotation platform. This maps entity types returned by the interface to the labels created in the annotation platform. For example, if the interface defines a type 时间 (time) but the platform label is named 时间日期 (date/time), you need to create a mapping between them. Create all required mappings.
Finally, click Test → Finish. Setup is complete.
Adding Annotator Users
Access the Django admin interface at <your-ip>:<annotation-service-port>/admin/, for example:
111.222.33.44:1234/admin/
In the admin panel, click Add under Users to create annotator accounts.