post

Doccano Auto-Labeling Configuration

· 5 min read · 934 words

Open-source annotation tool Doccano. Project: https://github.com/doccano/doccano

Official documentation: https://doccano.github.io/doccano/

Supports JSONL file import/export and REST API-based auto-labeling.

Auto-labeling API reference:

https://blog.csdn.net/weixin_44826203/article/details/125719480

Issues encountered:

Unable to correctly configure the auto-labeling API

The root cause is a frontend bug in the current version of Doccano. See https://github.com/doccano/doccano/issues/2281

Workaround: access the Django admin interface at http://x.x.x.x:8000/admin/ and configure it manually.

Model attrs:{"url": "http://x.x.x.x:5739", "body": {"text": "{{ text }}"}, "method": "POST", "params": {}, "headers": {}}

Template:[
    {% for entity in input %}
        {
            "start_offset": {{ entity.start_offset }},
            "end_offset": {{ entity.end_offset}},
            "label": "{{ entity.label }}"
        }{% if not loop.last %},{% endif %}
    {% endfor %}
]

Label mapping:{"label1":"match label","label2":"match label2"}
# label1: your configured label_span name
# match label: entity class name returned by the interface

After correct configuration, the API backend can receive data and process it normally. However, the Doccano frontend still fails to auto-label. The root cause is unclear — either relevant parameters were not configured correctly (difficult to diagnose due to the poor quality of the Doccano frontend), or Doccano is not receiving the returned data.

Debugging approach:

  • On the Doccano machine, capture traffic to the API endpoint to verify whether data is being received.
  • Check Doccano-related logs.
  • Read the source code (at this point, switching to another tool or annotating manually may be more practical).

Solution: install an older version of Doccano.

docker pull doccano/doccano:1.8.3
docker container create --name doccano_183 \
  -e "ADMIN_USERNAME=admin" \
  -e "ADMIN_EMAIL=admin@example.com" \
  -e "ADMIN_PASSWORD=password" \
  -v doccano-db:/data \
  -p 8002:8000 doccano/doccano:1.8.3

docker container start doccano_183

# List available tags
curl -s https://registry.hub.docker.com/v2/repositories/doccano/doccano/tags | jq '.results[].name'

Auto-Labeling

Named entity recognition interface:

from flask import Flask, request, jsonify
import regex, re

app = Flask(__name__)

def load_common_words(file_path):
    with open(file_path, 'r', encoding='utf-8') as file:
        words = [line.strip() for line in file if line.strip()]
    return words

words=load_common_words('common_words.txt')

# Define regex patterns
patterns_mc = [
    ('Phone', r'(?<=\+86[-\s]?)1[3-9]\d{9}|(?<=\+852[-\s]?)(?:4|5|6|7|8|9)\d{7}|(?<=\+886[-\s]?)09\d{8}|(?<=\+853[-\s]?)6\d{7}'),
    ('TG', r'@[a-zA-Z][a-zA-Z0-9_]{4,31}'),
    ('Mail', r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}'),
    ('QQ', r'(?<=QQ[ ]?|qq[ ]?|Qq[ ]?|qQ[ ]?)[1-9][0-9]{6,10}'),
    ('ID', r'(\d{6}(?:\d{8}|\d{6})\d{3}(?:\d|X))'),
    ('Landline_Number', r'0\d{2,3}-\d{7,8}'),
    ('Common_words', r'\b(' + '|'.join(re.escape(word) for word in words) + r')\b')
]

patterns = [
    ('Phone', r'1[3-9]\d{9}|(?:4|5|6|7|8|9)\d{7}|09\d{8}|6\d{7}'),  # (optional country code) mainland China / Hong Kong / Taiwan / Macau phone numbers
    ('QQ', r'[1-9][0-9]{6,10}'),  # QQ number, constrained to 7-11 digits
    ('WX', r'[a-zA-Z][-_a-zA-Z0-9]{5,19}')  # WeChat ID pattern
]

def extract_labels(text):
    results = []
    grapheme_clusters = list(regex.finditer(r'\X', text))
    matched_positions = [False] * len(grapheme_clusters)  # track which grapheme clusters have been matched

    all_matches_mc = []
    all_matches = []
    
    # Priority matching pass
    for label, pattern in patterns_mc:
        for match in regex.finditer(pattern, text):
            match_text=match.group
            
            start, end = match.start(), match.end()
            all_matches_mc.append((label, start, end))
            
    all_matches_mc.sort(key=lambda x: x[2] - x[1], reverse=True)
    
    for label, start, end in all_matches_mc:
        # Find the grapheme cluster range for the match
        start_cluster = next(i for i, m in enumerate(grapheme_clusters) if m.start() == start)
        end_cluster = next(i for i, m in enumerate(grapheme_clusters) if m.end() == end)
        # Check if any grapheme cluster in the range has already been matched
        if not any(matched_positions[start_cluster:end_cluster]):
            results.append({
                "label": label,
                "start_offset": start_cluster,
                "end_offset": end_cluster+1
            })
            # Mark the matched range
            for i in range(start_cluster, end_cluster):
                matched_positions[i] = True
    
    # Secondary matching pass
    for label, pattern in patterns:
        for match in regex.finditer(pattern, text):
            start, end = match.start(), match.end()
            all_matches.append((label, start, end))

    # Sort by match length, longest first
    all_matches.sort(key=lambda x: x[2] - x[1], reverse=True)

    for label, start, end in all_matches:
        # Find the grapheme cluster range for the match
        start_cluster = next(i for i, m in enumerate(grapheme_clusters) if m.start() == start)
        end_cluster = next(i for i, m in enumerate(grapheme_clusters) if m.end() == end)
        # Check if any grapheme cluster in the range has already been matched
        if not any(matched_positions[start_cluster:end_cluster]):
            results.append({
                "label": label,
                "start_offset": start_cluster,
                "end_offset": end_cluster+1
            })
            # Mark the matched range
            for i in range(start_cluster, end_cluster):
                matched_positions[i] = True

    return results

@app.route('/', methods=['POST'])
def get_result():
    text = request.json['text']
    print(text)
    results = extract_labels(text)
    return jsonify(results)

if __name__ == '__main__':
    # Be careful not to conflict with existing ports
    # host=0.0.0.0 means the service is accessible from any machine on the network
    # When accessing from another machine, use the actual IP address
    app.run(host='0.0.0.0', port=5739)

Test:

curl -X POST http://x.x.x.x:5739 -H "Content-Type: application/json" -d '{"text":"这是一个测试文本,包含中国大陆手机号:13912345678,香港手机号:51234567,澳门手机号:61234567,台湾手机号:0912345678"}'

Now that we have the Doccano annotation platform and an auto-labeling interface, the next step is to connect them.

Log into the annotation system with the admin account. Click Settings in the lower-left corner, then select Auto Labeling. In the dialog that appears, choose Custom REST Request.

Click Next and enter the address of the auto-labeling service (your IP + port).

Leave Params and Headers empty. In Body, fill in:

Key: text

Value: {{ text }}

Note: in value, there are two spaces between text and the surrounding brackets.

After filling this in, you can test the interface by entering a sample sentence and clicking Test. If a valid result is returned, the interface is working correctly. Otherwise, trace back through the previous steps.

Click Next and add the following template at the indicated location:

[
    {% for entity in input %}
        {
            "start_offset": {{ entity.start_offset }},
            "end_offset": {{ entity.end_offset}},
            "label": "{{ entity.label }}"
        }{% if not loop.last %},{% endif %}
    {% endfor %}
]

The final step is to establish label mappings from the interface to the annotation platform. This maps entity types returned by the interface to the labels created in the annotation platform. For example, if the interface defines a type 时间 (time) but the platform label is named 时间日期 (date/time), you need to create a mapping between them. Create all required mappings.

Finally, click TestFinish. Setup is complete.

Adding Annotator Users

Access the Django admin interface at <your-ip>:<annotation-service-port>/admin/, for example: 111.222.33.44:1234/admin/

In the admin panel, click Add under Users to create annotator accounts.