A bert powered bot help you identify users.
The goal of this project is to identify users who publish messages with the "SunXiaoChuan" pattern.
Nov 2019, a new wave of troll army named of "Sunxiaochuan 258" had reached twitter chinese users. Where they came from, how they organized and their background are unknown. However, they have very similar language behavior. It was a great opportunity to learn how to use NLP with Deep Learning to identify them.
20,000 tweets from "Sun XiaoChuan" and their followers network. 20,000 tweets from normal twitter users.
The crawler scripts is tools/fetch.py and tools/tweets.py
Download Training Set: https://drive.google.com/file/d/1pM9Gp5QXIoLDKb9L4ertM0D8RZRjfVC7/view
Fine-Tuning the language model
- BERT-Base, Chinese
- BERT-Base, Chinese 82.4%
- Chinese-BERT-wwm 83.6%
Sunxiaochuan positive : result_samples/positive
Sunxiaochuan negatives : result_samples/negative
How to use
- Clone this repository
- Download the fine-turned model: https://drive.google.com/uc?id=1DcvRmZceOewUiY-7gsqKuYQzn_xUHShN&export=download and Unpack to the directory model
- pip install -r requirements.txt or pip install -r requirements-gpu.txt
- cp config.json.sample config.json
- python server.py
- Open a web browser, open http://127.0.0.1:5002 for webui.
- API: http://localhost:5002/api/iden?screen_name=a_screen_name