Twitter Name Embeddings Dataset

Your name tells a lot about you: your gender, ethnicity and so on. It has been shown that name embeddings are more effective in representing names than traditional substring features.

These name embeddings are essentially 100-dimension vectors, encoding demographic signals including gender, ethnicity, nationality and so on. They are often used as features for downstream tasks. NamePrism, for example, uses name embeddings to predict nationalities, and its open API is often used to support social science research community to study racial discrimination.

This twitter embedding dataset is trained on large-scale public Twitter data and contains three million embeddings for unique name parts (first or last names). Its performances are tested against existing (but private) email name embeddings. It outpeforms email embeddings in encoding gender signals, while achieves comparable F1 scores in nationality predictions (more details).

We are excited to share this dataset with research communities. However, it is strictly limited to non-commercial research projects. The dataset requesting process demands more scrutiny considering the nature of the dataset.

To obtain this dataset, please follow steps below:

Requesting Dataset:

Step 1: Read and finish Dataset Request Form;

Step 2: Send an email via your offical account to Prof. Steven Skiena and Dr. Junting Ye to get a valid Dataset Request ID;

Step 3: Return Term of Use Form via email to get a copy of the dataset. The form needs to be signed by individuals holding long-term research positions, e.g. professor or equivalent;

Please note that each copy of the embeddings will be individually "watermarked" to prevent unauthorized distribution.

Contact:

Prof. Steven Skiena: skiena at cs dot stonybrook dot edu;
Dr. Junting Ye: juyye at cs dot stonybrook dot edu;

Citation:

The Secret Lives of Names? Name Embeddings from Social Media
Junting Ye, Steven Skiena.
ACM SIGKDD, Anchorage, Alaska, Aug. 2019.