Age and Gender Identification in Unbalanced Social Media

Diversity

Abstract:

Nowadays, people share a lot of information in social media in the form of videos, news, photos, posts, likes, etc. This large amount of generated information reflects the opinions, emotions and preferences of the users. As an example of the previous, Pinterest is a popular social network where the users show their interests in the form of pins, which are information units formed by a short text comment and an image. In this research, we study the problem of building a model to characterize users of Pinterest with two demographic variables, age and gender, using their textual information post in the network. To do that, we introduce a dataset formed by the texts in English from 548,761 pins corresponding to 264 users. This dataset is imbalanced and reflects the actual distribution of the social network for gender and age, with a dominant presence of women over men, and of middle age persons over young persons. With this dataset, we conducted experiments with a diversity of machine learning models, a variety of features and considering a set of performance metrics. Our results produce interesting insights about the problem.

Paper