AG News 数据集是一款用于文本分类任务的数据集合,包含大约12万条新闻文章样本,涵盖4个主要类别。
AG News Topic Classification Dataset Version 3, Updated 09/09/2015
ORIGIN:
AG is a collection of more than one million news articles gathered from over two thousand sources by ComeToMyHead in more than one year. ComeToMyHead has been an academic news search engine since July 2004.
The dataset is provided for research purposes, such as data mining (clustering and classification), information retrieval (ranking and searching), XML processing, data compression, data streaming, and other non-commercial activities.
DESCRIPTION:
The AGs news topic classification dataset was created by selecting the four largest classes from the original corpus. Each class contains 30,000 training samples and 1,900 testing samples. The total number of training samples is 120,000 and there are a total of 7,600 testing samples.
The file classes.txt lists all classes corresponding to each label.
Files named train.csv and test.csv contain the training and test data respectively as comma-separated values. Each row in these files consists of three columns: class index (1 to 4), title, and description. The titles and descriptions are enclosed within double quotes (). Any internal double quote is represented by two consecutive double quotes () while new lines are denoted by a backslash followed with an n character (\n).