Spanish Billion Words Corpus and Embeddings

read

So, a year and a half since my last post. Even if I kind of update my page to be a blog from the root, shame on me.

This blog post however is not something related to what I did in the previous ones. I promise someday I will continue with my Python to Scala tutorials, but for now you’ll have to settle with this.

Since I am a PhD Student in Natural Language Processing and a native speaker of the Spanish language, I like to do my research in this language. The problem is that Spanish, unlike English, doesn’t have that many resources.

In the last year I have been working and researching in the fields of deep learning and word embeddings. The problem with word embeddings, specially with those generated by neural networks methods like word2vec, is that they require great amount of unannotated data.

Most of the works I have seen to create Spanish word embeddings use the Wikipedia, which is a big corpus, but not that big, so I decided to contribute to the world of word embeddings by first releasing a corpus big enough to train some decent word embeddings, and then by releasing some embeddings created on my own.

This is why I am releasing now the Spanish Billion Words Corpus and Embeddings, a resource for the Spanish language that offers a big corpus (of nearly 1.5 billion words) and a set of word vectors (or embeddings) trained from this corpus.

Feel free to use it as it is released under a Creative Commons BY-SA license.

Spanish Billion Words Corpus and Embeddings

Written by

Cristian Cardellino

Supported by

Cristian Cardellino

Notes of a Computer Scientist