import numpy as np import codecs fileObj = codecs.open("fragments.txt", "r", "utf_8_sig") text = fileObj.read() fileObj.close() strings = text.split('\n') char_dictionary = np.load('bag_of_characters.npy').item() START_SYM = 165 END_SYM = 166 UNK = 167 SPACE_SYM = 168 CHARACTER_SYM = 169 train_size = len(strings) dict_len = len(char_dictionary) + 5 MAX_LENGTH = 130 def char_to_code(char): if char in char_dictionary.keys(): return char_dictionary[char] else: return UNK data = np.zeros(shape=(train_size, MAX_LENGTH, dict_len),dtype='float32') for i in range(train_size): for pos, char in enumerate(strings[i]): data[i, pos, char_to_code(char)] = 1 

I am trying to convert sinks into a numpy array using one-hot-encoding at a character level. In an array of 1.6k lines, if I try to convert more than + -200 lines, I get memory error. My question is, is this the physical limit of my PC (8 GB), or how can I get around this error? Thank.

  • If the second is python, then try replacing range(train_size) with xrange(train_size) - Enikeyschik
  • Can you explain why doing "one-hot-encoding" at the level of single characters? What will it give you? - MaxU
  • if you google, then seq2seq character-level, to put it simply, I’m interested in sequences of characters - Aliaksandr Nazarau
  • @AliaksandrNazarau, added this option ... - MaxU

1 answer 1

Why reinvent the wheel? Use sklearn.feature_extraction.text.CountVectorizer :

Example:

 from sklearn.feature_extraction.text import CountVectorizer from nltk import sent_tokenize text = """Пытаюсь преобразовать стоки в numpy массив с помощью one-hot-encoding на character уровне. В массиве 1.6кк строк при попытке преобразовать более +-200к строк выдает memory error. У меня вопрос в том, это физический предел моего ПК(8 гб) или можно как то обойти эту ошибку? Спасибо.""" sents = sent_tokenize(text) vect = CountVectorizer() X_vect = vect.fit_transform(sents) 

Result:

 In [6]: X_vect.A Out[6]: array([[0, 0, 1, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0], [1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0, 0, 2, 1, 1, 0, 1, 1, 1], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0]], dtype=int64) In [7]: vect.get_feature_names() Out[7]: ['200к', '6кк', 'character', 'encoding', 'error', 'hot', 'memory', 'numpy', 'one', 'более', 'вопрос', 'выдает', 'гб', 'или', 'как', 'массив', 'массиве', 'меня', 'моего', 'можно', 'на', 'обойти', 'ошибку', 'пк', 'помощью', 'попытке', 'предел', 'преобразовать', 'при', 'пытаюсь', 'спасибо', 'стоки', 'строк', 'то', 'том', 'уровне', 'физический', 'это', 'эту'] 

PS CountVectorizer by default returns a sparse matrix as a result - it usually takes several orders of magnitude less memory compared to regular Numpy NDArray.

 In [9]: X_vect Out[9]: <3x39 sparse matrix of type '<class 'numpy.int64'>' with 40 stored elements in Compressed Sparse Row format> In [10]: type(X_vect) Out[10]: scipy.sparse.csr.csr_matrix 

UPDATE: One-hot-encoding at the level of single characters:

 In [22]: vect = CountVectorizer(token_pattern=r'(?u)[\w\d]{1}') In [23]: X_vect = vect.fit_transform(sents) In [24]: ''.join(vect.get_feature_names()) Out[24]: '01268acdeghimnoprtuyабвгдежзийклмнопрстуфчшщыьэюя' In [25]: X_vect Out[25]: <3x49 sparse matrix of type '<class 'numpy.int64'>' with 77 stored elements in Compressed Sparse Row format> 
  • Thanks a good option, not sure if the discharged matrix is ​​suitable for learning rnn. At least in tutorials the keras did ohe-hot - Aliaksandr Nazarau