I'm trying to read the ports file from IANA. It is stored in UTF-8 w / o BOM. But on one of the lines, the readline() function swears like this

'charmap' codec can't decode byte 0x98 in position 7938: character maps to <"undefined">

The line in the file looks like this:

# Jim Harlan <"jimh & infowest.com">

What a crutch to come up with for this? Or is there a direct solution?

UPD

For a crutch in the form of deleting this line will go (and she, for some reason, is one here), but only for the time of debugging, because then suddenly that partners will tear the hair on my head. Also lay out the code that I use for this operation:

 try: file = open(path, 'r') while True: line = file.readline() if(not line): break print(line) finally: file.close() 

    4 answers 4

    Try using the built-in codecs library:

     import codecs fileObj = codecs.open( "someFilePath", "r", "utf_8_sig" ) text = fileObj.read() # или читайте по строке fileObj.close() 
    • So the error was found even earlier:> 'charmap' codec can't encode characters in position 29-30: character maps to> <undefined> - Dex
    • Added some corrections to the question. - Dex
    • for utf-8 with BOM you need to change the encoding in open () to "utf_8_sig" - rnd_d
    • one
      It can be said that 50/50. The problem with the first file was solved by deleting the unfortunate line. New file in a different format. Therefore, perhaps you are right, it was a random burst of joy. But your plus sign :) - Dex
    • one
      Do not use codecs, which may not work correctly with the universal string mode. Instead, io.open() can be used. - jfs

    To read a text file encoded using utf-8 encoding in Python, you can use the io.open() function, which is available as the built-in open() in Python 3 :

     #!/usr/bin/env python import io with io.open(path, encoding='utf-8') as file: for line in file: process(line) 

    If errors are possible in the file due to the encoding: the encoding itself is correct, but there may be minor errors, then you can pass an errors='ignore' error handler (or another value depending on the specific situation) .

    Do not use codecs , which may not work correctly with the universal string mode.
    You do not need to change your code page to cp65001 to read the utf-8 file.
    If you want to print Unicode in the Windows console, then see. How can I output a Unicode string to a Windows console from Python?

       file = codecs.open(path, encoding='utf-8', mode='r') 
      • So tried already, did not work - Dex
      • 'utf-8', not 'utf-8-sig' - Ali
      • I tried. Before that there was an answer with utf-8. - Dex

      Constantly caught this error, time after time. The decision is seen here .

       import codecs file = codecs.open( "yourFile", "r", "utf-8" ) data = file.read() file .close() 
      • chcp 65001 command line

      These not complicated actions solved the problem.