strife's devLog

300px-Python-logo-notext.svgConverting Large XLS, XLSX files to CSV using Python

I was in troubles to convert big xlsx files in PHP because there is no such a library that could make it as good as possible. What I’ve found out were libraries that could manage xls / xlsx files but it took a lot of time and performance was horrible.

So, not spending too much time to think about it I’ve figured out to leave PHP and try to make it in Python. What I wanted to achive was converting xls files into csv and handle it in PHP.

In Python we got xlrd library that enables us to make this “crazy” thing. Here it is: https://pypi.python.org/pypi/xlrd. You can install it using pip, such like:

But for my case it was better to use VirtualENV which is better solution when someday you’d like to move your software somewhere and you forgot about installing all those stuff. However, each way you chose would be fine till it’s working.

After installing required items all you need is to run the script that converts xls into csv. I have already written simple script that can make it. Please find code below and save it to your local system setting the proper permissions right (chmod +x filename).

convert_xls.py

If you have a python script created in your local filesystem the next step would be manage it from PHP. And this is most simplest solution. Using exec, system, or shell_exec solves the problem.

OR in command line.

I know that solution is not really clean but it works. To have it more elegant I would suggest to create some kind of service in Python that handles it in a good way. I will think about it …

Note: I have also fixed issue with Long numbers in XLS.

Update
If you need to parse big XLS, XLSX files I would suggest not to make in in PHP. Just parse this using JAVA where PHP would be used only for UI.

Any questions, let me know. Please find comment form below.

Share

8 ResponsesLeave one →

  1. This article is very interesting, but is hard to find in search engine.
    I found it on 16 spot. You can reach google top ten easily using one handy wordpress plugin and
    increase targeted traffic many times. Just search in google for:
    Akalatoru’s Rank Plugin

    Reply
  2. Someone

     /  February 22, 2016

    Dude, xrange is undefined!

    Reply
  3. Shehel

     /  August 23, 2016

    undefined because xrange is for python2. You probably have python3 which uses range

    Reply
  4. hossein1221

     /  September 10, 2016

    NameError: name ‘unicodes’ is not defined

    when use str in python3
    =>
    TypeError: ‘str’ does not support the buffer interface

    Reply
  5. willingness

     /  September 25, 2016

    Thx for this great script!

    After converting a lot xls files to csv, I now have encoding trouble with a particular file. The output is:

    *** No CODEPAGE record, no encoding_override: will use ‘ascii’
    ‘ascii’ codec can’t decode byte 0xc3 in position 16: ordinal not in range(128) Traceback (most recent call last):
    File “xls2csv.py”, line 20, in csv_from_excel
    wb = xlrd.open_workbook(xls)
    File “/Library/Python/2.7/site-packages/xlrd/__init__.py”, line 441, in open_workbook
    ragged_rows=ragged_rows,
    File “/Library/Python/2.7/site-packages/xlrd/book.py”, line 107, in open_workbook_xls
    bk.fake_globals_get_sheet()
    File “/Library/Python/2.7/site-packages/xlrd/book.py”, line 687, in fake_globals_get_sheet
    self.get_sheets()
    File “/Library/Python/2.7/site-packages/xlrd/book.py”, line 678, in get_sheets
    self.get_sheet(sheetno)
    File “/Library/Python/2.7/site-packages/xlrd/book.py”, line 669, in get_sheet
    sh.read(self)
    File “/Library/Python/2.7/site-packages/xlrd/sheet.py”, line 1356, in read
    strg = unpack_string(data, 7, bk.encoding or bk.derive_encoding(), lenlen=1)
    File “/Library/Python/2.7/site-packages/xlrd/biffh.py”, line 269, in unpack_string
    return unicode(data[pos:pos+nchars], encoding)
    UnicodeDecodeError: ‘ascii’ codec can’t decode byte 0xc3 in position 16: ordinal not in range(128)

    Anyone with an idea? :-)

    Reply
  6. I see you don’t monetize your page, don’t waste your traffic, you can earn extra bucks
    every month because you’ve got hi quality content. If you want to know how to make extra money, search for:
    Boorfe’s tips best adsense alternative

    Reply
  7. Shaul

     /  May 1, 2018

    Thank you
    It help me very much
    I improved the script to support multi sheets too:

    shaul@sm ~12:43:17$
    shaul@sm ~12:43:17$ cat Desktop/convert_xls.py
    #!/usr/bin/python
    # from http://strife.pl/2014/12/converting-large-xls-xlsx-files-to-csv-using-python/

    import logging
    import time
    import traceback
    import xlrd
    import csv
    import sys
    import re

    logging.basicConfig(level=logging.INFO, format=’%(asctime)s %(message)s’)

    def csv_from_excel(sheet_nu=0):
    global wr, wb
    start_time = time.time()
    sh = wb.sheet_by_index(sheet_nu)
    for row in xrange(sh.nrows):
    rowValues = sh.row_values(row)

    newValues = []
    for s in rowValues:
    if isinstance(s, unicode):
    strValue = (str(s.encode(“utf-8″)))
    else:
    strValue = (str(s))
    isInt = bool(re.match(“^([0-9]+)\.0$”, strValue))
    if isInt:
    strValue = int(float(strValue))
    else:
    isFloat = bool(re.match(“^([0-9]+)\.([0-9]+)$”, strValue))
    isLong = bool(re.match(“^([0-9]+)\.([0-9]+)e\+([0-9]+)$”, strValue))
    if isFloat:
    strValue = float(strValue)
    if isLong:
    strValue = int(float(strValue))
    newValues.append(strValue)
    wr.writerow(newValues)
    logging.info(“Finished in %s seconds, Sheet:%s”, time.time() – start_time, sheet_nu)

    if len(sys.argv) < 3:
    print "Usage: "+ sys.argv[0] +" XLS_file CSV_file"
    exit (1)

    xls = sys.argv[1]
    target = sys.argv[2]
    logging.info("Start converting: From '" + xls + "' to '" + target + "'. ")

    try:
    wb = xlrd.open_workbook(xls)
    csvFile = open(target, 'wb')
    wr = csv.writer(csvFile, quoting=csv.QUOTE_ALL)

    for idx in range(wb.nsheets):
    csv_from_excel(idx)
    csvFile.close()

    except Exception as e:
    print (str(e) + " " + traceback.format_exc())

    exit (0)

    shaul@sm ~12:43:19$
    shaul@sm ~12:43:19$

    Reply
  8. Peels help to eliminate from the greater part old amendments dermatological cover. Procedures effective at any time, recommended for skin of any species. After superficial exposure disappear minor wrinkles, skin becomes smooth, without wrinkles. Regeneration – 3 days. deeper peeling well struggles wrinkles (expression, old, smoothing minor scars, eliminates age spots.
    Recovery – 7 days. Deep peeling conducting in the beauty medical facilities. This is most effective procedure, although requires long period regeneration – about thirty days.
    chemical skin peel before and after

    Reply

Leave a Reply