THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Jamie Thomson

This is the blog of Jamie Thomson, a freelance data mangler in London

Why you shouldn't open .csv files in Excel

  1. Open up Notepad
  2. Type the following:
  3. a,"This is some text"
    b,This is some more text

  4. Save the file as "test.csv" 
  5. Open test.csv in Excel
  6. Without changing anything hit CTRL-S
  7. When prompted to keep the workbook in csv format say 'yes'
  8. Close Excel
  9. Open test.csv in Notepad

Are the contents of the file the same as what you originally wrote? No? That's why you shouldn't open .csv files in Excel!

UPDATE, 16th Feb 2012. Today I have been investigating a problem with a file that caused errors during loading. After about an hour of pouring over a SSIS package on a wild goose chase the business analyst sent me this email:

<business user> and I got it to work
He was opening the file in Excel - which hid the fact that it had lots of commas in a row after the data. He looked at it in Notepad, removed the commas, and all was well

 Need I say more!!

Published Wednesday, December 02, 2009 2:55 PM by jamiet
Filed under: ,

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

Cliff said:

I disagree. This is why you shouldn't EDIT csv files in excel. Excel works just fine for viewing the files, and is a good tool for reviewing files. Just don't change anything.

December 2, 2009 9:49 AM
 

Liam Westley said:

It can get even worse. In Excel 2000 days (and may still occur today), it was possible to load in a CSV file, make NO edits, save it as CSV and then when you tried to reload it into Excel it would fail with invalid format errors.

That's right, the specification that Excel used when SAVING CSV files is actually different to the specification used when LOADING CSV files.

I can't even start to explain my reaction when I found this out the first time.

December 2, 2009 9:53 AM
 

dewitte said:

Yep - I OPEN them, but I don't SAVE them.  I think it's a great tool to review a CSV file, but really quirky sometimes when it comes to re-saving.

December 2, 2009 9:54 AM
 

Greg Edwards said:

Jamie,

I see the issue. Excel drops the text delimiters (guotes) from the first record. But I'm not sure exactly why it's an problem.

Excel removes the text delimiters when the file is opened, rather than when it's saved. When you save the file, a warning is displayed to let you know that if you choose to keep the CSV file format, some contents may be lost. Seems like Excel is providing lots of disclosure here.

It's also worth noting that Excel removes the text delimiters because they aren't needed. Text delimiters in a CSV file are typically used when there are symbols in the value that are also used as field delimiters (i.e., a comma). For instance, when I changed the first record in the sample CSV file to read

a,"This is some really, really, really long text"

and then attempted the experiement, Excel saved the text delimiters exactly as written, because they're important to maintaining the structure of the CSV file.

Is there a reason you'd need to keep the text delimiters in your original example?

December 2, 2009 9:57 AM
 

jamiet said:

Cliff,

I'm of the opinion that you probably shouldn't run the risk (I admit to having hit CTRL-S through habit in my time when I really really shouldn't have done).

Be safe. Use a text editor! :)

December 2, 2009 10:09 AM
 

jamiet said:

"Is there a reason you'd need to keep the text delimiters in your original example?"

Good god yes. The reason being that that file may be destined to be consumed by some other process and if you've already opened the file in Excel and saved it you've put it into an unexpected state.

For example, I've built a process using SSIS that consumes such csv files that are provided by partner companies. My SSIS package will be expecting the files in a certain format, if you change that format then you run the risk that SSIS won't be able to load them (this is not a limitation of SSIS by the way before someone throwws that little chestnut at me :).

Fundamentally, if you open a file then close it (saving on the way) without making any changes then there should be no changes to the file. The point is, Excel changes the file without you knowing about it.

So, don't run the risk, use a text editor.

December 2, 2009 10:12 AM
 

jamiet said:

I really didn't expect this one to create a debate :)

No matter, if it informs people that there's a problem then its done its job!

December 2, 2009 10:14 AM
 

Adam Machanic said:

There are worse problems than this, due to how Excel does type inference. But the easy work around is to not "Open" the file, but rather "Import" it. Then you have options for configuring correct data types, etc.

December 2, 2009 10:43 AM
 

Owen R said:

Additionally you are moving between having text delimiters in field 1 and not having them in field b, that's a much bigger problem when trying to load a raw file than anything Excel may have "mangled"

This sounsd like a red herring to me.

December 2, 2009 10:57 AM
 

Wes W. said:

Hey Cliff, what happens to your CSV date fields when you open them in Excel?  The answer, "Lord only knows"!

Occasionally I'll use Excel to view CSV files when I need to have it easily sorted into columns or rows, otherwise, Notepad2 gets the most use.

December 2, 2009 11:02 AM
 

Owen R said:

wow, my typing is almost as bad as that example

more coffee please.

December 2, 2009 11:03 AM
 

Alexander Kuznetsov said:

A user typed CNA, which is a big insurance company, and it was automatically changed to Can - someone decided that they know better than the user what the user intends to type, and that someone was wrong. Also there was some time when I was figuring out why my automated process that was opening and parsing xls files was hanging. When I started that hanging process myself, and when that process opened and xls file, I got a pop up window advertising something related to MS, like Live Search or something - someone decided that that advertisement was more important than our work. Folks got rather annoyed and never even tried the stuff which was advertised so aggressively.

December 2, 2009 11:24 AM
 

jamiet said:

Owen,

Bad example you say?

Simple test, wrap the text in both lines in text delimiters and see if the same thing happens. (I'll save you the 20-odd seconds that it'll take you - it does).

So, not a red herring. Again, bad example you say?

-Jamie

December 2, 2009 11:24 AM
 

Jeff Parker said:

Well, your all right and you are all wrong.

There is NO specification on how a csv file, csv editor, or csv reader should work. If you do not like how Excel handles CSV Files then don't use it. In excel the " double quote is a qualifier for text that contains a comma, so for example if you replaced you example with

a,"This is some text"

b,This is some more text

c,"This is more text, it really is"

the c would then have in the second column

This is more text, it really is

If you didn't have the quotes on there c would have 3 columns now.

Then when you saved this back out to CSV in excel your quotes would be back on there. This, is how excel handles this and why. There are options when you import a CSV that allow you to override this, this double quote is the text qualifier, change it when importing it if you want some different behavior and even specify what type each field is. For example one of the things we do is used CSV exports of part numbers, a part number might be 0030040056 and that is a valid text field, however if you just import into excel it thinks it is a number and then you get only 30040056 so you have to tell excel this column is text before you import it.

Excel is quite a good tool for working with CSV files. Just most people don't understand there is no set specification for what a CSV file is and how it is structured.

If you want everything specified and structures exactly the same everywhere and automatically understands types then don't use CSV. Use XML, it has a standard and a specification and everything is handled consistently.

December 2, 2009 11:46 AM
 

Stephen Horne said:

I have run into this problem of Excel destructively manipulating CSV files. The truncation of leading zeros on numbers was probably the most frustrating event.

I am having difficulty locating the information on the Microsoft web site, but I recall that when opening a CSV file, Excel looks at the first 50 rows and does a best guess on the appropriate data type of a particular column. This is why the leading zeros get truncated from numeric-looking data.

My recommendation is to avoid the use of CSVs in any workflow where users of any kind could open to edit data and/or open to confirm data. In my opinion it is far too easy to open to edit or confirm a CSV file in Excel rather than a text editor. Any processes downstream after an Excel edit could either have invalid inputs or cause the process to break. Any open to confirm data in Excel can yield the wrong view of the data contained within the CSV, plus the tendency to want to edit and save the CSV in Excel is great and should be avoided.

Jeff Parker's recommendation of using XML is a very good one. For earlier versions of Excel, the *.xls file extension is actually several different document type formats, including binary and most importantly xml. The xml-flavored *.xls file allows for the defining of column data types and additional formatting. For Excel 2007 you can get more information here: http://msdn.microsoft.com/en-us/library/aa338205.aspx

Bottom Line Suggestion: If you must use CSVs (XML or fixed length is not an option) in some workflow then I strongly recommend *never* opening them with Excel. If you must allow for user editing/confirming of the CSVs then I recommend not allowing access through the file system (and thus Excel) but rather through a WinForms or ASP.NET application or another abstraction.

December 2, 2009 1:14 PM
 

Owen R said:

Jeff -

My point being, if I was dealing with a raw file input with the schema:

a, hello

b, "HELLO"

c, "goodbye"

I'd be more concerned with why my input is mangled BEFORE I inputted it into Excel or any other app. Blaming Excel for unfollowed input schemas is in fact a specious argument.

We are in total agreement with the summation of your findings when you say:

"Excel is quite a good tool for working with CSV files. Just most people don't understand there is no set specification for what a CSV file is and how it is structured.

If you want everything specified and structures exactly the same everywhere and automatically understands types then don't use CSV. Use XML, it has a standard and a specification and everything is handled consistently"

Excel may not be perfect but, if you give it broken input, expect broken output.

December 2, 2009 1:32 PM
 

jamiet said:

Blimey! I didn't expect this much activity on this post. Thanks one and all for your contributions!.

Owen/Jeff,

I quite agree, XML is a great method for passing around structured data.

By way of adding to the debate (and not because I agree with it) I'll point anyone reading this to Don Peterson's diatribe about XML from back in 2003 "Is XML the answer" (http://www.sqlservercentral.com/articles/Miscellaneous/isxmltheanswer/1147/) - a good, reasoned article. We can of course argue about its relevance given its age but nonetheless, worth a read I think!

-Jamie

December 2, 2009 2:11 PM
 

Cliff said:

Jamie,

 True, but if you have alot of records in your csv, a text editor could be a rough time ahead. What I tend to do, is make a copy of the file in excel as a .xls file (or .xslx in 2007). It's habit for me at this point and prevents issues of messing up the format while I investigate. It also allows me to highlight rows of interest when doing data analysis. Right tool for the right job is different for each of us.

Cliff

December 2, 2009 3:02 PM
 

Cliff said:

Wes,

  Microsoft spins a wheel of corruption and spits out whatever format it lands on. =)

Point taken.

Cliff

December 2, 2009 3:08 PM
 

Siddharth said:

Excel does not display leading zeros in numbers. This can be a big misguiding factor at times.

Also I am amazed to see such a heavy discussion on this kind of topic :) Jamie, may be you should convert your blog into a forum :)

December 2, 2009 6:57 PM
 

Siddharth said:

Excel does not display leading zeros in numbers. This can be a big misguiding factor at times.

Also I am amazed to see such a heavy discussion on this kind of topic :) Jamie, may be you should convert your blog into a forum :)

December 2, 2009 7:00 PM
 

Siddharth said:

Excel does not display leading zeros in numbers. This can be a big misguiding factor at times.

Also I am amazed to see such a heavy discussion on this kind of topic :) Jamie, may be you should convert your blog into a forum :)

December 2, 2009 7:00 PM
 

Andy said:

Excel has a few issues that I've seen when getting the data into SSIS, some are technical some are users. It's great because it's flexible but this is also one of it's issues when processing data

It's not WSISG.

It's not relational.

It supports mixed data in columns and then provides meta data based on a sample of the first few rows.

It can mangle dates, particularly when switching between different regional settings.

Formatting can make a 3 row sheet look like a 16000 row sheet.

Users like to put extra rows with comments or messages at the top.

Worksheets get moved about and relabelled.

Users put important information into comments or have significance with regards position or colour of cells such as creating calendar data in excel.

The one thing that I've seen SSIS excel with (pun intended) is converting a pivot chart back to relational data. I unsed an unpivot followed by a conditional split and a merge join.

December 3, 2009 4:23 AM
 

merrillaldrich said:

This is but the tip of the "Why Excel is horrible for data" iceberg. Seriously. It's horrible, aweful in at least 10 ways when put to use by users who are trying to manipulate tables of data rather than spreadsheets. But that's what about half of business users do...

December 3, 2009 2:12 PM
 

Vis Naicker said:

This is when it gets really frutrating.

I have a column with text like "c:\filesJan10"

It is excel proper, so I could format the column to the type of my liking, like Text "Text format cells are treated as text even when a number is in the cell. The cell is displayed exactly as entered."

Lets replace c:\files with blank ''. Excel will change the format to date format. But I explicitly told it the column was text.

This bites me even with a proper excel spreadsheet, not just when using a data table. I can't cut and paste from the same spreadsheet/workbook without excel trying to reformat - ignoring my formatting I just specified.. Where is the freakin' "ON" switch so I can switch this behaviour off?

January 5, 2010 4:40 AM
 

secondpresident said:

Hmm...  Wish I had found this earlier.  :-(

There is a specification for CSV.  It's called plain text.  Text is text.  MS should not be trying to treat it as anything other than text.  Even when you tell Excel all the fields are text fields it still ends up modifying some of the values.

I agree with the last person.  Where is the bloody off switch.  That's all that's needed.  Treat everything as text and all is well.  Adding/removing quotes around a field is not a big deal because good CSV parsers can handle that.  If in doubt, look at the source version of PHP's.  It's a simple nested loop.  But, stripping leading zeroes, changing date formats, etc., is a BIG problem.

This is a bug in Excel.

March 26, 2010 6:22 PM
 

Anil Patil said:

Vis,

Once you format a column or entire sheet as Text, excel preserves that setting and works well with the example you have mentioned here.

Also while copy & pasting such content, if you want to preserve the original formatting use Paste Special and select 'All using source theme'. This is make sure the original content and formatting is transferred to new cells.

April 14, 2011 5:51 PM
 

Rajesh Rampersad said:

I agree. It's just easier to use a CSV editor - much less trouble than excel, and most of the editors are free (or with a trial period). I use Ron's Editor ( http://www.ronsplace.eu/RonsPlace/Products/RonsEditor.aspx ), it's a very good CSV editor and is perfect for my daily work. I work with files thta contain somnetimes up to 500000 data and it's still fast and powerful. I really recommend it.

July 16, 2012 4:22 PM
 

jamiet said:

Just encountered the "stripping of leading zeroes" that Siddharth mentioned above. Case closed as far as I'm concerned :)

September 5, 2012 8:04 AM

Leave a Comment

(required) 
(required) 
Submit

This Blog

Syndication

Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement