Choosing file formats
Choosing open file formats is important for data archiving and long-term preservation, but it is also a good idea to think about which formats you will use to store your data even before you start collecting your data as your choice of file formats will have a significant impact on whether you will be able to access your data files at a later date and also on the ability of other, future users to access the data in the long term.
Ideally you should use formats that allow for the long term preservation and accessibility of your data, but your choice will most likely be determined by a range of factors. These might include...
Factors to consider
- Which formats are best suited for data collection and analysis?
- Which formats have you and your colleagues used in the past?
- Are there any disciplinary-specific standards or norms?
but also...
- Is the format open or proprietary?
- Does the format allow for the long-term preservation and reuse of the data?
- Is there a risk of file format obsolescence?
- Is the format suitable for conversion?
- Does your chosen repository or data centre place any restrictions on the types of file formats they will accept?
Which file formats are suitable for long term preservation and future reuse?
Open versus closed, proprietary formats
Where possible you should store your data in open or standard rather than closed or proprietary file formats.
With most proprietary formats the specification is privately owned and subject to restrictions due to copyright or other intellectual property rights.
With open formats however, the code supporting the format is publicly available and free to use by anyone. This allows others to develop software that can access the files and reduces the risk of software or hardware obsolescence.
Open formats are also more likely to be backwards compatible with previous versions and well supported by a user community over a longer period of time.
Examples of open formats include CSV, XML, JPEG 2000, Open Office Document, Tar, ZIP.
Some proprietary formats have become standard and are also supported by open documentation (e.g. PDF, TIFF) or are widely used and likely to be around of a long time (e.g. SPSS, MS Office software applications) and are therefore considered acceptable for short term preservation.
Lossless versus lossy compression
It is generally best to avoid “lossy” compressed file formats. Lossy compression has the advantage of allowing smaller file sizes but some important information might be lost. For image files, TIFF is lossless, JPEG lossy; for audio files, WAV is lossless, MP3 lossy. Lossless compression produces larger files but every bit of the original data is restored when the file is uncompressed (e.g. PNG, GIF, and ZIP files). For important files, best practice is to keep a master copy in a lossless format.
Recommended file formats:
Data conversion and format migration
Sometimes using proprietary formats is unavoidable, particularly if they are widely used within a discipline or research community (e.g. crystallographic information files (CIF)). It may well be the case that the formats you need to use to collect and analyse your data are not always the most suitable for preserving your data beyond the lifetime of your research project. Where possible you should consider converting your data to open formats to allow long term preservation and access. However, when converting data files from one format to another it is always advisable to check for any errors or loss of information such as missing such as fonts, footer and headers, footnotes, hyperlinks, image resolution, sound quality and colour fidelity. Also, some open formats lack the functionality and formatting of proprietary formats, so there may be occasions when it is a good idea to retain copies of important data in their original format while also using open formats to enable future access and sharing.
The UK Data Archive provide useful guidance and information on managing quality control for data collection, migration and transcription.