Data Migration: Import Data Using Bulk Upload

Praxent

11 years ago

Data Migration: Import Data Using Bulk Upload

By Tone Garot

What is a Bulk Uploader?

A Bulk Uploader is a software feature used to import any regular data (tabular data that fits into a grid/matrix) into a database system. For example, the data may be multiple “person” records, which will become user profiles in a system. Another example may be parts from a schematic to be uploaded into a bill of materials (BOM).

The bulk uploader concept is simple: transfer data, in bulk, into a database system.

Most spreadsheet applications can be used for bulk upload.

The format of the input data is generally Comma Separated Values (CSV), which can be created by spreadsheet software such as MS Excel, OpenOffice Calc, and the cloud-based Google Sheets. Each of these applications allows the entry of tabular data and the ability to export as a CSV file, usually with the file extension .csv.

CSV is the de facto standard for importing data from end users. As such, many programming languages have libraries of functions that meet the CSV specification. Programmers can easily write routines to read and manipulate CSV data, then store the values into a database.

Although a CSV file is a regular text file, it can be re-opened into a spreadsheet application to adjust or add values. Unlike an Excel spreadsheet, however, no formatting (no colors, no bold face, no italics) is saved in a .csv file.

Each row or line in a CSV file is associated with a single data record. In the case of uploading user profiles, each row will correspond to a single person. Each row is comprised of one or more fields separated with commas (thus the ‘C’ in the name CSV). Field values that are text or currency data are surrounded by double quotes to ensure that commas within the values are not misinterpreted as delimiters.

Why Have a Bulk Uploader?

Many software applications lend themselves to include a bulk upload feature. Data in a database is usually stored in tables of specified field names and multiple records. Thus, a bulk upload feature can easily populate the underlying database tables with tabular data. A spreadsheet application gives an easy interface to assemble such data.

Spreadsheet software is readily available and many people are already versed in the concept of building sheets of tabular data. They data may already be on-hand: for example, a roster of club memberships may already be tracked in a spreadsheet.

Save keystrokes, clicks, and mouse movements.

Thus, it may make sense to supply a means to import data into your business systems in bulk. Your end users can import hundreds or more data records rather than manually enter each person record (including name, address, phone, etc.) individually into a form. If a bulk uploader saves dozens of keystrokes, mouse clicks, and scrolling for a single person record, you can readily see the savings of time and hassle on hundreds of records.

Another use for bulk upload is during the transference of data from one system to another. Simply export the data from one system then import it into the other system. There is no need to re-enter hundreds of records manually.

Bulk Uploader Best Practices – A Developer’s Perspective

Over the years, I have written a number of bulk uploaders for various applications in different programming languages. Although the language and specific requirements may vary, there are some concepts that remain constant. First, and foremost, is that all bulk uploaders I have written use CSV, the de facto standard—even when that requires writing my own CSV library of handling functions.

I have come up with some practices that may help you determine what you need in a bulk uploader.

The Template

A bulk uploader should be understandable and simple to use. Therefore, I supply detailed instructions describing the upload process. Additionally, I supply a downloadable template with a “smart” header row. Here is an example:

"*First Name","*Last Name","*Date of Birth (MM/DD/YYYY)","Phone","*Email","Gender (M or F)"  "James","Morrison",08/05/2007,"928-555-1212","morrison@mailinator.com","M"

Required fields get an asterisk (*). Columns that expect a certain format, like the Date of Birth, specify that format in the header line. Columns that expect a specific value, like Gender, have those values explicitly stated.

The header line doesn’t need to look pretty—it needs to provide clarity.

A sample row of data with representative values is also helpful to take out the guesswork.

Flexible Data Pre-Parsing (Cleanup)

To create the best possible user experience, the developer should take pains to anticipate variations in data. Thus, the parsing code should be “smart” and flexible, and it should do data cleanup before validation occurs.

Before getting into specific examples, I will point out two concepts that the pre-parser does.

Trim any leading and trailing spaces
Case sensitive: Although the values you store may be case sensitive, the values you receive should not have to be.

The specific examples described below are not steadfast rules. They are based upon design decisions that should be flushed out by a conversation with the end customer.

Example 1: Dates

Consider a spreadsheet with “dates of birth.” In this scenario, consider that there are hundreds of people in the spreadsheet and, unfortunately, the person(s) who maintained the spreadsheet used multiple date formats. They might have used dashes or slashes or two digit years. This is one of the inherent problems of importing data—inconsistent data.

This is where some smart programming will alleviate some of the burden. Even though the template might state the format to be MM/DD/YYYY, it makes sense for the parsing routine to allow for alternate date formats.

There are, of course, limits. The spreadsheet may have been set up to take text values instead of date values, so someone may have entered the words “didn’t specify” instead of an actual date. There isn’t much the parser can do in such a case, so an error would need to be raised.

Also, certain assumptions must be made for dates. If your application is built in the USA, do you attempt to discern European date styles? What about UNIX timestamps? These are less likely, but valid concerns to consider.

Example 2: Expected Values (Lists)

Gender gets one of two values: “M” or “F.” Since I want to store a capital “M” for males, I will convert any of the following to that value:

M ; m ; male ; Male ; mAle ; etc.

Similarly, I will convert any “Yes” values from:

Y ; y ; yes ; Yes ; yEs ; etc.

On a particular application, I knew that most of the uploaded user profiles would come from the state of Texas. On the “add” form version, I simply used a pull-down with all fifty states to ensure data integrity. For the bulk uploader, I don’t have a pull-down option. So, I convert any of the following values to “TX”:

tx ; TX ; tX ; Texas ; texas ; tExas ; etc.

Example 3: Phone Numbers

Phone numbers can have many permutations. If I am collecting phone numbers predominantly for USA numbers, I first strip out all non-numeric characters. Then, if the result is ten digits, I reformat with dashes, e.g. 928-555-1212. If it is not ten digits, I might simply store whatever was supplied. This way, I capture phone numbers with extensions as well as out-of-country numbers such as +52 444 825 7172.

The phone number is a great example of where a conversation is necessary to ensure business needs and requirements are best met.

Example 4: Zipcodes

If I am considering primarily USA values, I first strip all non-numeric characters. If there are nine digits, I add the appropriate dash, e.g. 54313-0001; otherwise if there are five digits, I just store those.

If the value stripped of non-numeric characters is neither nine or five digits, I often store the
value as presented. It may just be a non-USA postal code.

Example 5: Currency

In a money field, allow the end user to supply a dollar sign ($) and commas. I came across a system recently that literally had this line in their instructions, in red, and bold:

Note that you do not want to use currency symbols like the $ sign as it will not work.

Really? The code to strip out dollar signs and commas is trivial! Why would they cause the end user to do extra work?

Smart Error Reporting (Validation)

Despite best efforts to make the template easy to understand, and efforts to clean anticipated variations in data, the end user will invariably supply (or not supply) data that does not meet the requirements of the system. At the database level, you just cannot stuff a text value into a date field, and you cannot skip a required value. Thus, there needs to be an effective way to let the end user know.

In these cases, I like to report to the user highly informational messages:

Exact row number in the .csv file.
Column header title such as “*Date of Birth” rather than a column number.
The value they supplied, e.g. “45.” This might help them find the value more quickly.
A descriptive error message such as: “ill-formed date of birth” or “date of birth required.”

Final Considerations on Bulk Upload

Personally, I like to ensure that all data validates before I persist records into the database.

It’s usually a good idea to have a mechanism to ensure no duplicate values. Optimally, this will be handled by the API on the back end, and if there is a duplicate, simply return the id to the existing record rather than complain.

Malicious data should be considered from both the front end and the back end. This includes SQL Injections, long strings, and garbage data from hackers.

Finally, use unit tests to automatically check expected outcomes with actual outcomes. Pay particular attention to edge cases. Here are a few examples based upon the above discussion in this BLOG:

ok( bulkValidation.texas('texas') === 'TX', 'texas - texas' );  ok( bulkValidation.texas('tx') === 'TX', 'texas - tx' );  ok( bulkValidation.yesno('y') === 'Yes', 'yesno - Yes' );  ok( bulkValidation.yesno('yippers') === 'Yes', 'yesno - starts with y' ); ok( bulkValidation.yesno('n') === 'No', 'yesno - No' );  ok( bulkValidation.properDate('04/03/2002') === '04/03/2002', 'properDate - MM/DD/YYYY' ); ok( bulkValidation.properDate('04/03/02') === '04/03/2002', 'properDate - MM/DD/YY' );  ok( bulkValidation.properDate('04-03-02') === '04/03/2002', 'properDate - MM-DD-YY' );  ok( bulkValidation.zipcode(bulkValidation.numbers('54313')) === '54313', 'zipcode - 5 digit' );  ok( bulkValidation.zipcode(bulkValidation.numbers('543130000')) === '54313-0000', 'zipcode - 9 digit' );  ok( bulkValidation.zipcode(bulkValidation.numbers('54313abcdefg0000')) === '54313-0000', 'zipcode - extract letters' ); ok( bulkValidation.zipcode(bulkValidation.numbers('123')) === null, 'zipcode - too short' );  ok( bulkValidation.zipcode(bulkValidation.numbers('54313-00000')) === null, 'zipcode - too long' );

Conclusion

Bulk upload is a great way to import data into a software system. It is reasonably easy to use, and saves your end users time. Time and attention to cleanups and smart validation will ensure a positive user experience.