Skip to Main Content
NYMC Library Banner
Ask a Librarian

Data for Research

Assure

The Assure stage of the data lifecycle is about data cleaning and quality assurance.

Data Cleaning

For data to be useful it needs to be clean data.  This means deduplicating, normalizing, and correcting your data before you can use it. 

Deduplication: Removal of redundant data

Normalization: Making data uniform across cells (e.g., all phone numbers are recorded as 000-000-0000 versus 0000000000, addresses are written out fully Avenue vs Ave, Road vs Rd)

Correction: Removing corrupted, inaccurate, or irrelevant records from a dataset

During your collection process, you should work to ensure that team members collecting data follow the same procedures and standards when they record data to avoid generating lots of cleaning tasks. Secondary datasets may also require some cleaning.

A free tool like OpenRefine is helpful for large cleaning tasks.