Updated online document!
Chapter 1 : General Structure
- Is there an established map of the database?
- Is it visible to all users?
- Is there a public strategy of how it is tested for truth?
- Are there known party(ies) responsible for updating it when there are structural changes?
- How often is it updated?
- When is the next scheduled update?
- Is there a change log to track prior changes?
- Is there a group of humans you can go to with questions?
- Is there a group of humans you can go to with general conversational topics?
- The user list from a DBA is great for this.
- Consider having a monthly meeting and yearly summit per database.
- If the source of data is an application or form, is there a map from every entry field to the corresponding field in the database?
- Capacity / Uptime
- How concerned should you be with locking up the database with poorly written queries?
- Who do you go to when it is dead?
- Is the data backed up / archived anywhere?
- For how long?
- Do you remove old values from the production database after archiving?
- Are queries logged?
- How often is the database refreshed, if it is part of an ETL?
- Automatically check, at the appropriate time, if it has been updated. Have a plan in place for when it is inevitably not updated.
- What is the official way to request access to the database?
- Is there a time-out process of old user ID’s?
- Have a plan in place for when an automated procedure’s stored credentials are suddenly invalid.
- Tables
- Is there any rhyme or reason to the naming of tables?
- If there are very similar tables (ie. CUSTOMER, CUSTOMER_NEW, CUSTOMER_NEW_NEW), do the alternates need to exist?
- Is this table “one row that is constantly kept updated” or “one row for each time the data changes”?
- Joins
- Should there be at least one “Child” for every “Parent”?
- Is there a maximum expected “Children” for every “Parent”?
- Should there be “Children” who don’t have “Parents”?
- Is every table relatable to every other table?
- On joining fields
- Do the field types match, or do you need to do a conversion?
- Queries / Processes
- Is the query properly notated?
- With the code removed, the notes should be copy and paste-able to a non-technical business partner.
- The notes should represent reality. Please.
- Is it possible to do an “anti-query”?
- Reverse your filters, but mirror your production process. In high value pipelines, this can be a huge preventative measure for errors, and an easy way to provide "free" extra value.
- Is it possible to do the query by hand on a small subset of records?
- If so, do this periodically. Seriously. I know you got into this line of business to not do this stuff manually, but doing this periodically can help verify all systems are working properly.
- Is the process entirely automated?
- If not, don’t launch.
- This includes any of the tests below that say “periodically”, if possible.
- Is it possible to store the results of every query, along with a timestamp of run (especially as part of an ETL process?)
- Are you choosing a “whitelist” approach, or a “blacklist” approach?
- If there are changes after launch - will you be applying those changes to previous projects, or leaving them in place with old (and potentially inaccurate) data?
Chapter 2 : Data Types
- All
- Are null values allowed?
- Are the fields expected to change over time?
- If the answer is “no”, take hashes of lines at the time of development and test periodically to see if they change.
- If the answer is “yes”, take extracts with timestamps in the name and only use those (instead of production) for reproducibility.
- Expert level - Create a X / Y grid of every field against every other field. Have all parties involved in development (especially the humans that are the source of the data) write assumptions and rules that come up in each cell (if this field is “beer” this other field should be “$5.99”)
- Document all of these.
- If something looks wrong and the customer says “that’s ok”, DOCUMENT THIS
- With audio evidence if possible
- General Text (Free entry)
- If there is a default size, why is the default size set to the size it is?
- Is the field having an identity crisis - does it only contain integers, dates, locations, etc.?
- If so - should it be converted to the correct type?
- What do you do with those that can’t be converted?
- Does the text contain special characters that will cause it to print strangely (line-break)?
- Does the text contain special (or international) characters? Should it?
- Is there a list of “stop words” that should be applied?
- Especially inappropriate words. Don’t let an f-bomb get into a shareholder report.
- Categorical Text (Unique list of values)
- Store the unique values at the time you are developing, and the counts of each value.
- Periodically test for new/missing values. Have a plan in place for when this inevitably happens.
- Periodically test for wild changes in distribution.
- Communicate these to anyone who will listen. This will make you a hero.
- Provide this list to stakeholders during development. This is always interesting, because often it is vastly different than their estimate.
- If there are multiple categorical fields, is there an hierarchy that is documented and should be followed (if veh_type is “car”, then veh_brand can only be “audi”, “ford”, “toyota” etc.)
- Have a plan in place for when this inevitably breaks.
- Is the field indexed for fast grouping?
- If the values are text numbers (one, two, three) - should they be converted to integers (1, 2, 3)?
- Boolean (True/False)
- If these represent a “switch” in the process, should the downstream impact also be represented in data? (if override_price=0 then discount_price should never be null)
- Store the distribution at the time of developing, and periodically test for changes.
- Communicate these to anyone who will listen.
- ID (Keys, AutoNum)
- Is the ID an AutoNum?
- Is the ID a complex key?
- Are duplicates allowed?
- Are you operating under the assumption that the ID for any given record will never change (maintaining an extract or join to a private table)?
- If you remove the ID from the equation, are there any two completely identical rows?
- General Number/Integer
- Currency
- General Contact Information
- Phone Numbers
- Mailing Address
- Sensitive PII
- Financial Information