All Posts By

Vitaly Tev

Intro to Machine Learning

By Machine Learning

You’ve probably heard of AI, or Artificial Intelligence.  I think a good way to define AI is a machine mimicking or displaying “human intelligence”.  Basically, as it is today, AI is a computer that can do a task (or tasks) as well or better than a human being.  Examples are Style Transfer filters (think the app on your phone that turns your pictures into Picasso-style abstracts) or some of the language translation in real time applications you see (think some of the new head phones which translate via your phone on the fly).  Another area of AI is Machine Learning (ML).  Let’s go through a brief introduction to Machine Learning.

What is Machine Learning (ML)?

So what does Artificial Intelligence have to do with Machine Learning (ML)?  Well, Machine Learning can be defined as an AI approach that learn through experience to recognize patterns in data.  Basically, a computer is taught patterns by examples using an algorithm, instead of recognizing the pattern by coding in rules that are followed in sequence.

The Machine Learning basics are:

  • Start with a training data set
  • Let the ML algorithm learn patterns
  • Provide new data to test ML algorithm and provide feedback

Using this method, an algorithm that starts not being very accurate can become very accurate at predictions.  From this basic process, ML can be implemented using more advanced techniques that allow recognition of patterns of patterns, helping machine learning algorithms more closely mimic a human brain (implementations are typically called Deep Learning).

How does it work?

As outlined prior, machine learning works by training an algorithm to do what you want.  So what does training mean?  The goal is to teach the algorithm the features or properties that are relevant to identifying the relevant information.  Choosing the appropriate features is absolutely vital.

For example, if you want to identify plants and animals, choosing the feature “living” is not particularly helpful.  However, “chlorophyll present” may be a very relevant feature.  That is just one feature.  Ideally, enough features are chosen to allow consistent identification of a plant or animal when presented with examples.  Note however, that one has to be careful not to provide too many features to train on, as this can lead the algorithm being unable to generalize when being used outside of training.

From this basic idea of using training data, ML can go into a variety of complex learning paradigms.  But at the core, these are all about the algorithm learning to understand inputs based on learned parameters.

How can it be used?

So with all this learning, what can ML do?  Four common outputs from an ML algorithm are:

  • Predictions of numerical values: Understanding the housing market in a given area and offering predictions for a home newly on sale
  • Classifying inputs: Understanding whether a given input is of class A (plant for example) or class B (animal for example)
  • Finding similar examples: Providing similar items for purchase a la Amazon.
  • Predicting next values in sequences: Helping with processing natural language or allowing a computer to speak

Where is it heading?

The most exciting part of Machine Learning is that we are just getting started.  ML as a concept has been around for many decades, but we’re seeing it evolve at an increasing rate.  This is due to improvements in processing power and access to the cloud.  The immediate pay off is an improvement in the consumer experience.  Everything from Google Photos to Amazon are incorporating ML more and more.  As we are becoming more connected, we are able to have ML assist us in all sorts of activities.  Better, more contextual shopping assistants.  Object identification in real time.   Real time language translation.  The possibilities are hard to quantify simply because it can be pervasive and impactful throughout our lives, individually and as a society!

This article is simply a short introduction to Machine Learning.  How is this relevant for DataGroomr?  We’re taking these concepts and applying them to cleaning your data, so that you don’t need to be an expert to have useful information at your fingertips!  Stay tuned for our next article on Machine Learning and Data!

The Value of Data Accuracy

By Data Cleansing

There is an axiom that has existed in data science for years.  “Garbage in, garbage out.”  Despite the existence of such sound advice, too often it is forgotten.  Organizations routinely use information that, when tested, does not meet the minimum acceptable range of data accuracy, per the Harvard Business Review.  However, creating information that is acceptable has a cost in dollars.  Is it worth the investment?  What are the benefits of having good data?

1. Importance of Data Quality

In the modern business landscape, data feeds directly into revenue.  When you are able to rely on your informaton because it is clean of duplicates, utilizes standard formatting, and has correct values in the critical fields, you can make confident decisions that lead to increased revenue.

2. Cost of Bad Data

Data feeds into many business processes.  Bad data means those processes run less efficiently.  Direct mail campaigns are sent to incorrect addresses.  Email campaigns send messages to non-existant addresses.  Even the best designed campaign will fail to meet expectations without the clean information to feed into them.  Good data means saving money on wasted efforts.  Well managed data means campaigns are efficient and targeted, helping to maximize return on a lower investment.

3. Customer Satisfaction

Customer satisfaction is imperative to any business.  By assuring that records are appropriately clean and accurate, your organization can be sure to match different elements together, you can be sure to deliver what your customers expect at each and every touchpoint.  Your customers will be happier, feeling that your organization is meeting their needs.

4. Impact of Data Accuracy (in time)

One of the sometimes hidden issues of poor data quality is the time spent manually fixing bad data.  Simple steps like standardizing “st” versus “street” done manually can be time consuming and error prone.  Often times the actions that are taken to fix information are done on output records, not the actual source, which means that the same steps will have to be repeated every time the source is used.  Departments will codify manual data correction and normalization as part of standard procedure, instead of expecting clean, correct information!  Applying principles of good data management saves time.

5. Good Data vs Bad Data

Organizations spend money on systems to maximize the data accuracy and the value they can get from their information.  But the underlying records doesn’t always allow these systems to operate as best possible.  But if you can be confident that your records are clean, you can be confident in the results of processing that data through other systems.  The investment in systems that work from your organization’s data will have a greater ROI.


Data is an invaluable resource, but only if you can rely on it.  The examples above are just some of the ways that clean data provides value to organizations.  In order to achieve this goal, it sometimes takes using a third party resource to perform data hygiene tasks.  DataGroomr is a data cleansing solution that is simple to use, while leveraging advanced machine learning technology to apply best practices in data cleansing.  Questions?  Reach out to us at or signup for a free demo at

What is Data Cleansing?

By Data Cleansing

Just about every organization has data today.  But that data can be worth nothing without some form of data quality management.  As part of this process, data cleansing is vital.  But what is data cleansing?  Generally it breaks down into three primary actions:


One of the common issues with data is duplicate records.  This can be due to spelling variations, or simple error in data input.  However, each duplicate can have a significant impact on how meaningful your data is.  Removing these duplicates means reducing costs and more effective usage of the data.  For example, if you have a record for “Steve Wilson” and another for “Stephen Wilson”, this can lead to wasted resources in a direct mail campaign.  At the same time, care has to be taken to prevent the removal of valid records.


Another common issue is normalization of the data.  This can be applied in many ways.  What does normalization actually mean however?  It’s the process where data is corrected to be consistent across records.  An example is choosing between “st” and “street”.  Either can work, but ideally your data chooses one version and uses that consistently.  This makes running all types of queries or searches easier, and returns results that actually meet what you’re looking for.

Record Completion

The final main component of data cleansing is completion of records.  Oftentimes data is missing from a given record; sometimes vital data.  For example, a given record may have name, address, and phone number that record.  For an email campaign, that record may as well not exist.  During the process of data cleansing, external sources can sometimes be used to fill in these gaps in the data.  Alternatively a manual process can be undergone periodically to try and complete the data as best possible.  Finally, there may be other records which have relevant data for a particular other record.


These three concepts together are the core of data cleansing.  To maintain good data vs bad data, all three should be undertaken periodically.  Unfortunately, doing it once and forgetting about it is not really an option.  As new data is input, the risk of bad data increases yet again, and you could end up in the same, or worse, spot that you started.  Maintaining the value of your data requires continuous vigilance and undertaking an approach that is repeatable.

So now that you know what needs to be done for data cleansing, how do you do it?  Check out for an innovative approach to accomplishing your goals and improving data quality.  We simplify the daunting task of cleaning your data by leveraging Machine Learning (ML) so that you don’t have to be a data expert.  Questions?  Email us at!