HomeBig DataThe Most Distinctive Snowflake - Cloudera Weblog

The Most Distinctive Snowflake – Cloudera Weblog

Okay, I admit, the title is somewhat click-baity, but it surely does maintain some fact! I spent the vacations up within the mountains, and when you reside within the northern hemisphere like me, you realize that implies that I spent the vacations both celebrating or cursing the snow. Once I was a child, throughout this time of 12 months we might all the time do an artwork mission making snowflakes. We’d bust out the scissors, glue, paper, string, and glitter, and go to work. Sooner or later, the instructor would undoubtedly pull out the large weapons and blow our minds with the truth that each snowflake in the whole world for all of time is totally different and distinctive (individuals simply like to oversell unimpressive snowflake options). 

Now that I’m a grown mature grownup that has every thing found out (pause for laughter), I’ve began to marvel in regards to the uniqueness of snowflakes. We are saying they’re all distinctive, however some have to be extra distinctive than others. Is there a way that we may quantify the individuality of snowflakes and thus discover essentially the most distinctive snowflake

Absolutely with fashionable ML know-how, a process like this could not solely be attainable, however dare I say, trivial? It in all probability appears like a novel thought to mix snowflakes with ML, but it surely’s about time somebody does. At Cloudera, we offer our clients with an in depth library of prebuilt information science tasks (full with out of the field fashions and apps) known as Utilized ML Prototypes (AMPs) to assist them transfer the start line of their mission nearer to the end line.

One in all my favourite issues about AMPs is that they’re completely open supply, that means anybody can use any a part of them to do no matter they need. Sure, they’re full ML options which might be able to deploy with a single click on in Cloudera Machine Studying (CML), however they can be repurposed for use in different tasks. AMPs are developed by ML analysis engineers at Cloudera’s Quick Ahead Labs, and consequently they’re an amazing supply for ML finest practices and code snippets. It’s yet one more software within the information scientist’s toolbox that can be utilized to make their life simpler and assist ship tasks sooner.

Launch the AMP

On this weblog we’ll dig into how the Deep Studying for Picture Evaluation AMP will be reused to search out snowflakes which might be much less just like each other. If you’re a Cloudera buyer and have entry to CML or Cloudera Information Science Workbench (CDSW), you can begin out by deploying the Deep Studying for Picture Evaluation AMP from the “AMPs” tab. 

In case you should not have entry to CDSW or CML, the AMP github repo has a README with directions for getting up and working in any surroundings.

Information Acquisition

Upon getting the AMP up and working, we will get began from there. For essentially the most half, we will reuse elements of the prevailing code. Nevertheless, as a result of we’re solely concerned with evaluating snowflakes, we have to convey our personal dataset consisting solely of snowflakes, and a whole lot of them.

It seems that there aren’t very many publicly accessible datasets of snowflake pictures. This wasn’t an enormous shock, as taking pictures of particular person snowflakes could be a guide intensive course of, with a comparatively minimal return. Nevertheless, I did discover one good dataset from Japanese Indiana College that we’ll use on this tutorial. 

You could possibly undergo and obtain every picture from the web site individually or use another utility, however I opted to place collectively a fast pocket book to obtain and retailer the photographs within the mission listing. You’ll want to put it within the /notebooks subdirectory and run it. The code parses out the entire picture URLs from the linked internet pages that include pictures of snowflakes and downloads the photographs. It would create a brand new subdirectory known as snowflakes in /notebooks/pictures and the script will populate this new folder with the snowflake pictures.

Like all good information scientist, we should always take a while to discover the information set. You’ll discover that these pictures have a constant format. They’ve little or no colour variation and a comparatively fixed background. An ideal playground for pc imaginative and prescient fashions.

Repurposing the AMP

Now that we now have our information, and it seems to be to be moderately fitted to picture evaluation, let’s take a second to restate our aim. We wish to quantify the individuality of a person snowflake. In line with its description, Deep Studying for Picture Evaluation is an AMP that “demonstrates the best way to construct a scalable semantic search resolution on a dataset of pictures.” Historically, semantic search is an NLP method used to extract the contextual that means of a search time period, as a substitute of simply matching key phrases. This AMP is exclusive in that it extends that idea to pictures as a substitute of textual content to search out pictures which might be just like each other.

The aim of this AMP is basically centered on educating customers on how deep studying and semantic search works. Within the AMP there’s a pocket book positioned in /notebooks that’s titled Semantic Picture Search Tutorial. It affords a sensible implementation information for 2 of the principle methods underlying the general resolution – function extraction & semantic similarity search. This pocket book would be the basis for our snowflake evaluation. Go forward and open it and run the whole pocket book (as a result of it takes a short while), after which we’ll check out what it comprises.

The pocket book is damaged down into three most important sections: 

  1. A conceptual overview of semantic picture search
  2. A proof of extracting options with CNN’s and demonstration code
  3. A proof of similarity search with Fb’s AI Similarity Search (FAISS) and demonstration code

Pocket book Part 1

The primary part comprises background data on how the end-to-end strategy of semantic search works. There isn’t any executable code on this part so there’s nothing for us to run or change, but when time permits and the subjects are new to you, it’s best to take the time to learn.

Pocket book Part 2

Part 2 is the place we’ll begin to make our adjustments. Within the first cell with executable code, we have to set the variable ICONIC_PATH equal to our new snowflake folder, so change 

ICONIC_PATH = “../app/frontend/construct/belongings/semsearch/datasets/iconic200/”


ICONIC_PATH = "./pictures/snowflakes"

Now run this cell and the subsequent one. It is best to see a picture of a snowflake displayed the place earlier than there there was a picture of a automotive. The pocket book will now use solely our snowflake pictures to carry out semantic search.

From right here, we really can run the remainder of the cells in part 2 and go away the code as is up till part 3, Similarity Search with FAISS. In case you have time although, I’d extremely suggest studying the remainder of the part to realize an understanding of what’s occurring. A pre-trained neural community is loaded, function maps are saved at every layer of the neural community, and the function maps are visualized for comparability.

Pocket book Part 3

Part 3 is the place we’ll make most of our adjustments. Normally with semantic search, you are attempting to search out issues which might be similar to each other, however for our use case we have an interest within the reverse, we wish to discover the snowflakes on this dataset which might be the least just like the others, aka essentially the most distinctive. 

The intro to this part within the pocket book does an amazing job of explaining how FAISS works. In abstract, FAISS is a library that enables us to retailer the function vectors in a extremely optimized database, after which question that database with different function vectors to retrieve the vector (or vectors) which might be most comparable. If you wish to dig deeper into FAISS, it’s best to learn this publish from Fb’s engineering web site by .

One of many classes that the unique pocket book focuses on is how the options output from the final convolutional layer are a way more summary and generalized illustration of what options the mannequin deems vital, particularly when in comparison with the output of the primary convolutional layer. Within the spirit of KISS (maintain it easy silly), we’ll apply this lesson to our evaluation and solely deal with the function index of the final convolutional layer, b5c3, with a view to discover our most original snowflake.

The code within the first 3 executable cells must be barely altered. We nonetheless wish to extract the options of every picture then create an FAISS index for the set of options, however we’ll solely do that for the options from convolutional layer b5c3.

# Cell 1

​​def get_feature_maps(mannequin, image_holder):

    # Add dimension and preprocess to scale pixel values for VGG

    pictures = np.asarray(image_holder)

    pictures = preprocess_input(pictures)

    # Get function maps

    feature_maps = mannequin.predict(pictures)

    # Reshape to flatten function tensor into function vectors

    feature_vector = feature_maps.reshape(feature_maps.form[0], -1)

    return feature_vector


# Cell 2

all_b5c3_features = get_feature_maps(b5c3_model, iconic_imgs)


# Cell 3

import faiss

feature_dim = all_b5c3_features.form[1]

b5c3_index = faiss.IndexFlatL2(feature_dim)



Right here is the place we’ll begin deviating considerably from the supply materials. Within the unique pocket book, the creator created a operate that enables customers to pick a selected picture from every index, the operate returns essentially the most comparable pictures from every index and shows these pictures. We’re going to use elements of that code with a view to obtain our new aim, discovering essentially the most distinctive snowflake, however for the needs of this tutorial you’ll be able to delete the remainder of the cells and we’ll undergo what so as to add of their place.

First off, we’ll create a operate that makes use of the index to retrieve the second most comparable function vector to the index that was chosen (as a result of essentially the most comparable could be the identical picture). There additionally occurs to be a pair duplicate pictures within the dataset, so if the second most comparable function vector can be a precise match, we’ll use the third most comparable.


def get_most_similar(index, query_vec):

    distances, indices = index.search(query_vec, 2)

    if distances[0][1] > 0:

        return distances[0][1], indices[0][1]


        distances, indices = index.search(query_vec, 3)

        return distances[0][2], indices[0][2]


From there it’s only a matter of iterating by every function, trying to find essentially the most comparable picture that isn’t the very same picture, and storing the leads to a listing:


distance_list = []

for x in vary(b5c3_index.ntotal):

    dist, indic = get_most_similar(b5c3_index, all_b5c3_features[x:x+1])

    distance_list.append([x, dist, indic])

Now we’ll import pandas and convert the listing to a dataframe. This provides us a dataframe for every layer, containing a row for each function vector within the unique FAISS index, with the index of the function vector, the index of the function vector that’s most just like it, and the L2 distance between the 2 function vectors. We’re curious in regards to the snowflakes which might be most distant from their most comparable snowflake, so we should always finish this cell with sorting the dataframe in ascending order by the L2 distance.

import pandas as pd

df = pd.DataFrame(distance_list, columns = ['index', 'L2', 'similar_index'])

df = df.sort_values('L2', ascending=False)

Let’s check out the outcomes by printing out the dataframe, in addition to displaying the L2 values in a box-and-whisker plot.



Superb stuff. Not solely did we discover the indexes of the snowflakes which might be the least just like their most comparable snowflake, however we now have a handful of outliers made evident within the field and whisker plot, one in all which stands alone.

To complete issues up, we should always see what these tremendous distinctive snowflakes really appear to be, so let’s show the highest 3 most original snowflakes in a column on the left, together with their most comparable snowflake counterparts within the column on the suitable. 

fig, ax = plt.subplots(nrows=3, ncols=2, figsize=(12, 12))

i = 0

for row in df.head(3).itertuples():

    # column 1



    ax[i][0].set_title('Distinctive Rank: %s' % (i+1), fontsize=12, loc='middle')

    ax[i][0].textual content(0.5, -0.1, 'index = %s' % row.index, measurement=11, ha='middle', remodel=ax[i][0].transAxes)

    # column 2



    ax[i][1].set_title('L2 Distance: %s' % (row.L2), fontsize=12, loc='middle')

    ax[i][1].textual content(0.5, -0.1, 'index = %s' % row.similar_index, measurement=11, ha='middle', remodel=ax[i][1].transAxes)

    i += 1

fig.subplots_adjust(wspace=-.56, hspace=.5)


That is why ML strategies are so nice. Nobody would ever have a look at that first snowflake and suppose, that’s one tremendous distinctive snowflake, however in response to our evaluation it’s by far essentially the most dissimilar to the subsequent most comparable snowflake.


Now, there are a mess of instruments that you can have used and ML methodologies that you can have leveraged to discover a distinctive snowflake, together with a type of overhyped ones. The good factor about utilizing Cloudera’s Utilized ML Prototypes is that we have been capable of leverage an present, fully-built, and practical resolution, and alter it for our personal functions, leading to a considerably sooner time to perception than had we began from scratch. That, women and gents, is what AMPs are all about!

In your comfort, I’ve made the ultimate ensuing pocket book accessible on github right here. In case you’re concerned with ending tasks sooner (higher query – who isn’t?) you also needs to take the time to take a look at what code within the different AMPs may very well be used in your present tasks. Simply choose the AMP you’re concerned with and also you’ll see a hyperlink to view the supply code on GitHub. In spite of everything, who wouldn’t be concerned with, legally, beginning a race nearer to the end line? Take a take a look at drive to strive AMPs for your self.



Please enter your comment!
Please enter your name here

Most Popular

Recent Comments