티스토리 뷰

What is similarity search?

Similarity search is used in data retrieval. Given a query data (ex. text, image), find K most similar data points from the database.

How can a vector representation be used?

A vector representation (ex. CNN feature, CLIP latent embeddings) can be used in similiarity search and classification. The modern AI-based high-dimensional vectors are known to be powerful and flexible than traditional symbolic representations. For example, without knowing the name of the city but only having a building of that city, you can search the city.

For the simiarity search, you compute the Euclidean distance between query vector and  other data points' vectors. For classification, you pick the vector that has the highest dot product with the query vector.

What is Faiss? How is it different from other similarity search libraries? What is indexing (index building)?

Faiss is the Facebook AI Similarity Search library that scales the efficient similarity search to a billion scale. FAISS focues on compressing the original vectors because it scales up to billion data points. Indexing / Index Building is some pre-processing of the dataset that makes the similarity search faster or memory efficient normally some trade-off against accuracy.

How do you evaluate similarity search?

Normally, there is three metrics. Speed, Memory, and accuracy. In practice, we evaluate the performance of similarity search by measuring the trade-off between speed and accuracy given fixed memory requirement (ex. RAM 30GB). 

How do you serach in the index? How do you make it faster?

data compression. dimension reduction + aaproximate NN / KD-Tree

What is difference between K-nearest neighbor (K-NN), K-Dimensional Tree (KDTree), Approximate nearest neighbor (ANN), ?

To return K similar data points, K-NN's time complexity is O(nlogk).

Constructing KDTree can be O(nlogn) or O(nlog^2n) depending on finding the appropriate cuts (ex. median),but the nearest neighbor search can be from O(logn) (height of the tree) to O(n).  But it is not efficient with high dimensional data. Curse of dimensionality. If the data's dimensionality is high, the volume of space increases exponentially and the data becomes sparse (requiring much higher number of data). It's hard to obtain the good cuts (ex. median), making the tree imbalance and so making the search inefficient.

Approximate nearest neighbor sacrifices accuracy for speed. going too deep...

 


https://engineering.fb.com/2017/03/29/data-infrastructure/faiss-a-library-for-efficient-similarity-search/

 

Faiss: A library for efficient similarity search

Visit the post for more.

engineering.fb.com

Introduciton to K-D Tree

https://www.baeldung.com/cs/k-d-trees#:~:text=Nearest%20Neighbour%20Search,points%20in%20each%20leaf%20node.

How to efficiently find k-nearest neighbours in high-dimensional data?

 

How to efficiently find k-nearest neighbours in high-dimensional data?

So I have about 16,000 75-dimensional data points, and for each point I want to find its k nearest neighbours (using euclidean distance, currently k=2 if this makes it easiser) My first thought wa...

stackoverflow.com

Understanding Ranking Loss, Contrastive Loss, Margin Loss, Triplet Loss, Hinge Loss and all those confusing names

https://gombru.github.io/2019/04/03/ranking_loss/

 

Understanding Ranking Loss, Contrastive Loss, Margin Loss, Triplet Loss, Hinge Loss and all those confusing names

After the success of my post Understanding Categorical Cross-Entropy Loss, Binary Cross-Entropy Loss, Softmax Loss, Logistic Loss, Focal Loss and all those confusing names, and after checking that Triplet Loss outperforms Cross-Entropy Loss in my main rese

gombru.github.io

 

Interesting plots from a model trained on MNIST with Cross-Entropy Loss, Pairwise Ranking Loss and Triplet Ranking Loss, and Pytorch code for those trainings.

https://github.com/adambielski/siamese-triplet

 

GitHub - adambielski/siamese-triplet: Siamese and triplet networks with online pair/triplet mining in PyTorch

Siamese and triplet networks with online pair/triplet mining in PyTorch - adambielski/siamese-triplet

github.com

 

'Research (연구 관련)' 카테고리의 다른 글

What is VLM?  (0) 2024.05.09
What is Variational Score Distillation?  (0) 2024.04.01
What is offset noise?  (0) 2024.03.29
What is rigid align?  (0) 2024.03.29
What is DDPM and DDIM?  (0) 2024.03.27
공지사항
최근에 올라온 글
최근에 달린 댓글
Total
Today
Yesterday
링크
«   2024/11   »
1 2
3 4 5 6 7 8 9
10 11 12 13 14 15 16
17 18 19 20 21 22 23
24 25 26 27 28 29 30
글 보관함