Given numpy arrays a and b, it is fairly straightforward find the indices of array a whose elements overlap with the elements of array b using the function numpy.in1d(). However, what if you want to also find the indices associated with elements in b that overlap with a? I’m unable to find a good solution to this, so I wrote a small function to do this. It does brute force matching after getting all the common elements of array a and b. If anyone knows of a better way, please let me know. The code snippet is below.
import numpy as np def overlap(a, b): # return the indices in a that overlap with b, also returns # the corresponding index in b only works if both a and b are unique! # This is not very efficient but it works bool_a = np.in1d(a,b) ind_a = np.arange(len(a)) ind_a = ind_a[bool_a] ind_b = np.array([np.argwhere(b == a[x]) for x in ind_a]).flatten() return ind_a,ind_b
Usage:
import overlap a = np.array([1,2,4,5]) b = np.array([4,6,10,9,1]) ind_a, ind_b = overlap.overlap(a,b)
UPDATE (2016-09-28): Based on comments from Mike (see below), there is different way to do this to account for non-unique elements. This also likely runs faster too. His suggestion is below:
def overlap_mbk(a, b): a1=np.argsort(a) b1=np.argsort(b) # use searchsorted: sort_left_a=a[a1].searchsorted(b[b1], side='left') sort_right_a=a[a1].searchsorted(b[b1], side='right') # sort_left_b=b[b1].searchsorted(a[a1], side='left') sort_right_b=b[b1].searchsorted(a[a1], side='right') # # which values are in b but not in a? # inds_b=(sort_right_a-sort_left_a == 0).nonzero()[0] # # which values are in b but not in a? # inds_a=(sort_right_b-sort_left_b == 0).nonzero()[0] # which values of b are also in a? inds_b=(sort_right_a-sort_left_a > 0).nonzero()[0] # which values of a are also in b? inds_a=(sort_right_b-sort_left_b > 0).nonzero()[0] return a1[inds_a], b1[inds_b]
This isn’t rigorously tested, but using numpy’s searchsorted method can do this quickly for non-unique inputs, too. Here’s some some sample code (along with commented out code that will do the same thing for fining elements not in common) that runs ~3x faster for the example you gave. Hope it’s useful.
%timeit -n 10000 overlap(a,b)
10000 loops, best of 3: 28.3 µs per loop
%timeit -n 10000 overlap_mbk(a,b)
10000 loops, best of 3: 8.68 µs per loop
def overlap_mbk(a, b):
a1=np.argsort(a)
b1=np.argsort(b)
# use searchsorted:
sort_left_a=a[a1].searchsorted(b[b1], side=’left’)
sort_right_a=a[a1].searchsorted(b[b1], side=’right’)
#
sort_left_b=b[b1].searchsorted(a[a1], side=’left’)
sort_right_b=b[b1].searchsorted(a[a1], side=’right’)
# # which values are in b but not in a?
# inds_b=(sort_right_a-sort_left_a == 0).nonzero()[0]
# # which values are in b but not in a?
# inds_a=(sort_right_b-sort_left_b == 0).nonzero()[0]
# which values of b are also in a?
inds_b=(sort_right_a-sort_left_a > 0).nonzero()[0]
# which values of a are also in b?
inds_a=(sort_right_b-sort_left_b > 0).nonzero()[0]
return a1[inds_a], b1[inds_b]
Thanks Mike! It’s great this works faster and for non-unique arrays. For my use case, I needed the elements to match one-to-one, so non-unique values are a problem.