What is Yelp / Near by Recommendation System
- Used to discover nearby places(eg: schools near me), events, restaurants, theaters, etc.
- Yelp App on phone sends (lattitude,longitude) of user’s device and yelp server sends locations within
10 km radius of (lattitude,longitude) to the user.
- (lattitude,longitude) are like coordinates of point on earth.
1. Requirements
Functional
- 1. Add places //POST
- 2. Update Places
- 3. Delete places
- 4. Get Places. Given (longitude,lattitude) nearby places should be listed for given radius. //GET
- 5. POST. Users should be able to give reviews, ratings, open/close times, pictures etc.
Non Functional
- 1. Scalable
Return nearby places without delay. (READ)
- 2. Reliable: Recommendations should be valid(not stale)
- 3. Fast
- 3. Available: On high load, system should be available. (CRUD=READ)
- 4. Automated Review Filters
2. BOE
- Storage Estimates:
- 200M~300M business listings are present on google
- Each place {string name, string address, double lat, double long, `vector
- 300M * 2000k = 60 PB / year
- User Reviews,Photos,Videos will keep on increasing. Suppose becomes 5 times per year. 300PB
- New Businesses also gets added, old fade away. Suppose from last year it becomes twice. 120PB
500 PB/year
5 years = 2 Exabytes
- QPS:
- Assuming 100k user are accessing per sec
- This may increase in peak season, and may decline/remain steady other times
- Bandwidth Estimates:
- 100k req/s.
- 1 GET request = 2KB
- 1 UPDATE request = 4KB //Avg of 2=3KB
Incoming bytes = 100k x 3k = 300MBytes/sec
Outgoing bytes = 100k x (sizeof 1 listing) = 100k x 2000 KB(calculated above) = 20GB/sec
3. System APIs
- 1. add_place = HTTP POST
POST https://url/v1/add_place
header {
Authorization: {Bearer "API_KEY_TOKEN"},
lat: "",
longitude: ""
}
body {
name, description, picture_url, video_url
}
- 2. update_place = HTTP PUT Place/person/thing information
GET https://url/v1/search_place
header {
Authorization: {Bearer "API_KEY_TOKEN"},
sort: optional,
business-id: id //unique key
}
body { //Server will compare the data and update the changed
name, description, picture_url, video_url,
userLocation {
lat: "",
long: "",
},
}
- 4. search_place //HTTP GET. Place/person/thing information
GET https://url/v1/search_place
header {
Authorization: {Bearer "API_KEY_TOKEN"},
userLocation {
lat: "",
long: "",
},
radius: 10km, //default
sort: optional,
category: optional
}
//Returns json:
{
{ //business-1
name: business-name
photo: photo_url,
ratings: array,
open close time: string
},
{
//business-2
},
}
4. Databases
- We will use SQL, not noSQL?
- NoSQL has good use when schema is dynamic, but we have fixed fields to store in DB here.
- 4.1 SQL Tables
Following are entities in system: Place, Place_Owner(who adds entry), Reviews, Photos, Videos
| ownerID(fk)[indexed] | lattitude | longitude | placeID(pk) | Description | categoryId(fk) | lastupdated_timestamp[indexed] |
ownerID: id of person who created the entry
lattitude: Geographic coordinate specifying north–south position of a point on the Earth's surface.
longitude: Geographic coordinate specifying east–west position of a point on the Earth's surface.
- (lattitude, longitude): precise location of features on the surface of the Earth.
placeID(8 bytes): Uniquely identifies a location. LocationId is taken 8 bytes(64 bits) considering future in mind.
- 264 = Huge number of locations
Description(512 bytes)
Category(1 byte): E.g., coffee shop, restaurant, theater, etc.
| ownerID(pk) | name | email | secret_key | placeID(fk) | created_timestamp |
placeId: locations this owner owns
created_timestamp: profile created by owner at this time
| review_id(pk) | creator_name | creator_email | review_description | placeID(fk) | created_timestamp |
photos stored on object store
| photo_id(pk) | active | photo_url | placeID(fk) | created_timestamp | marked_inactive_timestamp |
active: Is photo active on website, ie getting displayed. We will remove photo from db
after 30 days, if marks inactive
photo_url:
videos stored on object store
Reviews stored on object store
| reviewId(pk) | review-text-url | placeid(fk) | stars |
| categoryId(pk) | Category-text |
Benefits of keeping seperate Category table:
- Add/update category without affecting place table
- 4.2 DB Scaling and Partitioning
- Everyday new places(text, photos, videos, reviews) will get added, DB size will increase.
- Given such a huge QPS = 100k req/sec and such a huge data to store 2Exabytes, we would need to partition our DB, we cannot store everything at 1 place.
- Sharding based on location IDs
5. High Level Design
5.1 Authenticating to RESTful API using JWT(Json Web Tokens)
5.2 Reach Microservice from public network
5.3 Basic Design
- System will have Huge data. We assumed 200-300M entries.
- Now if someone want to get nearby places(within 10km), We need to perform following query on each entry of database
(long-10 and long+10) and (lat-10 and lat+10)
- Scanning the entire table and checking if it meets both criteria can be very time-consuming and expensive.
- DB should store region specific data. DB-1(store Europe(Germany)), DB-2(store India(Delhi)) data.
- Service instances should run on specific AWS regions, which will update/query data from local DB.
Quad Tree //Will Choose this | HashMap |
|
---|---|---|
Advantages |
- No problem in non-uniform distributed data. We can increase more DBs on heavy load. - Allows for spatial indexing, enabling efficient range queries and proximity searches. |
- Simple to implement. - Work well if data is uniformly distributed across zipcodes |
Disdvantages |
- More Complex wrt Hash map - Maintenance overhead as the system scales, especially in scenarios where data distribution changes frequently. |
- May not handle non-uniform data distribution effectively.
If certain regions have significantly more data than others, it could lead to load imbalance - Adding or removing servers might be challenging |
- Whole whole world map is divided into grids. 1 grid can reside on multiple servers. Search Complexity for DB is O(logn).
- Dynamically adjust the grid size such that whenever grid gets lot of places(maybe > 500) break it down into 4 dbs and distribute places b/w them.
- Thickly populated areas like San Francisco will have a lot of grids.
//quad means 4, it will always have 4 children
Quad Tree
[root]
/ | \ \
struct gridNode_or_dbNode {
uint32 gridId_or_dbID; //gridId hash gives the DB where (latt-start,long-start,latt-end,long-end) are stored
/*
child start-lat end-lat start-long end-long
1 1 10 101 111
2 11 20 111 121
3 20 30 121 131
4 30 180 131 -
*/
struct child_lat_long {
double child_lattitude-start, child_lattitude-end;
double child_longitude-start, child_longitude-end;
} CLL;
struct CLL[4];
double lattitude-start,lattitude-end;
double longitude-start,longitude-end;
struct grid *gridNode_or_dbNode[4]; //4 children
};
if (incoming_lat, long falls b/w range of any child)
jump to child
- Req-1: Add_place
@startuml actor UserApp as u box data center participant AppServer as as participant Diesel as diesel participant QuadTreeDB as db end box u -> as: POST(lat,lon,name,image) as -> diesel: Get quadtree root diesel -> as: root as -> diesel: insert\n(lat,lon,name,image) diesel -> db: insert into\nappropriate DB @enduml
- Req-4: Search. Hotels Near me
@startuml actor UserApp as u box data center participant AppServer as as participant Diesel as diesel participant QuadTreeDB as db end box u -> as: HTTP GET\nHotels in diameter\nof 10km(lat,lon) as -> diesel: Get quadtree root diesel -> as: root as -> diesel: Traverse to DB\nstoring (lat,lon) diesel -> db: Traverse to DB\nstoring (lat,lon) db -> diesel: db diesel -> as: db as -> diesel: GET Hotels in diameter\nof 10km(lat,lon) diesel -> db: db->diesel: diesel->as: HTTP Response\nbody{..Hotels..} as->u: Hotels @enduml
5.2 Problems in Basic Design
- Non-Functional-Req-1. System should be Scalable, but it is not
Problem-1: Only 1 application server serving all CREATE, READ. Will slow down on load
Solution: Multipod: Multiple instances of app-server behind load-balancer (this is horizontal scaling)
Problem-2: Only 1 DB serving all CREATE, READ. Will slow down on load
Solution: Master-Slave DB or Sharded Database
- Non-Functional-Req-2. Reliable, but it is not
Problem-1: Only 1 application server serving all CREATE, READ. if it fails, system will die
Solution: Multipod: Multiple instances of app-server behind load-balancer (this is horizontal scaling)
Problem-2: Only 1 DB serving all CREATE, READ. Will slow down on load
Solution: Master-Slave DB or Sharded Database
- Non-Functional-Req-3. Fast, but it is not
Problem-1: Only 1 application server serving all CREATE, READ. if it loaded system will be slow
Solution:
- Caching: store frequently accessed data in memory
- CDNs: CDNs to cache static content and serve it from edge servers located closer to users
Problem-2: Only 1 DB serving all CREATE, READ. Will slow down on load
Solution: Master-Slave DB or Sharded Database
Longitude,Lattitude Explained
Longitude (North to South) | Lattitude (East to West) | |
---|---|---|
Lines |
Lines of longitude (also called Meridians) run between the geographic North Pole and the geographic South Pole These are numbered from 1° to 179°. |
Lattitude is imaginary lines that circle Earth’s surface, running east and west parallel to the Equator North Pole Lattitude: 90°(N) South Pole Lattitude: 90°(S) |
Mesurement | In degrees, further broken into minutes, seconds | In degrees, further broken into minutes, seconds |