Near by Recommendation System

What is Yelp / Near by Recommendation System

- Used to discover nearby places(eg: schools near me), events, restaurants, theaters, etc.
- Yelp App on phone sends (lattitude,longitude) of user’s device and yelp server sends locations within 10 km radius of (lattitude,longitude) to the user.
- (lattitude,longitude) are like coordinates of point on earth.

1. Requirements

Functional

CRUD(Create=POST, Read=GET, Update=PUT, Delete=DELETE)

1. Add places //POST

2. Update Places

3. Delete places

4. Get Places. Given (longitude,lattitude) nearby places should be listed for given radius. //GET

5. POST. Users should be able to give reviews, ratings, open/close times, pictures etc.

Non Functional

1. Scalable

Add new places without delay. (CRUD=CREATE)
Return nearby places without delay. (READ)

2. Reliable: Recommendations should be valid(not stale)

3. Fast

3. Available: On high load, system should be available. (CRUD=READ)

4. Automated Review Filters

2. BOE

Storage Estimates:

- 200M~300M business listings are present on google
- Each place {string name, string address, double lat, double long, `vector reviews`, string photos_url} = 2000 KB
- 300M * 2000k = 60 PB / year
- User Reviews,Photos,Videos will keep on increasing. Suppose becomes 5 times per year. 300PB
- New Businesses also gets added, old fade away. Suppose from last year it becomes twice. 120PB
500 PB/year
5 years = 2 Exabytes

QPS:

- Assuming 100k user are accessing per sec
- This may increase in peak season, and may decline/remain steady other times

Bandwidth Estimates:

- 100k req/s.
- 1 GET request = 2KB
- 1 UPDATE request = 4KB //Avg of 2=3KB
Incoming bytes = 100k x 3k = 300MBytes/sec
Outgoing bytes = 100k x (sizeof 1 listing) = 100k x 2000 KB(calculated above) = 20GB/sec

3. System APIs

1. add_place = HTTP POST


                POST https://url/v1/add_place
                header {
                    Authorization: {Bearer "API_KEY_TOKEN"},
                    lat: "",
                    longitude: ""
                }
                body {
                    name, description, picture_url, video_url
                }

2. update_place = HTTP PUT Place/person/thing information


                GET https://url/v1/search_place
                header {
                  Authorization: {Bearer "API_KEY_TOKEN"},
                  sort: optional,
                  business-id: id    //unique key
                }
                body {              //Server will compare the data and update the changed
                  name, description, picture_url, video_url,
                  userLocation {
                    lat: "",
                    long: "",
                  },
                }

4. search_place //HTTP GET. Place/person/thing information


                GET https://url/v1/search_place
                header {
                  Authorization: {Bearer "API_KEY_TOKEN"},
                  userLocation {
                    lat: "",
                    long: "",
                  },
                  radius: 10km,  //default
                  sort: optional,
                  category: optional
                }
                //Returns json:
                {
                  {            //business-1
                    name: business-name
                    photo: photo_url,
                    ratings: array,
                    open close time: string
                  },
                  {
                    //business-2
                  },
                }

4. Databases

We will use SQL, not noSQL?

- NoSQL has good use when schema is dynamic, but we have fixed fields to store in DB here.

4.1 SQL Tables

Following are entities in system: Place, Place_Owner(who adds entry), Reviews, Photos, Videos

1. Place table (indexing for faster searching)


| ownerID(fk)[indexed] | lattitude | longitude | placeID(pk) | Description | categoryId(fk) | lastupdated_timestamp[indexed] |

ownerID: id of person who created the entry
lattitude: Geographic coordinate specifying north–south position of a point on the Earth's surface.
longitude: Geographic coordinate specifying east–west position of a point on the Earth's surface.
    - (lattitude, longitude): precise location of features on the surface of the Earth.
placeID(8 bytes): Uniquely identifies a location. LocationId is taken 8 bytes(64 bits) considering future in mind.
    - 2⁶⁴ = Huge number of locations
Description(512 bytes)
Category(1 byte): E.g., coffee shop, restaurant, theater, etc.

2. Place_owner table


| ownerID(pk) | name | email | secret_key | placeID(fk) | created_timestamp |

placeId: locations this owner owns
created_timestamp: profile created by owner at this time

3. Reviews table


| review_id(pk) | creator_name | creator_email | review_description | placeID(fk) | created_timestamp |

4. Place Photo table

photos stored on object store


| photo_id(pk) | active | photo_url | placeID(fk) | created_timestamp | marked_inactive_timestamp |

active: Is photo active on website, ie getting displayed. We will remove photo from db
    after 30 days, if marks inactive
photo_url:

5. Place video table

videos stored on object store

6. Reviews table

Reviews stored on object store


| reviewId(pk) | review-text-url | placeid(fk) | stars |

7. Category Table


| categoryId(pk) | Category-text |

Benefits of keeping seperate Category table:
- Add/update category without affecting place table

4.2 DB Scaling and Partitioning

Why we need DB Scaling?
- Everyday new places(text, photos, videos, reviews) will get added, DB size will increase.
- Given such a huge QPS = 100k req/sec and such a huge data to store 2Exabytes, we would need to partition our DB, we cannot store everything at 1 place.

We can do:
- Sharding based on location IDs

5. High Level Design

5.1 Authenticating to RESTful API using JWT(Json Web Tokens)

5.2 Reach Microservice from public network

5.3 Basic Design

Why we cannot store places directly on DB, Why we need to arrange DBs QuadTree Format?

- System will have Huge data. We assumed 200-300M entries.
- Now if someone want to get nearby places(within 10km), We need to perform following query on each entry of database (long-10 and long+10) and (lat-10 and lat+10)
- Scanning the entire table and checking if it meets both criteria can be very time-consuming and expensive.
- DB should store region specific data. DB-1(store Europe(Germany)), DB-2(store India(Delhi)) data.
- Service instances should run on specific AWS regions, which will update/query data from local DB.

2 ways in which Databases can be arranged

	Quad Tree //Will Choose this	HashMap
Advantages	- No problem in non-uniform distributed data. We can increase more DBs on heavy load. - Allows for spatial indexing, enabling efficient range queries and proximity searches.	- Simple to implement. - Work well if data is uniformly distributed across zipcodes
Disdvantages	- More Complex wrt Hash map - Maintenance overhead as the system scales, especially in scenarios where data distribution changes frequently.	- May not handle non-uniform data distribution effectively. If certain regions have significantly more data than others, it could lead to load imbalance - Adding or removing servers might be challenging

QuadTree:
- Whole whole world map is divided into grids. 1 grid can reside on multiple servers. Search Complexity for DB is O(logn).
- Dynamically adjust the grid size such that whenever grid gets lot of places(maybe > 500) break it down into 4 dbs and distribute places b/w them.
- Thickly populated areas like San Francisco will have a lot of grids.


      //quad means 4, it will always have 4 children
      Quad Tree
        [root]
        / | \ \

struct gridNode_or_dbNode {
  uint32 gridId_or_dbID;            //gridId hash gives the DB where (latt-start,long-start,latt-end,long-end) are stored

  /*
    child   start-lat   end-lat   start-long  end-long
    1         1           10        101       111
    2         11          20        111       121
    3         20          30        121       131
    4         30          180       131       -
  */
  struct child_lat_long {
    double child_lattitude-start, child_lattitude-end;
    double child_longitude-start, child_longitude-end;
  } CLL;
  struct CLL[4];

  double lattitude-start,lattitude-end;
  double longitude-start,longitude-end;
  struct grid *gridNode_or_dbNode[4];       //4 children
};

if (incoming_lat, long falls b/w range of any child)
  jump to child

Req-1: Add_place

Components in reaching Microservice from public network

@startuml
actor UserApp as u
box data center
participant AppServer as as
participant Diesel as diesel
participant QuadTreeDB as db
end box

u -> as: POST(lat,lon,name,image)
as -> diesel: Get quadtree root
diesel -> as: root
as -> diesel: insert\n(lat,lon,name,image)
diesel -> db: insert into\nappropriate DB

@enduml

Req-4: Search. Hotels Near me

@startuml
actor UserApp as u
box data center
participant AppServer as as
participant Diesel as diesel
participant QuadTreeDB as db
end box

u -> as: HTTP GET\nHotels in diameter\nof 10km(lat,lon)
as -> diesel: Get quadtree root
diesel -> as: root
as -> diesel: Traverse to DB\nstoring (lat,lon)
diesel -> db: Traverse to DB\nstoring (lat,lon)
db -> diesel: db
diesel -> as: db

as -> diesel: GET Hotels in diameter\nof 10km(lat,lon)
diesel -> db:
db->diesel:
diesel->as: HTTP Response\nbody{..Hotels..}
as->u: Hotels

@enduml

5.2 Problems in Basic Design

Non-Functional-Req-1. System should be Scalable, but it is not

Problem-1: Only 1 application server serving all CREATE, READ. Will slow down on load
Solution: Multipod: Multiple instances of app-server behind load-balancer (this is horizontal scaling)

Problem-2: Only 1 DB serving all CREATE, READ. Will slow down on load
Solution: Master-Slave DB or Sharded Database

Non-Functional-Req-2. Reliable, but it is not

Problem-1: Only 1 application server serving all CREATE, READ. if it fails, system will die
Solution: Multipod: Multiple instances of app-server behind load-balancer (this is horizontal scaling)

Problem-2: Only 1 DB serving all CREATE, READ. Will slow down on load
Solution: Master-Slave DB or Sharded Database

Non-Functional-Req-3. Fast, but it is not

Problem-1: Only 1 application server serving all CREATE, READ. if it loaded system will be slow
Solution:
- Caching: store frequently accessed data in memory
- CDNs: CDNs to cache static content and serve it from edge servers located closer to users

Problem-2: Only 1 DB serving all CREATE, READ. Will slow down on load
Solution: Master-Slave DB or Sharded Database

Longitude,Lattitude Explained

	Longitude (North to South)	Lattitude (East to West)
Lines	Lines of longitude (also called Meridians) run between the geographic North Pole and the geographic South Pole These are numbered from 1° to 179°.	Lattitude is imaginary lines that circle Earth’s surface, running east and west parallel to the Equator North Pole Lattitude: 90°(N) South Pole Lattitude: 90°(S)
Mesurement	In degrees, further broken into minutes, seconds	In degrees, further broken into minutes, seconds