통계분석, 머신러닝을 이용한 데이터 분석

OpenStreetMap Data Wrangling with MongoDB

10 Apr 2016

With Seoul map data from OppenStreetMap, I used data munging techniques, such as assessing the quality of the data for validity, accuracy, completeness, consistency and uniformity, to clean the OpenStreetMap data for a part of the world that I care about. I used MongoDB and applied new data schema to the project.

Map Area: Seoul, Republic of Korea

https://www.openstreetmap.org/relation/2297418

https://mapzen.com/data/metro-extracts/#seoul-south-korea

Overview

Problems Encountered in the Map

After initially downloading a sample data of the Seoul area and running it against a provisional p3.py file, I noticed three main problems with the data, which I will discuss in the following order:

Postal Codes

Postal codes in South Korea were 6-digit numeric, and 5-digit numeric system is introduced in August 1, 2015. Postal code strings have several problems, forcing a decision to strip middle character in 6-digit numeric. (“XXX-XXX” -> “XXXXXX”). After standarizing inconsistent postal codes, some “incorrect” postal codes surfaced when grouped together with this aggregator:

	db.seoul.aggregate([{'$match': {'tags.addr:postcode': {'$exists': 1}}}, {'$group': {'count': {'$sum': 1}, '_id': '$tags.addr:postcode'}}, {'$sort': {'count': -1}}])

Here are the top two results, beginning with the highest count:

	{u'_id': u'06321', u'count': 115},
	{u'_id': u'151841', u'count': 54}, …

Considering postal codes, it appears that 25% of documents aren’t in Seoul. Seoul postal codes should begin with “1” (in 6-digit format) and “01”~”09” (in 5-digit format). I found postal code starting with “420” is the most frequently appeared in documents. It is postal code of adjacent city of Seoul. So, I performed another aggregation to check:

	db.seoul.aggregate([{'$match': {'tags.addr:city': {'$exists': 1}}}, {'$group': {'count': {'$sum': 1}, '_id': '$tags.addr:city'}}, {'$sort': {'count': 1}}])

The results:

	{u'_id': u'용인시 (Yongin)', u'count':276},
	{u'_id': u'Seoul ', u'count':234},
	{u'_id': u'연수구 ', u'count':183},
	{u'_id': u'부천시(Bucheon)', u'count':145},
	{u'_id': u'부천시 (Bucheon)' , u'count':145}, ...

These result showed that this dataset include surrounding cities such as “Bucheon”, “Yongin”. Especially, Bucheon’s postal code starts with “420”, so, these postal codes aren’t “incorrect”, but simply unexpected.

Hospitals

Most nodes have “name”, “name:en”, and “name:ko_rm” tags. “name:en” is written in English, and “name:ko_rm” is written as it is pronounced in Korean. I focused hospital nodes to check their name fields:

	db.seoul.find({"tags.amenity": "hospital"})

Here are the two documents:

	{'_id': ObjectId('570a1a2c471c1f11deec607d'),
	...
	 'tags': {'amenity': 'hospital',
	 'name': '푸른정형외과의원 (Pureunjeonghyeongoegwa Clinic)',
	 'name:en': 'Pureunjeonghyeongoegwa Clinic', #Puren Orthopedic Clinic
	 'name:ko': '푸른정형외과의원',
	 'name:ko_rm': 'Pureunjeonghyeongoegwauiwon',
	 'ncat': '병원',
	 'source': 'http://kr.open.gugi.yahoo.com'},
	...
	},
	{'_id': ObjectId('570a1a2c471c1f11deec600b'),
	 ...
	 'tags': {'amenity': 'hospital',
	          'name': '우먼피아여성병원 (Umeonpiayeoseong Hospital)', 
	          'name:en': 'Umeonpiayeoseong Hospital', #Umeonpia Obstetrics & Gynecology Hospital
	          'name:ko': '우먼피아여성병원',
	          'name:ko_rm': 'Umeonpiayeoseongbyeongwon',
	          'ncat': '병원',
	          'source': 'http://kr.open.gugi.yahoo.com'},
	...
	}

“name:en” is translated name, but the documents have one as it is pronounced in Korean. So, it should be updated as I wrote in red. It is more consistent with other documents. There were typing error of “hospital” in amenity, such as “hopital”, and “hostpital”.

I summarized these cases in the following:

	mapping = { "hanuiwon": "Oriental Medicine",
	 "yeoseongbyeongwon" : "Obstetrics & Gynecology",
	 "yeoseonguiwon": "Obstetrics & Gynecology",
	 "yeoseong": "Obstetrics & Gynecology",
	 "soagwa": "Pediatric",
	 "singyeongoegwa" : "Neurosurgery",
	 "singyeong" : "Neurosurgery",
	 "jeonghyeongoegwa" : "Orthopedics",
	 "gajeonguihak": "Family Medicine",
	 "jaehwaluihakgwa": "Rehabilitation Medicine",
	 "tongjeunguihakgwa": "Pain Medicine",
	 "tongjeung": "Pain Medicine",
	 "dentist": "Dental",
	 "hopital": "Hospital",
	 "hostpital": "Hospital"
	 }

And, I updated “name:en” with new name using regex.

	db.seoul.update({'id':d}, {'$set': {"tags":{"name:en":data[d]}}})

The result, edited for readability:

	Gangdongmijeuyeoseong Hospital => Gangdongmijeu Obstetrics & Gynecology Hospital
	Pomijeuyeoseong Hospital => Pomijeu Obstetrics & Gynecology  Hospital
	Gyeonghuibom  Hanuiwon  Gangdongjeom => Gyeonghuibom   Oriental Medicine   Gangdongjeom Clinic
	...

Data Overview

This section contains basic statistics about the dataset and the MongoDB queries used to gather them.

Additional Ideas

Translation guideline suggestion

The translated name field(“name:en”) contains many errors. I cleaned 13%(655/4983) of “name:en” of hospitals. Because map data is gathered throughout the world, the translation-related problem might be common in other languages including Korean. So, I think cleaning up the similar cases is necessary.

Because of name fields are vary in different languages (“name:ko”, “name:fr”, “name:it”, etc.), common translation guideline should be considered beforehand, which might be to manage. If the translation guideline is set up and displayed to users, they might follow the guideline so that name fields are more consistent.

Additional data exploration using MongoDB queries

Registration period of hospitals

	> db.seoul.aggregate( [{ "$match" : {"tags.amenity" : "hospital"} },{"$group":{"_id": "$tags.amenity", "registration_from": { $min: "$timestamp" }, "registration_to": { $max: "$timestamp" }}}])
	{ 
		"result" : [ 
			{ 
				"_id" : "hospital", 
				"registration_from" : "2009-07-28T08:18:46Z", 
				"registration_to" : "2016-03-31T17:21:41Z" 
			} 
		], 
		"ok" : 1 
	} 

How many pediatric and dermatologic hospitals in seoul

	> db.seoul.aggregate( [{ "$match" : { "$or": [{"tags.name:en":{"$regex":"Pediatric"}},{"tags.name:en":{"$regex":"Dermatologic"}}] }}, {"$group":{"_id": "Pediatric and Dermatologic", "count":{"$sum":1}}}]) 
	{ 
		"result" : [ 
			{ 
				"_id" : "Pediatric and Dermatologic", 
				"count" : 415 
			} 
		], 
		"ok" : 1 
	} 

Conclusion

After this review of the map data it’s apparent that the Seoul area is incomplete, though I believe it has been well cleaned for the purposes of this exercise. It surprises me to notice documents are various structures according to area. Other metropolis have different document structure. And, a huge amount of Seoul data is copied from different sources. In the process of copying, some human mistakes seem to be inevitable. By auditing data, frequent mistakes are identified. So, I think it would be possible to prevent the mistakes through showing guidelines to users in advance, or using data processor similar to p3.py. It helps to maintain more consistent data in OpenStreetMap.org.

comments powered by Disqus